Job Description

Mandatory Skills

Python, Site Reliability Engineer, Elk

Skill to Evaluate

Python, Site Reliability Engineer,

Elk,AWS,GCP,Kubernetes,Docker,Ansible,packer,Jenkins,Splunk,Cribl,Terraform,Vector s,Prometheus,linux,helm,datadog

Job Description

We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability, cloud-native infrastructure, and large-scale distributed systems. This role is highly hands-on and focuses on designing, building, and operating reliable, observable, and scalable platforms running on Kubernetes, with a strong preference for Google Cloud Platform (GCP) and AWS.

Senior Site Reliability Engineer

Roles & Responsibilities

Reliability & Operations

Design, implement, and maintain highly available and resilient systems in Kubernetes-based environments
Define and enforce SLOs, SLIs, and error budgets
Lead incident response, RCA, and postmortems
Drive reliability improvements through automation

Observability (Core Focus)

Architect and operate observability platforms for metrics, logging, tracing, and alerting - Work with Prometheus, Alertmanager, OpenTelemetry, Grafana, Loki / ELK / OpenSearch
Implement cloud-native monitoring (GCP Cloud Monitoring & Logging preferred) - Establish actionable alerting standards

Cloud & Platform Engineering

Build and manage infrastructure on GCP (preferred) or AWS
Operate Kubernetes clusters (GKE preferred)
Deploy services using Helm
Manage containerized workloads using Docker

Automation & Tooling

Strong Python skills with emphasis on reliability, automation, and observability tooling - Develop automation and tooling using Python
Create internal reliability and monitoring tools
Integrate CI/CD pipelines with observability and reliability checks

Collaboration & Leadership