Job Description

We are seeking a highly skilled Site Reliability Engineering (SRE) Observability Engineer to join our Monitoring and Observability team. This role focuses on designing, implementing, and managing scalable observability solutions across a global enterprise environment. The ideal candidate will have strong expertise in Kubernetes/OpenShift, modern observability stacks, and automation, along with the ability to collaborate across teams and influence strategic technical decisions.

Responsibilities

Operate and support observability platforms in a global, enterprise-scale environment
Collaborate with cross-functional teams to design and implement observability solutions for large-scale deployments
Manage and maintain legacy monitoring systems within the Production Management organization
Drive the strategic development and delivery of end-to-end observability solutions
Analyze complex system behaviors to identify issues and develop innovative solutions
Influence business and technical decisions through expert guidance and recommendations
Communicate effectively with stakeholders, demonstrating strong interpersonal and diplomacy skills
Develop and maintain documentation for systems, processes, and operational procedures
Perform additional duties as required

Requirements

Experience with OpenShift/Kubernetes administration, including deploying, managing, and troubleshooting containerized applications
Strong knowledge of observability tools and practices:

Grafana (dashboards, alerting, user and data source management)
Prometheus and PromQL for metrics collection and querying
Familiarity with Grafana ecosystem tools such as Mimir (metrics), Loki (logs), and Tempo (traces)
Experience administering Geneos ITRS at scale

Hands-on experience with Helm for application deployment and management (including chart creation and maintenance)
Proficiency in scripting (Bash or Python) for automation of operational tasks
Strong technical documentation skills
Excellent communication and collaboration abilities

Nice to have

Experience with application deployment using Lightspeed Enterprise
Familiarity with Google Cloud operations and services
Experience working in hybrid or multi-cloud environments
Exposure to large-scale enterprise monitoring and observability transformations

We offer

Opportunity to work on bleeding-edge projects
Work with a highly motivated and dedicated team
Competitive salary
Flexible schedule
Benefits package - medical insurance, sports
Corporate social events
Professional development opportunities
Well-equipped office

About Us

Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI, and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical challenges and enable positive business outcomes for enterprise companies undergoing business transformation. A key differentiator for Grid Dynamics is our 8 years of experience and leadership in enterprise AI, supported by profound expertise and ongoing investment in data, analytics, cloud & DevOps, application modernization and customer experience. Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.