We are seeking a highly skilled Site Reliability Engineering (SRE) Observability Engineer to join our Monitoring and Observability team. This role focuses on designing, implementing, and managing scalable observability solutions across a global enterprise environment. The ideal candidate will have strong expertise in Kubernetes/OpenShift, modern observability stacks, and automation, along with the ability to collaborate across teams and influence strategic technical decisions.
Responsibilities
- Operate and support observability platforms in a global, enterprise-scale environment
- Collaborate with cross-functional teams to design and implement observability solutions for large-scale deployments
- Manage and maintain legacy monitoring systems within the Production Management organization
- Drive the strategic development and delivery of end-to-end observability solutions
- Analyze complex system behaviors to identify issues and develop innovative solutions
- Influence business and technical decisions through expert guidance and recommendations
- Communicate effectively with stakeholders, demonstrating strong interpersonal and diplomacy skills
- Develop and maintain documentation for systems, processes, and operational procedures
- Perform additional duties as required
Requirements
- Experience with OpenShift/Kubernetes administration, including deploying, managing, and troubleshooting containerized applications
- Strong knowledge of observability tools and practices:
- Grafana (dashboards, alerting, user and data source management)
- Prometheus and PromQL for metrics collection and querying
- Familiarity with Grafana ecosystem tools such as Mimir (metrics), Loki (logs), and Tempo (traces)
- Experience administering Geneos ITRS at scale
- Hands-on experience with Helm for application deployment and management (including chart creation and maintenance)
- Proficiency in scripting (Bash or Python) for automation of operational tasks
- Strong technical documentation skills
- Excellent communication and collaboration abilities
Nice to have
- Experience with application deployment using Lightspeed Enterprise
- Familiarity with Google Cloud operations and services
- Experience working in hybrid or multi-cloud environments
- Exposure to large-scale enterprise monitoring and observability transformations
We offer
- Opportunity to work on bleeding-edge projects
- Work with a highly motivated and dedicated team
- Competitive salary
- Flexible schedule
- Benefits package - medical insurance, sports
- Corporate social events
- Professional development opportunities
- Well-equipped office
About Us
Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI, and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical challenges and enable positive business outcomes for enterprise companies undergoing business transformation. A key differentiator for Grid Dynamics is our 8 years of experience and leadership in enterprise AI, supported by profound expertise and ongoing investment in data, analytics, cloud & DevOps, application modernization and customer experience. Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.