Job Description

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

We are seeking a Lead Site Reliability Engineer (SRE) – Azure to champion service health, reliability, and performance as we launch and expand services for our client.

This pivotal position calls for deep expertise in incident response, troubleshooting, and evolving cloud reliability practices in fast-paced, high-stakes settings with limited process maturity.

Responsibilities

Create and automate workflows to improve system reliability, scalability, and performance
Work closely with development and operations teams to embed reliability best practices throughout the software development lifecycle
Respond rapidly to service incidents in the Azure environment, minimizing downtime and customer impact
Lead root cause investigations and post-incident reviews, ensuring that lessons learned translate into actionable improvements
Design, implement, and maintain comprehensive monitoring, alerting, and observability solutions for all critical services
Proactively identify and mitigate reliability risks before they affect customers
Establish and refine SRE processes, including incident management and defining service level objectives (SLOs)
Mentor and guide team members in SRE methodologies and effective use of Azure tools
Analyze patterns in incidents and outages to drive long-term reliability enhancements
Promote a culture of reliability, accountability, and continuous improvement

Requirements

Minimum of 5 years’ experience in SRE, DevOps, or related roles, with a strong background in cloud environments, especially Azure
At least one year in a leadership or team management role
Advanced troubleshooting skills in distributed systems, networking, and cloud-native architectures
Practical experience with Azure tools such as Monitor, Log Analytics, Application Insights, ARM, Bicep, and Terraform
Proficiency in scripting or programming languages like Python, PowerShell, or Bash
Solid understanding of incident management processes and post-incident analysis
Experience implementing observability solutions and defining service level indicators (SLIs)
Excellent communication skills and ability to collaborate effectively in high-pressure situations
English proficiency at B2 level or higher

Nice to have

Advanced skills in Python
Azure certifications such as Azure Solutions Architect or Azure DevOps Engineer
Experience establishing SRE practices in environments with low process maturity
Familiarity with CI/CD pipelines and infrastructure as code practices
Background in mentoring or leading SRE/DevOps teams

We offer

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Interview Questions of Lead Site Reliability Engineer (SRE) - Azure at EPAM Systems

Currently, there aren't any interview questions for this role at EPAM Systems shared by other job seekers.

View more interview questions of similar roles from other companies →

Salary Insights of Lead Site Reliability Engineer (SRE) - Azure at EPAM Systems

Currently, there aren't any salaries for this role at EPAM Systems shared by other job seekers.

View more salaries from EPAM Systems →

Lead Site Reliability Engineer (SRE) - Azure

EPAM Systems

Job Summary

Job Description

Interview Questions of Lead Site Reliability Engineer (SRE) - Azure at EPAM Systems

Salary Insights of Lead Site Reliability Engineer (SRE) - Azure at EPAM Systems