Lead Site Reliability Engineer (SRE) - Azure

EPAM Systems  logo

EPAM Systems

View Salaries, Reviews, and more  

Job Summary


Job Type
-

Seniority

Years of Experience
Information not provided

Tech Stacks
Python Azure CI Analytics Terraform Powershell

Job Description

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

ย 

ย 

We have recently launched services for our client in Azure, and ensuring service health is our highest priority. As we establish our reputation through dependable, high-performing cloud solutions, we are looking for a Lead Site Reliability Engineer (SRE) who can make an immediate impact on incident response, troubleshooting, and the ongoing enhancement of our cloud reliability. This is a hands-on opportunity for someone who excels in high-pressure situations, can operate effectively with minimal SRE process maturity, and is passionate about both rapid incident response and building resilient systems for the future.

ย 

Responsibilities

  • Develop and automate operational workflows to enhance system reliability, scalability, and performance
  • Work closely with development and operations teams to integrate reliability best practices throughout the software development lifecycle
  • Respond quickly to and resolve service incidents in the Azure environment, minimizing downtime and customer disruption
  • Lead root cause investigations and post-incident reviews, implementing actionable improvements
  • Design, deploy, and maintain comprehensive monitoring, alerting, and observability solutions for all critical services
  • Proactively identify and mitigate reliability risks before they affect customers
  • Help define and mature SRE practices, including incident management, blameless postmortems, and SLO/SLI development
  • Mentor and train team members in SRE methodologies and Azure best practices
  • Analyze incident and outage trends to drive long-term reliability improvements
  • Foster a culture of reliability, accountability, and continuous learning within the team

Requirements

  • At least 5 years of experience in SRE, DevOps, or related roles, with a proven track record in cloud environments (Azure experience required)
  • Minimum of one year in a leadership or team management role
  • Advanced troubleshooting skills in distributed systems, networking, and cloud-native architectures
  • Hands-on experience with Azure monitoring, logging, and automation tools such as Azure Monitor, Log Analytics, Application Insights, ARM, Bicep, and Terraform
  • Proficiency in at least one scripting or programming language, such as Python, PowerShell, or Bash
  • Strong understanding of incident management, on-call operations, and post-incident analysis
  • Experience implementing observability solutions and defining service level objectives (SLOs) and indicators (SLIs)
  • Excellent communication skills and the ability to collaborate effectively in high-pressure, cross-functional environments
  • English proficiency at B2 level or higher

Nice to have

  • Advanced proficiency in Python
  • Azure certifications such as Azure Solutions Architect or Azure DevOps Engineer
  • Experience building SRE practices from the ground up in environments with low process maturity
  • Familiarity with CI/CD pipelines and infrastructure as code
  • Experience mentoring or leading SRE/DevOps teams

ย 

We offer

  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Interview Questions of Lead Site Reliability Engineer (SRE) - Azure at EPAM Systems

Currently, there aren't any interview questions for this role at EPAM Systems shared by other job seekers.
View more interview questions of similar roles from other companies โ†’
banner icon
Prepare For Your Interview in 1 Week?
Equip yourself with possible questions that interviewers might ask you, based on your work experience and job description.
Get Started!

Salary Insights of Lead Site Reliability Engineer (SRE) - Azure at EPAM Systems

Currently, there aren't any salaries for this role at EPAM Systems shared by other job seekers.

View more salaries from EPAM Systems โ†’

Achieve your dream job with our top-notch tools!

Resume Checker Illustration

Resume Checker

Our free resume checker analyzes the job description and identifies important keywords and skills missing from your resume in just a minute!

Check Now
Interview Preparation Illustration

AI InterviewPrep

Utilizing advanced AI, our tool generates tailored interview questions based on your industry, role, and experience. Practice and receive feedback on your answers in real time!

Check Now
Resume Builder Illustration

Resume Builder

Let us show you the differences between a bad, good, and great resume, and guide you in building a resume that helps you stand out to employers, ensuring you land your next position faster!

Check Now