Site Reliability Engineer – Compute Operations

IBM  logo

IBM

View Salaries, Reviews, and more  

Job Summary


Salary
₹79,581 - ₹114,312 / Monthly EST

Job Type
-

Seniority

Years of Experience
Information not provided

Tech Stacks
Linux

Job Description

Introduction

At IBM Infrastructure & Technology, we design and operate the systems that keep the world running. From high-resiliency mainframes and hybrid cloud platforms to networking, automation, and site reliability. Our teams ensure the performance, security, and scalability that clients and industries depend on every day. Working in Infrastructure & Technology means tackling complex challenges with curiosity and collaboration. You’ll work with diverse technologies and colleagues worldwide to deliver resilient, future-ready solutions that power innovation. With continuous learning, career growth, and a supportive culture, IBM provides the opportunities to build expertise and shape the infrastructure that drives progress.

Site Reliability engineers apply Software Engineering principles to perform infrastructure management tasks more eHiciently. They are focused on reliability and resiliency, and build systems which proactively detect issues before they cause customer impact. They are responsible for maintaining a high-performance, secure, and stable infrastructure for our clients.

Additionally, SREs resolve customer issues and problems detected through monitoring. They participate in datacenter build and configuration activities, performing tests, and deploy new features and capacity.

Your Role And Responsibilities

As a Site Reliability Engineer, you will work in an agile, collaborative environment to build, deploy, configure, and maintain systems for the IBM client business. In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes.

Site Reliability Engineering (SRE) professionals are engineers who specialize in reliability and resiliency with the right mix of knowledge and skills in software and systems, responsible to analyze business needs, problem determination, advise & design, build, test, deploy, changes and maintenance of a well-engineered information system and ecosystems.

Responsibilities

As a compute Operations Site Reliability Engineer, working in US Shift timing, you perform the following tasks:

  • Monitor provisioning tests and investigate/resolve any failures
  • Perform code stack updates on infrastructure systems (VIOS, firmware, PowerVC, HMC, Novalink, NIM servers) as well as cloud supporting systems (jump servers, sobox, network nodes, gateways, TSM servers)
  • Upload/maintain stock images
  • Maintain UserIDs(Add/delete) and passwords
  • Monitor daily/weekly backups to ensure they are working
  • Manage and maintain Nagios monitoring environment, troubleshoot scripts/plug-ins if there is an issue
  • Perform periodic LPMs, inactive migrations, or remote restarts of customer VMs to perform system maintenance, balance workloads, or free up resources
  • Monitor and provide details of Capacity utilized in each Data enter
  • Attend scheduled meetings planned by customer for cutover/maintenance windows
  • Verify capacity requirements in case of provisioning failure issues by customers
  • Work with customers to resolve any RSCT issues so that LPM activities can be performed without impacting customer workloads.

Preferred Education

Bachelor's Degree

Required Technical And Professional Expertise

The candidate should be willing to work in US shift timings.

Relevant Industry work experience of 7-12 years

  • In-depth knowledge of Power server HW (Models, I/O Adapters etc)
  • HMC knowledge and experience operating
  • In-depth knowledge of PowerVM including installation/configuration and operating
  • Experience with PowerVC including installation/configuration and operating
  • Experience with Linux administration, commands and networking
  • Knowledge of Nova Link including minimal installation/configuration
  • High level knowledge of Power Systems supported Operating Systems (AIX and IBM)
  • In-depth knowledge of how storage is connected and allocated to Power systems via NPIV connections
  • Good understanding of Power Systems network configuration at the system level

Preferred Technical And Professional Experience

  • Experience with configuring and tuning PowerVS
  • Experience training new personnel on tooling and processes
  • Storage & Power RTS, MVS Network for Cisco, Juniper; general support skills

Interview Questions of Site Reliability Engineer – Compute Operations at IBM

Interview questions from IBM that are similar to Site Reliability Engineer – Compute Operations
View more interview questions from IBM →
banner icon
Prepare For Your Interview in 1 Week?
Equip yourself with possible questions that interviewers might ask you, based on your work experience and job description.
Get Started!

Achieve your dream job with our top-notch tools!

Resume Checker Illustration

Resume Checker

Our free resume checker analyzes the job description and identifies important keywords and skills missing from your resume in just a minute!

Check Now
Interview Preparation Illustration

AI InterviewPrep

Utilizing advanced AI, our tool generates tailored interview questions based on your industry, role, and experience. Practice and receive feedback on your answers in real time!

Check Now
Resume Builder Illustration

Resume Builder

Let us show you the differences between a bad, good, and great resume, and guide you in building a resume that helps you stand out to employers, ensuring you land your next position faster!

Check Now