Job Description

Introduction

At IBM Infrastructure & Technology, we design and operate the systems that keep the world running. From high-resiliency mainframes and hybrid cloud platforms to networking, automation, and site reliability. Our teams ensure the performance, security, and scalability that clients and industries depend on every day. Working in Infrastructure & Technology means tackling complex challenges with curiosity and collaboration. You’ll work with diverse technologies and colleagues worldwide to deliver resilient, future-ready solutions that power innovation. With continuous learning, career growth, and a supportive culture, IBM provides the opportunities to build expertise and shape the infrastructure that drives progress.

Site Reliability engineers apply Software Engineering principles to perform infrastructure management tasks more eHiciently. They are focused on reliability and resiliency, and build systems which proactively detect issues before they cause customer impact. They are responsible for maintaining a high-performance, secure, and stable infrastructure for our clients.

Additionally, SREs resolve customer issues and problems detected through monitoring. They participate in datacenter build and configuration activities, performing tests, and deploy new features and capacity.

Your Role And Responsibilities

As a Site Reliability Engineer, you will work in an agile, collaborative environment to build, deploy, configure, and maintain systems for the IBM client business. In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes.

Site Reliability Engineering (SRE) professionals are engineers who specialize in reliability and resiliency with the right mix of knowledge and skills in software and systems, responsible to analyze business needs, problem determination, advise & design, build, test, deploy, changes and maintenance of a well-engineered information system and ecosystems.

Responsibilities

As a compute Operations Site Reliability Engineer, working in US Shift timing, you perform the following tasks:

Monitor provisioning tests and investigate/resolve any failures
Perform code stack updates on infrastructure systems (VIOS, firmware, PowerVC, HMC, Novalink, NIM servers) as well as cloud supporting systems (jump servers, sobox, network nodes, gateways, TSM servers)
Upload/maintain stock images
Maintain UserIDs(Add/delete) and passwords
Monitor daily/weekly backups to ensure they are working
Manage and maintain Nagios monitoring environment, troubleshoot scripts/plug-ins if there is an issue
Perform periodic LPMs, inactive migrations, or remote restarts of customer VMs to perform system maintenance, balance workloads, or free up resources
Monitor and provide details of Capacity utilized in each Data enter
Attend scheduled meetings planned by customer for cutover/maintenance windows
Verify capacity requirements in case of provisioning failure issues by customers
Work with customers to resolve any RSCT issues so that LPM activities can be performed without impacting customer workloads.

Preferred Education

Bachelor's Degree

Required Technical And Professional Expertise

The candidate should be willing to work in US shift timings.

Relevant Industry work experience of 7-12 years

In-depth knowledge of Power server HW (Models, I/O Adapters etc)
HMC knowledge and experience operating
In-depth knowledge of PowerVM including installation/configuration and operating
Experience with PowerVC including installation/configuration and operating
Experience with Linux administration, commands and networking
Knowledge of Nova Link including minimal installation/configuration
High level knowledge of Power Systems supported Operating Systems (AIX and IBM)
In-depth knowledge of how storage is connected and allocated to Power systems via NPIV connections
Good understanding of Power Systems network configuration at the system level

Preferred Technical And Professional Experience

Experience with configuring and tuning PowerVS
Experience training new personnel on tooling and processes
Storage & Power RTS, MVS Network for Cisco, Juniper; general support skills

Interview Questions of Site Reliability Engineer – Compute Operations at IBM

Interview questions from IBM that are similar to Site Reliability Engineer – Compute Operations

View more interview questions from IBM →

Site Reliability Engineer – Compute Operations

IBM

Job Summary

Job Description

Interview Questions of Site Reliability Engineer – Compute Operations at IBM