Job Description

Position: Junior Site Reliability Engineer

Location: Bangsar South

Type: Permanent

Mode: Hybrid

Benefit: EPF, Socso, Medical, Bonus, VISA

Job Description

• Support & oversee availability, reliability, resilience, performance, security, and monitoring of applications on Azure Cloud and various supporting platforms to ensure business operational SLA and SLO are met.

• Work closely with the service delivery function to undertake incident management, operational cost management, service improvement, and ongoing application health monitoring.

• Create a link between development and operations by applying a software engineering mindset to application reliability activities and instilling the culture in agile development teams.

• Maintain and improve the resiliency of core applications and infrastructure platforms through a continuous improvement backlog.

• Provide continued improvement to the platform infrastructure through automation and standardization.

• Drive best practice operational excellence for secure, high-performing, resilient, efficient infrastructure and cost-optimized applications and workloads.

• Maintain existing automation infrastructure used to identify risks in areas such as performance, reliability, capability, and scalability.

• Possess a modern approach aligned to things such as Infrastructure as Code, Configuration as Code, and DevOps.

• Be responsible for deployment, maintenance, and enhancements of DevSecOps automation workflows of a tribe working closely with developers.

• Conduct a root cause analysis on incidents and provide code fixes for permanent remediation.

• Document automation processes across all environments and technical administration tasks.

• Champion the adoption and culture change required for the continuous improvement in application reliability and embedding these in day-to-day development practices.

Requirement

• You are an engineer with an interest in service reliability, automation, monitoring, scalability, and high-availability systems.

• You have an analytical mindset, natural curiosity, initiative, and willingness to think outside the box to solve problems, using engineering approaches to run better production systems.

• You have experience executing a support function for customer-facing products and services handling incidents under a service management framework and agile methodologies.

• You have a basic understanding of how Docker and Kubernetes work end to end.

• You enjoy working with the latest monitoring and metrics platforms such as Dynatrace, Azure Monitor, and Splunk.

• You have basic experience in coding or scripting to support our system whenever we have some problems that need to be fixed, we can rely on your code or script solution to resolve the issue.

• You have experience in driving technology, people, processes, and culture change to instill site reliability in development practices.