Lead Engineer/ Engineer, MLOps / SRE (Developer Experience), xCloud

HTX logo

HTX

View Salaries, Reviews, and more  

Job Description

What The Role Is

HTX is the first Science and Technology Agency of its kind in the world, bringing together science and engineering capabilities across the Home Team Departments to transform Singapore's homeland security landscape. We are a statutory board under the Ministry of Home Affairs, dedicated to developing cutting-edge technologies that empower our Home Team to solve crimes, save lives, secure borders, and safeguard public spaces.

As the MLOps/SRE Engineer for HTX's developer experience squad, you will be responsible for deploying, operating, and optimizing a production LLM system in our secure infrastructure. You will ensure the agentic code assistant is reliable, performant, and cost-effective, managing the full stack from LLM inference to vector databases, orchestration services, and observability. This role combines deep MLOps expertise with SRE discipline to support Home Team's critical AI infrastructure.

What You Will Be Working On

  • LLM Deployment: Deploy and manage LLM models using vLLM/TensorRT-LLM on our GPU infrastructure, optimizing for throughput, latency, and GPU utilization
  • Infrastructure Management: Provision and maintain the supporting infrastructure including vector databases (for RAG), orchestration services, Redis/queue systems, and API gateways
  • Performance Optimization: Profile and tune LLM inference performance, experiment with batching strategies, context caching, and quantization techniques to maximize throughput within GPU constraints
  • Observability: Implement comprehensive monitoring using Prometheus, Grafana, DCGM exporters, and Elastic Stack to track inference latency, token throughput, cache hit rates, and system health
  • Reliability Engineering: Establish SLOs/SLIs, implement auto-scaling policies, design failure recovery mechanisms, and conduct chaos engineering to ensure high uptime
  • Cost Optimization: Monitor GPU utilization and inference costs, identify optimization opportunities, and implement strategies to reduce token usage and compute spend
  • Security & Compliance: Ensure all components operate within secure network boundaries, manage secrets and credentials securely, and maintain audit logs for compliance
  • Incident Response: Participate in on-call rotation, troubleshoot production incidents, conduct root cause analysis, and implement preventive measures
  • Capacity Planning: Model future load, forecast GPU requirements, and work with infrastructure teams to scale the platform as adoption grows

What We Are Looking For

  • 4+ years of experience in MLOps, SRE, or DevOps roles, with at least 1 year working with ML/AI systems
  • Hands-on experience deploying and operating LLMs in production (vLLM, TGI, TensorRT-LLM, or similar)
  • Strong Kubernetes expertise including operators, StatefulSets, and GPU scheduling
  • Deep understanding of GPU architecture, and inference optimization techniques
  • Experience with observability tools (Prometheus, Grafana, ELK/Elastic Stack)
  • Solid Python and Bash scripting skills for automation
  • Knowledge of vector databases (Milvus, Weaviate, Qdrant, or Pinecone)
  • Experience with infrastructure-as-code (Terraform, Helm, Kustomize)

  • Experience with NVIDIA GPUs (A100/H100/B200) and DCGM monitoring
  • Understanding of LLM inference concepts: KV cache, continuous batching, PagedAttention
  • Familiarity with Ray clusters, Kubeflow, or MLflow
  • Background in SRE practices: SLO/SLI definition, error budgets, incident management
  • Experience with secure or regulated environments
  • Knowledge of LiteLLM, Kong Gateway, or API management platforms

    Competencies:
  • Systems thinking with ability to diagnose complex issues across the ML stack
  • Data-driven decision making using metrics and telemetry
  • Proactive mindset focused on reliability, automation, and preventive measures
  • Strong debugging skills for GPU, networking, and distributed systems issues
  • Clear incident communication and documentation
  • Collaborative approach working with data scientists, ML engineers, and platform teams



All new hires are appointed on a two-year contract in the first instance and will be assessed and considered for permanent tenure over time, based on performance.

As part of the shortlisting process for this role, you may be required to complete a medical declaration and/or undergo further assessment.




Interview Questions of Lead Engineer/ Engineer, MLOps / SRE (Developer Experience), xCloud at HTX

Currently, there aren't any interview questions for this role at HTX shared by other job seekers.
View more interview questions of similar roles from other companies โ†’
banner icon
Prepare For Your Interview in 1 Week?
Equip yourself with possible questions that interviewers might ask you, based on your work experience and job description.
Get Started!

Salary Insights of Lead Engineer/ Engineer, MLOps / SRE (Developer Experience), xCloud at HTX

Currently, there aren't any salaries for this role at HTX shared by other job seekers.

View more salaries from HTX โ†’

Achieve your dream job with our top-notch tools!

Resume Checker Illustration

Resume Checker

Our free resume checker analyzes the job description and identifies important keywords and skills missing from your resume in just a minute!

Check Now
Interview Preparation Illustration

AI InterviewPrep

Utilizing advanced AI, our tool generates tailored interview questions based on your industry, role, and experience. Practice and receive feedback on your answers in real time!

Check Now
Resume Builder Illustration

Resume Builder

Let us show you the differences between a bad, good, and great resume, and guide you in building a resume that helps you stand out to employers, ensuring you land your next position faster!

Check Now