Job Description

What The Role Is

HTX is the first Science and Technology Agency of its kind in the world, bringing together science and engineering capabilities across the Home Team Departments to transform Singapore's homeland security landscape. We are a statutory board under the Ministry of Home Affairs, dedicated to developing cutting-edge technologies that empower our Home Team to solve crimes, save lives, secure borders, and safeguard public spaces.

As the MLOps/SRE Engineer for HTX's developer experience squad, you will be responsible for deploying, operating, and optimizing a production LLM system in our secure infrastructure. You will ensure the agentic code assistant is reliable, performant, and cost-effective, managing the full stack from LLM inference to vector databases, orchestration services, and observability. This role combines deep MLOps expertise with SRE discipline to support Home Team's critical AI infrastructure.

What You Will Be Working On

LLM Deployment: Deploy and manage LLM models using vLLM/TensorRT-LLM on our GPU infrastructure, optimizing for throughput, latency, and GPU utilization
Infrastructure Management: Provision and maintain the supporting infrastructure including vector databases (for RAG), orchestration services, Redis/queue systems, and API gateways
Performance Optimization: Profile and tune LLM inference performance, experiment with batching strategies, context caching, and quantization techniques to maximize throughput within GPU constraints
Observability: Implement comprehensive monitoring using Prometheus, Grafana, DCGM exporters, and Elastic Stack to track inference latency, token throughput, cache hit rates, and system health
Reliability Engineering: Establish SLOs/SLIs, implement auto-scaling policies, design failure recovery mechanisms, and conduct chaos engineering to ensure high uptime
Cost Optimization: Monitor GPU utilization and inference costs, identify optimization opportunities, and implement strategies to reduce token usage and compute spend
Security & Compliance: Ensure all components operate within secure network boundaries, manage secrets and credentials securely, and maintain audit logs for compliance
Incident Response: Participate in on-call rotation, troubleshoot production incidents, conduct root cause analysis, and implement preventive measures
Capacity Planning: Model future load, forecast GPU requirements, and work with infrastructure teams to scale the platform as adoption grows

What We Are Looking For

4+ years of experience in MLOps, SRE, or DevOps roles, with at least 1 year working with ML/AI systems
Hands-on experience deploying and operating LLMs in production (vLLM, TGI, TensorRT-LLM, or similar)
Strong Kubernetes expertise including operators, StatefulSets, and GPU scheduling
Deep understanding of GPU architecture, and inference optimization techniques
Experience with observability tools (Prometheus, Grafana, ELK/Elastic Stack)
Solid Python and Bash scripting skills for automation
Knowledge of vector databases (Milvus, Weaviate, Qdrant, or Pinecone)
Experience with infrastructure-as-code (Terraform, Helm, Kustomize)
Experience with NVIDIA GPUs (A100/H100/B200) and DCGM monitoring
Understanding of LLM inference concepts: KV cache, continuous batching, PagedAttention
Familiarity with Ray clusters, Kubeflow, or MLflow
Background in SRE practices: SLO/SLI definition, error budgets, incident management
Experience with secure or regulated environments
Knowledge of LiteLLM, Kong Gateway, or API management platforms

Competencies:
Systems thinking with ability to diagnose complex issues across the ML stack
Data-driven decision making using metrics and telemetry
Proactive mindset focused on reliability, automation, and preventive measures
Strong debugging skills for GPU, networking, and distributed systems issues
Clear incident communication and documentation
Collaborative approach working with data scientists, ML engineers, and platform teams

All new hires are appointed on a two-year contract in the first instance and will be assessed and considered for permanent tenure over time, based on performance.

As part of the shortlisting process for this role, you may be required to complete a medical declaration and/or undergo further assessment.

Interview Questions of Lead Engineer/ Engineer, MLOps / SRE (Developer Experience), xCloud at HTX

Currently, there aren't any interview questions for this role at HTX shared by other job seekers.

View more interview questions of similar roles from other companies →

Salary Insights of Lead Engineer/ Engineer, MLOps / SRE (Developer Experience), xCloud at HTX

Currently, there aren't any salaries for this role at HTX shared by other job seekers.

View more salaries from HTX →

Lead Engineer/ Engineer, MLOps / SRE (Developer Experience), xCloud

HTX

Job Summary

Job Description

Interview Questions of Lead Engineer/ Engineer, MLOps / SRE (Developer Experience), xCloud at HTX

Salary Insights of Lead Engineer/ Engineer, MLOps / SRE (Developer Experience), xCloud at HTX