Zalo Business Solution (ZBS) are powerful business solutions that enable enterprises to connect with customers through Zalo Official Accounts and send secure, personalized notifications, enhancing engagement and communication efficiency at a scale of tens of millions of users.
We are looking for a (Lead) Senior Software Engineer to spearhead our DevOps team. In this role, engineering excellence comes first. You will be responsible for architecting and building high-performance, mission-critical backend systems that integrate cutting-edge AI capabilities. You are someone who believes that AI is only as good as the system it runs on. Your mission is to lead a team in transforming ZBS into an AI-driven platform by applying rigorous software engineering principles to AI integration.
What you will do
- System Architecture: Lead the design and development of high-performance, scalable backend services using Java (Spring Boot) that serve as the backbone for AI features;
- AI Integration & Orchestration: Architect how AI models (LLMs, Agents) are integrated into the ZBS microservices ecosystem, focusing on reliability, low latency, and efficient resource management;
- Lead Innovation Workflows: Build internal developer platforms and AI-powered automation toolkits to optimize the entire engineering organization's lifecycle;
- Production-Grade AI: Move AI prototypes from "it works on my machine" to "it works for millions of users." This includes owning the billing, monitoring, and scaling logic of AI services;
- Engineering Leadership: Set the bar for code quality, system design, and technical documentation. Mentor engineers on how to build robust software that leverages AI without sacrificing stability;
- Strategic AI Implementation: Collaborate with business leads to identify where AI can solve complex engineering bottlenecks or create new product value, and then execute the technical roadmap.
What you will need
- Software Mastery: 5+ years of experience in Backend Engineering, with expert-level proficiency in Java and the Spring ecosystem;
- Systems Thinking: Deep understanding of distributed systems, microservices architecture, concurrency, and high-traffic platform engineering;
- AI-Enhanced Engineering: Hands-on experience with AI tools (Cursor, Claude, etc.) and a proven track record of using them to accelerate software delivery and solve complex bugs;
- Practical AI Expertise: Solid grasp of LLM integration, RAG architectures, and Agentic workflows (Python/LangGraph/LangChain) with the ability to bridge these into a Java-based production environment;
- Infrastructure Excellence: Strong experience with Linux, Docker, Kubernetes, and building sophisticated CI/CD pipelines that handle both code and model deployments;
- Data & Performance: Proficiency in SQL/NoSQL, Redis, and experience with data systems like Kafka, Apache Spark, or Apache Doris for processing high-volume event streams;
- Ownership Mindset: Experience owning systems end-to-end from the first line of code to production monitoring and incident response.
Nice to have
- AI Observability & Evaluation Frameworks: Experience building automated evaluation pipelines for LLMs (using RAGAS, DeepEval, or Arize Phoenix) and implementing specialized monitoring for "AI health" (hallucination rates, token-per-second, and context window efficiency);
- Self-hosting & Model Adaptation: Proven track record of self-hosting and fine-tuning state-of-the-art open-source models (DeepSeek, Llama 3, Mistral) on private infrastructure to ensure data privacy and reduce dependency on external APIs;
- Observability & Security: Hands-on with Prometheus, Grafana, and ELK; strong awareness of OWASP Top 10 and data privacy.