Job Description

AI / ML Engineer – SLM & RAG Specialist

Location: Trivandrum(Kerala)

Company: NeoITO

Experience: 5+ Years

About the Role

NeoITO is hiring an AI / ML Engineer to build and own an AI-powered Proposal & RFP generation system designed to transform meeting notes into structured, client-ready proposals within minutes.

You will be responsible for designing and managing the core AI layer, including the inference engine, RAG pipeline, embedding models, and compliance validation system.

ou will collaborate closely with backend (Node.js) and frontend (React) engineers to deliver a production-ready AI system within a defined delivery timeline.

Key Responsibilities

Model Deployment & Inference

Deploy and manage Small Language Models (SLMs) on on-premise GPU infrastructure.
Configure and optimize LLM inference pipelines using frameworks such as vLLM or HuggingFace Transformers.
Implement token streaming, continuous batching, and optimized sampling strategies for reliable text generation.
Apply quantization techniques (GPTQ/AWQ) to reduce GPU memory footprint while maintaining inference performance.
Monitor GPU health and performance metrics including VRAM usage, latency, and throughput

Retrieval-Augmented Generation (RAG)

Design and implement RAG pipelines to enable context-aware proposal generation.
Build text chunking pipelines and generate embeddings using sentence-transformer models.
Store and retrieve vector embeddings using PostgreSQL with pgvector.
Implement semantic similarity search to retrieve relevant historical proposal data.
Continuously evaluate and optimize retrieval quality and performance.

AI-Driven Proposal Generation

Design structured pipelines to generate multi-section proposals including:
Executive Summary
Project Scope
Technical Approach
Implementation Timeline
Investment Summary
Risk Mitigation
Create section-specific prompts and templates for high-quality generation.
Implement real-time streaming responses to backend services.
Support partial regeneration of sections for iterative proposal refinement.

AI Quality, Validation & Compliance

Develop a validation engine to ensure generated content meets compliance and quality standards.
Implement rule-based checks including:
Client name verification
Budget reference validation
Section completeness
Sensitive data detection
Support an optional AI-based review layer for deeper quality checks.
Deliver structured feedback and annotations for use within editing workflows.

Prompt Engineering & Model Optimization

Design and maintain structured prompts for classification, generation, and validation tasks.
Conduct iterative prompt optimization to improve accuracy, tone, and consistency.
Maintain prompt versioning and regression testing frameworks.
Evaluate output quality through structured human evaluation metrics.

Fine-Tuning & Model Improvement

Lead fine-tuning initiatives to improve model performance over time.
Prepare and curate training datasets from finalized proposals.
Implement LoRA / QLoRA fine-tuning strategies for efficient model updates.
Track experiments and model versions using tools such as MLflow.

Collaboration & Engineering Practices

Expose AI capabilities via FastAPI services consumed by backend applications.
Collaborate with backend teams on job orchestration, queue processing, and event streaming.
Implement unit tests and quality checks for ML pipelines.
Contribute to containerized deployment environments using Docker.
Support CI/CD pipelines with automated testing and linting workflows.

Required Skills & Experience

Large Language Models & AI Systems

Hands-on experience with LLMs or SLMs
Experience deploying models using vLLM, HuggingFace Transformers, or similar frameworks
Knowledge of quantization techniques and inference optimization

RAG & Vector Search

Experience building Retrieval-Augmented Generation pipelines
Knowledge of vector databases such as pgvector, FAISS, or similar
Familiarity with embedding models and semantic search

Programming & Frameworks

Strong Python development experience
Experience with FastAPI, Pydantic, and PyTorch
Knowledge of libraries such as sentence-transformers, LangChain, or LlamaIndex

Infrastructure & GPU Systems

Experience working with GPU-based model deployment
Familiarity with CUDA environments and GPU monitoring
Experience deploying applications with Docker on Linux environments

Databases & Storage

Experience with PostgreSQL
Familiarity with vector extensions or vector search databases
Knowledge of object storage solutions such as S3 or MinIO

MLOps & Model Lifecycle

Experience with LoRA / QLoRA fine-tuning
Familiarity with experiment tracking tools
Knowledge of dataset preparation and model evaluation

Nice to Have

Experience working with Meta Llama models
Familiarity with document generation systems
Experience with queue-based ML pipelines
Exposure to secure enterprise environments requiring strict data governance
Knowledge of observability tools such as Prometheus

In this role, you will:

Deliver a fully functional AI proposal generation system running entirely on-premise
Achieve high-quality, structured proposal outputs
Ensure stable performance under concurrent usage
Establish a foundation for continuous model improvement through fine-tuning

Tech Stack

Primary Language: Python

API Framework: FastAPI

LLM Inference: vLLM / Transformers

Embedding Models: Sentence Transformers

Vector Database: PostgreSQL + pgvector

GPU Infrastructure: NVIDIA GPU environments

Containerization: Docker

Monitoring: Prometheus

Testing: Pytest