As a Data Engineer, you will own the end-to-end speech data lifecycle that powers advanced speech and speech-to-speech models. Your primary responsibility is to build, curate, validate, and deliver high-quality datasets that enable robust speech understanding, generation, and conversational interaction.
This role is critical to ensuring that models are trained on clean, diverse, well-annotated, and model-ready data.
Key Responsibilities
Speech Data Curation
· Build datasets supporting:
o Speech recognition and understanding
o Multilingual and code-mixed speech
o Conversational and dialog-style speech
o Speech generation and synthetic voice data
o Audio-to-audio conversational scenarios
· Prepare datasets that capture:
o Speaker variation and continuity
o Emotional and expressive speech cues
o Real-world noise and acoustic conditions
o Conversational turn structure and timing
Ensemble-Based Data Curation
· Implement data curator pipelines using outputs from:
o Multiple in-house speech models
o External or open-source speech models
· Aggregate, reconcile, and validate model outputs to:
o Generate reliable annotations
o Filter low-confidence samples
o Detect inconsistencies and label noise
· Apply rule-based and confidence-driven selection strategies.
Validation & Quality Control
· Perform automated validation for:
o Audio integrity and format consistency
o Transcript alignment and correctness
o Language and speaker metadata accuracy
· Run sampling-based manual audits.
· Produce dataset quality reports and summaries.
Engineering & Operations
· Build scalable, reproducible data pipelines in Python and C++.
· Handle large audio corpora efficiently on Linux systems.
· Generate training-ready manifests and metadata.
· Maintain dataset versions, lineage, and reproducibility.
Required Skills
· 3+ years of experience in data engineering or ML data pipelines.
· Strong Python skills for large-scale data processing.
· Experience working with audio or speech datasets.
· Familiarity with annotation formats and metadata schemas.
· Knowledge of Linux, Bash, and Git workflows.
Currently, there aren't any salaries for this role at gnani.ai shared by other job seekers.
View more salaries from gnani.ai →Achieve your dream job with our top-notch tools!
Resume Checker
Our free resume checker analyzes the job description and identifies important keywords and skills missing from your resume in just a minute!
AI InterviewPrep
Utilizing advanced AI, our tool generates tailored interview questions based on your industry, role, and experience. Practice and receive feedback on your answers in real time!
Resume Builder
Let us show you the differences between a bad, good, and great resume, and guide you in building a resume that helps you stand out to employers, ensuring you land your next position faster!