Building a High-Quality ASR Training Data Pipeline with Multiple Models — Design Philosophy and Implementation of Whisper × Qwen3 × LLM

What you'll learn: The design philosophy and implementation details of a two-stage pipeline that combines Whisper, Qwen3-ASR, and LLMs to generate high-quality ASR training data that no single model can achieve alone. Topics include Forced Alignment, diff-based merging, and chunking strategies — explained from theory to practice.

Data quality is the biggest bottleneck in ASR model fine-tuning. Using raw audio as-is introduces noise such as transcription errors, timestamp drift, and excessive filler words, making training unstable.

This article details the design and implementation of a pipeline that links three models — Whisper, Qwen3-ASR, and an LLM — to generate high-quality ASR training data through mutual complementation. The full two-stage architecture, including the downstream chunk-splitting script, is covered to share practical know-how for ASR dataset creation.


Why a Single Model Is Not Enough

Whisper (OpenAI) excels at robust multilingual transcription, particularly in kanji conversion accuracy and proper noun representation. For audio such as news anchors or domain-heavy speech, Whisper's kanji output tends to be more accurate than other models. However, it is prone to drift (repetitive generation) on long-form audio, which can hurt overall CER.

Qwen3-ASR, with its LLM-based architecture, tends to achieve lower overall CER than Whisper, but its input length constraints make it difficult to handle long audio on its own, and it struggles with joining sentences that span chunk boundaries.

Ensembling these two models and using an LLM to arbitrate their differences is the core design philosophy of this pipeline.


Pipeline Overview

The pipeline is divided into two main scripts.

text
[Step 1] High-Accuracy Transcription + Integration
  Audio file
    → Whisper transcription (text + timestamps)
    → Qwen3-ASR transcription (chunking strategy)
    → LLM diff merge (unified text generation)
    → Forced Aligner (word-level timestamps)
    → JSON output

[Step 2] Chunk Splitting and Quality Filtering
  JSON (merged text + alignment)
    → Sentence-level segmentation
    → 20–30 second chunking
    → ffmpeg audio extraction
    → Whisper quality check (CER calculation)
    → manifest.jsonl output

Step 1: High-Accuracy Transcription + Integration

Whisper Transcription

Running Whisper via the Hugging Face Transformers pipeline is the standard approach. Using openai/whisper-large-v3-turbo, temperature, num_beams, and repetition_penalty are tuned to suppress drift.

Whisper's greatest strength lies in kanji and proper noun conversion accuracy. Its end-to-end encoder-decoder architecture, trained on large-scale multilingual corpora, tends to produce more accurate kanji representations for place names, personal names, and technical terms compared to other models. That said, overall CER often lags behind LLM-based Qwen3-ASR, making it insufficient as training data on its own in many scenarios.

Output consists of text paired with segment-level timestamps (start_time / end_time). These timestamps serve as the reference for subsequent Qwen3 chunking.

Qwen3-ASR Chunking Strategy

Because Qwen3-ASR is LLM-based, it has input length constraints. Naively cutting at fixed lengths causes chunks to end mid-sentence, losing surrounding context and degrading accuracy.

This is solved by Whisper-timestamp-aligned chunking:

text
Approach:
1. Respect segment boundaries (near sentence ends) from Whisper output
2. Group multiple segments to approach the target chunk duration (default: 25s)
3. Acceptable range: 70%–130% of target (17.5–32.5s)
4. Fall back to fixed-length splitting if no boundary is found within range

This strategy avoids mid-sentence chunk breaks while maximizing Qwen3's contextual understanding capability.

Key insight: Since Qwen3 struggles with joining sentences across chunk boundaries, aligning chunk boundaries to "semantically safe breaks" directly improves accuracy. Whisper's segment boundaries serve as a practical approximation for this.

LLM-Based Diff Merge

The transcription results from Whisper and Qwen3 are passed to an LLM in diff format, which determines which expression to adopt.

Diff notation rules:

text
Plain text         → Both models agree (adopted as-is)
(w: text)          → Whisper only
(q: text)          → Qwen3 only
(w: text|q: text)  → Both models differ (substitution candidate)

Example LLM instructions (summary):

text
Read the following diff-format transcription and output the most natural,
accurate Japanese transcription.
- No need to explain selection rationale; output final text only.
- Retain or remove fillers (uh, um) based on context.
- Add punctuation (。、) at natural positions.
- Prioritize whichever model is more accurate for proper nouns.

Long texts are split into chunks (default: 5,000 characters) before being sent to the LLM, with parallel processing for speed when using Gemini.

Warning: Since the LLM is a generative AI, it may occasionally add content not present in the transcription or summarize sentences. Explicitly state "output transcription text only" and "no summarization or paraphrasing" in the prompt, and verify quality via CER after generation.

Forced Alignment

The merged text produced by the LLM loses its connection to the original audio timestamps. Forced Alignment is the technique of treating the text as fixed and mapping each word or character to its timestamp in the audio.

text
Standard ASR:     Audio → Text (with timestamps)
Forced Alignment: Audio + Text → Timestamps

This pipeline uses the Qwen Forced Aligner to obtain word-level timestamps. A fallback strategy is also in place for OOM (out-of-memory) situations.

text
Fallback priority:
1. Forced Aligner available → use it (highest accuracy)
2. OOM occurs → predict from Qwen3 timestamps via linear interpolation
3. Aligner not used → use Qwen3 timestamps directly

The predicted alignment uses difflib.SequenceMatcher to match the source text (Qwen3 transcription) to the target text (merged), then assigns timestamps via linear interpolation over character positions. Accuracy decreases, but this practical design allows processing to continue even on long audio with frequent GPU OOM.

Step 1 Output JSON Structure

json
{
  "audio_path": "/path/to/audio.mp3",
  "status": "ok",
  "qwen_whisper_diff": "(w: こんにちは|q: こんにちわ)、今日は...",
  "metadata": {
    "audio_duration_sec": 123.456,
    "cer": {
      "pairwise": {
        "merged_vs_whisper": 0.04,
        "merged_vs_qwen": 0.03,
        "qwen_vs_whisper": 0.06
      }
    }
  },
  "whisper": { "text": "...", "timestamps": [] },
  "qwen":    { "text": "...", "timestamps": [] },
  "merged":  { "text": "...", "provider": "openai", "model": "gpt-4o-mini" },
  "alignment": {
    "items": [{ "text": "こんにちは", "start_time": 0.0, "end_time": 0.8 }],
    "source": "qwen_forced_alignment_merged"
  }
}

The cer.pairwise values reveal the degree of disagreement between all three models at a glance. If merged_vs_whisper is high and merged_vs_qwen is low, the LLM leaned toward Qwen3's judgment.


Step 2: Chunk Splitting and Quality Filtering

Sentence-Level Segmentation

alignment.items provides word-level timestamps, which are difficult to chunk directly. The first step is to restructure them into sentence-level segments.

Segments are split at sentence-ending punctuation (。!??!). If a single sentence exceeds the max duration, it is soft-split at clause-level punctuation (、,), and if still too long, force-split at the character level. Timestamps for each split are assigned via character-count-proportional linear interpolation.

python
# Conceptual illustration of proportional timestamp assignment
total_chars = len(text)
for i, char in enumerate(text):
    ratio = i / total_chars
    timestamp = start + ratio * (end - start)

Chunking (20–30 seconds)

ASR models (especially Whisper-based ones) are designed for audio under 30 seconds, so training data must be aligned to the same length.

Chunk split scoring:

text
Score = abs(chunk_duration - target_sec)   # deviation from target duration
Ends with sentence-final punctuation (。!?) → -0.35 bonus
Large gap before next segment → -0.1/sec bonus (up to 1 sec)

Selecting the split point with the lowest score prioritizes natural sentence endings at lengths close to the target. Chunks that are too short are merged with neighbors; chunks that are too long are recursively split at punctuation.

Design elegance: Rather than simply "cutting every N seconds," combining sentence-end scores with gap scores avoids unnatural mid-sentence breaks — which directly impacts training data quality.

Moving Leading Punctuation to the Previous Chunk

If a chunk starts with punctuation (e.g., 。おはよう), that punctuation is moved to the end of the previous chunk.

text
Before:
  Chunk 1: "こんにちは"
  Chunk 2: "。今日もいい天気ですね"

After:
  Chunk 1: "こんにちは。"
  Chunk 2: "今日もいい天気ですね"

This may seem like a minor detail, but it impacts both the naturalness of the text side and audio extraction accuracy (since the audio typically begins immediately after the punctuation).

Audio Extraction and Whisper Quality Check

Each chunk is extracted with ffmpeg:

bash
ffmpeg -y -ss {start} -i {source} -t {duration} \
       -ac 1 -ar 16000 {output.wav}

Mono conversion (-ac 1) and 16kHz sampling (-ar 16000) match the input specifications of Whisper-based models.

The extracted audio is re-transcribed by Whisper and CER is computed against the merged text. Chunks with high CER are suspected of timestamp drift or forced alignment failures and can be filtered in post-processing.

Text normalization for CER computation:

python
def normalize_for_cer(text):
    # Unicode NFKC normalization (full-width/half-width unification)
    text = unicodedata.normalize("NFKC", text)
    # Extract only Letter/Number category characters
    # → covers kanji, hiragana, katakana, alphanumerics
    return "".join(c for c in text if unicodedata.category(c)[0] in ("L", "N"))

Note: Punctuation and symbols are excluded from CER computation to prevent score fluctuation based on punctuation presence. However, punctuation quality should still be verified separately.

manifest.jsonl Structure

json
{
  "sample_name": "dataset_0001",
  "audio_path": "/output/audio/dataset_0001.wav",
  "text": "今日はとてもいい天気ですね。散歩日和です。",
  "start_time": 12.5,
  "end_time": 37.6,
  "duration_sec": 25.1,
  "source_audio_path": "/input/episode001.mp3",
  "source_json_path": "/input/episode001.json",
  "cer": 0.021,
  "whisper_text": "今日はとてもいい天気ですね。散歩日和です。",
  "error_metric": "cer"
}

Key Design Decisions in Implementation

Memory Management

GPU OOM is unavoidable in long-form audio processing. This pipeline detects OOM at each stage and automatically switches to a lightweight fallback.

python
try:
    result = forced_aligner.align(audio, text)
except torch.cuda.OutOfMemoryError:
    _clear_torch_memory()
    result = predict_alignment_from_qwen(qwen_timestamps, merged_text)

Combining torch.cuda.empty_cache() with explicit model object deletion (del model) prevents VRAM leaks.

Text Normalization

Common noise in Japanese ASR data:

  • Full-width spaces ( ) → normalized to half-width
  • Ellipsis () → removed
  • Consecutive whitespace → collapsed to single space
  • Typical hallucinations ("Thank you for watching", etc.) → removed via exclusion list

In particular, boilerplate exclusion is a critical filter to prevent repeated text from video or program endings from contaminating training data.


Broader Applications of This Architecture

The design philosophy of this pipeline — using an LLM to arbitrate differences between multiple models — extends well beyond ASR data creation.

  • Multilingual ASR: Change the language hint (--language zh, etc.) to support Chinese, English, and more
  • Speaker Diarization: Attach speaker IDs to alignment.items to repurpose as speaker separation training data
  • Low-resource Languages: Expand coverage by mixing models based on which language each handles best

From firsthand experience: The accuracy of the Forced Aligner depends heavily on the model and audio domain. Alignment reliability varies significantly between drama audio (BGM + sound effects) and meeting audio (close-mic). It is strongly recommended to inspect CER distributions as histograms and exclude outlier chunks from training.


FAQ

Q1. Which is more accurate, Whisper or Qwen3-ASR?

Qwen3-ASR tends to achieve lower overall CER (see this benchmark article). However, Whisper has an edge in kanji conversion and proper noun accuracy, and its output is more likely to be adopted as the final result when domain-specific terms, names, or place names are frequent. Since each model excels in different areas, files with higher qwen_vs_whisper CER benefit more from the LLM merge step.

Q2. How much does the LLM API cost?

When using Gemini Flash, costs reached roughly ¥1,000–2,000 for a few hours of audio. Using an open-source LLM is recommended.

Q3. Under what conditions does Forced Alignment tend to fail?

Primarily three cases: ① Audio with loud BGM or sound effects (music programs, dramas, etc.); ② Audio with extremely fast or slow speech; ③ Audio where the merged text diverges significantly from Qwen3's transcription (when the LLM made substantial rewrites). When using the linear interpolation fallback, timestamp errors of a few hundred milliseconds can occur.

Q4. Why is 25 seconds the target chunk duration?

The primary reason is Whisper's maximum input length of 30 seconds. Cutting chunks below 30 seconds minimizes chunks skipped (>29 seconds) during the Whisper re-transcription quality check. Keeping chunks too short also results in many sentence fragments, which is disadvantageous for training the model on long-range dependencies.

Q5. Can this be used for languages other than Japanese?

Changing the --language argument switches the target language. However, the sentence segmentation logic (SENTENCE_END_RE) assumes punctuation used in Japanese, Chinese, English, etc. For languages with unique sentence-ending conventions — such as Arabic or Devanagari — the regular expressions will need to be customized.

Q6. How much GPU VRAM is required?

Approximate requirements: ~6 GB for Whisper large-v3-turbo, 4–16 GB for Qwen3-ASR (depending on model size), and 2–4 GB for the Forced Aligner. The pipeline loads and unloads models stage by stage rather than running everything simultaneously, so sequential processing on a single GPU is feasible. That said, 24 GB of VRAM provides comfortable headroom.


Summary

Key points of this two-stage pipeline:

PhaseTechnologyProblem Solved
Whisper transcriptionWhisper large-v3-turboStable long-form transcription; timestamp reference
Qwen3 transcriptionWhisper-boundary-aligned chunkingSentence boundary alignment; LLM context utilization
LLM mergeDiff format + LLM arbitrationResolves inter-model disagreements into the best output
Forced alignmentQwen Forced Aligner + fallbackWord-level timestamp assignment
ChunkingScore-based sentence-end splittingGenerates natural 20–30 second chunks
Quality filteringWhisper re-transcription + CERDetects and excludes low-quality chunks

The essence of this pipeline is breaking through the accuracy ceiling that single-model data creation inevitably hits — through multi-model ensembling and LLM arbitration. We hope this serves as a useful reference for anyone struggling with ASR fine-tuning dataset quality.

Contact

For project inquiries and collaboration, contact us here.

If you are considering a new project, product development, or other collaboration, please get in touch.

Related Articles

Explore more articles connected to this topic.