Japanese speech recognition is more challenging than many other languages due to the wide variety of kanji and proper noun representations. This primary benchmark article implements a morphological-analysis-based F1 score evaluation using Sudachi to quantify "kanji and proper noun recognition accuracy" that WER and CER struggle to capture, and re-evaluates 9 open-source ASR models.
A deep-dive into a two-stage pipeline that integrates Whisper and Qwen3-ASR transcriptions via LLM diff-merging to generate ASR training data that surpasses what any single model can achieve. The pipeline combines Whisper's strength in kanji and proper nouns with Qwen3's lower overall CER, then applies Forced Alignment to assign word-level timestamps. The downstream chunk-splitting stage automates natural 20–30 second chunk generation using sentence-end scoring, followed by Whisper re-transcription for CER-based quality filtering.
This article presents a first-hand benchmark evaluating nine ASR models on real audio data containing IT company names, service names, and abbreviations. Unlike general-purpose Japanese speech recognition benchmarks, this evaluation takes a practical perspective — asking whether models can correctly transcribe domain-specific terms like "openclaw" and "ARR."
I finetuned IBM Granite Speech (`granite-4.0-1b-speech`) on 100 hours of Japanese speech data and reduced CER from 0.37 to 0.14. The official script's Projector+LoRA-only training has a ceiling on accuracy gains. The key breakthrough was additionally training `lm_head` and the last 8 layers of the Language Model. The result matches Qwen3-ASR-1.7B (CER 0.14) with only 1B parameters.