Japanese ASR Model Comparison 2026: Whisper, Qwen3, Voxtral & ReazonSpeech Benchmarked on RTX5090
::: message
Update History
2026/03/23 Added benchmark results for ibm/granite-4.0-1b-speech.
:::
What you'll learn: First-hand benchmark data comparing 9 Japanese speech recognition (ASR) models under identical conditions. We compare WER, CER, and RTF to answer "which model is best for which use case" — from real hands-on testing.
Summary
Using an RTX5090, we evaluated 9 ASR models on 20 audio clips (580 seconds total, ~10 minutes) of natural conversational speech. For overall accuracy and stability, openai/whisper-large-v3-turbo and qwen/qwen3-asr-1.7b stand clearly ahead of the pack. If speed is your top priority, nvidia/parakeet-tdt-0.6b-v3 (RTF=0.002) is the winner. For Japanese-specific domain adaptation, reazon-research/reazonspeech-espnet-v2 is a strong contender.
We also added ibm/granite-4.0-1b-speech to this round. It showed impressive accuracy on clean audio, but exhibited a clear weakness in noisy and multi-speaker environments — particularly a hallucination loop problem.
| Model | WER↓ | CER↓ | RTF↓ | Overall |
|---|---|---|---|---|
| qwen3-asr-1.7b | 0.1899 | 0.1473 | 0.034 | ⭐⭐⭐⭐⭐ |
| whisper-large-v3-turbo | 0.2099 | 0.1782 | 0.012 | ⭐⭐⭐⭐⭐ |
| reazonspeech-espnet-v2 | 0.2342 | 0.2007 | 0.589 | ⭐⭐⭐⭐ |
| voxtral-mini-4b | 0.2437 | 0.2187 | 0.207 | ⭐⭐⭐ |
| granite-4.0-1b-speech | 0.4058 | 0.3715 | 0.061 | ⭐⭐⭐ |
| parakeet-tdt-0.6b-v3 | 0.3465 | 0.3235 | 0.002 | ⭐⭐⭐ |
| reazonspeech-nemo-v2 | 0.3546 | 0.3346 | 0.020 | ⭐⭐⭐ |
| reazonspeech-k2-v2 | 0.4636 | 0.4486 | 0.026 | ⭐⭐ |
| kotoba-whisper-v2.0 | 0.5402 | 0.5050 | 0.008 | ⭐ |
Granite shows a highly polarized behavior — strong on clean audio, brittle on noise. Its average ranking is mid-tier, but it can be a viable option if your domain is well-controlled.
Test Environment & Evaluation Data
Hardware & Settings
GPU: RTX5090
temperature: 0.0
max_new_tokens: 512
language: ja
Dataset (20 clips, ~29 seconds each)
What sets this benchmark apart from surface-level roundups is that we used real media audio, not synthetic or read-speech datasets.
- News narration: Standard announcer-style delivery
- Entertainment (comedy / variety shows): Fillers, laughter, fast speech
- Drama / anime: Emotional delivery, slang (e.g., "temee", "babaa!")
- Business / reality TV: Overlapping multi-speaker segments
By including overlapping speech, laughter, and background noise, we replicated challenging, production-like scenarios.
Metrics Explained
WER (Word Error Rate)
WER = (Substitutions + Deletions + Insertions) / Reference Word Count
Lower is better. Japanese text has no word-boundary spaces, so we use MeCab (a morphological analyzer) to tokenize text before computing WER. Punctuation (、。!?) is also stripped before evaluation, since different models punctuate differently. See the "Reproducibility" section for details.
CER (Character Error Rate)
Error rate measured at the character level. Generally considered more reliable than WER for Japanese evaluation. In our results, WER and CER rankings were almost identical, confirming the consistency of both metrics.
RTF (Real-Time Factor)
RTF = Processing Time / Audio Duration
RTF < 1.0 means real-time capable. At RTF=0.002, parakeet processes 29 seconds of audio in roughly 0.06 seconds.
Per-Model Analysis
🥇 qwen/qwen3-asr-1.7b — Highest Accuracy (WER: 0.1899)
The standout feature of this model is its "completion ability."
Rather than simply transcribing sounds, it completes sentences in a way that reads naturally — with punctuation placed at natural breath points, like output from a professional transcriptionist.
Observed behavior:
- Appropriately drops fillers ("uh", "um") while preserving grammatical particles
- Only 1 of 20 clips exceeded WER 0.5; all others stayed within 0.2 — exceptional consistency
Caveats:
- Occasional "mishearing" (creative reinterpretation) of internet slang and proper nouns
- Average inference time 0.99s (RTF=0.034) — roughly 3x slower than Whisper
🥈 openai/whisper-large-v3-turbo — Reliable All-Rounder (WER: 0.2099)
Its contextual understanding is head and shoulders above the rest.
Segments where ReazonSpeech models fell silent or dropped the tail end of utterances were transcribed near-perfectly by Whisper.
Strengths:
- Drops fillers while reconstructing grammatical structure
- Holds up well under multi-speaker and noisy conditions
Weakness: Hallucination
Whisper can generate text where no speech exists. In our test, the word "私" (I/me) was appended at the end without being present in the audio. For long-form audio, a post-processing hallucination detection step is advisable.
reazon-research/reazonspeech-espnet-v2 — Japanese Domain Adaptation (WER: 0.2342)
Trained on Japanese broadcast media data, this model demonstrated the ability to correctly interpret culturally specific terms — for example, distinguishing the colloquial "Tō-yoko" (a youth hangout spot named after the Tōyoko train line) from the literal place name. This is a significant advantage for news and media transcription.
It also actively inserts exclamation (!) and question (?) marks, attempting to capture the energy and emotion of speech.
Weaknesses:
- RTF=0.589 — the slowest in this benchmark (about 17 seconds to process a 29-second clip)
- Aggressively drops fillers, making it unsuitable when filler preservation is needed
mistralai/voxtral-mini-4b-realtime-2602 — Readability-First Design (WER: 0.2437)
This model clearly prioritizes readability over verbatim transcription.
Examples:
ぐらい→くらい(orthographic normalization)っていうか→というのか(spoken → written register conversion)
Well-suited for subtitle production and meeting minutes where clean, readable output matters more than exact reproduction.
In one file, the model suddenly output what appeared to be Arabic script. Language validation is mandatory before using this in production.
🆕 ibm/granite-4.0-1b-speech — Polarized Accuracy (WER: 0.4058)
Developed by IBM Research, this is a compact 1B-parameter speech recognition model with a practical RTF of 0.061. However, our evaluation revealed a strongly polarized behavior: excellent on clean audio, severely degraded on noisy or multi-speaker input.
Performance on Clean Audio
On news narration and clear business conversation samples, the model achieved CER around 10% — a level that rivals the top models. Within a controlled domain, Granite can be a competitive option.
Issue: Hallucination and Loop Behavior
On noisy or multi-speaker samples, CER exceeded 70–90% in multiple cases. This "catastrophic failure" on a subset of samples is the primary driver of the low overall average.
Here's a breakdown of the failure modes we observed:
① Repetition Loop
"...iine iine iine iine iine..."
"hai hai hai hai hai hai hai..."
A known failure mode in Transformer-based ASR models, where non-speech sounds (silence, BGM, laughter) cause the decoder to enter a loop attempting to verbalize them. Whisper also suffers from this, but the occurrence rate in Granite was notably higher.
② Early Stopping
Several clips showed extremely short outputs despite long reference transcripts. Likely caused by overlapping speakers or failed VAD (Voice Activity Detection) at transition points, leading the model to incorrectly judge that speech has ended.
③ Proper Noun / Kanji Conversion Errors
| Reference | Granite Output | Note |
|---|---|---|
| マル暴 (police slang) | 丸棒 (round stick) | Phonetically similar, wrong kanji |
| 霜降り明星 (comedian duo) | 下振り明星 | Near-homophone misconversion |
Even when phoneme-level recognition succeeds, the kanji conversion step frequently fails on Japanese slang and proper nouns. Japanese-specific vocabulary handling appears to be underdeveloped.
④ Instability Around Fillers
In multiple clips, important words immediately following fillers ("uh", "um") were dropped entirely. The filler processing appears to disrupt the subsequent recognition context.
Granite Positioning Summary
✅ Good fit
- News / announcer-style narration
- Single-speaker clean audio batch processing
- Edge / low-resource environments (small 1B model footprint)
❌ Poor fit
- Comedy / variety / emotionally expressive audio
- Multi-speaker meetings or debates
- Entertainment content with laughter or BGM
- Natural conversation with fillers
If using Granite in production, you must implement loop detection post-processing (repetition pattern detection and truncation) and output length validation. Using raw output as-is is risky.
nvidia/parakeet-tdt-0.6b-v3 — Blazing Fast (RTF: 0.002)
RTF=0.002 is the fastest in this benchmark. It processes 29 seconds of audio in roughly 0.06 seconds.
The TDT (Transducer with Dynamic Time Alignment) architecture is designed to skip low-energy or low-importance tokens — this "aggressive skipping" contributes to the WER of 0.3465.
Good for:
- Real-time caption drafts
- High-volume audio indexing where speed beats accuracy
Not suitable for:
- Detailed meeting minutes or drama/conversation transcription
reazonspeech-nemo-v2 vs. k2-v2
Both share the ReazonSpeech brand, but show clear differences in output style.
| Feature | nemo-v2 | k2-v2 |
|---|---|---|
| Numbers | 13歳 (Arabic numerals) | 十三歳 (kanji numerals) |
| Latin characters | gps (lowercase) | GPS (uppercase) |
| Speed | RTF=0.020 | RTF=0.026 |
| Truncation pattern | Drops the tail | Drops the middle |
nemo-v2 also demonstrated domain-specific strength, correctly converting the police slang "マル暴 (Marubo)" to proper kanji.
kotoba-whisper-v2.0 — Lowest Score in This Benchmark (WER: 0.5402)
Handles clear, scripted speech adequately, but struggles significantly with natural conversation, emotional delivery, and multiple speakers.
Characteristic issues:
- Repeats the same phrase when audio is unclear
- Drops important words immediately after fillers ("etto", "ano", "maa")
- Fails to transcribe the beginning of audio in some clips
Cross-Model Comparison: Loop / Repetition Behavior
One of the most notable findings across this benchmark was repetition loop behavior observed in multiple models.
| Model | Loop Occurrence | Mitigation |
|---|---|---|
| whisper-large-v3-turbo | Rare (appended hallucination) | VAD pre-processing, no_speech_threshold tuning |
| granite-4.0-1b-speech | Frequent (BGM / laughter triggers) | Post-processing repetition detection is mandatory |
| kotoba-whisper-v2.0 | Occurs on unclear audio | — |
Loop behavior typically stems not from model architecture, but from how the training data handles non-speech sounds (laughter, BGM, breath). Japanese variety and entertainment content is particularly prone to triggering this, as it contains high concentrations of such sounds.
Recommended Models by Use Case
Maximum accuracy (transcripts, subtitles)
→ qwen3-asr-1.7b or whisper-large-v3-turbo
Japanese media / broadcast
→ reazonspeech-espnet-v2
Speed-first (real-time / batch)
→ parakeet-tdt-0.6b-v3
Readability / written-style output
→ voxtral-mini (language validation required)
Clean audio only / lightweight model needed
→ granite-4.0-1b-speech (pair with loop detection post-processing)
Models to avoid for certain use cases
→ kotoba-whisper-v2.0 (natural conversation)
→ voxtral-mini (without language validation)
→ granite-4.0-1b-speech (noisy / emotional / multi-speaker audio)
Reproducibility: The WER Evaluation Pipeline
The most critical design decision in this benchmark is how to compute WER fairly. For Japanese ASR, simply calling jiwer.wer(ref, hyp) is not sufficient — it requires careful normalization and tokenization.
Step 1: Text Normalization (Strip Punctuation, Symbols, Spaces)
Different models punctuate very differently. For example, reazonspeech-espnet-v2 aggressively inserts ! and ?, while reference transcripts often have none. To prevent WER from being unfairly skewed by punctuation differences, we strip all punctuation before scoring.
import unicodedata
def normalize_text(text: str) -> str:
# NFKC normalization: full-width → half-width, variant unification
text = unicodedata.normalize("NFKC", text)
chars: list[str] = []
for ch in text:
cat = unicodedata.category(ch)
if cat.startswith("Z"): # Separators (spaces, ideographic spaces)
continue
if cat.startswith("P"): # Punctuation (、。!?「」etc.)
continue
if cat.startswith("S"): # Symbols (¥, © etc.)
continue
chars.append(ch)
return "".join(chars).strip()
This normalization prevents unfair penalization from surface-level differences like:
| Transform | Example |
|---|---|
| NFKC normalization | 13歳 → 13歳, km → km |
| Punctuation removal | 行って、叫んで。 → 行って叫んで |
| Space removal | 東京 都 → 東京都 |
Step 2: Japanese Tokenization with MeCab (WER only)
Japanese has no word-boundary spaces, so English-style WER computation doesn't apply directly. We use MeCab morphological analysis (wakati-gaki) to split text into words before computing WER.
import MeCab
_MECAB_TAGGER = None
def _get_mecab_tagger():
global _MECAB_TAGGER
if _MECAB_TAGGER is None:
# -Owakati: outputs space-separated morphemes
_MECAB_TAGGER = MeCab.Tagger("-Owakati")
return _MECAB_TAGGER
def _contains_japanese(text: str) -> bool:
for ch in text:
code = ord(ch)
if (
0x3040 <= code <= 0x30FF # Hiragana / Katakana
or 0x4E00 <= code <= 0x9FFF # CJK Unified Ideographs
or 0x3400 <= code <= 0x4DBF # CJK Extension A
):
return True
return False
def tokenize_for_wer(text: str) -> list[str]:
# Already space-separated (e.g., English)
if " " in text:
return [tok for tok in text.split(" ") if tok]
# Use MeCab for Japanese
if _contains_japanese(text):
tagger = _get_mecab_tagger()
parsed = str(tagger.parse(text) or "").strip()
return [w for w in parsed.split() if w]
# Fallback: character-level
return [ch for ch in text if not ch.isspace()]
Why MeCab? For a word like "行ってきました", naive character splitting produces one token per character. MeCab splits it into morphemes: 行っ / て / き / まし / た — much closer to what a human would count as "word errors."
Step 3: CER Computation (Character Level)
CER is straightforward: split into individual characters, excluding spaces.
def tokenize_for_cer(text: str) -> list[str]:
return [ch for ch in text if not ch.isspace()]
Step 4: Error Rate Calculation with jiwer
Token lists are joined into space-separated strings and passed to jiwer for substitution/deletion/insertion counts.
import jiwer
def compute_error_rate(ref: str, hyp: str, mode: str) -> tuple[float, int, int]:
ref_n = normalize_text(ref)
hyp_n = normalize_text(hyp)
if mode == "wer":
ref_tokens = tokenize_for_wer(ref_n)
hyp_tokens = tokenize_for_wer(hyp_n)
elif mode == "cer":
ref_tokens = tokenize_for_cer(ref_n)
hyp_tokens = tokenize_for_cer(hyp_n)
out = jiwer.process_words(
" ".join(ref_tokens),
" ".join(hyp_tokens)
)
err = int(out.substitutions + out.deletions + out.insertions)
total = int(out.hits + out.substitutions + out.deletions)
return (err / max(total, 1)), err, total
Final WER/CER is computed as a corpus-level aggregate (Σerr / Σtotal across all 20 samples), not an average of per-sample WERs. This prevents short clips from having disproportionate influence.
Step 5: RTF Measurement
Each ASR model runs as a subprocess and returns a JSON payload with inference time and audio duration. RTF is their ratio.
# JSON format returned by each model's transcribe.py
{
"text": "transcribed text here",
"transcribe_elapsed_sec": 0.34,
"audio_duration_sec": 29.0,
"rtf": 0.012
}
# Corpus-level RTF = Σtranscribe_elapsed / Σaudio_duration
rtf = transcribe_elapsed_sum / audio_duration_sum
Dependencies
jiwer # WER/CER computation
MeCab # Japanese morphological analysis
unicodedata # Text normalization (Python stdlib)
Installing MeCab (Ubuntu):
sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8
pip install mecab-python3 jiwer
FAQ
Q1. Which model has the best accuracy for Japanese ASR?
In this benchmark, qwen/qwen3-asr-1.7b (WER=0.1899) achieved the highest accuracy. That said, whisper offers comparable stability in noisy or multi-speaker environments. The right choice depends on your use case.
Q2. Should I use WER or CER?
For Japanese, WER can be artificially inflated due to ambiguities in morphological segmentation. Since we use MeCab-based tokenization here, our WER is more reliable than naive character-split WER — but we still recommend CER as the primary metric. In our results, the rankings were nearly identical between the two.
Q3. Which models support real-time transcription?
A practical threshold is RTF < 0.1. In this benchmark, parakeet-tdt-0.6b-v3 (RTF=0.002), whisper (RTF=0.012), nemo-v2 (RTF=0.020), and granite-4.0-1b-speech (RTF=0.061) all qualify. Factor in accuracy tradeoffs — Granite meets the speed bar but has stability issues in noisy conditions.
Q4. Which model best preserves fillers ("uh", "um")?
qwen3-asr-1.7b and whisper drop some fillers but are relatively faithful. reazonspeech-espnet-v2 aggressively removes them and is not suitable when filler preservation matters.
Q5. Which model handles overlapping multi-speaker audio best?
whisper and qwen3-asr-1.7b maintained stable accuracy even in multi-speaker and noisy conditions. ReazonSpeech models tend to drop the tail end of utterances when speakers overlap. Granite and kotoba-whisper are not suitable for multi-speaker audio.
Q6. How do I mitigate Whisper hallucinations?
Three approaches: ① pre-processing with a VAD (Voice Activity Detection) model to strip silent segments, ② post-processing to detect repetitive patterns, ③ tuning the no_speech_threshold parameter. Pay extra attention when processing long-form audio.
Q7. Why did voxtral-mini output what appeared to be Arabic?
The exact cause is unclear, but this is consistent with language-mixing issues common in multilingual models. Automatic language validation of output is strongly recommended for production use.
Q8. Why is granite-4.0-1b-speech so polarized in accuracy?
Granite appears to be built on a primarily English-trained base model with speech recognition capability added. It handles standard Japanese (news, announcements) reasonably well, but in environments with heavy non-speech audio (laughter, BGM, overlapping speech), the decoder is prone to looping. This likely reflects insufficient coverage of such difficult cases in training data.
Q9. Does post-processing to remove loops actually improve WER?
Removing loops does improve CER/WER scores, but the audio content that should have been transcribed is still lost. Early stopping problems persist after loop removal, so it's not a real fix — it's output sanitization, not accuracy improvement. Understand this distinction before relying on it.
Q10. How reliable is this benchmark?
The 20-clip dataset covers diverse genres, but sample size is limited. For domain-specific deployments (medical, legal, etc.), we recommend running additional evaluation on domain-matched data.
Conclusion
Across 9 models evaluated under identical conditions:
- Best accuracy + stability:
qwen3-asr-1.7b>whisper - Japanese media / broadcast:
reazonspeech-espnet-v2 - Speed-first:
parakeet-tdt-0.6b-v3 - Clean audio / lightweight:
granite-4.0-1b-speech(loop post-processing required) - Use with caution:
kotoba-whisper-v2.0(natural conversation),voxtral-mini(without language validation),granite-4.0-1b-speech(noisy / emotional / multi-speaker audio)
What this Granite evaluation reinforced is that picking a model based on average WER alone is dangerous. Granite rivals top models on clean audio, but catastrophically fails under specific conditions. Understanding your actual audio characteristics — and pairing the right model with appropriate pre-processing (noise reduction, speaker separation) and post-processing (hallucination detection, loop removal, language validation) — is what actually determines ASR pipeline quality.
Contact
For project inquiries and collaboration, contact us here.
If you are considering a new project, product development, or other collaboration, please get in touch.
Related Articles
Explore more articles connected to this topic.
How I Finetuned IBM Granite Speech 1B on Japanese Audio and Improved CER from 0.37 to 0.14
I finetuned IBM Granite Speech (`granite-4.0-1b-speech`) on 100 hours of Japanese speech data and reduced CER from 0.37 to 0.14. The official script's Projector+LoRA-only training has a ceiling on accuracy gains. The key breakthrough was additionally training `lm_head` and the last 8 layers of the Language Model. The result matches Qwen3-ASR-1.7B (CER 0.14) with only 1B parameters.
Read article →Open-Source TTS Model Showdown 2026: Hands-On Testing of Qwen3-TTS, Chatterbox, Fish Speech, CosyVoice & IndexTTS2
A hands-on comparison of five open-source TTS models, with a focus on commercial license pitfalls in Fish Speech and IndexTTS2 — covering features, licenses, and working code all in one place.
Read article →