Japanese ASR Model Comparison 2026: Whisper, Qwen3, Voxtral & ReazonSpeech Benchmarked on RTX5090
What you'll learn: First-party benchmark data comparing 8 Japanese automatic speech recognition (ASR) models under identical conditions. We evaluate WER, CER, and RTF to answer the question: which model is best for which use case?
Summary (Conclusion First)
We evaluated 8 ASR models on 20 natural conversation audio clips (29 seconds each) using an RTX5090. openai/whisper-large-v3-turbo and qwen/qwen3-asr-1.7b clearly outperform the rest in both accuracy and stability. For speed-critical applications, nvidia/parakeet-tdt-0.6b-v3 (RTF=0.002) is unmatched; for Japanese-domain adaptation, reazon-research/reazonspeech-espnet-v2 is a strong contender.
| Model | WER↓ | CER↓ | RTF↓ | Overall |
|---|---|---|---|---|
| qwen/qwen3-asr-1.7b | 0.1899 | 0.1473 | 0.034 | ⭐⭐⭐⭐⭐ |
| openai/whisper-large-v3-turbo | 0.2099 | 0.1782 | 0.012 | ⭐⭐⭐⭐⭐ |
| reazonspeech-espnet-v2 | 0.2342 | 0.2007 | 0.589 | ⭐⭐⭐⭐ |
| voxtral-mini-4b | 0.2437 | 0.2187 | 0.207 | ⭐⭐⭐ |
| parakeet-tdt-0.6b-v3 | 0.3465 | 0.3235 | 0.002 | ⭐⭐⭐ |
| reazonspeech-nemo-v2 | 0.3546 | 0.3346 | 0.020 | ⭐⭐⭐ |
| reazonspeech-k2-v2 | 0.4636 | 0.4486 | 0.026 | ⭐⭐ |
| kotoba-whisper-v2.0 | 0.5402 | 0.5050 | 0.008 | ⭐ |
Test Environment & Dataset
Hardware & Settings
GPU: RTX5090
temperature: 0.0
max_new_tokens: 512
language: ja
Dataset (20 clips, 29 seconds each)
What sets this benchmark apart from superficial roundup articles is that we used real Japanese media audio — not clean read-speech or synthetic data.
- News narration: Standard broadcaster speech
- Entertainment (comedy/variety shows): Heavy fillers, laughter, rapid-fire speech
- Drama & anime: Emotional delivery, slang (e.g., "temee", "babaa!")
- Business reality shows: Overlapping speakers
By including overlapping speakers and background noise such as audience laughter, the dataset reflects conditions close to real-world production use.
Metrics Explained
WER (Word Error Rate)
WER = (Substitutions + Deletions + Insertions) / Reference word count
Lower is better. Unlike English, Japanese has no spaces between words, so we use MeCab (a morphological analyzer) to tokenize the text before computing WER. Punctuation marks (、。!?) also vary widely between models, so they are stripped before evaluation to ensure fair comparison. See the Reproducibility section for full implementation details.
CER (Character Error Rate)
Error rate at the character level. For Japanese, CER is generally considered more reliable than WER. In our results, the ranking was nearly identical between WER and CER, confirming consistency between the two metrics.
RTF (Real-Time Factor)
RTF = Processing time / Audio duration
RTF < 1.0 means faster than real-time. Parakeet's RTF of 0.002 means it processes a 29-second clip in roughly 0.06 seconds.
DR (Detection Rate)
The proportion of inputs that produced a valid output. All models achieved 100% — zero complete failures.
Per-Model Deep Dive
🥇 qwen/qwen3-asr-1.7b — Best Accuracy (WER: 0.1899)
The standout feature of this model is its ability to "complete" the transcript contextually.
Rather than simply mapping sounds to characters, it infers what the speaker most likely said and produces output that reads like it was written by a professional transcriptionist, with punctuation placed at natural breath boundaries.
Observed behavior:
- Appropriately drops filler words ("えー", "あの") while preserving grammatically important particles
- Only 1 out of 20 clips exceeded WER=0.5; all others came in under 0.2 — exceptional stability
Caveats:
- Tends to "mishear" internet slang and unusual proper nouns with plausible-sounding alternatives
- Average inference time 0.99s (RTF=0.034) — roughly 3× slower than Whisper
🥈 openai/whisper-large-v3-turbo — Reliable All-Rounder (WER: 0.2099)
Its contextual understanding is unmatched among the models tested.
Clips #17 and #18 — casual conversation snippets like "went to the ocean and screamed…" — were nearly perfectly transcribed by Whisper, while all ReazonSpeech variants either went silent or truncated the back half.
Strengths:
- Intelligently trims filler words while filling in dropped particles to maintain sentence structure
- Holds up well under overlapping speakers and ambient noise
Weakness: Hallucination
Whisper occasionally generates text for silent portions of audio. In clip #13, the word "私" (I/me) was appended at the end despite not appearing in the reference. For long-form audio, a post-processing hallucination detection step is advisable.
reazon-research/reazonspeech-espnet-v2 — Japanese Domain Expert (WER: 0.2342)
Trained on Japanese broadcast media data, this model demonstrated an impressive ability to distinguish between formal and colloquial Japanese. For example, it correctly differentiated between "東横" (Toyoko, a place name) and "トー横" (Tōyoko, a youth slang term for a district in Shinjuku) based on context alone — a practical advantage for news and media applications.
It also tends to actively use exclamation marks (!) and question marks (?) to capture conversational energy.
Weaknesses:
- RTF=0.589 — the slowest model tested (~17 seconds to process a 29-second clip)
- Aggressively removes filler words, making it unsuitable when filler preservation is required
mistralai/voxtral-mini-4b-realtime-2602 — Readability Over Fidelity (WER: 0.2437)
This model clearly prioritizes "readable output" over verbatim transcription accuracy.
Concrete examples observed:
ぐらい→くらい(orthographic unification)っていうか→というのか(colloquial → written form conversion)
This makes it appealing for subtitle production and meeting minutes where clean text matters more than phonetic accuracy.
Critical concern:
One file triggered output that appeared to be Arabic or a garbled non-Japanese script. Language validation is mandatory before any production deployment.
nvidia/parakeet-tdt-0.6b-v3 — Extreme Speed (RTF: 0.002)
RTF=0.002 is the fastest result in this benchmark — a 29-second clip processed in ~0.06 seconds.
The TDT (Transducer with Dynamic Time Alignment) architecture is designed to skip low-energy audio and tokens deemed low-importance. This "aggressive omission" behavior is the root cause of the WER=0.3465 score.
Good fit for:
- Real-time caption drafts
- High-volume audio indexing where speed trumps accuracy
Poor fit for:
- Detailed meeting transcripts or complete drama/dialogue transcription
reazonspeech-nemo-v2 vs. k2-v2: A Tale of Two Styles
Both share the ReazonSpeech brand but differ significantly in output formatting:
| Feature | nemo-v2 | k2-v2 |
|---|---|---|
| Number style | 13歳 (Arabic numerals) | 十三歳 (Kanji numerals) |
| English letters | gps (lowercase) | GPS |
| Speed | RTF=0.020 | RTF=0.026 |
| Truncation pattern | Drops the second half | Drops the middle ("hollow" output) |
nemo-v2 also correctly transcribed the police jargon "マル暴" (Marubo, slang for organized crime division) as proper kanji — demonstrating solid specialized vocabulary coverage.
kotoba-whisper-v2.0 — Last Place in This Test (WER: 0.5402)
This model handles clean, clearly articulated speech (such as news scripts) adequately, but struggles significantly with natural conversation, emotionally expressive delivery, and multi-speaker scenarios.
Characteristic failure modes:
- Loops the same phrase when audio becomes unclear
- Loses words immediately following filler tokens ("えっと", "あの", "まー")
- Frequently misses the opening portion of an audio clip entirely
Model Recommendations by Use Case
Best accuracy (minutes, subtitles)
→ qwen3-asr-1.7b or whisper-large-v3-turbo
Japanese media / broadcast domain
→ reazonspeech-espnet-v2
Speed-first (real-time, batch processing)
→ parakeet-tdt-0.6b-v3
Readability-first (written-style output)
→ voxtral-mini (language validation required)
Reproducibility: How WER Was Computed
The most critical design decision in this benchmark was how to fairly compute WER for Japanese. Simply calling jiwer.wer(ref, hyp) out of the box does not produce accurate or fair results for Japanese ASR.
Step 1: Text Normalization (Strip Punctuation, Symbols & Whitespace)
Different models insert punctuation very differently — reazonspeech-espnet-v2 aggressively adds ! and ?, while reference annotations may have none. To prevent punctuation habits from unfairly skewing WER scores, we strip all of it before evaluation.
import unicodedata
def normalize_text(text: str) -> str:
# NFKC normalization: full-width digits/letters → half-width, variant forms unified
text = unicodedata.normalize("NFKC", text)
chars: list[str] = []
for ch in text:
cat = unicodedata.category(ch)
if cat.startswith("Z"): # Separators (space, ideographic space, etc.)
continue
if cat.startswith("P"): # Punctuation (、。!?「」etc.)
continue
if cat.startswith("S"): # Symbols (¥, ©, etc.)
continue
chars.append(ch)
return "".join(chars).strip()
This normalization prevents the following types of unfair scoring:
| Transform | Example |
|---|---|
| NFKC normalization | 13歳 → 13歳, km → km |
| Punctuation removal | 行って、叫んで。 → 行って叫んで |
| Whitespace removal | 東京 都 → 東京都 |
Step 2: MeCab Morphological Tokenization (for WER)
Japanese has no word boundaries marked by spaces, so we cannot split on whitespace the way English WER is computed. We use MeCab in wakati (space-separated) mode to perform morphological analysis before computing WER.
import MeCab
_MECAB_TAGGER = None
def _get_mecab_tagger():
global _MECAB_TAGGER
if _MECAB_TAGGER is None:
# -Owakati: output in space-separated word form
_MECAB_TAGGER = MeCab.Tagger("-Owakati")
return _MECAB_TAGGER
def _contains_japanese(text: str) -> bool:
for ch in text:
code = ord(ch)
if (
0x3040 <= code <= 0x30FF # Hiragana / Katakana
or 0x4E00 <= code <= 0x9FFF # CJK Unified Ideographs
or 0x3400 <= code <= 0x4DBF # CJK Extension A
):
return True
return False
def tokenize_for_wer(text: str) -> list[str]:
# Already space-separated (e.g. English): split on spaces
if " " in text:
return [tok for tok in text.split(" ") if tok]
# Contains Japanese: use MeCab morphological analysis
if _contains_japanese(text):
tagger = _get_mecab_tagger()
parsed = str(tagger.parse(text) or "").strip()
return [w for w in parsed.split() if w]
# Otherwise (pure ASCII/numerals): character-level split
return [ch for ch in text if not ch.isspace()]
Why MeCab? Take the word "行ってきました" (I went and came back). Character-level splitting would treat each character as an independent token. MeCab correctly segments it into morphemes: 行っ / て / き / まし / た. This makes substitution/deletion/insertion counts much more meaningful.
Step 3: CER Tokenization (Character Level)
CER is straightforward — split into individual characters, excluding whitespace.
def tokenize_for_cer(text: str) -> list[str]:
return [ch for ch in text if not ch.isspace()]
Step 4: Error Rate Computation via jiwer
Token sequences are joined with spaces and passed to jiwer to obtain substitution, deletion, and insertion counts.
import jiwer
def compute_error_rate(ref: str, hyp: str, mode: str) -> tuple[float, int, int]:
ref_n = normalize_text(ref)
hyp_n = normalize_text(hyp)
if mode == "wer":
ref_tokens = tokenize_for_wer(ref_n)
hyp_tokens = tokenize_for_wer(hyp_n)
elif mode == "cer":
ref_tokens = tokenize_for_cer(ref_n)
hyp_tokens = tokenize_for_cer(hyp_n)
# jiwer expects whitespace-delimited strings
out = jiwer.process_words(
" ".join(ref_tokens),
" ".join(hyp_tokens)
)
err = int(out.substitutions + out.deletions + out.insertions)
total = int(out.hits + out.substitutions + out.deletions)
return (err / max(total, 1)), err, total
The final WER and CER figures are corpus-level aggregates across all 20 samples — computed as Σerr / Σtotal rather than averaging per-sample rates. This prevents short clips from having disproportionate influence on the overall score.
Step 5: RTF Measurement
Each ASR model runs as a subprocess and returns timing metadata as JSON. RTF is derived from the ratio of inference time to audio duration.
# JSON format returned by each model's transcribe.py
{
"text": "transcribed text here",
"transcribe_elapsed_sec": 0.34, # wall-clock inference time
"audio_duration_sec": 29.0, # length of input audio
"rtf": 0.012 # transcribe_elapsed / audio_duration
}
# Corpus-level RTF = Σtranscribe_elapsed / Σaudio_duration
rtf = transcribe_elapsed_sum / audio_duration_sum
Dependencies
jiwer # WER/CER computation
MeCab # Japanese morphological analysis (tokenization)
unicodedata # Text normalization (Python standard library)
Installing MeCab (Ubuntu/Debian):
sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8
pip install mecab-python3 jiwer
FAQ
Q1. Which Japanese ASR model has the best accuracy?
In this benchmark, qwen/qwen3-asr-1.7b achieved the best WER (0.1899). That said, whisper is essentially on par in terms of stability across diverse audio conditions — the right choice depends on your specific use case.
Q2. Should I use WER or CER for Japanese ASR evaluation?
Due to the ambiguity of word segmentation in Japanese, WER can be unfairly inflated. We used MeCab to mitigate this, which improves reliability compared to character-level splitting, but we still recommend CER as your primary metric. In our results, the model rankings were nearly identical under both metrics.
Q3. Which model is best for real-time transcription?
RTF < 0.1 is a practical threshold for real-time use. In our results, parakeet-tdt-0.6b-v3 (RTF=0.002), whisper (RTF=0.012), and nemo-v2 (RTF=0.020) all qualify — but be sure to weigh the accuracy tradeoffs.
Q4. Which model best preserves filler words (えー, あのー)?
qwen3-asr-1.7b and whisper both trim fillers somewhat but are relatively faithful. reazonspeech-espnet-v2 aggressively removes them, making it unsuitable for filler-preserving use cases like linguistic analysis.
Q5. Which model handles overlapping speakers best?
whisper and qwen3-asr-1.7b both maintained stable accuracy even with overlapping speakers and background noise. ReazonSpeech variants tended to truncate or drop the latter half of the transcript when speakers overlapped.
Q6. How do I deal with Whisper hallucinations?
Three approaches are effective: ① pre-processing with VAD (Voice Activity Detection) to silence non-speech segments; ② post-processing to detect repetition patterns (see the code snippet above); ③ tuning the no_speech_threshold parameter. Extra care is needed for long-form audio.
Q7. Why did voxtral-mini output Arabic-looking text?
The exact cause is unclear, but it is likely a language-confusion artifact common in multilingual models. Implement automatic language validation before using this model in production.
Q8. How reliable is this benchmark?
The 20-clip test set covers a diverse range of genres, but 20 samples is still a limited set. For domain-specific applications (medical, legal, etc.), we strongly recommend supplementing with in-domain evaluation data before making a final model selection.
Conclusion
To summarize 8 models benchmarked under identical conditions:
- Best accuracy + stability:
qwen3-asr-1.7b>whisper - Japanese media domain:
reazonspeech-espnet-v2 - Speed-first:
parakeet-tdt-0.6b-v3 - Use with caution:
kotoba-whisper-v2.0(natural conversation),voxtral-mini(without language validation)
ASR performance depends not only on model selection, but on the full pipeline — preprocessing (noise reduction, speaker separation) and postprocessing (hallucination detection, language validation) can dramatically shift real-world accuracy. Use the data in this article as a foundation for building the optimal pipeline for your use case.