Japanese ASR Model Comparison 2026: Whisper, Qwen3, Voxtral & ReazonSpeech Benchmarked on RTX5090

February 26, 2026

What you'll learn: First-party benchmark data comparing 8 Japanese automatic speech recognition (ASR) models under identical conditions. We evaluate WER, CER, and RTF to answer the question: which model is best for which use case?

Summary (Conclusion First)

We evaluated 8 ASR models on 20 natural conversation audio clips (29 seconds each) using an RTX5090. openai/whisper-large-v3-turbo and qwen/qwen3-asr-1.7b clearly outperform the rest in both accuracy and stability. For speed-critical applications, nvidia/parakeet-tdt-0.6b-v3 (RTF=0.002) is unmatched; for Japanese-domain adaptation, reazon-research/reazonspeech-espnet-v2 is a strong contender.

Model	WER↓	CER↓	RTF↓	Overall
qwen/qwen3-asr-1.7b	0.1899	0.1473	0.034	⭐⭐⭐⭐⭐
openai/whisper-large-v3-turbo	0.2099	0.1782	0.012	⭐⭐⭐⭐⭐
reazonspeech-espnet-v2	0.2342	0.2007	0.589	⭐⭐⭐⭐
voxtral-mini-4b	0.2437	0.2187	0.207	⭐⭐⭐
parakeet-tdt-0.6b-v3	0.3465	0.3235	0.002	⭐⭐⭐
reazonspeech-nemo-v2	0.3546	0.3346	0.020	⭐⭐⭐
reazonspeech-k2-v2	0.4636	0.4486	0.026	⭐⭐
kotoba-whisper-v2.0	0.5402	0.5050	0.008	⭐

Test Environment & Dataset

Hardware & Settings

text

GPU: RTX5090
temperature: 0.0
max_new_tokens: 512
language: ja

Dataset (20 clips, 29 seconds each)

What sets this benchmark apart from superficial roundup articles is that we used real Japanese media audio — not clean read-speech or synthetic data.

News narration: Standard broadcaster speech
Entertainment (comedy/variety shows): Heavy fillers, laughter, rapid-fire speech
Drama & anime: Emotional delivery, slang (e.g., "temee", "babaa!")
Business reality shows: Overlapping speakers

By including overlapping speakers and background noise such as audience laughter, the dataset reflects conditions close to real-world production use.

Metrics Explained

WER (Word Error Rate)

text

WER = (Substitutions + Deletions + Insertions) / Reference word count

Lower is better. Unlike English, Japanese has no spaces between words, so we use MeCab (a morphological analyzer) to tokenize the text before computing WER. Punctuation marks (、。！？) also vary widely between models, so they are stripped before evaluation to ensure fair comparison. See the Reproducibility section for full implementation details.

CER (Character Error Rate)

Error rate at the character level. For Japanese, CER is generally considered more reliable than WER. In our results, the ranking was nearly identical between WER and CER, confirming consistency between the two metrics.

RTF (Real-Time Factor)

text

RTF = Processing time / Audio duration

RTF < 1.0 means faster than real-time. Parakeet's RTF of 0.002 means it processes a 29-second clip in roughly 0.06 seconds.

DR (Detection Rate)

The proportion of inputs that produced a valid output. All models achieved 100% — zero complete failures.

Per-Model Deep Dive

🥇 qwen/qwen3-asr-1.7b — Best Accuracy (WER: 0.1899)

The standout feature of this model is its ability to "complete" the transcript contextually.

Rather than simply mapping sounds to characters, it infers what the speaker most likely said and produces output that reads like it was written by a professional transcriptionist, with punctuation placed at natural breath boundaries.

Observed behavior:

Appropriately drops filler words ("えー", "あの") while preserving grammatically important particles
Only 1 out of 20 clips exceeded WER=0.5; all others came in under 0.2 — exceptional stability

Caveats:

Tends to "mishear" internet slang and unusual proper nouns with plausible-sounding alternatives
Average inference time 0.99s (RTF=0.034) — roughly 3× slower than Whisper

🥈 openai/whisper-large-v3-turbo — Reliable All-Rounder (WER: 0.2099)

Its contextual understanding is unmatched among the models tested.

Clips #17 and #18 — casual conversation snippets like "went to the ocean and screamed…" — were nearly perfectly transcribed by Whisper, while all ReazonSpeech variants either went silent or truncated the back half.

Strengths:

Intelligently trims filler words while filling in dropped particles to maintain sentence structure
Holds up well under overlapping speakers and ambient noise

Weakness: Hallucination

Whisper occasionally generates text for silent portions of audio. In clip #13, the word "私" (I/me) was appended at the end despite not appearing in the reference. For long-form audio, a post-processing hallucination detection step is advisable.

reazon-research/reazonspeech-espnet-v2 — Japanese Domain Expert (WER: 0.2342)

Trained on Japanese broadcast media data, this model demonstrated an impressive ability to distinguish between formal and colloquial Japanese. For example, it correctly differentiated between "東横" (Toyoko, a place name) and "トー横" (Tōyoko, a youth slang term for a district in Shinjuku) based on context alone — a practical advantage for news and media applications.

It also tends to actively use exclamation marks (！) and question marks (？) to capture conversational energy.

Weaknesses:

RTF=0.589 — the slowest model tested (~17 seconds to process a 29-second clip)
Aggressively removes filler words, making it unsuitable when filler preservation is required

mistralai/voxtral-mini-4b-realtime-2602 — Readability Over Fidelity (WER: 0.2437)

This model clearly prioritizes "readable output" over verbatim transcription accuracy.

Concrete examples observed:

ぐらい → くらい (orthographic unification)
っていうか → というのか (colloquial → written form conversion)

This makes it appealing for subtitle production and meeting minutes where clean text matters more than phonetic accuracy.

Critical concern:

One file triggered output that appeared to be Arabic or a garbled non-Japanese script. Language validation is mandatory before any production deployment.

nvidia/parakeet-tdt-0.6b-v3 — Extreme Speed (RTF: 0.002)

RTF=0.002 is the fastest result in this benchmark — a 29-second clip processed in ~0.06 seconds.

The TDT (Transducer with Dynamic Time Alignment) architecture is designed to skip low-energy audio and tokens deemed low-importance. This "aggressive omission" behavior is the root cause of the WER=0.3465 score.

Good fit for:

Real-time caption drafts
High-volume audio indexing where speed trumps accuracy

Poor fit for:

Detailed meeting transcripts or complete drama/dialogue transcription

reazonspeech-nemo-v2 vs. k2-v2: A Tale of Two Styles

Both share the ReazonSpeech brand but differ significantly in output formatting:

Feature	nemo-v2	k2-v2
Number style	`13歳` (Arabic numerals)	`十三歳` (Kanji numerals)
English letters	`gps` (lowercase)	`GPS`
Speed	RTF=0.020	RTF=0.026
Truncation pattern	Drops the second half	Drops the middle ("hollow" output)

nemo-v2 also correctly transcribed the police jargon "マル暴" (Marubo, slang for organized crime division) as proper kanji — demonstrating solid specialized vocabulary coverage.

kotoba-whisper-v2.0 — Last Place in This Test (WER: 0.5402)

This model handles clean, clearly articulated speech (such as news scripts) adequately, but struggles significantly with natural conversation, emotionally expressive delivery, and multi-speaker scenarios.

Characteristic failure modes:

Loops the same phrase when audio becomes unclear
Loses words immediately following filler tokens ("えっと", "あの", "まー")
Frequently misses the opening portion of an audio clip entirely

Model Recommendations by Use Case

text

Best accuracy (minutes, subtitles)
  → qwen3-asr-1.7b  or  whisper-large-v3-turbo

Japanese media / broadcast domain
  → reazonspeech-espnet-v2

Speed-first (real-time, batch processing)
  → parakeet-tdt-0.6b-v3

Readability-first (written-style output)
  → voxtral-mini  (language validation required)

Reproducibility: How WER Was Computed

The most critical design decision in this benchmark was how to fairly compute WER for Japanese. Simply calling jiwer.wer(ref, hyp) out of the box does not produce accurate or fair results for Japanese ASR.

Step 1: Text Normalization (Strip Punctuation, Symbols & Whitespace)

Different models insert punctuation very differently — reazonspeech-espnet-v2 aggressively adds ！ and ？, while reference annotations may have none. To prevent punctuation habits from unfairly skewing WER scores, we strip all of it before evaluation.

python

import unicodedata

def normalize_text(text: str) -> str:
    # NFKC normalization: full-width digits/letters → half-width, variant forms unified
    text = unicodedata.normalize("NFKC", text)
    chars: list[str] = []
    for ch in text:
        cat = unicodedata.category(ch)
        if cat.startswith("Z"):  # Separators (space, ideographic space, etc.)
            continue
        if cat.startswith("P"):  # Punctuation (、。！？「」etc.)
            continue
        if cat.startswith("S"):  # Symbols (￥, ©, etc.)
            continue
        chars.append(ch)
    return "".join(chars).strip()

This normalization prevents the following types of unfair scoring:

Transform	Example
NFKC normalization	`１３歳` → `13歳`, `ｋｍ` → `km`
Punctuation removal	`行って、叫んで。` → `行って叫んで`
Whitespace removal	`東京都` → `東京都`

Step 2: MeCab Morphological Tokenization (for WER)

Japanese has no word boundaries marked by spaces, so we cannot split on whitespace the way English WER is computed. We use MeCab in wakati (space-separated) mode to perform morphological analysis before computing WER.

python

import MeCab

_MECAB_TAGGER = None

def _get_mecab_tagger():
    global _MECAB_TAGGER
    if _MECAB_TAGGER is None:
        # -Owakati: output in space-separated word form
        _MECAB_TAGGER = MeCab.Tagger("-Owakati")
    return _MECAB_TAGGER

def _contains_japanese(text: str) -> bool:
    for ch in text:
        code = ord(ch)
        if (
            0x3040 <= code <= 0x30FF  # Hiragana / Katakana
            or 0x4E00 <= code <= 0x9FFF  # CJK Unified Ideographs
            or 0x3400 <= code <= 0x4DBF  # CJK Extension A
        ):
            return True
    return False

def tokenize_for_wer(text: str) -> list[str]:
    # Already space-separated (e.g. English): split on spaces
    if " " in text:
        return [tok for tok in text.split(" ") if tok]
    # Contains Japanese: use MeCab morphological analysis
    if _contains_japanese(text):
        tagger = _get_mecab_tagger()
        parsed = str(tagger.parse(text) or "").strip()
        return [w for w in parsed.split() if w]
    # Otherwise (pure ASCII/numerals): character-level split
    return [ch for ch in text if not ch.isspace()]

Why MeCab? Take the word "行ってきました" (I went and came back). Character-level splitting would treat each character as an independent token. MeCab correctly segments it into morphemes: 行っ / て / き / まし / た. This makes substitution/deletion/insertion counts much more meaningful.

Step 3: CER Tokenization (Character Level)

CER is straightforward — split into individual characters, excluding whitespace.

python

def tokenize_for_cer(text: str) -> list[str]:
    return [ch for ch in text if not ch.isspace()]

Step 4: Error Rate Computation via jiwer

Token sequences are joined with spaces and passed to jiwer to obtain substitution, deletion, and insertion counts.

python

import jiwer

def compute_error_rate(ref: str, hyp: str, mode: str) -> tuple[float, int, int]:
    ref_n = normalize_text(ref)
    hyp_n = normalize_text(hyp)

    if mode == "wer":
        ref_tokens = tokenize_for_wer(ref_n)
        hyp_tokens = tokenize_for_wer(hyp_n)
    elif mode == "cer":
        ref_tokens = tokenize_for_cer(ref_n)
        hyp_tokens = tokenize_for_cer(hyp_n)

    # jiwer expects whitespace-delimited strings
    out = jiwer.process_words(
        " ".join(ref_tokens),
        " ".join(hyp_tokens)
    )
    err = int(out.substitutions + out.deletions + out.insertions)
    total = int(out.hits + out.substitutions + out.deletions)
    return (err / max(total, 1)), err, total

The final WER and CER figures are corpus-level aggregates across all 20 samples — computed as Σerr / Σtotal rather than averaging per-sample rates. This prevents short clips from having disproportionate influence on the overall score.

Step 5: RTF Measurement

Each ASR model runs as a subprocess and returns timing metadata as JSON. RTF is derived from the ratio of inference time to audio duration.

python

# JSON format returned by each model's transcribe.py
{
    "text": "transcribed text here",
    "transcribe_elapsed_sec": 0.34,  # wall-clock inference time
    "audio_duration_sec": 29.0,       # length of input audio
    "rtf": 0.012                      # transcribe_elapsed / audio_duration
}

# Corpus-level RTF = Σtranscribe_elapsed / Σaudio_duration
rtf = transcribe_elapsed_sum / audio_duration_sum

Dependencies

text

jiwer          # WER/CER computation
MeCab          # Japanese morphological analysis (tokenization)
unicodedata    # Text normalization (Python standard library)

Installing MeCab (Ubuntu/Debian):

bash

sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8
pip install mecab-python3 jiwer

FAQ

Q1. Which Japanese ASR model has the best accuracy?

In this benchmark, qwen/qwen3-asr-1.7b achieved the best WER (0.1899). That said, whisper is essentially on par in terms of stability across diverse audio conditions — the right choice depends on your specific use case.

Q2. Should I use WER or CER for Japanese ASR evaluation?

Due to the ambiguity of word segmentation in Japanese, WER can be unfairly inflated. We used MeCab to mitigate this, which improves reliability compared to character-level splitting, but we still recommend CER as your primary metric. In our results, the model rankings were nearly identical under both metrics.

Q3. Which model is best for real-time transcription?

RTF < 0.1 is a practical threshold for real-time use. In our results, parakeet-tdt-0.6b-v3 (RTF=0.002), whisper (RTF=0.012), and nemo-v2 (RTF=0.020) all qualify — but be sure to weigh the accuracy tradeoffs.

Q4. Which model best preserves filler words (えー, あのー)?

qwen3-asr-1.7b and whisper both trim fillers somewhat but are relatively faithful. reazonspeech-espnet-v2 aggressively removes them, making it unsuitable for filler-preserving use cases like linguistic analysis.

Q5. Which model handles overlapping speakers best?

whisper and qwen3-asr-1.7b both maintained stable accuracy even with overlapping speakers and background noise. ReazonSpeech variants tended to truncate or drop the latter half of the transcript when speakers overlapped.

Q6. How do I deal with Whisper hallucinations?

Three approaches are effective: ① pre-processing with VAD (Voice Activity Detection) to silence non-speech segments; ② post-processing to detect repetition patterns (see the code snippet above); ③ tuning the no_speech_threshold parameter. Extra care is needed for long-form audio.

Q7. Why did voxtral-mini output Arabic-looking text?

The exact cause is unclear, but it is likely a language-confusion artifact common in multilingual models. Implement automatic language validation before using this model in production.

Q8. How reliable is this benchmark?

The 20-clip test set covers a diverse range of genres, but 20 samples is still a limited set. For domain-specific applications (medical, legal, etc.), we strongly recommend supplementing with in-domain evaluation data before making a final model selection.

Conclusion

To summarize 8 models benchmarked under identical conditions:

Best accuracy + stability: qwen3-asr-1.7b > whisper
Japanese media domain: reazonspeech-espnet-v2
Speed-first: parakeet-tdt-0.6b-v3
Use with caution: kotoba-whisper-v2.0 (natural conversation), voxtral-mini (without language validation)

ASR performance depends not only on model selection, but on the full pipeline — preprocessing (noise reduction, speaker separation) and postprocessing (hallucination detection, language validation) can dramatically shift real-world accuracy. Use the data in this article as a foundation for building the optimal pipeline for your use case.