Japanese ASR Model Comparison 2026: Whisper, Qwen3, Voxtral & ReazonSpeech Benchmarked on RTX5090

::: message
Update History
2026/03/23 Added benchmark results for ibm/granite-4.0-1b-speech.
:::

What you'll learn: First-hand benchmark data comparing 9 Japanese speech recognition (ASR) models under identical conditions. We compare WER, CER, and RTF to answer "which model is best for which use case" — from real hands-on testing.


Summary

Using an RTX5090, we evaluated 9 ASR models on 20 audio clips (580 seconds total, ~10 minutes) of natural conversational speech. For overall accuracy and stability, openai/whisper-large-v3-turbo and qwen/qwen3-asr-1.7b stand clearly ahead of the pack. If speed is your top priority, nvidia/parakeet-tdt-0.6b-v3 (RTF=0.002) is the winner. For Japanese-specific domain adaptation, reazon-research/reazonspeech-espnet-v2 is a strong contender.

We also added ibm/granite-4.0-1b-speech to this round. It showed impressive accuracy on clean audio, but exhibited a clear weakness in noisy and multi-speaker environments — particularly a hallucination loop problem.

ModelWER↓CER↓RTF↓Overall
qwen3-asr-1.7b0.18990.14730.034⭐⭐⭐⭐⭐
whisper-large-v3-turbo0.20990.17820.012⭐⭐⭐⭐⭐
reazonspeech-espnet-v20.23420.20070.589⭐⭐⭐⭐
voxtral-mini-4b0.24370.21870.207⭐⭐⭐
granite-4.0-1b-speech0.40580.37150.061⭐⭐⭐
parakeet-tdt-0.6b-v30.34650.32350.002⭐⭐⭐
reazonspeech-nemo-v20.35460.33460.020⭐⭐⭐
reazonspeech-k2-v20.46360.44860.026⭐⭐
kotoba-whisper-v2.00.54020.50500.008

Granite shows a highly polarized behavior — strong on clean audio, brittle on noise. Its average ranking is mid-tier, but it can be a viable option if your domain is well-controlled.


Test Environment & Evaluation Data

Hardware & Settings

text
GPU: RTX5090
temperature: 0.0
max_new_tokens: 512
language: ja

Dataset (20 clips, ~29 seconds each)

What sets this benchmark apart from surface-level roundups is that we used real media audio, not synthetic or read-speech datasets.

  • News narration: Standard announcer-style delivery
  • Entertainment (comedy / variety shows): Fillers, laughter, fast speech
  • Drama / anime: Emotional delivery, slang (e.g., "temee", "babaa!")
  • Business / reality TV: Overlapping multi-speaker segments

By including overlapping speech, laughter, and background noise, we replicated challenging, production-like scenarios.


Metrics Explained

WER (Word Error Rate)

text
WER = (Substitutions + Deletions + Insertions) / Reference Word Count

Lower is better. Japanese text has no word-boundary spaces, so we use MeCab (a morphological analyzer) to tokenize text before computing WER. Punctuation (、。!?) is also stripped before evaluation, since different models punctuate differently. See the "Reproducibility" section for details.

CER (Character Error Rate)

Error rate measured at the character level. Generally considered more reliable than WER for Japanese evaluation. In our results, WER and CER rankings were almost identical, confirming the consistency of both metrics.

RTF (Real-Time Factor)

text
RTF = Processing Time / Audio Duration

RTF < 1.0 means real-time capable. At RTF=0.002, parakeet processes 29 seconds of audio in roughly 0.06 seconds.


Per-Model Analysis

🥇 qwen/qwen3-asr-1.7b — Highest Accuracy (WER: 0.1899)

The standout feature of this model is its "completion ability."

Rather than simply transcribing sounds, it completes sentences in a way that reads naturally — with punctuation placed at natural breath points, like output from a professional transcriptionist.

Observed behavior:

  • Appropriately drops fillers ("uh", "um") while preserving grammatical particles
  • Only 1 of 20 clips exceeded WER 0.5; all others stayed within 0.2 — exceptional consistency

Caveats:

  • Occasional "mishearing" (creative reinterpretation) of internet slang and proper nouns
  • Average inference time 0.99s (RTF=0.034) — roughly 3x slower than Whisper

🥈 openai/whisper-large-v3-turbo — Reliable All-Rounder (WER: 0.2099)

Its contextual understanding is head and shoulders above the rest.

Segments where ReazonSpeech models fell silent or dropped the tail end of utterances were transcribed near-perfectly by Whisper.

Strengths:

  • Drops fillers while reconstructing grammatical structure
  • Holds up well under multi-speaker and noisy conditions

Weakness: Hallucination

Whisper can generate text where no speech exists. In our test, the word "私" (I/me) was appended at the end without being present in the audio. For long-form audio, a post-processing hallucination detection step is advisable.


reazon-research/reazonspeech-espnet-v2 — Japanese Domain Adaptation (WER: 0.2342)

Trained on Japanese broadcast media data, this model demonstrated the ability to correctly interpret culturally specific terms — for example, distinguishing the colloquial "Tō-yoko" (a youth hangout spot named after the Tōyoko train line) from the literal place name. This is a significant advantage for news and media transcription.

It also actively inserts exclamation (!) and question (?) marks, attempting to capture the energy and emotion of speech.

Weaknesses:

  • RTF=0.589 — the slowest in this benchmark (about 17 seconds to process a 29-second clip)
  • Aggressively drops fillers, making it unsuitable when filler preservation is needed

mistralai/voxtral-mini-4b-realtime-2602 — Readability-First Design (WER: 0.2437)

This model clearly prioritizes readability over verbatim transcription.

Examples:

  • ぐらいくらい (orthographic normalization)
  • っていうかというのか (spoken → written register conversion)

Well-suited for subtitle production and meeting minutes where clean, readable output matters more than exact reproduction.

In one file, the model suddenly output what appeared to be Arabic script. Language validation is mandatory before using this in production.


🆕 ibm/granite-4.0-1b-speech — Polarized Accuracy (WER: 0.4058)

Developed by IBM Research, this is a compact 1B-parameter speech recognition model with a practical RTF of 0.061. However, our evaluation revealed a strongly polarized behavior: excellent on clean audio, severely degraded on noisy or multi-speaker input.

Performance on Clean Audio

On news narration and clear business conversation samples, the model achieved CER around 10% — a level that rivals the top models. Within a controlled domain, Granite can be a competitive option.

Issue: Hallucination and Loop Behavior

On noisy or multi-speaker samples, CER exceeded 70–90% in multiple cases. This "catastrophic failure" on a subset of samples is the primary driver of the low overall average.

Here's a breakdown of the failure modes we observed:

① Repetition Loop

text
"...iine iine iine iine iine..."
"hai hai hai hai hai hai hai..."

A known failure mode in Transformer-based ASR models, where non-speech sounds (silence, BGM, laughter) cause the decoder to enter a loop attempting to verbalize them. Whisper also suffers from this, but the occurrence rate in Granite was notably higher.

② Early Stopping

Several clips showed extremely short outputs despite long reference transcripts. Likely caused by overlapping speakers or failed VAD (Voice Activity Detection) at transition points, leading the model to incorrectly judge that speech has ended.

③ Proper Noun / Kanji Conversion Errors

ReferenceGranite OutputNote
マル暴 (police slang)丸棒 (round stick)Phonetically similar, wrong kanji
霜降り明星 (comedian duo)下振り明星Near-homophone misconversion

Even when phoneme-level recognition succeeds, the kanji conversion step frequently fails on Japanese slang and proper nouns. Japanese-specific vocabulary handling appears to be underdeveloped.

④ Instability Around Fillers

In multiple clips, important words immediately following fillers ("uh", "um") were dropped entirely. The filler processing appears to disrupt the subsequent recognition context.

Granite Positioning Summary

text
✅ Good fit
  - News / announcer-style narration
  - Single-speaker clean audio batch processing
  - Edge / low-resource environments (small 1B model footprint)

❌ Poor fit
  - Comedy / variety / emotionally expressive audio
  - Multi-speaker meetings or debates
  - Entertainment content with laughter or BGM
  - Natural conversation with fillers

If using Granite in production, you must implement loop detection post-processing (repetition pattern detection and truncation) and output length validation. Using raw output as-is is risky.


nvidia/parakeet-tdt-0.6b-v3 — Blazing Fast (RTF: 0.002)

RTF=0.002 is the fastest in this benchmark. It processes 29 seconds of audio in roughly 0.06 seconds.

The TDT (Transducer with Dynamic Time Alignment) architecture is designed to skip low-energy or low-importance tokens — this "aggressive skipping" contributes to the WER of 0.3465.

Good for:

  • Real-time caption drafts
  • High-volume audio indexing where speed beats accuracy

Not suitable for:

  • Detailed meeting minutes or drama/conversation transcription

reazonspeech-nemo-v2 vs. k2-v2

Both share the ReazonSpeech brand, but show clear differences in output style.

Featurenemo-v2k2-v2
Numbers13歳 (Arabic numerals)十三歳 (kanji numerals)
Latin charactersgps (lowercase)GPS (uppercase)
SpeedRTF=0.020RTF=0.026
Truncation patternDrops the tailDrops the middle

nemo-v2 also demonstrated domain-specific strength, correctly converting the police slang "マル暴 (Marubo)" to proper kanji.


kotoba-whisper-v2.0 — Lowest Score in This Benchmark (WER: 0.5402)

Handles clear, scripted speech adequately, but struggles significantly with natural conversation, emotional delivery, and multiple speakers.

Characteristic issues:

  • Repeats the same phrase when audio is unclear
  • Drops important words immediately after fillers ("etto", "ano", "maa")
  • Fails to transcribe the beginning of audio in some clips

Cross-Model Comparison: Loop / Repetition Behavior

One of the most notable findings across this benchmark was repetition loop behavior observed in multiple models.

ModelLoop OccurrenceMitigation
whisper-large-v3-turboRare (appended hallucination)VAD pre-processing, no_speech_threshold tuning
granite-4.0-1b-speechFrequent (BGM / laughter triggers)Post-processing repetition detection is mandatory
kotoba-whisper-v2.0Occurs on unclear audio

Loop behavior typically stems not from model architecture, but from how the training data handles non-speech sounds (laughter, BGM, breath). Japanese variety and entertainment content is particularly prone to triggering this, as it contains high concentrations of such sounds.


text
Maximum accuracy (transcripts, subtitles)
  → qwen3-asr-1.7b or whisper-large-v3-turbo

Japanese media / broadcast
  → reazonspeech-espnet-v2

Speed-first (real-time / batch)
  → parakeet-tdt-0.6b-v3

Readability / written-style output
  → voxtral-mini (language validation required)

Clean audio only / lightweight model needed
  → granite-4.0-1b-speech (pair with loop detection post-processing)

Models to avoid for certain use cases
  → kotoba-whisper-v2.0 (natural conversation)
  → voxtral-mini (without language validation)
  → granite-4.0-1b-speech (noisy / emotional / multi-speaker audio)

Reproducibility: The WER Evaluation Pipeline

The most critical design decision in this benchmark is how to compute WER fairly. For Japanese ASR, simply calling jiwer.wer(ref, hyp) is not sufficient — it requires careful normalization and tokenization.

Step 1: Text Normalization (Strip Punctuation, Symbols, Spaces)

Different models punctuate very differently. For example, reazonspeech-espnet-v2 aggressively inserts and , while reference transcripts often have none. To prevent WER from being unfairly skewed by punctuation differences, we strip all punctuation before scoring.

python
import unicodedata

def normalize_text(text: str) -> str:
    # NFKC normalization: full-width → half-width, variant unification
    text = unicodedata.normalize("NFKC", text)
    chars: list[str] = []
    for ch in text:
        cat = unicodedata.category(ch)
        if cat.startswith("Z"):  # Separators (spaces, ideographic spaces)
            continue
        if cat.startswith("P"):  # Punctuation (、。!?「」etc.)
            continue
        if cat.startswith("S"):  # Symbols (¥, © etc.)
            continue
        chars.append(ch)
    return "".join(chars).strip()

This normalization prevents unfair penalization from surface-level differences like:

TransformExample
NFKC normalization13歳13歳, kmkm
Punctuation removal行って、叫んで。行って叫んで
Space removal東京 都東京都

Step 2: Japanese Tokenization with MeCab (WER only)

Japanese has no word-boundary spaces, so English-style WER computation doesn't apply directly. We use MeCab morphological analysis (wakati-gaki) to split text into words before computing WER.

python
import MeCab

_MECAB_TAGGER = None

def _get_mecab_tagger():
    global _MECAB_TAGGER
    if _MECAB_TAGGER is None:
        # -Owakati: outputs space-separated morphemes
        _MECAB_TAGGER = MeCab.Tagger("-Owakati")
    return _MECAB_TAGGER

def _contains_japanese(text: str) -> bool:
    for ch in text:
        code = ord(ch)
        if (
            0x3040 <= code <= 0x30FF  # Hiragana / Katakana
            or 0x4E00 <= code <= 0x9FFF  # CJK Unified Ideographs
            or 0x3400 <= code <= 0x4DBF  # CJK Extension A
        ):
            return True
    return False

def tokenize_for_wer(text: str) -> list[str]:
    # Already space-separated (e.g., English)
    if " " in text:
        return [tok for tok in text.split(" ") if tok]
    # Use MeCab for Japanese
    if _contains_japanese(text):
        tagger = _get_mecab_tagger()
        parsed = str(tagger.parse(text) or "").strip()
        return [w for w in parsed.split() if w]
    # Fallback: character-level
    return [ch for ch in text if not ch.isspace()]

Why MeCab? For a word like "行ってきました", naive character splitting produces one token per character. MeCab splits it into morphemes: 行っ / て / き / まし / た — much closer to what a human would count as "word errors."


Step 3: CER Computation (Character Level)

CER is straightforward: split into individual characters, excluding spaces.

python
def tokenize_for_cer(text: str) -> list[str]:
    return [ch for ch in text if not ch.isspace()]

Step 4: Error Rate Calculation with jiwer

Token lists are joined into space-separated strings and passed to jiwer for substitution/deletion/insertion counts.

python
import jiwer

def compute_error_rate(ref: str, hyp: str, mode: str) -> tuple[float, int, int]:
    ref_n = normalize_text(ref)
    hyp_n = normalize_text(hyp)

    if mode == "wer":
        ref_tokens = tokenize_for_wer(ref_n)
        hyp_tokens = tokenize_for_wer(hyp_n)
    elif mode == "cer":
        ref_tokens = tokenize_for_cer(ref_n)
        hyp_tokens = tokenize_for_cer(hyp_n)

    out = jiwer.process_words(
        " ".join(ref_tokens),
        " ".join(hyp_tokens)
    )
    err = int(out.substitutions + out.deletions + out.insertions)
    total = int(out.hits + out.substitutions + out.deletions)
    return (err / max(total, 1)), err, total

Final WER/CER is computed as a corpus-level aggregate (Σerr / Σtotal across all 20 samples), not an average of per-sample WERs. This prevents short clips from having disproportionate influence.


Step 5: RTF Measurement

Each ASR model runs as a subprocess and returns a JSON payload with inference time and audio duration. RTF is their ratio.

python
# JSON format returned by each model's transcribe.py
{
    "text": "transcribed text here",
    "transcribe_elapsed_sec": 0.34,
    "audio_duration_sec": 29.0,
    "rtf": 0.012
}

# Corpus-level RTF = Σtranscribe_elapsed / Σaudio_duration
rtf = transcribe_elapsed_sum / audio_duration_sum

Dependencies

text
jiwer          # WER/CER computation
MeCab          # Japanese morphological analysis
unicodedata    # Text normalization (Python stdlib)

Installing MeCab (Ubuntu):

bash
sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8
pip install mecab-python3 jiwer

FAQ

Q1. Which model has the best accuracy for Japanese ASR?

In this benchmark, qwen/qwen3-asr-1.7b (WER=0.1899) achieved the highest accuracy. That said, whisper offers comparable stability in noisy or multi-speaker environments. The right choice depends on your use case.

Q2. Should I use WER or CER?

For Japanese, WER can be artificially inflated due to ambiguities in morphological segmentation. Since we use MeCab-based tokenization here, our WER is more reliable than naive character-split WER — but we still recommend CER as the primary metric. In our results, the rankings were nearly identical between the two.

Q3. Which models support real-time transcription?

A practical threshold is RTF < 0.1. In this benchmark, parakeet-tdt-0.6b-v3 (RTF=0.002), whisper (RTF=0.012), nemo-v2 (RTF=0.020), and granite-4.0-1b-speech (RTF=0.061) all qualify. Factor in accuracy tradeoffs — Granite meets the speed bar but has stability issues in noisy conditions.

Q4. Which model best preserves fillers ("uh", "um")?

qwen3-asr-1.7b and whisper drop some fillers but are relatively faithful. reazonspeech-espnet-v2 aggressively removes them and is not suitable when filler preservation matters.

Q5. Which model handles overlapping multi-speaker audio best?

whisper and qwen3-asr-1.7b maintained stable accuracy even in multi-speaker and noisy conditions. ReazonSpeech models tend to drop the tail end of utterances when speakers overlap. Granite and kotoba-whisper are not suitable for multi-speaker audio.

Q6. How do I mitigate Whisper hallucinations?

Three approaches: ① pre-processing with a VAD (Voice Activity Detection) model to strip silent segments, ② post-processing to detect repetitive patterns, ③ tuning the no_speech_threshold parameter. Pay extra attention when processing long-form audio.

Q7. Why did voxtral-mini output what appeared to be Arabic?

The exact cause is unclear, but this is consistent with language-mixing issues common in multilingual models. Automatic language validation of output is strongly recommended for production use.

Q8. Why is granite-4.0-1b-speech so polarized in accuracy?

Granite appears to be built on a primarily English-trained base model with speech recognition capability added. It handles standard Japanese (news, announcements) reasonably well, but in environments with heavy non-speech audio (laughter, BGM, overlapping speech), the decoder is prone to looping. This likely reflects insufficient coverage of such difficult cases in training data.

Q9. Does post-processing to remove loops actually improve WER?

Removing loops does improve CER/WER scores, but the audio content that should have been transcribed is still lost. Early stopping problems persist after loop removal, so it's not a real fix — it's output sanitization, not accuracy improvement. Understand this distinction before relying on it.

Q10. How reliable is this benchmark?

The 20-clip dataset covers diverse genres, but sample size is limited. For domain-specific deployments (medical, legal, etc.), we recommend running additional evaluation on domain-matched data.


Conclusion

Across 9 models evaluated under identical conditions:

  • Best accuracy + stability: qwen3-asr-1.7b > whisper
  • Japanese media / broadcast: reazonspeech-espnet-v2
  • Speed-first: parakeet-tdt-0.6b-v3
  • Clean audio / lightweight: granite-4.0-1b-speech (loop post-processing required)
  • Use with caution: kotoba-whisper-v2.0 (natural conversation), voxtral-mini (without language validation), granite-4.0-1b-speech (noisy / emotional / multi-speaker audio)

What this Granite evaluation reinforced is that picking a model based on average WER alone is dangerous. Granite rivals top models on clean audio, but catastrophically fails under specific conditions. Understanding your actual audio characteristics — and pairing the right model with appropriate pre-processing (noise reduction, speaker separation) and post-processing (hallucination detection, loop removal, language validation) — is what actually determines ASR pipeline quality.

Contact

For project inquiries and collaboration, contact us here.

If you are considering a new project, product development, or other collaboration, please get in touch.

Related Articles

Explore more articles connected to this topic.