Japanese ASR Model Comparison 2026: Whisper, Qwen3, Voxtral & ReazonSpeech Benchmarked on RTX5090
Benchmarking 8 Japanese ASR models under identical conditions on an RTX5090, qwen/qwen3-asr-1.7b (WER: 0.1899) and whisper (WER: 0.2099) clearly lead in both accuracy and stability. For speed-critical workloads, parakeet-tdt-0.6b-v3 (RTF: 0.002) is the fastest by a wide margin, while reazonspeech-espnet-v2 is the strongest choice for Japanese broadcast/media domains. To ensure fair comparison across models, WER was computed using MeCab morphological tokenization combined with punctuation stripping before scoring.