Japanese ASR Models for IT Terminology [2026]: Benchmarking Whisper, Qwen, and More on Real-World Accuracy
This article is a primary benchmark evaluating nine ASR models on real audio data containing IT company names, service names, and abbreviations. Unlike general Japanese speech recognition benchmarks, this evaluation takes a practical perspective — assessing whether models can correctly transcribe terms like "openclaw" and "ARR."
Conclusion: Choosing for IT Domain Use Cases
| Use Case | Recommended Model | Reason |
|---|---|---|
| Overall balance | whisper | Best balance of CER, speed, and stability. Consistent English proper noun retention |
| Character accuracy | qwen3-asr-1.7b | CER_EN 0.0589, best across all models. Fewest dropped Japanese characters |
| Real-time processing | nvidia/parakeet | Lowest RTF (0.003). Overwhelmingly the fastest |
| English vocabulary retention | voxtral-mini | CER 0.1354. High retention of English proper nouns even in long utterances |
| Avoid for | kotoba-whisper / ibm/granite | High CER in this IT domain; notable dropping and collapse |
Why an IT Domain-Specific Benchmark Is Necessary
General Japanese ASR benchmarks (CommonVoice, ReazonSpeech corpus, etc.) focus primarily on everyday conversation and read-aloud speech. However, real-world ASR use cases — especially AI and IT-related meetings and podcasts — contain domain-specific vocabulary.
For example, utterances like this:
"Our Go to Market targets enterprise SMBs, aiming for 20 million in ARR while pursuing an LLM-core distribution strategy…"
Terms like Go to Market, ARR, LLM, and distribution are classic examples of vocabulary that general-purpose ASR struggles with.
- ARR (Annual Recurring Revenue) sounds like "A-R-R" in Japanese, but some models transcribe it as "Earalley" or "Eararisu"
- SIer (System Integrator) has ambiguous pronunciation in Japanese — "S-I-ya", "S-I-a", etc. — with multiple valid readings
- English service names (OpenAI, Anthropic, Sansan, DeepSeek, etc.) mix katakana and Roman characters, making output dependent on each model's training data distribution
This benchmark evaluates which ASR models hold up against this "IT domain reality."
Benchmark Design
Test Data
- 15 audio clips × ~30 seconds, each with a different speaker
- IT company names (Anthropic, OpenAI, Sansan, DeepSeek, etc.)
- Service names (including fictional names like "openclaw")
- IT abbreviations (ARR, LLM, GUI, DHCP, HTML, SIer)
- Katakana loanwords (distribution, infrastructure)
- Product names (Windows 95, Mac Mini, PowerPoint, Excel, iPhone)
Evaluation Metrics
| Metric | Description |
|---|---|
| CER | Character Error Rate. Error rate at the character level. Lower is better. Since Japanese is not space-delimited, CER reflects real-world accuracy better than word-level error rate |
| CER_EN | CER after normalizing English tokens to their Japanese readings |
| AvgTx(s) | Average transcription processing time (seconds) |
| RTF | Real Time Factor. Ratio of processing time to audio duration. Below 0.1 is generally considered real-time capable |
What is CER_EN normalization?
This is a normalization process that treats English text (e.g., "OpenClaw") and its Japanese katakana equivalent (e.g., "オープンクロー") as the same thing. Even if an ASR writes English words in Roman characters rather than katakana, it counts as correct if the meaning matches. This prevents evaluation scores from fluctuating based purely on stylistic differences in how English is written.
Excerpt from the normalization dictionary:
[
{ "src": "OpenClaw", "dst": ["オープンクロー"] },
{ "src": "Java", "dst": ["ジャバ", "ジャヴァ"] },
{ "src": "Ruby", "dst": ["ルビー"] },
{ "src": "SIer", "dst": ["エスアイヤー", "エスアイアー"] },
{ "src": "DHCP", "dst": ["ディーエイチシーピー"] },
{ "src": "ARR", "dst": ["エーアールアール"] }
]
Full Results
Standard Evaluation (After Punctuation Removal, Ranked by CER)
| ASR | CER↓ | CER_EN↓ | AvgTx(s)↓ | RTF↓ |
|---|---|---|---|---|
| qwen3-asr-1.7b | 0.0820 | 0.0589 | 1.43 | 0.048 |
| voxtral-mini-4b | 0.1354 | 0.1151 | 6.19 | 0.208 |
| whisper | 0.1565 | 0.1339 | 0.47 | 0.016 |
| cohere-transcribe | 0.1701 | 0.1566 | 0.33 | 0.011 |
| parakeet-tdt-0.6b | 0.1960 | 0.1696 | 0.10 | 0.003 |
| reazonspeech-nemo-v2 | 0.2332 | 0.2022 | 0.62 | 0.021 |
| reazonspeech-k2-v2 | 0.2855 | 0.2714 | 0.82 | 0.028 |
| granite-4.0-1b | 0.3542 | 0.3300 | 1.76 | 0.059 |
| kotoba-whisper-v2.0 | 0.6072 | 0.5859 | 0.29 | 0.010 |
qwen3-asr-1.7b ranks #1 by CER
qwen3-asr-1.7b achieves CER=0.0820 and CER_EN=0.0589, outperforming all other models by a wide margin. In terms of "hearing accurately at the character level," it is the top model in this benchmark. However, there is a separate risk of semantic errors — see the detailed analysis below.
After English Normalization (CER_EN Improvement)
| ASR | CER | CER_EN | Improvement |
|---|---|---|---|
| whisper | 0.1565 | 0.1339 | -0.0226 |
| voxtral-mini-4b | 0.1354 | 0.1151 | -0.0203 |
| cohere-transcribe | 0.1701 | 0.1566 | -0.0135 |
| parakeet-tdt-0.6b | 0.1960 | 0.1696 | -0.0264 |
| reazonspeech-nemo-v2 | 0.2332 | 0.2022 | -0.0310 |
| reazonspeech-k2-v2 | 0.2855 | 0.2714 | -0.0141 |
| granite-4.0-1b | 0.3542 | 0.3300 | -0.0242 |
| qwen3-asr-1.7b | 0.0820 | 0.0589 | -0.0231 |
| kotoba-whisper-v2.0 | 0.6072 | 0.5859 | -0.0213 |
Improvement margins across all models fall within the 0.013–0.031 range, meaning the benefit of normalization is limited. This suggests that whether the model outputs English in Roman characters or katakana matters less than whether it correctly heard the word in the first place.
Per-Model Detailed Analysis
whisper (Recommended: Overall Balance)
Score: CER 0.1565 / CER_EN 0.1339 / RTF 0.016
whisper offered the best overall balance in this IT domain benchmark. Product names like Windows 95, GUI, and iPhone were transcribed consistently, and there was no repetitive collapse even in long utterances.
Its weakness is handling the "semantic restoration" of abbreviations. Cases were observed where ARR was written as "Earalley" — phonetically similar but semantically incorrect. That said, this is a common challenge across all models, and whisper is among those with the lowest occurrence rate.
voxtral-mini-4b-realtime (High English Vocabulary Retention)
Score: CER 0.1354 / CER_EN 0.1151 / RTF 0.208
In terms of natural output and information retention, voxtral-mini is on par with whisper. It retains English vocabulary well and is less prone to dropping information mid-utterance in longer text.
However, RTF 0.208 makes it the heaviest model in this benchmark. With an average processing time of around 6 seconds, it is not suited for real-time use cases. It is a good fit for batch transcription or asynchronous processing where accuracy is prioritized, but the cost implications must be understood before adoption.
Some proper nouns showed localized corruption (e.g., DeepSeek → DFC-like output), so noise can still creep in even when English vocabulary is largely retained.
qwen3-asr-1.7b (Best CER / Top Character Accuracy)
Score: CER 0.0820 / CER_EN 0.0589 / RTF 0.048
This model has the most distinctive profile. CER is the best among all models (0.0820), and CER_EN at 0.0589 far outpaces the second-place voxtral (0.1151).
This means it "hears accurately at the character level." The Japanese text output reads naturally, and output length tends to match the original.
⚠️ Watch out for semantic errors
qwen3-asr-1.7b shows noticeable errors where phonetically similar but semantically wrong substitutions occur:
SIer→SREARR→EararisuSansan→SansonThese are character-level near-misses that are factually wrong from an IT knowledge standpoint. If using this model for meeting minutes or subtitles, always include a downstream LLM correction step.
cohere-transcribe-03-2026 (English Vocabulary Retention + Fast)
Score: CER 0.1701 / CER_EN 0.1566 / RTF 0.011
English product and tag names like PowerPoint, Excel, and HTML are handled reliably, and speed (RTF 0.011) is excellent.
The major weakness is dropping the beginning of long utterances. Rather than transcribing the whole thing, it tends to produce a clean transcription of the latter half while omitting the first. It also has a habit of condensing content, making it unsuitable for use cases requiring faithful verbatim transcription. For extracting key points from meeting notes, this behavior may be acceptable.
nvidia/parakeet-tdt-0.6b-v3 (Speed First)
Score: CER 0.1960 / CER_EN 0.1696 / RTF 0.003
RTF 0.003 is the fastest in this benchmark, making it the strongest candidate for real-time transcription and large-scale batch processing. Repetitive collapse is also minimal, and output stability is high.
Weaknesses in the IT domain involve misreading technical vocabulary:
Excel→x7HTML→hShimeru- Various corruptions of ARR-type terms
The underlying Japanese sentence structure is preserved, but technical proper nouns are unreliable. A practical design would pair this speed with a downstream LLM correction stage.
reazonspeech-nemo-v2 / k2-v2
nemo-v2: CER 0.2332 / CER_EN 0.2022 — k2-v2: CER 0.2855 / CER_EN 0.2714
Both preserve the basic structure of Japanese text. The ReazonSpeech series has a solid reputation for general Japanese, but abbreviation corruption appears in the IT domain (e.g., HTML → h7).
k2-v2 produces shorter average outputs, with a noticeable tendency to abbreviate or drop content in longer utterances. In utterances containing Go to Market, the first half was largely dropped in some cases.
ibm/granite-4.0-1b-speech (Stability Concerns)
Score: CER 0.3542 / CER_EN 0.3300 / RTF 0.059
Output is readable on good samples, but repetitive collapse in long utterances is pronounced — looping patterns like "so so so so…" were observed. Quality variance is the highest among all models in this benchmark. Reproducibility is uncertain, and additional evaluation is needed before any production deployment.
kotoba-tech/kotoba-whisper-v2.0 (Struggles Under These Conditions)
Score: CER 0.6072 / CER_EN 0.5859 / RTF 0.010
Speed is good (RTF 0.010), but quality under these IT domain conditions was the worst in this benchmark.
- 12 of 15 samples produced output less than 75% of the original length
- Repetitive collapse ("Rorocal…", "Mamama…") was observed in multiple samples
- Proper noun retention was the lowest among all models
Despite sharing the same whisper base, results diverged dramatically from the original whisper. The training data or distillation design appears to be poorly matched to this domain, and the model requires careful use case selection.
Implementation Guidance: How to Integrate IT Domain ASR
Pattern 1: Accuracy-First Pipeline
Audio input
└─ whisper or voxtral-mini (transcription)
└─ LLM-based abbreviation and proper noun check (post-processing)
└─ Final text output
whisper alone is stable, but semantic errors remain for IT abbreviations like ARR and SIer. Adding a downstream LLM for contextual correction improves practical accuracy.
Pattern 2: Speed-First Pipeline
Audio input
└─ parakeet (transcription at RTF 0.003)
└─ LLM completion for low-confidence tokens
└─ Final text output
parakeet is fast but weak on technical vocabulary. A hybrid approach using LLM correction for outputs like x7 (Excel) and hShimeru (HTML) is practical.
Pattern 3: CER-First (Text Media Use Cases)
Audio input
└─ qwen3-asr-1.7b (best CER_EN)
└─ LLM flagging for semantic errors like SIer→SRE
└─ Subtitle / text output
qwen3 has high character accuracy but carries semantic error risk, so pairing it with an IT terminology dictionary and context check is recommended.
Summary: Decision Criteria for IT Domain ASR Selection
Speed is top priority → parakeet-tdt-0.6b (RTF 0.003)
English vocabulary retention → voxtral-mini or whisper
Character accuracy (CER) → qwen3-asr-1.7b (semantic check required)
All-around balance → whisper (CER 0.1565, high stability)
Long-form / full-text retention → voxtral-mini (but slow)
Speed-accuracy balance → cohere-transcribe (watch for front-half dropout)
No model offers "perfect IT domain support" — abbreviation and proper noun handling should always be treated as a design problem involving downstream correction. For phonetically ambiguous abbreviations like ARR and SIer in particular, it is more realistic not to expect accurate restoration from ASR alone.
FAQ
Q. What hardware was used to run this benchmark?
An RTX 5090 was used. Absolute RTF values depend on the GPU environment, so numbers will differ in other setups. However, the relative ordering (parakeet fastest, voxtral heaviest) is generally stable across environments.
Q. Which version of whisper was used (large / medium / small)?
This benchmark used whisper large v3 turbo. Scores vary significantly by version, so it is recommended to fix the version when reproducing results.
Q. Where can I get the CER_EN normalization dictionary?
The conversion dictionary used here was created in-house. Terms with multiple valid readings — such as SIer's "S-I-ya" / "S-I-a" — are fully enumerated. Plans to release the dictionary publicly are under consideration.
Q. Which model is best for real-time subtitle generation?
parakeet with RTF 0.003 is the most realistic option. Technical vocabulary corruption remains, but can be mitigated with downstream LLM correction. voxtral-mini at RTF 0.208 is not suited for real-time use.
Q. Why do results differ from general Japanese benchmarks like CommonVoice?
General benchmarks focus on everyday conversation and read-aloud speech, and rarely include IT abbreviations, English-mixed utterances, or technical jargon. A model's general benchmark score and its IT domain score can diverge significantly — the kotoba-whisper results in this benchmark are a prime example.
Q. Can qwen3-asr be expected to improve for IT domain use in the future?
Its exceptionally strong CER indicates high baseline capability. Most semantic errors likely stem from insufficient IT-specific training data, so there is significant room for improvement through fine-tuning or post-processing dictionaries.
Related article: Japanese ASR Benchmark (Previous)
Contact
For project inquiries and collaboration, contact us here.
If you are considering a new project, product development, or other collaboration, please get in touch.
Related Articles
Explore more articles connected to this topic.
How I Finetuned IBM Granite Speech 1B on Japanese Audio and Improved CER from 0.37 to 0.14
I finetuned IBM Granite Speech (`granite-4.0-1b-speech`) on 100 hours of Japanese speech data and reduced CER from 0.37 to 0.14. The official script's Projector+LoRA-only training has a ceiling on accuracy gains. The key breakthrough was additionally training `lm_head` and the last 8 layers of the Language Model. The result matches Qwen3-ASR-1.7B (CER 0.14) with only 1B parameters.
Read article →Japanese ASR Model Comparison 2026: Whisper, Qwen3, Voxtral & ReazonSpeech Benchmarked on RTX5090
Benchmarking 8 Japanese ASR models under identical conditions on an RTX5090, qwen/qwen3-asr-1.7b (WER: 0.1899) and whisper (WER: 0.2099) clearly lead in both accuracy and stability. For speed-critical workloads, parakeet-tdt-0.6b-v3 (RTF: 0.002) is the fastest by a wide margin, while reazonspeech-espnet-v2 is the strongest choice for Japanese broadcast/media domains. To ensure fair comparison across models, WER was computed using MeCab morphological tokenization combined with punctuation stripping before scoring.
Read article →