Japanese ASR Models for IT Terminology [2026]: Benchmarking Whisper, Qwen, and More on Real-World Accuracy

This article is a primary benchmark evaluating nine ASR models on real audio data containing IT company names, service names, and abbreviations. Unlike general Japanese speech recognition benchmarks, this evaluation takes a practical perspective — assessing whether models can correctly transcribe terms like "openclaw" and "ARR."

Conclusion: Choosing for IT Domain Use Cases

Use CaseRecommended ModelReason
Overall balancewhisperBest balance of CER, speed, and stability. Consistent English proper noun retention
Character accuracyqwen3-asr-1.7bCER_EN 0.0589, best across all models. Fewest dropped Japanese characters
Real-time processingnvidia/parakeetLowest RTF (0.003). Overwhelmingly the fastest
English vocabulary retentionvoxtral-miniCER 0.1354. High retention of English proper nouns even in long utterances
Avoid forkotoba-whisper / ibm/graniteHigh CER in this IT domain; notable dropping and collapse

Why an IT Domain-Specific Benchmark Is Necessary

General Japanese ASR benchmarks (CommonVoice, ReazonSpeech corpus, etc.) focus primarily on everyday conversation and read-aloud speech. However, real-world ASR use cases — especially AI and IT-related meetings and podcasts — contain domain-specific vocabulary.

For example, utterances like this:

"Our Go to Market targets enterprise SMBs, aiming for 20 million in ARR while pursuing an LLM-core distribution strategy…"

Terms like Go to Market, ARR, LLM, and distribution are classic examples of vocabulary that general-purpose ASR struggles with.

  • ARR (Annual Recurring Revenue) sounds like "A-R-R" in Japanese, but some models transcribe it as "Earalley" or "Eararisu"
  • SIer (System Integrator) has ambiguous pronunciation in Japanese — "S-I-ya", "S-I-a", etc. — with multiple valid readings
  • English service names (OpenAI, Anthropic, Sansan, DeepSeek, etc.) mix katakana and Roman characters, making output dependent on each model's training data distribution

This benchmark evaluates which ASR models hold up against this "IT domain reality."


Benchmark Design

Test Data

  • 15 audio clips × ~30 seconds, each with a different speaker
  • IT company names (Anthropic, OpenAI, Sansan, DeepSeek, etc.)
  • Service names (including fictional names like "openclaw")
  • IT abbreviations (ARR, LLM, GUI, DHCP, HTML, SIer)
  • Katakana loanwords (distribution, infrastructure)
  • Product names (Windows 95, Mac Mini, PowerPoint, Excel, iPhone)

Evaluation Metrics

MetricDescription
CERCharacter Error Rate. Error rate at the character level. Lower is better. Since Japanese is not space-delimited, CER reflects real-world accuracy better than word-level error rate
CER_ENCER after normalizing English tokens to their Japanese readings
AvgTx(s)Average transcription processing time (seconds)
RTFReal Time Factor. Ratio of processing time to audio duration. Below 0.1 is generally considered real-time capable

What is CER_EN normalization?
This is a normalization process that treats English text (e.g., "OpenClaw") and its Japanese katakana equivalent (e.g., "オープンクロー") as the same thing. Even if an ASR writes English words in Roman characters rather than katakana, it counts as correct if the meaning matches. This prevents evaluation scores from fluctuating based purely on stylistic differences in how English is written.

Excerpt from the normalization dictionary:

json
[
  { "src": "OpenClaw", "dst": ["オープンクロー"] },
  { "src": "Java",     "dst": ["ジャバ", "ジャヴァ"] },
  { "src": "Ruby",     "dst": ["ルビー"] },
  { "src": "SIer",     "dst": ["エスアイヤー", "エスアイアー"] },
  { "src": "DHCP",     "dst": ["ディーエイチシーピー"] },
  { "src": "ARR",      "dst": ["エーアールアール"] }
]

Full Results

Standard Evaluation (After Punctuation Removal, Ranked by CER)

ASRCER↓CER_EN↓AvgTx(s)↓RTF↓
qwen3-asr-1.7b0.08200.05891.430.048
voxtral-mini-4b0.13540.11516.190.208
whisper0.15650.13390.470.016
cohere-transcribe0.17010.15660.330.011
parakeet-tdt-0.6b0.19600.16960.100.003
reazonspeech-nemo-v20.23320.20220.620.021
reazonspeech-k2-v20.28550.27140.820.028
granite-4.0-1b0.35420.33001.760.059
kotoba-whisper-v2.00.60720.58590.290.010

qwen3-asr-1.7b ranks #1 by CER
qwen3-asr-1.7b achieves CER=0.0820 and CER_EN=0.0589, outperforming all other models by a wide margin. In terms of "hearing accurately at the character level," it is the top model in this benchmark. However, there is a separate risk of semantic errors — see the detailed analysis below.

After English Normalization (CER_EN Improvement)

ASRCERCER_ENImprovement
whisper0.15650.1339-0.0226
voxtral-mini-4b0.13540.1151-0.0203
cohere-transcribe0.17010.1566-0.0135
parakeet-tdt-0.6b0.19600.1696-0.0264
reazonspeech-nemo-v20.23320.2022-0.0310
reazonspeech-k2-v20.28550.2714-0.0141
granite-4.0-1b0.35420.3300-0.0242
qwen3-asr-1.7b0.08200.0589-0.0231
kotoba-whisper-v2.00.60720.5859-0.0213

Improvement margins across all models fall within the 0.013–0.031 range, meaning the benefit of normalization is limited. This suggests that whether the model outputs English in Roman characters or katakana matters less than whether it correctly heard the word in the first place.


Per-Model Detailed Analysis

Score: CER 0.1565 / CER_EN 0.1339 / RTF 0.016

whisper offered the best overall balance in this IT domain benchmark. Product names like Windows 95, GUI, and iPhone were transcribed consistently, and there was no repetitive collapse even in long utterances.

Its weakness is handling the "semantic restoration" of abbreviations. Cases were observed where ARR was written as "Earalley" — phonetically similar but semantically incorrect. That said, this is a common challenge across all models, and whisper is among those with the lowest occurrence rate.


voxtral-mini-4b-realtime (High English Vocabulary Retention)

Score: CER 0.1354 / CER_EN 0.1151 / RTF 0.208

In terms of natural output and information retention, voxtral-mini is on par with whisper. It retains English vocabulary well and is less prone to dropping information mid-utterance in longer text.

However, RTF 0.208 makes it the heaviest model in this benchmark. With an average processing time of around 6 seconds, it is not suited for real-time use cases. It is a good fit for batch transcription or asynchronous processing where accuracy is prioritized, but the cost implications must be understood before adoption.

Some proper nouns showed localized corruption (e.g., DeepSeek → DFC-like output), so noise can still creep in even when English vocabulary is largely retained.


qwen3-asr-1.7b (Best CER / Top Character Accuracy)

Score: CER 0.0820 / CER_EN 0.0589 / RTF 0.048

This model has the most distinctive profile. CER is the best among all models (0.0820), and CER_EN at 0.0589 far outpaces the second-place voxtral (0.1151).

This means it "hears accurately at the character level." The Japanese text output reads naturally, and output length tends to match the original.

⚠️ Watch out for semantic errors

qwen3-asr-1.7b shows noticeable errors where phonetically similar but semantically wrong substitutions occur:

  • SIerSRE
  • ARREararisu
  • SansanSanson

These are character-level near-misses that are factually wrong from an IT knowledge standpoint. If using this model for meeting minutes or subtitles, always include a downstream LLM correction step.


cohere-transcribe-03-2026 (English Vocabulary Retention + Fast)

Score: CER 0.1701 / CER_EN 0.1566 / RTF 0.011

English product and tag names like PowerPoint, Excel, and HTML are handled reliably, and speed (RTF 0.011) is excellent.

The major weakness is dropping the beginning of long utterances. Rather than transcribing the whole thing, it tends to produce a clean transcription of the latter half while omitting the first. It also has a habit of condensing content, making it unsuitable for use cases requiring faithful verbatim transcription. For extracting key points from meeting notes, this behavior may be acceptable.


nvidia/parakeet-tdt-0.6b-v3 (Speed First)

Score: CER 0.1960 / CER_EN 0.1696 / RTF 0.003

RTF 0.003 is the fastest in this benchmark, making it the strongest candidate for real-time transcription and large-scale batch processing. Repetitive collapse is also minimal, and output stability is high.

Weaknesses in the IT domain involve misreading technical vocabulary:

  • Excelx7
  • HTMLhShimeru
  • Various corruptions of ARR-type terms

The underlying Japanese sentence structure is preserved, but technical proper nouns are unreliable. A practical design would pair this speed with a downstream LLM correction stage.


reazonspeech-nemo-v2 / k2-v2

nemo-v2: CER 0.2332 / CER_EN 0.2022 — k2-v2: CER 0.2855 / CER_EN 0.2714

Both preserve the basic structure of Japanese text. The ReazonSpeech series has a solid reputation for general Japanese, but abbreviation corruption appears in the IT domain (e.g., HTML → h7).

k2-v2 produces shorter average outputs, with a noticeable tendency to abbreviate or drop content in longer utterances. In utterances containing Go to Market, the first half was largely dropped in some cases.


ibm/granite-4.0-1b-speech (Stability Concerns)

Score: CER 0.3542 / CER_EN 0.3300 / RTF 0.059

Output is readable on good samples, but repetitive collapse in long utterances is pronounced — looping patterns like "so so so so…" were observed. Quality variance is the highest among all models in this benchmark. Reproducibility is uncertain, and additional evaluation is needed before any production deployment.


kotoba-tech/kotoba-whisper-v2.0 (Struggles Under These Conditions)

Score: CER 0.6072 / CER_EN 0.5859 / RTF 0.010

Speed is good (RTF 0.010), but quality under these IT domain conditions was the worst in this benchmark.

  • 12 of 15 samples produced output less than 75% of the original length
  • Repetitive collapse ("Rorocal…", "Mamama…") was observed in multiple samples
  • Proper noun retention was the lowest among all models

Despite sharing the same whisper base, results diverged dramatically from the original whisper. The training data or distillation design appears to be poorly matched to this domain, and the model requires careful use case selection.


Implementation Guidance: How to Integrate IT Domain ASR

Pattern 1: Accuracy-First Pipeline

text
Audio input
  └─ whisper or voxtral-mini (transcription)
       └─ LLM-based abbreviation and proper noun check (post-processing)
            └─ Final text output

whisper alone is stable, but semantic errors remain for IT abbreviations like ARR and SIer. Adding a downstream LLM for contextual correction improves practical accuracy.

Pattern 2: Speed-First Pipeline

text
Audio input
  └─ parakeet (transcription at RTF 0.003)
       └─ LLM completion for low-confidence tokens
            └─ Final text output

parakeet is fast but weak on technical vocabulary. A hybrid approach using LLM correction for outputs like x7 (Excel) and hShimeru (HTML) is practical.

Pattern 3: CER-First (Text Media Use Cases)

text
Audio input
  └─ qwen3-asr-1.7b (best CER_EN)
       └─ LLM flagging for semantic errors like SIer→SRE
            └─ Subtitle / text output

qwen3 has high character accuracy but carries semantic error risk, so pairing it with an IT terminology dictionary and context check is recommended.


Summary: Decision Criteria for IT Domain ASR Selection

text
Speed is top priority             → parakeet-tdt-0.6b (RTF 0.003)
English vocabulary retention      → voxtral-mini or whisper
Character accuracy (CER)          → qwen3-asr-1.7b (semantic check required)
All-around balance                → whisper (CER 0.1565, high stability)
Long-form / full-text retention   → voxtral-mini (but slow)
Speed-accuracy balance            → cohere-transcribe (watch for front-half dropout)

No model offers "perfect IT domain support" — abbreviation and proper noun handling should always be treated as a design problem involving downstream correction. For phonetically ambiguous abbreviations like ARR and SIer in particular, it is more realistic not to expect accurate restoration from ASR alone.


FAQ

Q. What hardware was used to run this benchmark?

An RTX 5090 was used. Absolute RTF values depend on the GPU environment, so numbers will differ in other setups. However, the relative ordering (parakeet fastest, voxtral heaviest) is generally stable across environments.

Q. Which version of whisper was used (large / medium / small)?

This benchmark used whisper large v3 turbo. Scores vary significantly by version, so it is recommended to fix the version when reproducing results.

Q. Where can I get the CER_EN normalization dictionary?

The conversion dictionary used here was created in-house. Terms with multiple valid readings — such as SIer's "S-I-ya" / "S-I-a" — are fully enumerated. Plans to release the dictionary publicly are under consideration.

Q. Which model is best for real-time subtitle generation?

parakeet with RTF 0.003 is the most realistic option. Technical vocabulary corruption remains, but can be mitigated with downstream LLM correction. voxtral-mini at RTF 0.208 is not suited for real-time use.

Q. Why do results differ from general Japanese benchmarks like CommonVoice?

General benchmarks focus on everyday conversation and read-aloud speech, and rarely include IT abbreviations, English-mixed utterances, or technical jargon. A model's general benchmark score and its IT domain score can diverge significantly — the kotoba-whisper results in this benchmark are a prime example.

Q. Can qwen3-asr be expected to improve for IT domain use in the future?

Its exceptionally strong CER indicates high baseline capability. Most semantic errors likely stem from insufficient IT-specific training data, so there is significant room for improvement through fine-tuning or post-processing dictionaries.


Related article: Japanese ASR Benchmark (Previous)

Contact

For project inquiries and collaboration, contact us here.

If you are considering a new project, product development, or other collaboration, please get in touch.

Related Articles

Explore more articles connected to this topic.