Are WER and CER Not Enough? Re-evaluating Japanese ASR with Noun and Proper Noun F1 Scores
This article is a primary benchmark report that implements morphological-analysis-based F1 score evaluation using Sudachi to quantify "kanji and proper noun recognition accuracy" that is difficult to capture with WER or CER alone, and re-evaluates 9 open-source ASR models.
Summary: What You'll Learn
In Japanese ASR (Automatic Speech Recognition) accuracy evaluation, WER (Word Error Rate) and CER (Character Error Rate) are widely used as standard metrics. However, "how critical a single kanji substitution or a proper noun error is in practice" is difficult to reflect in these metrics.
To address this, we used the morphological analysis library Sudachi to extract "Common Nouns" and "Proper Nouns" from ASR output, then re-evaluated each using Precision, Recall, and F1 scores. The target models are the same 9 models from our previously published IT-domain benchmark.
Key findings:
- Qwen3-ASR, which leads on CER, falls behind Whisper on proper noun F1 (0.58 vs 0.60)
- Whisper ranks mid-range on CER but tops the field on both common noun and proper noun F1
- Judging by CER alone risks mistakenly underrating Whisper
- Parakeet and ReazonSpeech variants score 0.15–0.25 on proper noun F1, making them unsuitable for professional use involving technical terminology
Why WER and CER Alone Are Insufficient for Japanese ASR Evaluation
In English ASR evaluation, WER (Word Error Rate) is the standard, but Japanese has no "word segmentation" (spaces between words), making word boundaries ambiguous. As a result, CER (Character Error Rate) has become the de facto standard for Japanese evaluation.
However, CER has a fundamental problem of its own.
| Reference Text | ASR Output | CER |
|---|---|---|
| "会議の議事録を作成する" (Create meeting minutes) | "会議の議事論を作成する" (1 char error) | 0.04 |
| "田中一郎が承認した" (Ichiro Tanaka approved) | "多中一郎が承認した" (1 char error) | 0.04 |
The CER is similar in both cases, but the real-world impact is completely different:
- "議事録 → 議事論": Recoverable from context to some degree
- "田中一郎 → 多中一郎": A person name error is a serious mistake legally and operationally
Furthermore, a sentence with a missing proper noun and one with a wrong particle may have similar CER scores, but the information loss is vastly different.
"Implemented using Anthropic's Claude" → "Implemented using Claude"
CER difference: just a few percent. Yet the company name — a primary piece of information — has vanished.
Motivated by this problem, we implemented a noun- and proper-noun-specific F1 evaluation metric.
Evaluation Methodology
Text Normalization
Before noun matching, both the reference text and ASR output are normalized. This includes NFKC Unicode normalization to unify full-width/half-width variants, punctuation removal, and standardization of kanji numerals and numeric expressions. This preprocessing prevents unfair score degradation due to surface-form variation.
WER / CER Calculation
We use jiwer's edit-distance-based implementation, dividing the number of substitution, deletion, and insertion errors by the length of the reference text. Since Japanese has no space delimiters, WER is calculated by first tokenizing with MeCab before token comparison. CER is computed character by character.
Noun Extraction (Sudachi)
Using Sudachi's part-of-speech information (UniDic-based), each morpheme is classified, separating 名詞-普通名詞-一般 or サ変可能 (common nouns) and 名詞-固有名詞 (proper nouns, including all subcategories such as personal names, place names, and organization names).
Rather than using the dictionary-normalized form, we use the surface form for extracted tokens. The normalized form sometimes converts English-mixed terms in unintended ways, whereas the surface form lets us evaluate "exactly what the ASR actually transcribed."
Precision, Recall, and F1 in the ASR Context
- Precision: Of the nouns output by the ASR, the proportion that also appear in the reference text. Measures "how many spurious or hallucinated nouns does the ASR produce?"
- Recall: Of the nouns in the reference text, the proportion the ASR successfully captured. Measures "how many important nouns does the ASR miss?"
- F1 Score: The harmonic mean of Precision and Recall. Ranges from 0 to 1, where 1.0 is a perfect match.
F1 Score Calculation (Multiset Aggregation)
Noun sequences are tracked using a Counter (multiset) that preserves occurrence counts rather than a plain set. True positives are computed from the intersection of the reference and prediction multisets, so repeated mentions of the same noun are counted correctly — missing a noun that appears multiple times is penalized proportionally.
Corpus-Level Aggregation (Micro-Averaging)
For benchmark summaries, rather than averaging per-sample F1 scores, we use micro-averaging: summing true positives, reference noun counts, and predicted noun counts across all samples, then recomputing Precision and Recall. This represents "how many nouns were correctly captured across the entire corpus" on an absolute noun count basis, so results are not distorted by utterance count when utterances vary in noun density.
Benchmark Results: WER, CER, and Noun F1 Comparison
| ASR Model | WER↓ | CER↓ | CN_F1↑ | PN_F1↑ | RTF↓ |
|---|---|---|---|---|---|
| openai/whisper | 0.3256 | 0.1565 | 0.8831 | 0.6000 | 0.016 |
| mistralai/voxtral-mini-4b | 0.4220 | 0.1354 | 0.8821 | 0.5854 | 0.231 |
| coherelabs/cohere-transcribe | 0.4837 | 0.1701 | 0.8658 | 0.5652 | 0.011 |
| nvidia/parakeet-tdt-0.6b-v3 | 0.5236 | 0.1957 | 0.7810 | 0.2222 | 0.003 |
| reazon/reazonspeech-nemo-v2 | 0.5352 | 0.2327 | 0.7576 | 0.1500 | 0.021 |
| reazon/reazonspeech-k2-v2 | 0.5838 | 0.2852 | 0.7959 | 0.2500 | 0.028 |
| ibm/granite-4.0-1b-speech | 0.6149 | 0.3545 | 0.6435 | 0.1500 | 0.060 |
| qwen/qwen3-asr-1.7b | 0.6418 | 0.0823 | 0.8642 | 0.5778 | 0.047 |
| kotoba-tech/kotoba-whisper-v2.0 | 0.8695 | 0.6072 | 0.4621 | 0.1818 | 0.010 |
CN_F1: Common Noun F1 / PN_F1: Proper Noun F1 / RTF: Real Time Factor (lower is faster)
Deep Dive: Why CER Rankings Reverse
Whisper vs Qwen3-ASR: The CER–PN_F1 Inversion
The most striking finding is the inversion between CER and PN_F1 rankings.
| Metric | Whisper | Qwen3-ASR | Difference |
|---|---|---|---|
| CER | 0.1565 (3rd of 9) | 0.0823 (1st) | Qwen ~50% lower |
| PN_F1 | 0.6000 (1st) | 0.5778 (4th) | Whisper +0.02 |
| CN_F1 | 0.8831 (1st) | 0.8642 (3rd) | Whisper marginally ahead |
Qwen3-ASR makes very few mistakes on particles, conjunctions, and hiragana, giving it high character-level accuracy and thus a low CER. For proper nouns, however, cases where the output collapses to a phonetically similar but different word are noticeable — and in surface-form-based evaluation, those outputs do not match the reference.
Whisper, trained on multilingual data, handles katakana and mixed-script proper nouns well and tends to preserve company and service names intact. This aligns with my own day-to-day experience — the intuition that "Whisper is strong on proper nouns" is now backed by numbers.
Practical interpretation: For use cases where "who said what, from which company, about which service" matters — such as meeting transcriptions or interview notes — relying solely on CER risks undervaluing Whisper.
The Speed King Parakeet's Hidden Weakness
NVIDIA's Parakeet (RTF=0.003) offers overwhelming inference speed, but its proper noun F1 of 0.2222 ranks near the bottom. Trained primarily on English data, it struggles with Japanese proper nouns and technical terms. It is well-suited to scenarios where speed is the top priority — real-time subtitles, large-scale batch preprocessing — but is not appropriate for transcriptions where personal names and company names must be accurate.
Voxtral Mini's CN_F1 Score
On common noun F1 (CN_F1), Whisper leads at 0.8831 with Voxtral Mini close behind at 0.8821 — essentially tied. The gap is just 0.001, meaning the two models are nearly equivalent at recognizing general technical vocabulary. However, Voxtral Mini's processing speed (RTF=0.231) is significantly slower than Whisper's (0.016), which strengthens the case for choosing Whisper when accuracy is comparable.
Recommended Models by Use Case (Incorporating Noun F1)
| Use Case | Recommended Model | Reason |
|---|---|---|
| Meeting minutes with names & organizations | Whisper | Highest PN_F1 (0.60), CN_F1 also #1 (0.88) |
| High-accuracy character-level transcription | Qwen3-ASR | Lowest CER (0.082), solid common noun score (0.86) |
| Real-time / live captions | Parakeet | RTF=0.003; note PN_F1 caveat |
| All-around balance | Whisper | Consistently top-tier across CER, CN_F1, and PN_F1 |
Future Directions: Beyond Noun F1
1. Semantic CER
Current CER requires exact character match. But "コーヒー", "珈琲", and "coffee" are synonymous and should ideally incur no penalty. Candidates include surface-variation absorption using morphological analysis with a thesaurus (e.g., WordNet), or semantic similarity correction via multilingual embedding models.
2. Term Hit Rate
Prepare a domain-specific term list in advance and measure the proportion of terms appearing in the reference that the ASR correctly transcribed. Particularly valuable for medical, legal, and technical ASR — enabling targeted evaluation such as "disease name hit rate" or "legal article number hit rate."
3. Dataset Expansion
The current benchmark covers only 15 utterances. We plan to add audio samples containing medical and legal terminology, and to introduce domain-specific subsets for more granular evaluation.
With a small dataset, recognition differences in 1–2 utterances can significantly swing scores. Treat the numbers in this benchmark as indicative, and supplement with small-scale validation on your own data for practical deployment decisions.
Conclusion
For Japanese ASR evaluation, WER and CER are necessary but not sufficient. When proper noun and technical term recognition accuracy is the practical bottleneck, supplementary metrics like the noun F1 score introduced here are essential.
The key design decisions: normalize both texts with NFKC + punctuation removal before comparison; use Sudachi's POS tags to separately extract 名詞-普通名詞 and 名詞-固有名詞; use surface forms rather than dictionary-normalized forms as the evaluation target; and aggregate at the corpus level using micro-averaging (summing TP, reference count, and prediction count across all samples before recomputing Precision and Recall).
FAQ
Q. Which should I use — WER or CER?
CER is more appropriate for Japanese ASR. Because Japanese lacks word delimiters, word boundary definitions are ambiguous and WER results fluctuate depending on library settings. In our implementation, Japanese WER is computed after MeCab tokenization, but MeCab's segmentation itself introduces variability. Since CER alone cannot capture the severity of noun misrecognition, we recommend using it alongside the F1 metrics introduced in this article.
Q. Would using MeCab or Janome instead of Sudachi produce the same results?
No — differences in dictionaries will affect results. Proper noun extraction accuracy is highest with Sudachi (SudachiDict-full). MeCab with ipadic has limited registration of neologisms and company names, so many proper nouns may not be classified as such. Additionally, our implementation depends on Sudachi's POS taxonomy (UniDic-based), so swapping in another tool changes the granularity of POS classification and will not yield the same scores.
Q. Does it matter much whether micro- or macro-averaging (mean of per-sample F1) is used?
It can. If proper noun errors are concentrated in short utterances (with few nouns), macro-averaging pulls down the score for those samples and heavily impacts the corpus-level score. Micro-averaging weights results by actual noun counts, so utterances with more nouns have proportionally greater influence. Micro-averaging is appropriate when the goal is to straightforwardly represent "how many nouns were captured across the entire corpus."
Q. Can post-processing improve models with low proper noun F1?
Yes. LLM-based post-correction is effective. Passing ASR output to an LLM for context-aware proper noun correction can achieve higher accuracy than standalone ASR. Combining Whisper with LLM post-processing in particular has the potential to further improve both CER and proper noun F1.
Q. Can I reproduce this benchmark with my own data?
Installing Sudachi (sudachidict-full) and following the same pipeline will let you reproduce the noun F1 calculation. For ASR model inference code, refer to our previous benchmark article. The dataset is not publicly released, but the same pipeline can be applied directly to your own audio recordings and reference transcriptions.
Q. Can the same methodology be applied to specialized domains like medicine or law?
The core approach transfers, but technical terms not included in Sudachi's standard dictionary may not be correctly extracted as proper nouns. For medical domains, consider adding a domain-specific custom dictionary. When the goal is to evaluate specialized terminology directly, the Term Hit Rate approach described in this article serves as a useful complement.
Contact
For project inquiries and collaboration, contact us here.
If you are considering a new project, product development, or other collaboration, please get in touch.
Related Articles
Explore more articles connected to this topic.
Building a High-Quality ASR Training Data Pipeline with Multiple Models — Design Philosophy and Implementation of Whisper × Qwen3 × LLM
A deep-dive into a two-stage pipeline that integrates Whisper and Qwen3-ASR transcriptions via LLM diff-merging to generate ASR training data that surpasses what any single model can achieve. The pipeline combines Whisper's strength in kanji and proper nouns with Qwen3's lower overall CER, then applies Forced Alignment to assign word-level timestamps. The downstream chunk-splitting stage automates natural 20–30 second chunk generation using sentence-end scoring, followed by Whisper re-transcription for CER-based quality filtering.
Read article →Japanese ASR Models for IT Terminology [2026]: Benchmarking Whisper, Qwen, and More on Real-World Accuracy
This article presents a first-hand benchmark evaluating nine ASR models on real audio data containing IT company names, service names, and abbreviations. Unlike general-purpose Japanese speech recognition benchmarks, this evaluation takes a practical perspective — asking whether models can correctly transcribe domain-specific terms like "openclaw" and "ARR."
Read article →