Are WER and CER Not Enough? Re-evaluating Japanese ASR with Noun and Proper Noun F1 Scores

April 21, 2026

This article is a primary benchmark report that implements morphological-analysis-based F1 score evaluation using Sudachi to quantify "kanji and proper noun recognition accuracy" that is difficult to capture with WER or CER alone, and re-evaluates 9 open-source ASR models.

Summary: What You'll Learn

In Japanese ASR (Automatic Speech Recognition) accuracy evaluation, WER (Word Error Rate) and CER (Character Error Rate) are widely used as standard metrics. However, "how critical a single kanji substitution or a proper noun error is in practice" is difficult to reflect in these metrics.

To address this, we used the morphological analysis library Sudachi to extract "Common Nouns" and "Proper Nouns" from ASR output, then re-evaluated each using Precision, Recall, and F1 scores. The target models are the same 9 models from our previously published IT-domain benchmark.

Key findings:

Qwen3-ASR, which leads on CER, falls behind Whisper on proper noun F1 (0.58 vs 0.60)
Whisper ranks mid-range on CER but tops the field on both common noun and proper noun F1
Judging by CER alone risks mistakenly underrating Whisper
Parakeet and ReazonSpeech variants score 0.15–0.25 on proper noun F1, making them unsuitable for professional use involving technical terminology

Why WER and CER Alone Are Insufficient for Japanese ASR Evaluation

In English ASR evaluation, WER (Word Error Rate) is the standard, but Japanese has no "word segmentation" (spaces between words), making word boundaries ambiguous. As a result, CER (Character Error Rate) has become the de facto standard for Japanese evaluation.

However, CER has a fundamental problem of its own.

Reference Text	ASR Output	CER
"会議の議事録を作成する" (Create meeting minutes)	"会議の議事論を作成する" (1 char error)	0.04
"田中一郎が承認した" (Ichiro Tanaka approved)	"多中一郎が承認した" (1 char error)	0.04

The CER is similar in both cases, but the real-world impact is completely different:

"議事録 → 議事論": Recoverable from context to some degree
"田中一郎 → 多中一郎": A person name error is a serious mistake legally and operationally

Furthermore, a sentence with a missing proper noun and one with a wrong particle may have similar CER scores, but the information loss is vastly different.

"Implemented using Anthropic's Claude" → "Implemented using Claude"
CER difference: just a few percent. Yet the company name — a primary piece of information — has vanished.

Motivated by this problem, we implemented a noun- and proper-noun-specific F1 evaluation metric.

Evaluation Methodology

Text Normalization

Before noun matching, both the reference text and ASR output are normalized. This includes NFKC Unicode normalization to unify full-width/half-width variants, punctuation removal, and standardization of kanji numerals and numeric expressions. This preprocessing prevents unfair score degradation due to surface-form variation.

WER / CER Calculation

We use jiwer's edit-distance-based implementation, dividing the number of substitution, deletion, and insertion errors by the length of the reference text. Since Japanese has no space delimiters, WER is calculated by first tokenizing with MeCab before token comparison. CER is computed character by character.

Noun Extraction (Sudachi)

Using Sudachi's part-of-speech information (UniDic-based), each morpheme is classified, separating 名詞-普通名詞-一般 or サ変可能 (common nouns) and 名詞-固有名詞 (proper nouns, including all subcategories such as personal names, place names, and organization names).

Rather than using the dictionary-normalized form, we use the surface form for extracted tokens. The normalized form sometimes converts English-mixed terms in unintended ways, whereas the surface form lets us evaluate "exactly what the ASR actually transcribed."

Precision, Recall, and F1 in the ASR Context

Precision: Of the nouns output by the ASR, the proportion that also appear in the reference text. Measures "how many spurious or hallucinated nouns does the ASR produce?"
Recall: Of the nouns in the reference text, the proportion the ASR successfully captured. Measures "how many important nouns does the ASR miss?"
F1 Score: The harmonic mean of Precision and Recall. Ranges from 0 to 1, where 1.0 is a perfect match.

F1 Score Calculation (Multiset Aggregation)

Noun sequences are tracked using a Counter (multiset) that preserves occurrence counts rather than a plain set. True positives are computed from the intersection of the reference and prediction multisets, so repeated mentions of the same noun are counted correctly — missing a noun that appears multiple times is penalized proportionally.

Corpus-Level Aggregation (Micro-Averaging)

For benchmark summaries, rather than averaging per-sample F1 scores, we use micro-averaging: summing true positives, reference noun counts, and predicted noun counts across all samples, then recomputing Precision and Recall. This represents "how many nouns were correctly captured across the entire corpus" on an absolute noun count basis, so results are not distorted by utterance count when utterances vary in noun density.

Benchmark Results: WER, CER, and Noun F1 Comparison

ASR Model	WER↓	CER↓	CN_F1↑	PN_F1↑	RTF↓
openai/whisper	0.3256	0.1565	0.8831	0.6000	0.016
mistralai/voxtral-mini-4b	0.4220	0.1354	0.8821	0.5854	0.231
coherelabs/cohere-transcribe	0.4837	0.1701	0.8658	0.5652	0.011
nvidia/parakeet-tdt-0.6b-v3	0.5236	0.1957	0.7810	0.2222	0.003
reazon/reazonspeech-nemo-v2	0.5352	0.2327	0.7576	0.1500	0.021
reazon/reazonspeech-k2-v2	0.5838	0.2852	0.7959	0.2500	0.028
ibm/granite-4.0-1b-speech	0.6149	0.3545	0.6435	0.1500	0.060
qwen/qwen3-asr-1.7b	0.6418	0.0823	0.8642	0.5778	0.047
kotoba-tech/kotoba-whisper-v2.0	0.8695	0.6072	0.4621	0.1818	0.010

CN_F1: Common Noun F1 / PN_F1: Proper Noun F1 / RTF: Real Time Factor (lower is faster)

Deep Dive: Why CER Rankings Reverse

Whisper vs Qwen3-ASR: The CER–PN_F1 Inversion

The most striking finding is the inversion between CER and PN_F1 rankings.

Metric	Whisper	Qwen3-ASR	Difference
CER	0.1565 (3rd of 9)	0.0823 (1st)	Qwen ~50% lower
PN_F1	0.6000 (1st)	0.5778 (4th)	Whisper +0.02
CN_F1	0.8831 (1st)	0.8642 (3rd)	Whisper marginally ahead

Qwen3-ASR makes very few mistakes on particles, conjunctions, and hiragana, giving it high character-level accuracy and thus a low CER. For proper nouns, however, cases where the output collapses to a phonetically similar but different word are noticeable — and in surface-form-based evaluation, those outputs do not match the reference.

Whisper, trained on multilingual data, handles katakana and mixed-script proper nouns well and tends to preserve company and service names intact. This aligns with my own day-to-day experience — the intuition that "Whisper is strong on proper nouns" is now backed by numbers.

Practical interpretation: For use cases where "who said what, from which company, about which service" matters — such as meeting transcriptions or interview notes — relying solely on CER risks undervaluing Whisper.

The Speed King Parakeet's Hidden Weakness

NVIDIA's Parakeet (RTF=0.003) offers overwhelming inference speed, but its proper noun F1 of 0.2222 ranks near the bottom. Trained primarily on English data, it struggles with Japanese proper nouns and technical terms. It is well-suited to scenarios where speed is the top priority — real-time subtitles, large-scale batch preprocessing — but is not appropriate for transcriptions where personal names and company names must be accurate.

Voxtral Mini's CN_F1 Score

On common noun F1 (CN_F1), Whisper leads at 0.8831 with Voxtral Mini close behind at 0.8821 — essentially tied. The gap is just 0.001, meaning the two models are nearly equivalent at recognizing general technical vocabulary. However, Voxtral Mini's processing speed (RTF=0.231) is significantly slower than Whisper's (0.016), which strengthens the case for choosing Whisper when accuracy is comparable.

Recommended Models by Use Case (Incorporating Noun F1)

Use Case	Recommended Model	Reason
Meeting minutes with names & organizations	Whisper	Highest PN_F1 (0.60), CN_F1 also #1 (0.88)
High-accuracy character-level transcription	Qwen3-ASR	Lowest CER (0.082), solid common noun score (0.86)
Real-time / live captions	Parakeet	RTF=0.003; note PN_F1 caveat
All-around balance	Whisper	Consistently top-tier across CER, CN_F1, and PN_F1

Future Directions: Beyond Noun F1

1. Semantic CER

Current CER requires exact character match. But "コーヒー", "珈琲", and "coffee" are synonymous and should ideally incur no penalty. Candidates include surface-variation absorption using morphological analysis with a thesaurus (e.g., WordNet), or semantic similarity correction via multilingual embedding models.

2. Term Hit Rate

Prepare a domain-specific term list in advance and measure the proportion of terms appearing in the reference that the ASR correctly transcribed. Particularly valuable for medical, legal, and technical ASR — enabling targeted evaluation such as "disease name hit rate" or "legal article number hit rate."

3. Dataset Expansion

The current benchmark covers only 15 utterances. We plan to add audio samples containing medical and legal terminology, and to introduce domain-specific subsets for more granular evaluation.

With a small dataset, recognition differences in 1–2 utterances can significantly swing scores. Treat the numbers in this benchmark as indicative, and supplement with small-scale validation on your own data for practical deployment decisions.

Conclusion

For Japanese ASR evaluation, WER and CER are necessary but not sufficient. When proper noun and technical term recognition accuracy is the practical bottleneck, supplementary metrics like the noun F1 score introduced here are essential.

The key design decisions: normalize both texts with NFKC + punctuation removal before comparison; use Sudachi's POS tags to separately extract 名詞-普通名詞 and 名詞-固有名詞; use surface forms rather than dictionary-normalized forms as the evaluation target; and aggregate at the corpus level using micro-averaging (summing TP, reference count, and prediction count across all samples before recomputing Precision and Recall).

FAQ

Q. Which should I use — WER or CER?

CER is more appropriate for Japanese ASR. Because Japanese lacks word delimiters, word boundary definitions are ambiguous and WER results fluctuate depending on library settings. In our implementation, Japanese WER is computed after MeCab tokenization, but MeCab's segmentation itself introduces variability. Since CER alone cannot capture the severity of noun misrecognition, we recommend using it alongside the F1 metrics introduced in this article.

Q. Would using MeCab or Janome instead of Sudachi produce the same results?

No — differences in dictionaries will affect results. Proper noun extraction accuracy is highest with Sudachi (SudachiDict-full). MeCab with ipadic has limited registration of neologisms and company names, so many proper nouns may not be classified as such. Additionally, our implementation depends on Sudachi's POS taxonomy (UniDic-based), so swapping in another tool changes the granularity of POS classification and will not yield the same scores.

Q. Does it matter much whether micro- or macro-averaging (mean of per-sample F1) is used?

It can. If proper noun errors are concentrated in short utterances (with few nouns), macro-averaging pulls down the score for those samples and heavily impacts the corpus-level score. Micro-averaging weights results by actual noun counts, so utterances with more nouns have proportionally greater influence. Micro-averaging is appropriate when the goal is to straightforwardly represent "how many nouns were captured across the entire corpus."

Q. Can post-processing improve models with low proper noun F1?

Yes. LLM-based post-correction is effective. Passing ASR output to an LLM for context-aware proper noun correction can achieve higher accuracy than standalone ASR. Combining Whisper with LLM post-processing in particular has the potential to further improve both CER and proper noun F1.

Q. Can I reproduce this benchmark with my own data?

Installing Sudachi (sudachidict-full) and following the same pipeline will let you reproduce the noun F1 calculation. For ASR model inference code, refer to our previous benchmark article. The dataset is not publicly released, but the same pipeline can be applied directly to your own audio recordings and reference transcriptions.

Q. Can the same methodology be applied to specialized domains like medicine or law?

The core approach transfers, but technical terms not included in Sudachi's standard dictionary may not be correctly extracted as proper nouns. For medical domains, consider adding a domain-specific custom dictionary. When the goal is to evaluate specialized terminology directly, the Term Hit Rate approach described in this article serves as a useful complement.

Contact

For project inquiries and collaboration, contact us here.

If you are considering a new project, product development, or other collaboration, please get in touch.

Go to Contact

Explore more articles connected to this topic.

April 30, 2026asrbenchmark

Japanese IT-Domain ASR Benchmark 2026: Granite 4.0 vs Qwen3 vs Whisper

We tested 9 ASR models on Japanese IT jargon — SaaS names, abbreviations, company terms. Benchmarks include Granite 4.0, Qwen3-ASR, Cohere Transcribe, and Whisper variants.

Read article →

April 29, 2026asr

Released the Highest-Accuracy Japanese ASR Model for Free

Fine-tuned Qwen3-ASR-1.7B for proper noun recognition. Free on Hugging Face. Outperforms Whisper on both CER and proper noun F1. Also available in Sonophie for macOS.

Read article →

April 16, 2026asrdataset

Building a High-Quality ASR Training Data Pipeline with Multiple Models — Design Philosophy and Implementation of Whisper × Qwen3 × LLM

A deep-dive into a two-stage pipeline that integrates Whisper and Qwen3-ASR transcriptions via LLM diff-merging to generate ASR training data that surpasses what any single model can achieve. The pipeline combines Whisper's strength in kanji and proper nouns with Qwen3's lower overall CER, then applies Forced Alignment to assign word-level timestamps. The downstream chunk-splitting stage automates natural 20–30 second chunk generation using sentence-end scoring, followed by Whisper re-transcription for CER-based quality filtering.

Read article →

Summary: What You'll Learn

Why WER and CER Alone Are Insufficient for Japanese ASR Evaluation

Evaluation Methodology

Text Normalization

WER / CER Calculation

Noun Extraction (Sudachi)

F1 Score Calculation (Multiset Aggregation)

Corpus-Level Aggregation (Micro-Averaging)

Benchmark Results: WER, CER, and Noun F1 Comparison

Deep Dive: Why CER Rankings Reverse

Whisper vs Qwen3-ASR: The CER–PN_F1 Inversion

The Speed King Parakeet's Hidden Weakness

Voxtral Mini's CN_F1 Score

Recommended Models by Use Case (Incorporating Noun F1)

Future Directions: Beyond Noun F1

1. Semantic CER

2. Term Hit Rate

3. Dataset Expansion

Conclusion

FAQ

For project inquiries and collaboration, contact us here.

Related Articles

Japanese IT-Domain ASR Benchmark 2026: Granite 4.0 vs Qwen3 vs Whisper

Released the Highest-Accuracy Japanese ASR Model for Free

Building a High-Quality ASR Training Data Pipeline with Multiple Models — Design Philosophy and Implementation of Whisper × Qwen3 × LLM