We Released the Highest-Accuracy Japanese ASR Model for Free

April 27, 2026

This article covers the technical background, evaluation methodology, and usage of a fine-tuned version of Qwen3-ASR-1.7B specialized for proper noun recognition. The model is freely available on Hugging Face.

Conclusion

There are two reasons this model qualifies as "state of the art."

The first is CER (Character Error Rate). As demonstrated in our previous benchmark article, Qwen3-ASR currently achieves the lowest CER for Japanese ASR. This has been verified in comparisons against major models including Whisper, ReazonSpeech, and Granite.

But CER alone is not enough. Even a low CER is useless in practice if the model can't correctly transcribe proper nouns.

The second reason is the proper noun F1 score. As shown in our proper noun evaluation article, Qwen3-ASR leads on CER but was outperformed by Whisper on proper noun recognition accuracy. The model released here addresses this weakness through fine-tuning, achieving scores that surpass Whisper on both CER and proper noun F1. This two-axis superiority is what makes it "state of the art."

👉 Model (free)
neosophie/Qwen3-ASR-1.7B-JA — Hugging Face

👉 Japanese ASR app (macOS)
Sonophie — Neosophie

Why Qwen3-ASR?

Our benchmark article covers this in depth, but the short answer is that Qwen3-ASR currently achieves the lowest CER among Japanese ASR models.

What is CER (Character Error Rate)?
A metric that measures the difference between recognized output and reference text at the character level. Lower is better.

That said, CER alone cannot tell the whole story.

The Proper Noun Problem That CER Can't Measure

A New Evaluation Axis: Proper Noun F1 Score

CER measures character-level match rates, so transcribing "ByteDance" as "バイトダンス" (the katakana transliteration) wouldn't count as an error. But if a document expects "バイトダンス" and the model outputs "ByteDance," post-processing is required.

This is why we introduced an F1-score evaluation focused on proper nouns and common nouns.

How to read the metrics

WER (Word Error Rate): Word-level error rate. Lower is better.
CER (Character Error Rate): Character-level error rate. Lower is better.
CN_F1 (Common Noun F1): F1 score for common noun recognition. Higher is better.
PN_F1 (Proper Noun F1): F1 score for proper noun recognition. Higher is better.

Benchmark Results

IT Domain Dataset (technical audio, meetings, and explainer videos)

ASR	WER↓	CER↓	CN_F1↑	PN_F1↑
neosophie/qwen3-asr-1.7b-ja (ours)	0.2162	0.0892	0.8754	0.6512
qwen/qwen3-asr-1.7b	0.3003	0.0823	0.8642	0.5909
whisper	0.3441	0.1565	0.8831	0.5641

Why Proper Noun Recognition Matters in Practice

A 1-point drop in CER is far less noticeable than getting proper nouns right. Getting proper nouns right is what actually makes transcription feel accurate.

For example:

You want "ByteDance" in meeting minutes, not "バイトダンス"
"Pull Request" written as "プルリクエスト" is harder to read in a technical doc
"position: absolute" transcribed as "ポジションアブソルート" becomes unsearchable

These small transcription errors are tedious to fix manually at scale.

Why Japanese ASR Is Fundamentally Hard

Japanese has one of the most complex writing systems in the world. Hiragana, katakana, kanji, Arabic numerals, and the Latin alphabet coexist — and for a single spoken phrase, there are often multiple "correct" written forms.

Examples of Orthographic Variation

text

Speech: "バイトダンス"
Correct A (general document): バイトダンス
Correct B (technical document): ByteDance

text

Speech: "いちまんにせんさんびゃくよんじゅうご"
Correct A: 一万二千三百四十五
Correct B: 1万2,345 (more readable)

text

Speech: "AイコールAプラス1"
Correct A: A＝A＋1
Correct B: A=A+1 (code context)

Which form is "correct" depends entirely on context. Engineers prefer ASCII and symbol notation; other fields may prefer kanji.

General-purpose ASR models are optimized to convert speech to text — not to decide which orthographic form is appropriate. This is exactly why domain-specific fine-tuning is necessary.

Concrete Examples of Improvements from Fine-Tuning

Here are examples of conversions the model now handles correctly.

Katakana → Appropriate Notation

Speech (Katakana)	Output
バイトダンス	ByteDance
プルリクエスト	Pull Request
グーグルスライド	Google Slides
ポジションアブソルート	position: absolute
サードパーティークッキー	third-party cookie

Japanese Numerals → Arabic Numerals

Speech	Output
一万二千三百四十五	1万2345
十人	10人

Formulas and Code Notation

Speech	Output
AイコールAプラス1	A=A+1

How to Use the Model

Via Hugging Face

bash

pip install transformers torch

python

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

model_id = "neosophie/Qwen3-ASR-1.7B-JA"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load audio and run inference
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)

inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
    generated_ids = model.generate(**inputs)

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

For detailed usage and licensing, see the Hugging Face model page.

Prefer a GUI?

For those who don't want to write code, the macOS app Sonophie ships with this model built in.

Sonophie — Japanese-first speech recognition for macOS

Launch with a global shortcut → transcribe → copy to clipboard. Works great for meeting notes and interview transcription.

Technical Notes: Fine-Tuning Design Philosophy

Why a 1.7B Model?

Designed for local inference and edge devices
CER is on par with or better than large-v3-class models
Low memory footprint; runs on CPU-only environments

Fine-Tuning Data Composition

We collected audio–text pairs centered on proper nouns and trained the model on orthographic variation patterns. The dataset focuses on technical terms, company names, product names, and programming vocabulary.

This model is optimized for Japanese technical and business domains. Separate evaluation is recommended for specialized fields such as medicine, law, and government.

FAQ

Q. Should I use Whisper or Qwen3-ASR?
A. If CER is your priority, Qwen3-ASR is currently the stronger choice. For use cases with many proper nouns, we recommend the fine-tuned model released here.

Q. Does it work without a GPU?
A. Yes — at 1.7B parameters, it runs on CPU only. Expect significantly slower inference compared to GPU environments. float16 and int8 quantization are also worth considering.

Q. Can it transcribe English audio?
A. This model was fine-tuned specifically for Japanese, so English audio performance is not guaranteed. Use the original Qwen3-ASR for English.

Q. Is commercial use allowed?
A. Check the model page on Hugging Face. Licensing follows the base Qwen3-ASR model.

Q. Can I add custom proper nouns?
A. Additional training support is not currently provided, but you can use the Hugging Face model as a base for your own fine-tuning.

Q. What platforms does Sonophie support?
A. macOS only, for now.

Summary

CER has limits for Japanese ASR — proper noun recognition requires a separate metric (F1 score)
Qwen3-ASR leads on CER but was outperformed by Whisper on proper nouns
Fine-tuning on proper noun data resolves this gap
The model is freely available on Hugging Face
A macOS GUI app (Sonophie) is also available for non-technical users

👉 Try the model
neosophie/Qwen3-ASR-1.7B-JA

Contact

For project inquiries and collaboration, contact us here.

If you are considering a new project, product development, or other collaboration, please get in touch.

Go to Contact

Explore more articles connected to this topic.

April 27, 2026asrbenchmark

Japanese IT-Domain ASR Benchmark 2026: Granite 4.0 vs Qwen3 vs Whisper

We tested 9 ASR models on Japanese IT jargon — SaaS names, abbreviations, company terms. Benchmarks include Granite 4.0, Qwen3-ASR, Cohere Transcribe, and Whisper variants.

Read article →

April 21, 2026asrbenchmark

Are WER and CER Not Enough? Re-evaluating Japanese ASR with Noun and Proper Noun F1 Scores

Japanese speech recognition is more challenging than many other languages due to the wide variety of kanji and proper noun representations. This primary benchmark article implements a morphological-analysis-based F1 score evaluation using Sudachi to quantify "kanji and proper noun recognition accuracy" that WER and CER struggle to capture, and re-evaluates 9 open-source ASR models.

Read article →

Conclusion

Why Qwen3-ASR?

The Proper Noun Problem That CER Can't Measure

A New Evaluation Axis: Proper Noun F1 Score

Benchmark Results

Why Proper Noun Recognition Matters in Practice

Why Japanese ASR Is Fundamentally Hard

Examples of Orthographic Variation

Concrete Examples of Improvements from Fine-Tuning

Katakana → Appropriate Notation

Japanese Numerals → Arabic Numerals

Formulas and Code Notation

How to Use the Model

Via Hugging Face

Prefer a GUI?

Technical Notes: Fine-Tuning Design Philosophy

Why a 1.7B Model?

Fine-Tuning Data Composition

FAQ

Summary

For project inquiries and collaboration, contact us here.

Related Articles

Japanese IT-Domain ASR Benchmark 2026: Granite 4.0 vs Qwen3 vs Whisper

Are WER and CER Not Enough? Re-evaluating Japanese ASR with Noun and Proper Noun F1 Scores