We Released the Highest-Accuracy Japanese ASR Model for Free
This article covers the technical background, evaluation methodology, and usage of a fine-tuned version of Qwen3-ASR-1.7B specialized for proper noun recognition. The model is freely available on Hugging Face.
Conclusion
There are two reasons this model qualifies as "state of the art."
The first is CER (Character Error Rate). As demonstrated in our previous benchmark article, Qwen3-ASR currently achieves the lowest CER for Japanese ASR. This has been verified in comparisons against major models including Whisper, ReazonSpeech, and Granite.
But CER alone is not enough. Even a low CER is useless in practice if the model can't correctly transcribe proper nouns.
The second reason is the proper noun F1 score. As shown in our proper noun evaluation article, Qwen3-ASR leads on CER but was outperformed by Whisper on proper noun recognition accuracy. The model released here addresses this weakness through fine-tuning, achieving scores that surpass Whisper on both CER and proper noun F1. This two-axis superiority is what makes it "state of the art."
👉 Model (free)
neosophie/Qwen3-ASR-1.7B-JA — Hugging Face
👉 Japanese ASR app (macOS)
Sonophie — Neosophie
Why Qwen3-ASR?
Our benchmark article covers this in depth, but the short answer is that Qwen3-ASR currently achieves the lowest CER among Japanese ASR models.
What is CER (Character Error Rate)?
A metric that measures the difference between recognized output and reference text at the character level. Lower is better.
That said, CER alone cannot tell the whole story.
The Proper Noun Problem That CER Can't Measure
A New Evaluation Axis: Proper Noun F1 Score
CER measures character-level match rates, so transcribing "ByteDance" as "バイトダンス" (the katakana transliteration) wouldn't count as an error. But if a document expects "バイトダンス" and the model outputs "ByteDance," post-processing is required.
This is why we introduced an F1-score evaluation focused on proper nouns and common nouns.
How to read the metrics
- WER (Word Error Rate): Word-level error rate. Lower is better.
- CER (Character Error Rate): Character-level error rate. Lower is better.
- CN_F1 (Common Noun F1): F1 score for common noun recognition. Higher is better.
- PN_F1 (Proper Noun F1): F1 score for proper noun recognition. Higher is better.
Benchmark Results
IT Domain Dataset (technical audio, meetings, and explainer videos)
| ASR | WER↓ | CER↓ | CN_F1↑ | PN_F1↑ |
|---|---|---|---|---|
| neosophie/qwen3-asr-1.7b-ja (ours) | 0.2162 | 0.0892 | 0.8754 | 0.6512 |
| qwen/qwen3-asr-1.7b | 0.3003 | 0.0823 | 0.8642 | 0.5909 |
| whisper | 0.3441 | 0.1565 | 0.8831 | 0.5641 |
Why Proper Noun Recognition Matters in Practice
A 1-point drop in CER is far less noticeable than getting proper nouns right. Getting proper nouns right is what actually makes transcription feel accurate.
For example:
- You want "ByteDance" in meeting minutes, not "バイトダンス"
- "Pull Request" written as "プルリクエスト" is harder to read in a technical doc
- "position: absolute" transcribed as "ポジションアブソルート" becomes unsearchable
These small transcription errors are tedious to fix manually at scale.
Why Japanese ASR Is Fundamentally Hard
Japanese has one of the most complex writing systems in the world. Hiragana, katakana, kanji, Arabic numerals, and the Latin alphabet coexist — and for a single spoken phrase, there are often multiple "correct" written forms.
Examples of Orthographic Variation
Speech: "バイトダンス"
Correct A (general document): バイトダンス
Correct B (technical document): ByteDance
Speech: "いちまんにせんさんびゃくよんじゅうご"
Correct A: 一万二千三百四十五
Correct B: 1万2,345 (more readable)
Speech: "AイコールAプラス1"
Correct A: A=A+1
Correct B: A=A+1 (code context)
Which form is "correct" depends entirely on context. Engineers prefer ASCII and symbol notation; other fields may prefer kanji.
General-purpose ASR models are optimized to convert speech to text — not to decide which orthographic form is appropriate. This is exactly why domain-specific fine-tuning is necessary.
Concrete Examples of Improvements from Fine-Tuning
Here are examples of conversions the model now handles correctly.
Katakana → Appropriate Notation
| Speech (Katakana) | Output |
|---|---|
| バイトダンス | ByteDance |
| プルリクエスト | Pull Request |
| グーグルスライド | Google Slides |
| ポジションアブソルート | position: absolute |
| サードパーティークッキー | third-party cookie |
Japanese Numerals → Arabic Numerals
| Speech | Output |
|---|---|
| 一万二千三百四十五 | 1万2345 |
| 十人 | 10人 |
Formulas and Code Notation
| Speech | Output |
|---|---|
| AイコールAプラス1 | A=A+1 |
How to Use the Model
Via Hugging Face
pip install transformers torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
model_id = "neosophie/Qwen3-ASR-1.7B-JA"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Load audio and run inference
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
generated_ids = model.generate(**inputs)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)
For detailed usage and licensing, see the Hugging Face model page.
Prefer a GUI?
For those who don't want to write code, the macOS app Sonophie ships with this model built in.
Sonophie — Japanese-first speech recognition for macOS
Launch with a global shortcut → transcribe → copy to clipboard. Works great for meeting notes and interview transcription.
Technical Notes: Fine-Tuning Design Philosophy
Why a 1.7B Model?
- Designed for local inference and edge devices
- CER is on par with or better than large-v3-class models
- Low memory footprint; runs on CPU-only environments
Fine-Tuning Data Composition
We collected audio–text pairs centered on proper nouns and trained the model on orthographic variation patterns. The dataset focuses on technical terms, company names, product names, and programming vocabulary.
This model is optimized for Japanese technical and business domains. Separate evaluation is recommended for specialized fields such as medicine, law, and government.
FAQ
Q. Should I use Whisper or Qwen3-ASR?
A. If CER is your priority, Qwen3-ASR is currently the stronger choice. For use cases with many proper nouns, we recommend the fine-tuned model released here.
Q. Does it work without a GPU?
A. Yes — at 1.7B parameters, it runs on CPU only. Expect significantly slower inference compared to GPU environments. float16 and int8 quantization are also worth considering.
Q. Can it transcribe English audio?
A. This model was fine-tuned specifically for Japanese, so English audio performance is not guaranteed. Use the original Qwen3-ASR for English.
Q. Is commercial use allowed?
A. Check the model page on Hugging Face. Licensing follows the base Qwen3-ASR model.
Q. Can I add custom proper nouns?
A. Additional training support is not currently provided, but you can use the Hugging Face model as a base for your own fine-tuning.
Q. What platforms does Sonophie support?
A. macOS only, for now.
Summary
- CER has limits for Japanese ASR — proper noun recognition requires a separate metric (F1 score)
- Qwen3-ASR leads on CER but was outperformed by Whisper on proper nouns
- Fine-tuning on proper noun data resolves this gap
- The model is freely available on Hugging Face
- A macOS GUI app (Sonophie) is also available for non-technical users
👉 Try the model
neosophie/Qwen3-ASR-1.7B-JA
Contact
For project inquiries and collaboration, contact us here.
If you are considering a new project, product development, or other collaboration, please get in touch.
Related Articles
Explore more articles connected to this topic.
Japanese IT-Domain ASR Benchmark 2026: Granite 4.0 vs Qwen3 vs Whisper
We tested 9 ASR models on Japanese IT jargon — SaaS names, abbreviations, company terms. Benchmarks include Granite 4.0, Qwen3-ASR, Cohere Transcribe, and Whisper variants.
Read article →Are WER and CER Not Enough? Re-evaluating Japanese ASR with Noun and Proper Noun F1 Scores
Japanese speech recognition is more challenging than many other languages due to the wide variety of kanji and proper noun representations. This primary benchmark article implements a morphological-analysis-based F1 score evaluation using Sudachi to quantify "kanji and proper noun recognition accuracy" that WER and CER struggle to capture, and re-evaluates 9 open-source ASR models.
Read article →