We Released the Highest-Accuracy Japanese ASR Model for Free

This article covers the technical background, evaluation methodology, and usage of a fine-tuned version of Qwen3-ASR-1.7B specialized for proper noun recognition. The model is freely available on Hugging Face.

Conclusion

There are two reasons this model qualifies as "state of the art."

The first is CER (Character Error Rate). As demonstrated in our previous benchmark article, Qwen3-ASR currently achieves the lowest CER for Japanese ASR. This has been verified in comparisons against major models including Whisper, ReazonSpeech, and Granite.

But CER alone is not enough. Even a low CER is useless in practice if the model can't correctly transcribe proper nouns.

The second reason is the proper noun F1 score. As shown in our proper noun evaluation article, Qwen3-ASR leads on CER but was outperformed by Whisper on proper noun recognition accuracy. The model released here addresses this weakness through fine-tuning, achieving scores that surpass Whisper on both CER and proper noun F1. This two-axis superiority is what makes it "state of the art."

👉 Model (free)
neosophie/Qwen3-ASR-1.7B-JA — Hugging Face

👉 Japanese ASR app (macOS)
Sonophie — Neosophie


Why Qwen3-ASR?

Our benchmark article covers this in depth, but the short answer is that Qwen3-ASR currently achieves the lowest CER among Japanese ASR models.

What is CER (Character Error Rate)?
A metric that measures the difference between recognized output and reference text at the character level. Lower is better.

That said, CER alone cannot tell the whole story.


The Proper Noun Problem That CER Can't Measure

A New Evaluation Axis: Proper Noun F1 Score

CER measures character-level match rates, so transcribing "ByteDance" as "バイトダンス" (the katakana transliteration) wouldn't count as an error. But if a document expects "バイトダンス" and the model outputs "ByteDance," post-processing is required.

This is why we introduced an F1-score evaluation focused on proper nouns and common nouns.

How to read the metrics

  • WER (Word Error Rate): Word-level error rate. Lower is better.
  • CER (Character Error Rate): Character-level error rate. Lower is better.
  • CN_F1 (Common Noun F1): F1 score for common noun recognition. Higher is better.
  • PN_F1 (Proper Noun F1): F1 score for proper noun recognition. Higher is better.

Benchmark Results

IT Domain Dataset (technical audio, meetings, and explainer videos)

ASRWER↓CER↓CN_F1↑PN_F1↑
neosophie/qwen3-asr-1.7b-ja (ours)0.21620.08920.87540.6512
qwen/qwen3-asr-1.7b0.30030.08230.86420.5909
whisper0.34410.15650.88310.5641

Why Proper Noun Recognition Matters in Practice

A 1-point drop in CER is far less noticeable than getting proper nouns right. Getting proper nouns right is what actually makes transcription feel accurate.

For example:

  • You want "ByteDance" in meeting minutes, not "バイトダンス"
  • "Pull Request" written as "プルリクエスト" is harder to read in a technical doc
  • "position: absolute" transcribed as "ポジションアブソルート" becomes unsearchable

These small transcription errors are tedious to fix manually at scale.


Why Japanese ASR Is Fundamentally Hard

Japanese has one of the most complex writing systems in the world. Hiragana, katakana, kanji, Arabic numerals, and the Latin alphabet coexist — and for a single spoken phrase, there are often multiple "correct" written forms.

Examples of Orthographic Variation

text
Speech: "バイトダンス"
Correct A (general document): バイトダンス
Correct B (technical document): ByteDance
text
Speech: "いちまんにせんさんびゃくよんじゅうご"
Correct A: 一万二千三百四十五
Correct B: 1万2,345 (more readable)
text
Speech: "AイコールAプラス1"
Correct A: A=A+1
Correct B: A=A+1 (code context)

Which form is "correct" depends entirely on context. Engineers prefer ASCII and symbol notation; other fields may prefer kanji.

General-purpose ASR models are optimized to convert speech to text — not to decide which orthographic form is appropriate. This is exactly why domain-specific fine-tuning is necessary.


Concrete Examples of Improvements from Fine-Tuning

Here are examples of conversions the model now handles correctly.

Katakana → Appropriate Notation

Speech (Katakana)Output
バイトダンスByteDance
プルリクエストPull Request
グーグルスライドGoogle Slides
ポジションアブソルートposition: absolute
サードパーティークッキーthird-party cookie

Japanese Numerals → Arabic Numerals

SpeechOutput
一万二千三百四十五1万2345
十人10人

Formulas and Code Notation

SpeechOutput
AイコールAプラス1A=A+1

How to Use the Model

Via Hugging Face

bash
pip install transformers torch
python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

model_id = "neosophie/Qwen3-ASR-1.7B-JA"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load audio and run inference
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)

inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
    generated_ids = model.generate(**inputs)

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

For detailed usage and licensing, see the Hugging Face model page.

Prefer a GUI?

For those who don't want to write code, the macOS app Sonophie ships with this model built in.

Sonophie — Japanese-first speech recognition for macOS

Launch with a global shortcut → transcribe → copy to clipboard. Works great for meeting notes and interview transcription.


Technical Notes: Fine-Tuning Design Philosophy

Why a 1.7B Model?

  • Designed for local inference and edge devices
  • CER is on par with or better than large-v3-class models
  • Low memory footprint; runs on CPU-only environments

Fine-Tuning Data Composition

We collected audio–text pairs centered on proper nouns and trained the model on orthographic variation patterns. The dataset focuses on technical terms, company names, product names, and programming vocabulary.

This model is optimized for Japanese technical and business domains. Separate evaluation is recommended for specialized fields such as medicine, law, and government.


FAQ

Q. Should I use Whisper or Qwen3-ASR?
A. If CER is your priority, Qwen3-ASR is currently the stronger choice. For use cases with many proper nouns, we recommend the fine-tuned model released here.

Q. Does it work without a GPU?
A. Yes — at 1.7B parameters, it runs on CPU only. Expect significantly slower inference compared to GPU environments. float16 and int8 quantization are also worth considering.

Q. Can it transcribe English audio?
A. This model was fine-tuned specifically for Japanese, so English audio performance is not guaranteed. Use the original Qwen3-ASR for English.

Q. Is commercial use allowed?
A. Check the model page on Hugging Face. Licensing follows the base Qwen3-ASR model.

Q. Can I add custom proper nouns?
A. Additional training support is not currently provided, but you can use the Hugging Face model as a base for your own fine-tuning.

Q. What platforms does Sonophie support?
A. macOS only, for now.


Summary

  • CER has limits for Japanese ASR — proper noun recognition requires a separate metric (F1 score)
  • Qwen3-ASR leads on CER but was outperformed by Whisper on proper nouns
  • Fine-tuning on proper noun data resolves this gap
  • The model is freely available on Hugging Face
  • A macOS GUI app (Sonophie) is also available for non-technical users

👉 Try the model
neosophie/Qwen3-ASR-1.7B-JA

Contact

For project inquiries and collaboration, contact us here.

If you are considering a new project, product development, or other collaboration, please get in touch.

Related Articles

Explore more articles connected to this topic.