Neosophie

Blog

Articles about Neosophie's product development and technology.

#asr #benchmark #books #dataset #diarization #finetune #llm #metrics #oss #rag #tts

May 1, 2026

Best Japanese ASR Models 2026: Qwen3 ASR vs Whisper vs Reazonspeech Benchmark

We benchmarked 8 Japanese ASR models — Qwen3-ASR, Whisper, ReazonSpeech, and Parakeet — on WER and inference speed. See which wins for accuracy, speed, and broadcast audio.

Neosophie Team

May 1, 2026diarization

Best Open-Source Speaker Diarization Models 2026: NeMo vs Pyannote vs VibeVoice Benchmarked

VibeVoice tops our 2026 benchmark across 6–14 speaker scenarios. We tested NeMo (SortFormer, MSDD), Pyannote, and VibeVoice on real-world audio — full DER scores, latency, and a model picker by use case.

Neosophie Team

April 30, 2026asr benchmark

Japanese IT-Domain ASR Benchmark 2026: Granite 4.0 vs Qwen3 vs Whisper

We tested 9 ASR models on Japanese IT jargon — SaaS names, abbreviations, company terms. Benchmarks include Granite 4.0, Qwen3-ASR, Cohere Transcribe, and Whisper variants.

Neosophie Team

April 29, 2026asr

Released the Highest-Accuracy Japanese ASR Model for Free

Fine-tuned Qwen3-ASR-1.7B for proper noun recognition. Free on Hugging Face. Outperforms Whisper on both CER and proper noun F1. Also available in Sonophie for macOS.

Neosophie Team

April 29, 2026llm finetune

Custom LogitsProcessor for HuggingFace Transformers: Fixing JSON Repetition Hallucinations in Qwen

Step-by-step guide to building a custom LogitsProcessor in HuggingFace Transformers. Applies no-repeat-ngram control selectively inside JSON Content fields to suppress repetition hallucinations during fine-tuned Qwen inference.

Neosophie Team

April 29, 2026tts

Zero-Shot Voice Cloning with OpenVoice: Emotion Control Across 8 Styles from a 14-Second Reference

How to run OpenVoice for zero-shot voice cloning with emotion style control. Hands-on results across 8 emotion styles using a 14-second reference clip, including code with se_extractor.

Neosophie Team

April 29, 2026oss

whisper.cpp Local Inference on Mac: Offline Transcription with CoreML & Apple Silicon

Offline transcription on Mac using whisper.cpp + CoreML. Supports microphone and web conference audio, with no data transmission and completely free to use.

Neosophie Team

April 27, 2026diarization metrics

Speaker Diarization Metrics: DER, JER, Purity & Boundary Error with Python

DER alone won't tell you why your diarization fails. Learn DER, JER, Purity, Coverage, and Boundary Error — with formulas and working Python code using pyannote.

Neosophie Team

April 27, 2026tts

Best Open-Source TTS Models 2026: Chatterbox vs Fish Speech vs Qwen3

Hands-on benchmarks of Chatterbox, Qwen3-TTS, Fish Speech 1.5, CosyVoice 3.0 & IndexTTS2 — voice quality, latency, zero-shot cloning, emotion control, and license gotchas.

Neosophie Team

April 21, 2026asr benchmark

Are WER and CER Not Enough? Re-evaluating Japanese ASR with Noun and Proper Noun F1 Scores

Japanese speech recognition is more challenging than many other languages due to the wide variety of kanji and proper noun representations. This primary benchmark article implements a morphological-analysis-based F1 score evaluation using Sudachi to quantify "kanji and proper noun recognition accuracy" that WER and CER struggle to capture, and re-evaluates 9 open-source ASR models.

Neosophie Team

April 16, 2026asr dataset

Building a High-Quality ASR Training Data Pipeline with Multiple Models — Design Philosophy and Implementation of Whisper × Qwen3 × LLM

A deep-dive into a two-stage pipeline that integrates Whisper and Qwen3-ASR transcriptions via LLM diff-merging to generate ASR training data that surpasses what any single model can achieve. The pipeline combines Whisper's strength in kanji and proper nouns with Qwen3's lower overall CER, then applies Forced Alignment to assign word-level timestamps. The downstream chunk-splitting stage automates natural 20–30 second chunk generation using sentence-end scoring, followed by Whisper re-transcription for CER-based quality filtering.

Neosophie Team

April 8, 2026asr finetune

How I Finetuned IBM Granite Speech 1B on Japanese Audio and Improved CER from 0.37 to 0.14

I finetuned IBM Granite Speech (`granite-4.0-1b-speech`) on 100 hours of Japanese speech data and reduced CER from 0.37 to 0.14. The official script's Projector+LoRA-only training has a ceiling on accuracy gains. The key breakthrough was additionally training `lm_head` and the last 8 layers of the Language Model. The result matches Qwen3-ASR-1.7B (CER 0.14) with only 1B parameters.

Neosophie Team

March 12, 2026books

5 Books a Working AI Engineer Actually Read for Real-World Accuracy Improvement and Product Development

The reason you can call AI APIs but still can't improve accuracy is a lack of understanding of what's happening inside the model. These 5 books — all O'Reilly titles written by practitioners at the forefront of the field, including Hugging Face engineers and Chip Huyen — are the standard reading list for ML engineers worldwide.

Neosophie Team

March 12, 2026rag

Replaced Embedding RAG with PageIndex on My Own Site — Here's What Actually Happened

After hitting a wall with hybrid BM25 + Embedding + Rerank search (wrong retrievals, high infra cost, Japanese embedding model limitations), I implemented **PageIndex** — a vector-free RAG architecture — as the AI assistant for my company site. The result: no VectorDB, no embedding model, no Reranker needed. Two-stage LLM-based filtering via a URL tree navigated to the correct pages for queries like *"What products do you have?"* and *"What ASR-related blog posts exist?"* — zero vector math involved.

Neosophie Team