Best Open-Source Speaker Diarization Models 2026: NeMo vs Pyannote vs VibeVoice Benchmarked

May 1, 2026

Summary

I tested three major speaker diarization frameworks — NVIDIA NeMo (Clustering, MSDD, Sortformer), Microsoft VibeVoice, and pyannote.audio — against real-world audio: a 10-minute drama clip with 14 speakers and two 10-minute debate clips with 6–10 speakers.

Key findings:

VibeVoice achieved DER 9.19% on debate-style audio — production-ready accuracy
Drama-style audio remains unsolved: best result was 27.41% (VibeVoice), still too high for practical use
Background sound separation via Demucs improved NeMo Clustering by 3.69 points on debate audio — but slightly hurt VibeVoice
Model architecture determines whether audio preprocessing helps or harms accuracy

This is a first-hand experiment report with full DER breakdowns, reproducible code, and practical model selection guidance based on real failure analysis.

What Is Speaker Diarization?

Speaker diarization is the task of answering: "Who spoke when?" — segmenting an audio stream by speaker identity across time.

It's distinct from automatic speech recognition (ASR), which answers "What was said?". Diarization answers "Who said it?"

Common use cases:

Automatic meeting minutes with speaker attribution
Subtitle/dubbing pipelines for video content
Call center conversation analytics
Multi-speaker podcast transcription

DER: The Standard Accuracy Metric

DER (Diarization Error Rate) is the primary evaluation metric:

text

DER = (Confusion + Missed Detection + False Alarm) / Total Reference Duration

Component	Meaning
Confusion	Audio assigned to the wrong speaker
Missed Detection	Speech that wasn't detected at all
False Alarm	Non-speech incorrectly labeled as speech

Lower is better. DER below 10% is generally considered production-viable.

All evaluations in this post use collar=0.0 and skip_overlap=False — the strictest standard settings, with no forgiveness window around speaker boundaries and overlap regions fully counted.

collar is a tolerance window (in seconds) applied around speaker-change boundaries. Setting it to 0.0 means even a 50ms misalignment counts as an error. Many published benchmarks use collar=0.25, which makes numbers look better but hides real boundary detection problems.

Test Environment

Item	Detail
Audio A	Drama-style, 10 min, 14 speakers
Audio B	Debate-style #1, 10 min, 10 speakers
Audio C	Debate-style #2, 10 min, 6 speakers
Metric	DER (collar=0.0, skip_overlap=False)
Background separation tool	Demucs
Source code	https://github.com/y-dai20/speechlab

The audio was sourced from Japanese-language video content. This matters because most models are trained primarily on English and Chinese — Japanese is inherently harder for these models without fine-tuning.

Models Tested

NVIDIA NeMo — Three Approaches in One Toolkit

NeMo ships with three distinct diarization methods, each with different tradeoffs.

① NeMo Clustering Diarizer

The standard pipeline-style approach. Speaker embeddings are extracted per audio segment, then clustered to assign speaker labels. Simple architecture, stable across different audio types, and reliable on clean dialogue. This is the go-to baseline for meeting and debate audio.

② NeMo MSDD (Multi-Scale Diarization Decoder)

MSDD uses clustering to generate a rough initial segmentation, then refines boundaries using a neural decoder. In theory it's more accurate — in practice it can be unstable on audio that doesn't match its training distribution.

③ NeMo Sortformer (End-to-End)

A fully end-to-end architecture that skips clustering entirely. Its key advantage is native overlap detection — it can handle simultaneous speech. The current limitation: maximum 4 speakers. This makes it impractical for drama or large-panel content for now.

Microsoft VibeVoice — Integrated ASR + Diarization

VibeVoice handles ASR, timestamps, and speaker diarization in a single unified model. Because it was trained primarily on English and Chinese data, Japanese accuracy out of the box is lower — but fine-tuning on Japanese data significantly improves results.

All VibeVoice results in this benchmark are without fine-tuning — vanilla model weights on Japanese audio.

For fine-tuning VibeVoice with Japanese data: synthetic conversation datasets generated via GPT + Qwen work well. Key insight — don't use plain dialogue transcripts. Randomly inject overlaps, back-channels (aizuchi), and interruptions to simulate realistic speech. Clean scripted dialogue produces AI-sounding artifacts that hurt generalization.

pyannote.audio — The Open-Source Standard

pyannote.audio is the de facto open-source diarization library, with tight HuggingFace integration for easy local inference. A paid API ("pyannote precise") is also available for production use.

Background Sound Separation with Demucs

What Is Demucs?

Demucs is a deep learning audio source separation model developed by Meta (Facebook AI). It splits an audio file into four stems:

bass — low-frequency content
drums — percussion, often mixed with sound effects and ambient noise
other — other instruments
vocals — speech/singing only

The hypothesis: pass only the vocals stem to the diarization model, removing BGM and environmental noise that contaminates speaker embeddings.

bash

pip install demucs
python -m demucs --two-stems=vocals your_audio.wav
# Outputs vocals.wav and no_vocals.wav in the htdemucs/ directory

Critical Finding: Demucs Helps Some Models, Hurts Others

This is the most important insight from the experiment: the effect of background sound separation is architecture-dependent.

VibeVoice: Negligible improvement, slight degradation

VibeVoice was trained on real-world audio that includes background noise. Feeding it an overly clean vocals-only track moves the input out of its training distribution. On debate audio, DER actually increased slightly (9.19% → 9.31%) after separation.

Hypothesis: VibeVoice expects noisy, real-world input. Removing all background audio creates an unnaturally sterile signal that the model isn't calibrated for.

NeMo Clustering & MSDD: Clear improvement on debate audio

Clustering-based models rely heavily on the quality of speaker embeddings. When BGM or environmental noise contaminates the audio, those embeddings become inconsistent — the same speaker sounds different across segments, and clustering misassigns labels.

Demucs removes that contamination, producing cleaner embeddings and better cluster separation.

Model	Normal DER	After Demucs	Improvement
NeMo Clustering	14.41%	10.72%	▲3.69pt
NeMo MSDD	18.84%	15.44%	▲3.40pt
VibeVoice	9.19%	9.31%	▼0.12pt (worse)
pyannote	16.04%	15.98%	▲0.06pt (negligible)

Don't blindly apply Demucs as a preprocessing step to all models. Always A/B test. For VibeVoice and models trained on noisy real-world data, it may hurt more than help.

Full Benchmark Results

Drama-Style Audio (10 min, 14 speakers)

Model	DER	Est. Speakers	Confusion	Missed	FA
VibeVoice	27.41%	10	70.610s	37.580s	21.670s
VibeVoice + Demucs	27.66%	11	62.380s	43.470s	23.540s
NeMo Clustering	48.37%	4	111.480s	103.250s	11.520s
NeMo Clustering + Demucs	49.28%	4	113.315s	102.810s	14.370s
NeMo MSDD	58.20%	4	87.030s	166.670s	18.535s
NeMo MSDD + Demucs	56.43%	4	94.630s	152.090s	17.210s
pyannote	49.51%	8	57.844s	157.042s	16.705s
pyannote + Demucs	52.54%	9	73.272s	154.827s	17.632s

All models exceeded DER 25% on drama audio. VibeVoice was the best performer at 27.41%, but this is not production-viable. Demucs provided no meaningful improvement.

Debate-Style Audio #1 (10 min, 10 speakers)

Model	DER	Est. Speakers	Confusion	Missed	FA
VibeVoice	9.19%	7	45.590s	4.100s	4.120s
VibeVoice + Demucs	9.31%	8	49.600s	0.760s	4.120s
NeMo Clustering	14.41%	5	50.965s	31.160s	2.250s
NeMo Clustering + Demucs	10.72%	6	40.050s	19.290s	3.420s
NeMo MSDD	18.84%	5	58.060s	47.460s	4.770s
NeMo MSDD + Demucs	15.44%	6	48.765s	38.620s	2.980s
pyannote	16.04%	6	49.038s	44.750s	0.122s
pyannote + Demucs	15.98%	6	49.429s	43.901s	0.212s

VibeVoice hit 9.19% — crossing the practical-use threshold. NeMo Clustering with Demucs reached 10.72%, closing the gap significantly.

Debate-Style Audio #2 (10 min, 6 speakers)

Model	DER	Est. Speakers	Confusion	Missed	FA
VibeVoice	15.52%	7	75.340s	16.210s	0.980s
VibeVoice + Demucs	14.62%	7	69.850s	16.300s	1.040s
NeMo Clustering	18.46%	7	64.905s	44.880s	0.310s
NeMo Clustering + Demucs	15.01%	7	67.360s	21.090s	1.040s
pyannote	10.16%	6	9.791s	50.496s	0.299s
pyannote + Demucs	10.05%	6	10.076s	49.660s	0.204s

A notable result: pyannote outperformed VibeVoice by 5 percentage points on this audio. The Confusion score (9.791s) is remarkably low — almost no speaker misidentification. This shows that model ranking can reverse depending on audio characteristics. There is no universal winner.

Deep Analysis

Why Is Drama So Much Harder?

1. Speaker count estimation failure

NeMo and pyannote estimated 4–8 speakers when the true count was 14. Automatic speaker count estimation is a major accuracy bottleneck — the number of clusters bounds what's even possible downstream.

2. Voice quality variation from acting

Actors deliberately vary pitch, emotion, and vocal character. Speaker embedding models trained on natural conversational speech are not calibrated for these intentional variations. The same actor in two emotional states may be assigned different speaker IDs.

3. Overlapping speech

Drama contains frequent scenes where multiple characters speak simultaneously. Clustering-based models have no native way to handle overlap — they must assign each time segment to exactly one speaker. Sortformer is the only tested model with overlap-aware design, though its 4-speaker limit makes it impractical here.

When Does Demucs Actually Help?

Based on the experiments, Demucs preprocessing is effective when all three conditions hold:

Clustering-based model in use — the benefit flows through better speaker embedding quality
Audio contains real background noise or BGM — there's actually something to remove
Audio is dialogue-dominant — drama-style complexity introduces confounds that background removal can't address

Applying Demucs to VibeVoice or other models trained on real-world noisy data can degrade accuracy. Test on a held-out validation set before committing to this preprocessing step in production.

Reproducible Code

Running the Benchmark Pipeline

The full source is at https://github.com/y-dai20/speechlab. Here's the core structure:

Step 1: Background separation (optional)

bash

pip install demucs
python -m demucs --two-stems=vocals input_audio.wav
# Output: htdemucs/input_audio/vocals.wav

Step 2: NeMo Clustering Diarizer

python

from nemo.collections.asr.models import ClusteringDiarizer
import omegaconf

cfg = omegaconf.OmegaConf.load("diarizer_config.yaml")
diarizer = ClusteringDiarizer(cfg=cfg)
diarizer.diarize()

Step 3: Evaluate with pyannote.metrics

python

from pyannote.metrics.diarization import DiarizationErrorRate

metric = DiarizationErrorRate(collar=0.0, skip_overlap=False)
der = metric(reference, hypothesis)
print(f"DER: {der:.2%}")

Step 4: Parse RTTM output

python

from pyannote.core import Annotation, Segment

def load_rttm(rttm_path):
    annotation = Annotation()
    with open(rttm_path) as f:
        for line in f:
            parts = line.strip().split()
            start = float(parts[3])
            duration = float(parts[4])
            speaker = parts[7]
            annotation[Segment(start, start + duration)] = speaker
    return annotation

RTTM (Rich Transcription Time Marked) is the standard file format for diarization output. Each line encodes a speaker segment: file ID, channel, start time, duration, and speaker label. Most diarization toolkits output and accept this format.

Model Selection Guide

For Debate, Meeting, and Podcast Audio

Primary choice: VibeVoice

DER 9.19% on real debate audio — production-ready
Built-in ASR integration gives you speaker-labeled transcripts in a single pipeline
No need for separate ASR step

Cost-conscious or fully local: NeMo Clustering + Demucs

With preprocessing, reaches 10.72% — close to VibeVoice
Fully open-source, no API dependency
Demucs adds ~1–2 minutes of preprocessing time per 10-minute clip on GPU

For Drama and Video Content

No clear winner yet. VibeVoice leads at 27.41% but is not production-viable. Pending improvements:

NeMo Sortformer maturation — E2E overlap handling could address drama's core failure mode
Speaker Identification approach — pre-register voice profiles for known speakers, convert the task from unsupervised clustering to supervised search
Domain-specific fine-tuning — drama audio with ground-truth annotations for training

FAQ

Q: Does specifying the number of speakers in advance improve accuracy?

A: Yes, often significantly. Both NeMo and pyannote expose a num_speakers parameter. Automatic speaker count estimation is one of the biggest sources of error — if you know the true count, always pass it in. The catch: in real-world pipelines you usually don't know.

Q: Is Japanese speech harder to diarize than English?

A: In practice, yes — primarily because of training data volume and quality. VibeVoice was built on English and Chinese data; Japanese requires fine-tuning for good results. High-quality Japanese speaker diarization datasets are scarce across the industry, which affects every model tested here.

Q: Should I always apply Demucs preprocessing?

A: No. The effect is model-dependent. Clustering-based models (NeMo Clustering, MSDD) benefited by 3–4 points on debate audio. VibeVoice showed slight degradation. Run an A/B comparison on your target audio type before deciding.

Q: What's the difference between pyannote OSS and pyannote precise?

A: This benchmark used the open-source version. The paid API ("pyannote precise") uses additional fine-tuning and larger training data — accuracy gains are expected but not verified in this experiment. Testing both on your specific domain is recommended.

Q: Is real-time diarization possible with these models?

A: NeMo Clustering is offline-only. VibeVoice is exploring streaming, but batch processing of recorded audio is the primary supported use case. Real-time diarization at production quality remains an open research problem.

Q: Which model handles overlapping speech best?

A: NeMo Sortformer is the only tested model with explicit overlap detection. However, its current 4-speaker limit makes it impractical for most real content. All other models assign each time window to exactly one speaker, which makes simultaneous speech a structural blind spot.

Q: What's the fastest path to better accuracy on my specific audio?

A: In priority order: ① specify the speaker count if known, ② add Demucs preprocessing if using a clustering-based model, ③ fine-tune on domain-matched data, ④ switch to Speaker Identification if you have known speaker profiles.

Conclusion

Speaker diarization accuracy is fundamentally different between clean dialogue (debates, meetings) and complex media (drama, TV). The key findings from this benchmark:

VibeVoice achieves DER 9.19% on debate audio — currently the strongest performer, with ASR integration as an additional advantage
Demucs preprocessing is architecture-dependent — 3–4 point DER improvement for clustering-based NeMo models, but a slight degradation for VibeVoice
Drama-style audio has no production-ready solution yet — failure is multi-causal (speaker count, voice variation, overlap), and background separation alone doesn't fix it
pyannote can outperform VibeVoice on specific audio — model ranking reverses between Audio B and Audio C, confirming there is no universal winner

The practical takeaway: evaluate on audio that matches your actual production data. A model that ranks first on a published benchmark may rank third on your content.

Full source code: https://github.com/y-dai20/speechlab

Contact

For project inquiries and collaboration, contact us here.

If you are considering a new project, product development, or other collaboration, please get in touch.

Go to Contact

Explore more articles connected to this topic.

April 27, 2026diarizationmetrics

Speaker Diarization Metrics: DER, JER, Purity & Boundary Error with Python

DER alone won't tell you why your diarization fails. Learn DER, JER, Purity, Coverage, and Boundary Error — with formulas and working Python code using pyannote.

Read article →

May 1, 2026

Best Japanese ASR Models 2026: Qwen3 ASR vs Whisper vs Reazonspeech Benchmark

We benchmarked 8 Japanese ASR models — Qwen3-ASR, Whisper, ReazonSpeech, and Parakeet — on WER and inference speed. See which wins for accuracy, speed, and broadcast audio.

Read article →

April 30, 2026asrbenchmark

Japanese IT-Domain ASR Benchmark 2026: Granite 4.0 vs Qwen3 vs Whisper

We tested 9 ASR models on Japanese IT jargon — SaaS names, abbreviations, company terms. Benchmarks include Granite 4.0, Qwen3-ASR, Cohere Transcribe, and Whisper variants.

Read article →

Summary

What Is Speaker Diarization?

DER: The Standard Accuracy Metric

Test Environment

Models Tested

NVIDIA NeMo — Three Approaches in One Toolkit

Microsoft VibeVoice — Integrated ASR + Diarization

pyannote.audio — The Open-Source Standard

Background Sound Separation with Demucs

What Is Demucs?

Critical Finding: Demucs Helps Some Models, Hurts Others

VibeVoice: Negligible improvement, slight degradation

NeMo Clustering & MSDD: Clear improvement on debate audio

Full Benchmark Results

Drama-Style Audio (10 min, 14 speakers)

Debate-Style Audio #1 (10 min, 10 speakers)

Debate-Style Audio #2 (10 min, 6 speakers)

Deep Analysis

Why Is Drama So Much Harder?

When Does Demucs Actually Help?

Reproducible Code

Running the Benchmark Pipeline

Model Selection Guide

For Debate, Meeting, and Podcast Audio

For Drama and Video Content

FAQ

Conclusion

For project inquiries and collaboration, contact us here.

Related Articles

Speaker Diarization Metrics: DER, JER, Purity & Boundary Error with Python

Best Japanese ASR Models 2026: Qwen3 ASR vs Whisper vs Reazonspeech Benchmark

Japanese IT-Domain ASR Benchmark 2026: Granite 4.0 vs Qwen3 vs Whisper