Speaker Diarization Benchmark: NeMo vs VibeVoice vs Pyannote — Which Model Actually Works?

Summary

I tested three major speaker diarization frameworks — NVIDIA NeMo (Clustering, MSDD, Sortformer), Microsoft VibeVoice, and pyannote.audio — against real-world audio: a 10-minute drama clip with 14 speakers and two 10-minute debate clips with 6–10 speakers.

Key findings:

  • VibeVoice achieved DER 9.19% on debate-style audio — production-ready accuracy
  • Drama-style audio remains unsolved: best result was 27.41% (VibeVoice), still too high for practical use
  • Background sound separation via Demucs improved NeMo Clustering by 3.69 points on debate audio — but slightly hurt VibeVoice
  • Model architecture determines whether audio preprocessing helps or harms accuracy

This is a first-hand experiment report with full DER breakdowns, reproducible code, and practical model selection guidance based on real failure analysis.


What Is Speaker Diarization?

Speaker diarization is the task of answering: "Who spoke when?" — segmenting an audio stream by speaker identity across time.

It's distinct from automatic speech recognition (ASR), which answers "What was said?". Diarization answers "Who said it?"

Common use cases:

  • Automatic meeting minutes with speaker attribution
  • Subtitle/dubbing pipelines for video content
  • Call center conversation analytics
  • Multi-speaker podcast transcription

DER: The Standard Accuracy Metric

DER (Diarization Error Rate) is the primary evaluation metric:

text
DER = (Confusion + Missed Detection + False Alarm) / Total Reference Duration
ComponentMeaning
ConfusionAudio assigned to the wrong speaker
Missed DetectionSpeech that wasn't detected at all
False AlarmNon-speech incorrectly labeled as speech

Lower is better. DER below 10% is generally considered production-viable.

All evaluations in this post use collar=0.0 and skip_overlap=False — the strictest standard settings, with no forgiveness window around speaker boundaries and overlap regions fully counted.

collar is a tolerance window (in seconds) applied around speaker-change boundaries. Setting it to 0.0 means even a 50ms misalignment counts as an error. Many published benchmarks use collar=0.25, which makes numbers look better but hides real boundary detection problems.


Test Environment

ItemDetail
Audio ADrama-style, 10 min, 14 speakers
Audio BDebate-style #1, 10 min, 10 speakers
Audio CDebate-style #2, 10 min, 6 speakers
MetricDER (collar=0.0, skip_overlap=False)
Background separation toolDemucs
Source codehttps://github.com/y-dai20/speechlab

The audio was sourced from Japanese-language video content. This matters because most models are trained primarily on English and Chinese — Japanese is inherently harder for these models without fine-tuning.


Models Tested

NVIDIA NeMo — Three Approaches in One Toolkit

NeMo ships with three distinct diarization methods, each with different tradeoffs.

① NeMo Clustering Diarizer

The standard pipeline-style approach. Speaker embeddings are extracted per audio segment, then clustered to assign speaker labels. Simple architecture, stable across different audio types, and reliable on clean dialogue. This is the go-to baseline for meeting and debate audio.

② NeMo MSDD (Multi-Scale Diarization Decoder)

MSDD uses clustering to generate a rough initial segmentation, then refines boundaries using a neural decoder. In theory it's more accurate — in practice it can be unstable on audio that doesn't match its training distribution.

③ NeMo Sortformer (End-to-End)

A fully end-to-end architecture that skips clustering entirely. Its key advantage is native overlap detection — it can handle simultaneous speech. The current limitation: maximum 4 speakers. This makes it impractical for drama or large-panel content for now.

Microsoft VibeVoice — Integrated ASR + Diarization

VibeVoice handles ASR, timestamps, and speaker diarization in a single unified model. Because it was trained primarily on English and Chinese data, Japanese accuracy out of the box is lower — but fine-tuning on Japanese data significantly improves results.

All VibeVoice results in this benchmark are without fine-tuning — vanilla model weights on Japanese audio.

For fine-tuning VibeVoice with Japanese data: synthetic conversation datasets generated via GPT + Qwen work well. Key insight — don't use plain dialogue transcripts. Randomly inject overlaps, back-channels (aizuchi), and interruptions to simulate realistic speech. Clean scripted dialogue produces AI-sounding artifacts that hurt generalization.

pyannote.audio — The Open-Source Standard

pyannote.audio is the de facto open-source diarization library, with tight HuggingFace integration for easy local inference. A paid API ("pyannote precise") is also available for production use.


Background Sound Separation with Demucs

What Is Demucs?

Demucs is a deep learning audio source separation model developed by Meta (Facebook AI). It splits an audio file into four stems:

  • bass — low-frequency content
  • drums — percussion, often mixed with sound effects and ambient noise
  • other — other instruments
  • vocals — speech/singing only

The hypothesis: pass only the vocals stem to the diarization model, removing BGM and environmental noise that contaminates speaker embeddings.

bash
pip install demucs
python -m demucs --two-stems=vocals your_audio.wav
# Outputs vocals.wav and no_vocals.wav in the htdemucs/ directory

Critical Finding: Demucs Helps Some Models, Hurts Others

This is the most important insight from the experiment: the effect of background sound separation is architecture-dependent.

VibeVoice: Negligible improvement, slight degradation

VibeVoice was trained on real-world audio that includes background noise. Feeding it an overly clean vocals-only track moves the input out of its training distribution. On debate audio, DER actually increased slightly (9.19% → 9.31%) after separation.

Hypothesis: VibeVoice expects noisy, real-world input. Removing all background audio creates an unnaturally sterile signal that the model isn't calibrated for.

NeMo Clustering & MSDD: Clear improvement on debate audio

Clustering-based models rely heavily on the quality of speaker embeddings. When BGM or environmental noise contaminates the audio, those embeddings become inconsistent — the same speaker sounds different across segments, and clustering misassigns labels.

Demucs removes that contamination, producing cleaner embeddings and better cluster separation.

ModelNormal DERAfter DemucsImprovement
NeMo Clustering14.41%10.72%▲3.69pt
NeMo MSDD18.84%15.44%▲3.40pt
VibeVoice9.19%9.31%▼0.12pt (worse)
pyannote16.04%15.98%▲0.06pt (negligible)

Don't blindly apply Demucs as a preprocessing step to all models. Always A/B test. For VibeVoice and models trained on noisy real-world data, it may hurt more than help.


Full Benchmark Results

Drama-Style Audio (10 min, 14 speakers)

ModelDEREst. SpeakersConfusionMissedFA
VibeVoice27.41%1070.610s37.580s21.670s
VibeVoice + Demucs27.66%1162.380s43.470s23.540s
NeMo Clustering48.37%4111.480s103.250s11.520s
NeMo Clustering + Demucs49.28%4113.315s102.810s14.370s
NeMo MSDD58.20%487.030s166.670s18.535s
NeMo MSDD + Demucs56.43%494.630s152.090s17.210s
pyannote49.51%857.844s157.042s16.705s
pyannote + Demucs52.54%973.272s154.827s17.632s

All models exceeded DER 25% on drama audio. VibeVoice was the best performer at 27.41%, but this is not production-viable. Demucs provided no meaningful improvement.

Debate-Style Audio #1 (10 min, 10 speakers)

ModelDEREst. SpeakersConfusionMissedFA
VibeVoice9.19%745.590s4.100s4.120s
VibeVoice + Demucs9.31%849.600s0.760s4.120s
NeMo Clustering14.41%550.965s31.160s2.250s
NeMo Clustering + Demucs10.72%640.050s19.290s3.420s
NeMo MSDD18.84%558.060s47.460s4.770s
NeMo MSDD + Demucs15.44%648.765s38.620s2.980s
pyannote16.04%649.038s44.750s0.122s
pyannote + Demucs15.98%649.429s43.901s0.212s

VibeVoice hit 9.19% — crossing the practical-use threshold. NeMo Clustering with Demucs reached 10.72%, closing the gap significantly.

Debate-Style Audio #2 (10 min, 6 speakers)

ModelDEREst. SpeakersConfusionMissedFA
VibeVoice15.52%775.340s16.210s0.980s
VibeVoice + Demucs14.62%769.850s16.300s1.040s
NeMo Clustering18.46%764.905s44.880s0.310s
NeMo Clustering + Demucs15.01%767.360s21.090s1.040s
pyannote10.16%69.791s50.496s0.299s
pyannote + Demucs10.05%610.076s49.660s0.204s

A notable result: pyannote outperformed VibeVoice by 5 percentage points on this audio. The Confusion score (9.791s) is remarkably low — almost no speaker misidentification. This shows that model ranking can reverse depending on audio characteristics. There is no universal winner.


Deep Analysis

Why Is Drama So Much Harder?

1. Speaker count estimation failure

NeMo and pyannote estimated 4–8 speakers when the true count was 14. Automatic speaker count estimation is a major accuracy bottleneck — the number of clusters bounds what's even possible downstream.

2. Voice quality variation from acting

Actors deliberately vary pitch, emotion, and vocal character. Speaker embedding models trained on natural conversational speech are not calibrated for these intentional variations. The same actor in two emotional states may be assigned different speaker IDs.

3. Overlapping speech

Drama contains frequent scenes where multiple characters speak simultaneously. Clustering-based models have no native way to handle overlap — they must assign each time segment to exactly one speaker. Sortformer is the only tested model with overlap-aware design, though its 4-speaker limit makes it impractical here.

When Does Demucs Actually Help?

Based on the experiments, Demucs preprocessing is effective when all three conditions hold:

  1. Clustering-based model in use — the benefit flows through better speaker embedding quality
  2. Audio contains real background noise or BGM — there's actually something to remove
  3. Audio is dialogue-dominant — drama-style complexity introduces confounds that background removal can't address

Applying Demucs to VibeVoice or other models trained on real-world noisy data can degrade accuracy. Test on a held-out validation set before committing to this preprocessing step in production.


Reproducible Code

Running the Benchmark Pipeline

The full source is at https://github.com/y-dai20/speechlab. Here's the core structure:

Step 1: Background separation (optional)

bash
pip install demucs
python -m demucs --two-stems=vocals input_audio.wav
# Output: htdemucs/input_audio/vocals.wav

Step 2: NeMo Clustering Diarizer

python
from nemo.collections.asr.models import ClusteringDiarizer
import omegaconf

cfg = omegaconf.OmegaConf.load("diarizer_config.yaml")
diarizer = ClusteringDiarizer(cfg=cfg)
diarizer.diarize()

Step 3: Evaluate with pyannote.metrics

python
from pyannote.metrics.diarization import DiarizationErrorRate

metric = DiarizationErrorRate(collar=0.0, skip_overlap=False)
der = metric(reference, hypothesis)
print(f"DER: {der:.2%}")

Step 4: Parse RTTM output

python
from pyannote.core import Annotation, Segment

def load_rttm(rttm_path):
    annotation = Annotation()
    with open(rttm_path) as f:
        for line in f:
            parts = line.strip().split()
            start = float(parts[3])
            duration = float(parts[4])
            speaker = parts[7]
            annotation[Segment(start, start + duration)] = speaker
    return annotation

RTTM (Rich Transcription Time Marked) is the standard file format for diarization output. Each line encodes a speaker segment: file ID, channel, start time, duration, and speaker label. Most diarization toolkits output and accept this format.


Model Selection Guide

For Debate, Meeting, and Podcast Audio

Primary choice: VibeVoice

  • DER 9.19% on real debate audio — production-ready
  • Built-in ASR integration gives you speaker-labeled transcripts in a single pipeline
  • No need for separate ASR step

Cost-conscious or fully local: NeMo Clustering + Demucs

  • With preprocessing, reaches 10.72% — close to VibeVoice
  • Fully open-source, no API dependency
  • Demucs adds ~1–2 minutes of preprocessing time per 10-minute clip on GPU

For Drama and Video Content

No clear winner yet. VibeVoice leads at 27.41% but is not production-viable. Pending improvements:

  • NeMo Sortformer maturation — E2E overlap handling could address drama's core failure mode
  • Speaker Identification approach — pre-register voice profiles for known speakers, convert the task from unsupervised clustering to supervised search
  • Domain-specific fine-tuning — drama audio with ground-truth annotations for training

FAQ

Q: Does specifying the number of speakers in advance improve accuracy?

A: Yes, often significantly. Both NeMo and pyannote expose a num_speakers parameter. Automatic speaker count estimation is one of the biggest sources of error — if you know the true count, always pass it in. The catch: in real-world pipelines you usually don't know.

Q: Is Japanese speech harder to diarize than English?

A: In practice, yes — primarily because of training data volume and quality. VibeVoice was built on English and Chinese data; Japanese requires fine-tuning for good results. High-quality Japanese speaker diarization datasets are scarce across the industry, which affects every model tested here.

Q: Should I always apply Demucs preprocessing?

A: No. The effect is model-dependent. Clustering-based models (NeMo Clustering, MSDD) benefited by 3–4 points on debate audio. VibeVoice showed slight degradation. Run an A/B comparison on your target audio type before deciding.

Q: What's the difference between pyannote OSS and pyannote precise?

A: This benchmark used the open-source version. The paid API ("pyannote precise") uses additional fine-tuning and larger training data — accuracy gains are expected but not verified in this experiment. Testing both on your specific domain is recommended.

Q: Is real-time diarization possible with these models?

A: NeMo Clustering is offline-only. VibeVoice is exploring streaming, but batch processing of recorded audio is the primary supported use case. Real-time diarization at production quality remains an open research problem.

Q: Which model handles overlapping speech best?

A: NeMo Sortformer is the only tested model with explicit overlap detection. However, its current 4-speaker limit makes it impractical for most real content. All other models assign each time window to exactly one speaker, which makes simultaneous speech a structural blind spot.

Q: What's the fastest path to better accuracy on my specific audio?

A: In priority order: ① specify the speaker count if known, ② add Demucs preprocessing if using a clustering-based model, ③ fine-tune on domain-matched data, ④ switch to Speaker Identification if you have known speaker profiles.


Conclusion

Speaker diarization accuracy is fundamentally different between clean dialogue (debates, meetings) and complex media (drama, TV). The key findings from this benchmark:

  1. VibeVoice achieves DER 9.19% on debate audio — currently the strongest performer, with ASR integration as an additional advantage
  2. Demucs preprocessing is architecture-dependent — 3–4 point DER improvement for clustering-based NeMo models, but a slight degradation for VibeVoice
  3. Drama-style audio has no production-ready solution yet — failure is multi-causal (speaker count, voice variation, overlap), and background separation alone doesn't fix it
  4. pyannote can outperform VibeVoice on specific audio — model ranking reverses between Audio B and Audio C, confirming there is no universal winner

The practical takeaway: evaluate on audio that matches your actual production data. A model that ranks first on a published benchmark may rank third on your content.

Full source code: https://github.com/y-dai20/speechlab

Related Articles

Explore more articles connected to this topic.