Speaker Diarization Evaluation Metrics: Beyond DER — A Practical Diagnostic Guide
TL;DR — Read This First
Evaluating speaker diarization systems requires choosing the right metric for what you actually want to measure. DER is the industry standard, but it cannot diagnose why a system fails. To systematically improve a real system, combine DER with Confusion Matrix, Purity/Coverage, and Boundary Error for proper root-cause analysis. This article covers calculation formulas, working Python code, and real experimental data — not just textbook definitions.
Speaker diarization — automatically determining "who spoke when" — is a foundational task in speech processing. It powers transcription pipelines, meeting analytics, call center QA, and medical interview documentation. Yet when engineers actually sit down to evaluate a diarization system, a common experience is confusion: "What's the difference between DER and JER? Why is my DER low but the output still sounds wrong? Which metric should I even optimize?"
This guide answers those questions with precision. We cover six key evaluation metrics: their mathematical definitions, Python implementations using pyannote.metrics, and — critically — how to read the numbers together to diagnose what is actually broken in your system.
All experimental figures in this article come from first-hand benchmarking on a 15-speaker meeting recording under strict evaluation conditions (collar=0.0, skip_overlap=False).
1. DER (Diarization Error Rate) and JER (Jaccard Error Rate)
What Is DER?
DER is the de facto standard metric for speaker diarization, in use since the 1997 NIST evaluation benchmarks. It measures the total proportion of time that is incorrectly labeled.
| Component | Definition |
|---|---|
| FA (False Alarm) | Duration where the system outputs a speaker label but no one is actually speaking |
| MISS (Missed Detection) | Duration where someone is speaking but the system outputs no label |
| CONF (Confusion) | Duration where a speaker is correctly detected but assigned the wrong speaker label |
Key Terminology
- Speaker Diarization: The task of automatically segmenting an audio stream by speaker identity — answering the question "who spoke when."
- Reference (Ground Truth): Human-annotated speaker labels used as the gold standard.
- Hypothesis: The system's predicted speaker labels.
- Collar: A tolerance window (typically 250 ms) around reference boundaries that is excluded from scoring. Errors within the collar are not penalized.
- Optimal Mapping: Because system-assigned speaker IDs (e.g.,
SPEAKER_1) do not correspond to reference IDs (e.g.,SPEAKER_A), DER uses the Hungarian algorithm to find the best one-to-one label mapping before scoring.
Python Implementation
from pyannote.metrics.diarization import DiarizationErrorRate
from pyannote.core import Annotation, Segment
# Ground truth annotation
reference = Annotation()
reference[Segment(0, 5)] = 'SPEAKER_A'
reference[Segment(5, 10)] = 'SPEAKER_B'
reference[Segment(10, 15)] = 'SPEAKER_A'
# System output
hypothesis = Annotation()
hypothesis[Segment(0, 4)] = 'SPEAKER_1' # Corresponds to SPEAKER_A
hypothesis[Segment(4, 11)] = 'SPEAKER_2' # Partially mixed
hypothesis[Segment(11, 15)] = 'SPEAKER_1'
metric = DiarizationErrorRate()
der = metric(reference, hypothesis)
print(f"DER: {der:.3f}") # e.g., 0.133 (13.3%)
# Decompose into components
detail = metric(reference, hypothesis, detailed=True)
print(f" Miss: {detail['missed detection']:.3f}")
print(f" FA: {detail['false alarm']:.3f}")
print(f" Conf: {detail['confusion']:.3f}")
On the Collar Setting
Most published DER results use collar=0.25 (250 ms exclusion around boundaries). Some strict benchmarks such as NIST SRE and DIHARD use collar=0.0. The difference between these two settings can shift DER by 3–5 percentage points on the same system output. Always report your collar setting explicitly — it is not optional metadata.
What Is JER?
JER was introduced in the CHiME-6 Challenge to address a known weakness of DER: DER is dominated by speakers who talk the most. In a meeting where Speaker A talks for 900 seconds and Speaker B talks for 100 seconds, completely mis-labeling Speaker B barely moves DER. JER corrects for this.
Where:
- : the set of all reference speakers
- : the reference time intervals for speaker
- : the hypothesis intervals mapped to speaker
This is essentially the complement of the per-speaker Jaccard similarity, averaged equally across all speakers regardless of their total speaking time.
# DER example: SPEAKER_A = 900s, SPEAKER_B = 100s
# Completely mislabeling SPEAKER_B → DER impact is small (~10%)
# JER treats both speakers equally → JER impact is large (~50%)
from pyannote.metrics.diarization import JaccardErrorRate
jer = JaccardErrorRate()
jer_score = jer(reference, hypothesis)
print(f"JER: {jer_score:.3f}")
When to use which:
- Use DER for standard benchmark comparisons; it is what most papers report.
- Use JER when fairness across speakers matters — multi-speaker meetings with unequal talk time, or any scenario where minority-speaker accuracy is important.
- In a 2-speaker scenario, JER and DER converge closely. JER becomes most informative at 4+ speakers with imbalanced talk time.
2. Speaker Confusion Matrix
The confusion matrix answers: "Which speaker is being confused with which?" It is the primary tool for decomposing the CONF component of DER into actionable information.
Calculation
The key implementation detail is computing overlap between each reference speaker segment and each hypothesis segment. The result is a matrix of overlapping durations, from which you derive the dominant mapping and dominant share per reference speaker.
def compute_speaker_confusion_matrix(
reference: Annotation,
hypothesis: Annotation,
) -> dict:
ref_labels = sorted(map(str, reference.labels()))
hyp_labels = sorted(map(str, hypothesis.labels()))
# Compute pairwise overlap durations (seconds)
overlap_by_pair = {
ref_label: {hyp_label: 0.0 for hyp_label in hyp_labels}
for ref_label in ref_labels
}
for ref_seg, _, ref_label in reference.itertracks(yield_label=True):
for hyp_seg, _, hyp_label in hypothesis.itertracks(yield_label=True):
intersection = ref_seg & hyp_seg
if intersection:
overlap_by_pair[str(ref_label)][str(hyp_label)] += float(
intersection.duration
)
# Summarize per reference speaker
per_reference_speaker = {}
for ref_label, row in overlap_by_pair.items():
ref_duration = sum(
float(seg.duration)
for seg, _, lbl in reference.itertracks(yield_label=True)
if str(lbl) == ref_label
)
overlaps = sorted(
[
{
"hyp_speaker": h,
"overlap_sec": s,
"share": s / ref_duration if ref_duration > 0 else 0.0
}
for h, s in row.items() if s > 0.0
],
key=lambda x: -x["overlap_sec"],
)
dominant = overlaps[0] if overlaps else None
per_reference_speaker[ref_label] = {
"ref_duration_sec": ref_duration,
"hyp_count": len(overlaps),
"dominant": dominant["hyp_speaker"] if dominant else None,
"dominant_share": dominant["share"] if dominant else 0.0,
"overlaps": overlaps,
}
return {
"seconds": overlap_by_pair,
"per_reference_speaker": per_reference_speaker
}
Reading the Output
GT 6: hyp_count=4 dominant=speaker_9 share=49.6%
overlaps=[speaker_9:49.6%, speaker_7:43.9%, speaker_10:3.4%, speaker_1:1.3%]
dominant_sharebelow 50%: This reference speaker's time is split roughly equally between two hypothesis clusters. The system cannot consistently decide which cluster they belong to.hyp_countabove 3: Severe fragmentation — one person's voice has been split into 3+ clusters.- Shared
hyp_speakeracross multipleGTspeakers: That hypothesis cluster is acting as a "garbage collector," absorbing fragments from multiple real speakers.
3. Fragmentation, Purity, and Coverage
While DER measures quantity of error, Purity and Coverage measure quality of clustering structure.
Definitions
- Fragmentation: A single reference speaker's speech is spread across multiple hypothesis clusters.
- Purity: Each hypothesis cluster contains speech from primarily one speaker (no mixing).
- Coverage: Each reference speaker's speech is captured by primarily one hypothesis cluster (no splitting).
Formulas
Cluster Purity
Where is total audio duration, is the time covered by cluster , and is reference speaker 's time.
Coverage
from pyannote.metrics.diarization import (
DiarizationPurity,
DiarizationCoverage,
)
# Always align collar/skip_overlap settings with your DER configuration
collar = 0.0
skip_overlap = False
purity_metric = DiarizationPurity(collar=collar, skip_overlap=skip_overlap)
coverage_metric = DiarizationCoverage(collar=collar, skip_overlap=skip_overlap)
purity = float(purity_metric(reference, hypothesis))
coverage = float(coverage_metric(reference, hypothesis))
print(f"Purity: {purity:.3f}") # 1.0 is ideal
print(f"Coverage: {coverage:.3f}") # 1.0 is ideal
Diagnostic Interpretation
| Pattern | Interpretation |
|---|---|
| High Purity, Low Coverage | Fragmentation dominant — each cluster is pure, but one speaker is split into many clusters |
| Low Purity, High Coverage | Merging dominant — clusters contain multiple speakers; the system under-segments |
| Both high | Good structure, DER errors likely from boundary imprecision |
| Both low | Systemic failure in both clustering and VAD |
Use Purity/Coverage First for Debugging
Before tuning hyperparameters, check whether you have a fragmentation problem (split) or a merging problem (merge). These require opposite interventions — more aggressive clustering vs. less aggressive clustering. Getting this diagnosis wrong wastes significant engineering effort.
4. Count Error (Speaker Count Estimation)
Count Error measures how accurately the system estimates the number of active speakers at each moment in time. This is not simply comparing the total number of unique speakers — it is a time-weighted frame-level comparison.
Formula
Where and are the number of simultaneously active speakers at time according to the reference and hypothesis respectively.
def compute_count_error(
reference: Annotation,
hypothesis: Annotation,
timeline,
) -> dict:
"""
Compute time-weighted speaker count error.
Uses the evaluation timeline from DER's uemify() for consistency.
"""
total_duration = 0.0
weighted_abs_err = 0.0
weighted_sgn_err = 0.0
exact_match_dur = 0.0
for segment in timeline:
duration = float(segment.duration)
ref_count = len(reference.get_labels(segment, unique=False))
hyp_count = len(hypothesis.get_labels(segment, unique=False))
diff = hyp_count - ref_count
total_duration += duration
weighted_abs_err += duration * abs(diff)
weighted_sgn_err += duration * diff
if diff == 0:
exact_match_dur += duration
return {
"mean_abs_speaker_count_error": weighted_abs_err / total_duration,
"mean_signed_speaker_count_error": weighted_sgn_err / total_duration,
"exact_count_match_ratio": exact_match_dur / total_duration,
}
# Retrieve the evaluation timeline from DER computation
der_metric = DiarizationErrorRate(collar=0.0, skip_overlap=False)
ref_proj, hyp_proj, timeline = der_metric.uemify(
reference, hypothesis,
collar=0.0, skip_overlap=False, returns_timeline=True,
)
result = compute_count_error(ref_proj, hyp_proj, timeline)
print(f"Mean abs count error: {result['mean_abs_speaker_count_error']:.4f}")
print(f"Exact count match: {result['exact_count_match_ratio']:.4f}")
Low Count Error Does Not Imply Low DER
A system can correctly estimate "3 speakers are active" while assigning completely wrong labels to all three. Count Error measures the accuracy of how many, not who. Use mean_signed_speaker_count_error to detect direction: positive means over-counting (splitting speakers), negative means under-counting (merging speakers).
5. Boundary Error
Boundary Error evaluates how accurately the system detects speaker change points — the moments when one speaker stops and another begins. DER's collar mechanism intentionally hides boundary errors from the main score, making Boundary Error necessary as a complementary metric.
The Two Types of Boundary Error
Reference: A----A----A | B-----B | A---A
Hypothesis: A----A------ | B----B-------- | A---
↑ ↑
Detection Error Localization Error
(boundary exists, (boundary found,
detected late) but offset by Δt ms)
- Detection Error: A reference boundary is missed entirely, or the system inserts a spurious boundary.
- Localization Error: The boundary is detected but its timestamp is off by some amount.
Implementation
The critical implementation detail is one-to-one boundary matching: each reference boundary is matched to at most one hypothesis boundary, and matched hypotheses are removed from the candidate pool. This prevents nearby false-positive detections from artificially inflating recall.
def extract_boundaries(annotation: Annotation) -> list[float]:
"""Return all segment start and end times as boundary timestamps."""
boundaries: set[float] = set()
for segment in annotation.itersegments():
boundaries.add(round(float(segment.start), 9))
boundaries.add(round(float(segment.end), 9))
return sorted(boundaries)
def compute_boundary_error(
reference: Annotation,
hypothesis: Annotation,
tolerance: float = 0.5, # Default: 500 ms
) -> dict:
ref_boundaries = extract_boundaries(reference)
hyp_boundaries = list(extract_boundaries(hypothesis)) # mutable for pop()
matched_pairs = []
hyp_idx = 0
for ref_b in ref_boundaries:
# Advance scan pointer to left edge of tolerance window
while (hyp_idx < len(hyp_boundaries)
and hyp_boundaries[hyp_idx] < ref_b - tolerance):
hyp_idx += 1
# Find the single closest hypothesis boundary within tolerance
candidates = []
for ci in (hyp_idx - 1, hyp_idx, hyp_idx + 1):
if 0 <= ci < len(hyp_boundaries):
dist = abs(hyp_boundaries[ci] - ref_b)
if dist <= tolerance:
candidates.append((dist, ci))
if not candidates:
continue
_, best_ci = min(candidates)
matched_pairs.append((ref_b, hyp_boundaries[best_ci]))
hyp_boundaries.pop(best_ci)
if best_ci < hyp_idx:
hyp_idx -= 1
matched = len(matched_pairs)
ref_total = len(ref_boundaries)
hyp_total = matched + len(hyp_boundaries) # Remaining = false positives
precision = matched / hyp_total if hyp_total else 1.0
recall = matched / ref_total if ref_total else 1.0
f1 = (2 * precision * recall / (precision + recall)
if precision + recall else 0.0)
errors = [abs(r - h) for r, h in matched_pairs]
return {
"precision": precision,
"recall": recall,
"f1": f1,
"mean_abs_error_sec": sum(errors) / matched if matched else None,
"max_abs_error_sec": max(errors) if errors else None,
}
result = compute_boundary_error(reference, hypothesis, tolerance=0.5)
print(f"Precision: {result['precision']:.3f}")
print(f"Recall: {result['recall']:.3f}")
print(f"F1: {result['f1']:.3f}")
if result["mean_abs_error_sec"]:
print(f"Mean localization error: {result['mean_abs_error_sec']*1000:.1f} ms")
Precision < Recall means the system is inserting too many boundaries — fragmenting continuous speech. Recall < Precision means the system is missing real transitions — merging consecutive segments from different speakers.
Practical tolerance guidance:
- Real-time captioning / transcription: 100 ms
- Offline speaker labeling: 500 ms is generally acceptable
- Benchmark reporting (CHiME-6, DIHARD): 500 ms is common convention
6. Utterance Length Recall (ULR)
Utterance Length Recall measures how accurately the system recovers speech segments, broken down by the length of those segments. It specifically exposes the system's performance on short utterances — a dimension that DER systematically hides because short segments contribute little to total duration.
Why Short Utterances Matter
Meeting audio is full of short responses: backchannels ("mm-hm"), brief acknowledgments ("right"), interruptions, and rapid-fire exchanges. Systems that perform well on long monologues but fail on short utterances will produce transcripts that feel choppy and attribution-heavy in exactly the places where human conversation is most dynamic.
Calculation
After applying the optimal speaker mapping from DER, compute per-utterance recall as matched_duration / utterance_duration, then group by length bins.
UTTERANCE_LENGTH_BINS = [
("0_to_1s", 0.0, 1.0),
("1_to_2s", 1.0, 2.0),
("2_to_5s", 2.0, 5.0),
("5_to_10s", 5.0, 10.0),
("10s_plus", 10.0, None),
]
def compute_utterance_length_recall(
reference: Annotation,
mapped_hypothesis: Annotation, # hypothesis with optimal_mapping applied
) -> dict:
hyp_by_label: dict[str, list[tuple[float, float]]] = {}
for seg, _, lbl in mapped_hypothesis.itertracks(yield_label=True):
hyp_by_label.setdefault(str(lbl), []).append(
(float(seg.start), float(seg.end))
)
grouped = {
name: {"count": 0, "ref_dur": 0.0, "matched_dur": 0.0}
for name, *_ in UTTERANCE_LENGTH_BINS
}
for ref_seg, _, ref_lbl in reference.itertracks(yield_label=True):
dur = float(ref_seg.duration)
lbl = str(ref_lbl)
matched = sum(
max(0.0, min(float(ref_seg.end), h_end)
- max(float(ref_seg.start), h_start))
for h_start, h_end in hyp_by_label.get(lbl, [])
)
bin_name = next(
name for name, lo, hi in UTTERANCE_LENGTH_BINS
if dur >= lo and (hi is None or dur < hi)
)
grouped[bin_name]["count"] += 1
grouped[bin_name]["ref_dur"] += dur
grouped[bin_name]["matched_dur"] += matched
results = {}
total_ref = total_matched = 0.0
for name, stats in grouped.items():
rd = stats["ref_dur"]
results[name] = {
"utterance_count": stats["count"],
"duration_weighted_recall": stats["matched_dur"] / rd if rd > 0 else 0.0,
}
total_ref += rd
total_matched += stats["matched_dur"]
results["overall_duration_weighted"] = (
total_matched / total_ref if total_ref else 0.0
)
return results
Interpreting duration_weighted > macro:
If the duration-weighted recall (e.g., 0.847) is significantly higher than the macro recall (e.g., 0.807), it means long utterances are being recovered well but short utterances are dragging down the per-utterance average. This directly points to a segmentation or assignment problem for short speech events.
Real-World Diagnostic Walkthrough: 15-Speaker Meeting Recording
To show how these metrics work together in practice, here is a complete diagnostic walkthrough using real system output. Evaluation conditions: collar=0.0 (strict mode), skip_overlap=False.
Step 1: Read the Full Picture
DER: 17.69% → confusion 12.61% / missed 2.67% / FA 2.42%
JER: 37.36%
Purity: 97.31%
Coverage: 88.63%
ref/hyp: 15 speakers / 17 speakers
Count Error: mean_abs=0.050 exact_match=95.0%
Boundary: precision=0.785 recall=0.886 f1=0.833
ULR: duration_weighted=0.847 macro=0.807
Two signals jump out immediately.
Signal 1 — DER 17.7% vs. JER 37.4%. This 20-point gap is not noise. DER is time-weighted; the two longest speakers (GT 1: 144.6 s, GT 2: 106.7 s) account for 56% of total audio. If those are handled reasonably well, DER stays moderate. JER weights all 15 speakers equally, so the catastrophically poor results for minority speakers (GT 11: 2.38 s total, GT 14: 2.76 s total) hit the score directly. The gap warns us that the system's behavior is very unequal across speakers.
Signal 2 — Confusion dominates DER at ~71%. MISS (2.67%) and FA (2.42%) are both small. The system is detecting speech; it just can't consistently decide whose speech it is.
This immediately rules out VAD as the primary problem.
When MISS and FA are both under 3%, the voice activity detection is functioning. The engineering effort should target speaker assignment quality, not VAD threshold tuning.
Step 2: Purity/Coverage Asymmetry Points to Fragmentation
Purity: 97.31% ← very high
Coverage: 88.63% ← meaningfully lower
This is a textbook fragmentation signature. High Purity means each cluster contains predominantly one speaker — no aggressive merging. Low Coverage means a single reference speaker's time is scattered across multiple clusters.
Fragmentation data confirms it:
GT 1: split_count=57 hyp_count=8 ← 144s speaker split into 8 clusters
GT 2: split_count=42 hyp_count=4
GT 6: split_count=17 hyp_count=4
GT 1 (the longest speaker at 144 seconds) has been fragmented into 8 hypothesis clusters across 57 segment boundaries. This is consistent with speaker embeddings that drift due to vocal variation — changes in pitch, emotional state, microphone distance, or room acoustics — causing the clustering algorithm to assign consecutive utterances from the same person to different centroids.
Step 3: Hypothesis Speaker Count Corroborates the Direction
ref_speakers: 15
hyp_speakers: 17 ← 2 extra clusters
count_error: mean_abs=0.050 exact_match_ratio=0.950
The system creates 2 more clusters than exist. A merging-dominant system would produce the opposite (hyp < ref). The directional signal — hyp > ref — is consistent with fragmentation. Meanwhile, the excellent Count Error (95% exact frame-level match) tells us the system understands "how many speakers are active right now" almost perfectly. The problem is not the count — it is the consistency of identity assignment across time.
Good Count Error + High DER: What This Combination Means
"The system knows there are 3 speakers active. It just can't decide which of its 5 clusters they are." The bottleneck is not speaker detection or counting — it is the clustering algorithm's ability to maintain consistent speaker identity over a long recording.
Step 4: The Confusion Matrix Reveals the Mixing Structure
Examining the confusion matrix output:
GT 6: dominant=speaker_9 share=49.6%
overlaps=[speaker_9:49.6%, speaker_7:43.9%, speaker_10:3.4%, ...]
→ GT 6's speech is split nearly 50/50 between two hypothesis clusters
GT 9: dominant=speaker_15 share=57.7%
overlaps=[speaker_15:57.7%, speaker_10:21.3%, ...]
→ speaker_10 appears in both GT 6 and GT 9 — acting as a catch-all fragment cluster
GT 11: dominant=speaker_14 share=36.1% ← lowest dominant share
detected=53.8% ← 46% of speech is missed entirely
→ Only 2.38s of total speech; too short for a stable cluster to form
GT 11 illustrates the fundamental challenge with low-talktime speakers: the clustering algorithm has insufficient data to form a reliable centroid. This is why JER is so much worse than DER — these minority speakers have terrible per-speaker JER values, but they contribute almost nothing to the total-duration DER.
Step 5: Boundary Error Confirms the Fragmentation Mechanism
boundary_error: precision=0.785 recall=0.886 f1=0.833
Recall exceeds Precision, meaning the system inserts excess boundaries — it is cutting speech into more segments than exist in reality. This is the mechanical driver of fragmentation: each spurious boundary is an opportunity for the speaker assignment to flip to a different cluster ID.
Fine-grained boundary analysis reveals:
start+0.5s: coverage=76.1% confusion=18.1%
end-0.5s: coverage=67.1% confusion=25.0% ← consistently worse at utterance end
The final 500 ms of utterances shows more confusion than the opening 500 ms. Acoustic fading, trailing breath sounds, and near-silence at utterance endings cause the system's speaker embedding extraction to become unreliable, leading to boundary placement jitter and cascading label inconsistency.
Step 6: Short-Utterance Analysis Separates VAD from Assignment Errors
Utterance length conditioned errors:
0.5–1s: coverage=43.5% confusion=56.5% miss=0.0% (n=4)
1–2s: coverage=70.7% confusion=23.8% miss=5.5% (n=16)
2s+: coverage=85.9% confusion=11.6% miss=2.5% (n=78)
This is the most important cut for actionable diagnosis.
For utterances under 1 second: miss=0% — VAD detects them. But confusion=56.5% — more than half are attributed to the wrong speaker. Short segments produce lower-quality embeddings. With less acoustic evidence, the assignment is essentially a coin flip between nearby cluster centroids.
This finding decisively locates the problem in speaker assignment logic, not VAD. The intervention should be at the clustering/assignment level: for example, using temporal context (prior/subsequent segments) to anchor short-segment assignments, or applying confidence-weighted assignment that defaults to the nearest temporal neighbor for segments below a length threshold.
Consolidated Diagnosis
Across all six metrics, a consistent story emerges:
"The system detects speech reliably and estimates speaker counts well. But it cannot maintain a consistent speaker identity over a long recording — the same person gets split into multiple clusters (fragmentation), which is most severe for short utterances and utterance endpoints."
| Hypothesis | Supporting Evidence |
|---|---|
| Primary cause: fragmentation | Purity high, Coverage low; hyp > ref; split_count large; Boundary precision < recall |
| VAD is not the primary issue | MISS/FA both < 3%; short utterance miss = 0% |
| Speaker assignment is the bottleneck | Confusion dominates DER; short utterance confusion high; Count Error excellent |
| Endpoint instability compounds the problem | Utterance end-0.5s shows higher confusion than start+0.5s |
Recommended improvement priority:
- Strengthen cluster merging strategy — directly addresses oversegmentation / fragmentation
- Use temporal context for short-segment assignment — targets the 0.5–2s confusion peak
- Improve utterance endpoint segmentation accuracy — removes a secondary driver of fragmentation
Noise removal and VAD tuning are low priority for this specific data. Applying those first would be optimizing the wrong component.
Collar=0.0 Makes All Numbers Look Worse
The figures above use strict collar=0.0. Switching to the commonly reported collar=0.25 would reduce DER to approximately 13–15% on the same output. This is an evaluation policy change, not a system improvement. When comparing to published benchmarks, ensure collar settings match exactly.
Diagnostic Decision Tree
DER is high
│
├─ MISS component is large → Increase VAD sensitivity
│
├─ FA component is large → Increase VAD threshold; add noise filtering
│
└─ CONF component is large
│
├─ Count Error also large → Fix speaker count estimation (clustering k)
│
├─ Purity is low → Clusters contain mixed speakers
│ → Make clustering finer-grained
│
├─ Coverage is low → One speaker split across many clusters
│ → Improve cluster merging strategy
│
├─ Boundary F1 is low → Speaker change point detection is inaccurate
│
└─ Short utterance recall → VAD or embedding quality for
is low brief speech segments
Metric Comparison Template
Use this table to structure multi-system comparisons in papers or reports.
| Metric | System A | System B | Baseline |
|-------------------------|----------|----------|----------|
| DER (%) | 12.3 | 10.8 | 18.5 |
| JER (%) | 15.6 | 13.2 | 22.1 |
| Purity | 0.921 | 0.943 | 0.872 |
| Coverage | 0.884 | 0.901 | 0.821 |
| Count Error (mean abs) | 0.4 | 0.3 | 1.2 |
| Boundary F1 | 0.823 | 0.871 | 0.741 |
| ULR < 1s | 0.51 | 0.68 | 0.38 |
| Collar (sec) | 0.25 | 0.25 | 0.25 |
| Overlap speech excluded | No | No | Yes |
| Dataset | AMI | AMI | AMI |
| pyannote.metrics ver. | 3.2.1 | 3.2.1 | 3.2.1 |
Critical Notes for Benchmark Comparisons
- Always report the collar value. DER differences of 3–5 percentage points between
collar=0andcollar=0.25are common. - Document overlapping speech handling. Many published results exclude overlapping speech segments from evaluation. Systems optimized for non-overlapping speech will appear better than those that handle overlap natively.
- Report whether optimal mapping (Hungarian algorithm) was used. Different label-assignment strategies affect scores, especially in low-speaker-count scenarios.
- Pin your library version.
pyannote.metricsbehavior has changed across versions; results may not be reproducible across versions without pinning.
FAQ
Q1: Can DER ever reach 0%?
In theory yes, but in practice, inter-annotator agreement between two human annotators on the same recording typically yields DER of 5–7% on corpora like AMI. Boundary placement is inherently subjective — where exactly a speaker "stops" and silence or another speaker "begins" is ambiguous even for humans. This is why DER should not be treated as a loss function to minimize to zero, but as a relative comparison tool.
Q2: Why does pyannote.metrics give different results from NIST's md-eval tool?
md-eval processes annotations at the utterance level, while pyannote.metrics operates on a continuous timeline. They also differ in default collar settings and how they handle speaker mapping and overlapping speech. Neither is definitively "correct" — they make different design choices. Always use whichever tool is specified by the benchmark you are submitting to, and never compare scores across tools directly.
Q3: Is JER useful when there are only 2 speakers?
With 2 speakers and reasonably balanced talk time, DER and JER will be nearly identical. JER's value grows with speaker count and talk-time imbalance. For a 2-speaker interview with 60/40 split, there is little reason to report both. For a 6-person meeting where the moderator speaks 60% of the time, JER is informative precisely because it does not let the moderator's accuracy dominate the score.
Q4: What tolerance should I use for Boundary Error?
Application context determines this:
- Real-time captioning, live meeting transcription: 100 ms or tighter
- Offline speaker labeling and analysis: 500 ms is generally acceptable
- Benchmark evaluation (DIHARD, CHiME-6 conventions): 500 ms is standard
Report the tolerance as a hyperparameter. Boundary F1 at 500 ms tolerance vs. 100 ms can differ enormously for the same system.
Q5: How do I evaluate a meeting where speakers join and leave (late arrivals, early departures)?
Standard DER implementations will count missing speakers as false alarms for the entire file duration, even during intervals when those speakers were not present. The cleanest approach is to define a UEM (Unpartitioned Evaluation Map) that restricts scoring to the intervals when each speaker is actually present, then mask out intervals where the speaker count in the reference is undefined. Alternatively, use count-based evaluation restricted to time spans with annotated speaker activity.
Q6: My Purity and Coverage are both above 0.90 but DER is still above 15%. What's wrong?
Purity and Coverage are time-weighted ratios, so large, well-clustered speakers can pull both metrics high even when smaller speakers are fragmented. DER also includes FA and MISS components that Purity/Coverage do not reflect. Decompose DER into its three components first. If CONF is dominant, look at the confusion matrix for the specific pairs causing confusion. If MISS or FA is large, focus on VAD calibration rather than clustering quality.
Q7: What is the relationship between Boundary Error and DER's collar setting?
DER's collar hides boundary errors by excluding a window around reference boundaries from scoring. Boundary Error measures those same errors directly. A system where every boundary is consistently 200 ms late will show normal DER under collar=0.25 but poor Boundary localization scores. If your application requires precise timestamps — for subtitle synchronization, for example — Boundary Error is essential even when DER looks acceptable.
Q8: How do I handle overlapping speech in DER computation?
By default, pyannote.metrics includes overlapping speech segments (where two speakers talk simultaneously) in evaluation. Setting skip_overlap=True excludes these segments. Systems that output only single-speaker labels at any moment will be unfairly penalized by overlapping-speech segments under skip_overlap=False. Most published results on challenging corpora (CHiME-6, AMI with overlap) report with skip_overlap=False, but single-speaker diarization systems are sometimes evaluated with skip_overlap=True. Document this explicitly.
Summary
| Metric | What It Measures | Primary Use Case |
|---|---|---|
| DER | Total error rate (time-weighted) | Benchmark comparison, overall system health |
| JER | Error rate with equal weight per speaker | Multi-speaker fairness, minority speaker evaluation |
| Confusion Matrix | Which speakers are confused with which | Root-cause debugging of CONF component |
| Purity / Coverage | Cluster quality — mixing vs. splitting | Diagnosing fragmentation vs. merging problems |
| Count Error | Accuracy of active speaker count estimation | Evaluating clustering strategy, k selection |
| Boundary Error | Accuracy of speaker change point detection | Segmentation quality, timestamp-sensitive applications |
| Utterance Length Recall | Short utterance recovery rate | VAD tuning, embedding quality for brief speech |
Evaluating a diarization system with DER alone is like running a health check with only a thermometer. The single number tells you something is wrong but not where. By combining DER with the six complementary metrics described here, you can precisely locate whether the problem is in VAD, speaker embedding quality, cluster assignment, merging strategy, or boundary detection — and direct your engineering effort accordingly.
The diagnostic framework is not theoretical: every number in the real-world example section came from actual system output on a 15-speaker meeting recording, illustrating that the metrics genuinely tell different parts of the story.
Start with DER for the headline. Let Purity/Coverage tell you the category of failure. Use the confusion matrix to name the specific speakers causing problems. Then use Boundary Error and Utterance Length Recall to decide whether the fix lives in your segmentation or your clustering pipeline.
Related Articles
Explore more articles connected to this topic.
Speaker Diarization Benchmark: NeMo vs VibeVoice vs Pyannote — Which Model Actually Works?
I tested three major speaker diarization frameworks — **NVIDIA NeMo** (Clustering, MSDD, Sortformer), **Microsoft VibeVoice**, and **pyannote.audio** — against real-world audio: a 10-minute drama clip with 14 speakers and two 10-minute debate clips with 6–10 speakers.
Read article →5 Books a Working AI Engineer Actually Read for Real-World Accuracy Improvement and Product Development
The reason you can call AI APIs but still can't improve accuracy is a lack of understanding of what's happening inside the model. These 5 books — all O'Reilly titles written by practitioners at the forefront of the field, including Hugging Face engineers and Chip Huyen — are the standard reading list for ML engineers worldwide.
Read article →