Speaker Diarization Evaluation Metrics: Beyond DER — A Practical Diagnostic Guide

TL;DR — Read This First

Evaluating speaker diarization systems requires choosing the right metric for what you actually want to measure. DER is the industry standard, but it cannot diagnose why a system fails. To systematically improve a real system, combine DER with Confusion Matrix, Purity/Coverage, and Boundary Error for proper root-cause analysis. This article covers calculation formulas, working Python code, and real experimental data — not just textbook definitions.


Speaker diarization — automatically determining "who spoke when" — is a foundational task in speech processing. It powers transcription pipelines, meeting analytics, call center QA, and medical interview documentation. Yet when engineers actually sit down to evaluate a diarization system, a common experience is confusion: "What's the difference between DER and JER? Why is my DER low but the output still sounds wrong? Which metric should I even optimize?"

This guide answers those questions with precision. We cover six key evaluation metrics: their mathematical definitions, Python implementations using pyannote.metrics, and — critically — how to read the numbers together to diagnose what is actually broken in your system.

All experimental figures in this article come from first-hand benchmarking on a 15-speaker meeting recording under strict evaluation conditions (collar=0.0, skip_overlap=False).


1. DER (Diarization Error Rate) and JER (Jaccard Error Rate)

What Is DER?

DER is the de facto standard metric for speaker diarization, in use since the 1997 NIST evaluation benchmarks. It measures the total proportion of time that is incorrectly labeled.

DER=FA+MISS+CONFTotal Reference Duration\text{DER} = \frac{\text{FA} + \text{MISS} + \text{CONF}}{\text{Total Reference Duration}}
ComponentDefinition
FA (False Alarm)Duration where the system outputs a speaker label but no one is actually speaking
MISS (Missed Detection)Duration where someone is speaking but the system outputs no label
CONF (Confusion)Duration where a speaker is correctly detected but assigned the wrong speaker label

Key Terminology

  • Speaker Diarization: The task of automatically segmenting an audio stream by speaker identity — answering the question "who spoke when."
  • Reference (Ground Truth): Human-annotated speaker labels used as the gold standard.
  • Hypothesis: The system's predicted speaker labels.
  • Collar: A tolerance window (typically 250 ms) around reference boundaries that is excluded from scoring. Errors within the collar are not penalized.
  • Optimal Mapping: Because system-assigned speaker IDs (e.g., SPEAKER_1) do not correspond to reference IDs (e.g., SPEAKER_A), DER uses the Hungarian algorithm to find the best one-to-one label mapping before scoring.

Python Implementation

python
from pyannote.metrics.diarization import DiarizationErrorRate
from pyannote.core import Annotation, Segment

# Ground truth annotation
reference = Annotation()
reference[Segment(0, 5)]   = 'SPEAKER_A'
reference[Segment(5, 10)]  = 'SPEAKER_B'
reference[Segment(10, 15)] = 'SPEAKER_A'

# System output
hypothesis = Annotation()
hypothesis[Segment(0, 4)]   = 'SPEAKER_1'   # Corresponds to SPEAKER_A
hypothesis[Segment(4, 11)]  = 'SPEAKER_2'   # Partially mixed
hypothesis[Segment(11, 15)] = 'SPEAKER_1'

metric = DiarizationErrorRate()
der = metric(reference, hypothesis)
print(f"DER: {der:.3f}")  # e.g., 0.133 (13.3%)

# Decompose into components
detail = metric(reference, hypothesis, detailed=True)
print(f"  Miss: {detail['missed detection']:.3f}")
print(f"  FA:   {detail['false alarm']:.3f}")
print(f"  Conf: {detail['confusion']:.3f}")

On the Collar Setting

Most published DER results use collar=0.25 (250 ms exclusion around boundaries). Some strict benchmarks such as NIST SRE and DIHARD use collar=0.0. The difference between these two settings can shift DER by 3–5 percentage points on the same system output. Always report your collar setting explicitly — it is not optional metadata.

What Is JER?

JER was introduced in the CHiME-6 Challenge to address a known weakness of DER: DER is dominated by speakers who talk the most. In a meeting where Speaker A talks for 900 seconds and Speaker B talks for 100 seconds, completely mis-labeling Speaker B barely moves DER. JER corrects for this.

JER=11SsSHsRsHsRs\text{JER} = 1 - \frac{1}{|S|} \sum_{s \in S} \frac{|H_s \cap R_s|}{|H_s \cup R_s|}

Where:

  • SS: the set of all reference speakers
  • RsR_s: the reference time intervals for speaker ss
  • HsH_s: the hypothesis intervals mapped to speaker ss

This is essentially the complement of the per-speaker Jaccard similarity, averaged equally across all speakers regardless of their total speaking time.

python
# DER example: SPEAKER_A = 900s, SPEAKER_B = 100s
# Completely mislabeling SPEAKER_B → DER impact is small (~10%)
# JER treats both speakers equally → JER impact is large (~50%)

from pyannote.metrics.diarization import JaccardErrorRate

jer = JaccardErrorRate()
jer_score = jer(reference, hypothesis)
print(f"JER: {jer_score:.3f}")

When to use which:

  • Use DER for standard benchmark comparisons; it is what most papers report.
  • Use JER when fairness across speakers matters — multi-speaker meetings with unequal talk time, or any scenario where minority-speaker accuracy is important.
  • In a 2-speaker scenario, JER and DER converge closely. JER becomes most informative at 4+ speakers with imbalanced talk time.

2. Speaker Confusion Matrix

The confusion matrix answers: "Which speaker is being confused with which?" It is the primary tool for decomposing the CONF component of DER into actionable information.

Calculation

The key implementation detail is computing overlap between each reference speaker segment and each hypothesis segment. The result is a matrix of overlapping durations, from which you derive the dominant mapping and dominant share per reference speaker.

python
def compute_speaker_confusion_matrix(
    reference: Annotation,
    hypothesis: Annotation,
) -> dict:
    ref_labels = sorted(map(str, reference.labels()))
    hyp_labels = sorted(map(str, hypothesis.labels()))

    # Compute pairwise overlap durations (seconds)
    overlap_by_pair = {
        ref_label: {hyp_label: 0.0 for hyp_label in hyp_labels}
        for ref_label in ref_labels
    }
    for ref_seg, _, ref_label in reference.itertracks(yield_label=True):
        for hyp_seg, _, hyp_label in hypothesis.itertracks(yield_label=True):
            intersection = ref_seg & hyp_seg
            if intersection:
                overlap_by_pair[str(ref_label)][str(hyp_label)] += float(
                    intersection.duration
                )

    # Summarize per reference speaker
    per_reference_speaker = {}
    for ref_label, row in overlap_by_pair.items():
        ref_duration = sum(
            float(seg.duration)
            for seg, _, lbl in reference.itertracks(yield_label=True)
            if str(lbl) == ref_label
        )
        overlaps = sorted(
            [
                {
                    "hyp_speaker": h,
                    "overlap_sec": s,
                    "share": s / ref_duration if ref_duration > 0 else 0.0
                }
                for h, s in row.items() if s > 0.0
            ],
            key=lambda x: -x["overlap_sec"],
        )
        dominant = overlaps[0] if overlaps else None
        per_reference_speaker[ref_label] = {
            "ref_duration_sec": ref_duration,
            "hyp_count": len(overlaps),
            "dominant": dominant["hyp_speaker"] if dominant else None,
            "dominant_share": dominant["share"] if dominant else 0.0,
            "overlaps": overlaps,
        }

    return {
        "seconds": overlap_by_pair,
        "per_reference_speaker": per_reference_speaker
    }

Reading the Output

text
GT 6:  hyp_count=4  dominant=speaker_9  share=49.6%
       overlaps=[speaker_9:49.6%, speaker_7:43.9%, speaker_10:3.4%, speaker_1:1.3%]
  • dominant_share below 50%: This reference speaker's time is split roughly equally between two hypothesis clusters. The system cannot consistently decide which cluster they belong to.
  • hyp_count above 3: Severe fragmentation — one person's voice has been split into 3+ clusters.
  • Shared hyp_speaker across multiple GT speakers: That hypothesis cluster is acting as a "garbage collector," absorbing fragments from multiple real speakers.

3. Fragmentation, Purity, and Coverage

While DER measures quantity of error, Purity and Coverage measure quality of clustering structure.

Definitions

  • Fragmentation: A single reference speaker's speech is spread across multiple hypothesis clusters.
  • Purity: Each hypothesis cluster contains speech from primarily one speaker (no mixing).
  • Coverage: Each reference speaker's speech is captured by primarily one hypothesis cluster (no splitting).

Formulas

Cluster Purity

Purity=1TkmaxsCkRs\text{Purity} = \frac{1}{T} \sum_{k} \max_{s} |C_k \cap R_s|

Where TT is total audio duration, CkC_k is the time covered by cluster kk, and RsR_s is reference speaker ss's time.

Coverage

Coverage=1TsmaxkRsCk\text{Coverage} = \frac{1}{T} \sum_{s} \max_{k} |R_s \cap C_k|
python
from pyannote.metrics.diarization import (
    DiarizationPurity,
    DiarizationCoverage,
)

# Always align collar/skip_overlap settings with your DER configuration
collar = 0.0
skip_overlap = False

purity_metric   = DiarizationPurity(collar=collar, skip_overlap=skip_overlap)
coverage_metric = DiarizationCoverage(collar=collar, skip_overlap=skip_overlap)

purity   = float(purity_metric(reference, hypothesis))
coverage = float(coverage_metric(reference, hypothesis))

print(f"Purity:   {purity:.3f}")    # 1.0 is ideal
print(f"Coverage: {coverage:.3f}")  # 1.0 is ideal

Diagnostic Interpretation

PatternInterpretation
High Purity, Low CoverageFragmentation dominant — each cluster is pure, but one speaker is split into many clusters
Low Purity, High CoverageMerging dominant — clusters contain multiple speakers; the system under-segments
Both highGood structure, DER errors likely from boundary imprecision
Both lowSystemic failure in both clustering and VAD

Use Purity/Coverage First for Debugging

Before tuning hyperparameters, check whether you have a fragmentation problem (split) or a merging problem (merge). These require opposite interventions — more aggressive clustering vs. less aggressive clustering. Getting this diagnosis wrong wastes significant engineering effort.


4. Count Error (Speaker Count Estimation)

Count Error measures how accurately the system estimates the number of active speakers at each moment in time. This is not simply comparing the total number of unique speakers — it is a time-weighted frame-level comparison.

Formula

Count Error=tΔtNref(t)Nhyp(t)tΔt\text{Count Error} = \frac{\sum_t \Delta t \cdot |N_{\text{ref}}(t) - N_{\text{hyp}}(t)|}{\sum_t \Delta t}

Where Nref(t)N_{\text{ref}}(t) and Nhyp(t)N_{\text{hyp}}(t) are the number of simultaneously active speakers at time tt according to the reference and hypothesis respectively.

python
def compute_count_error(
    reference: Annotation,
    hypothesis: Annotation,
    timeline,
) -> dict:
    """
    Compute time-weighted speaker count error.
    Uses the evaluation timeline from DER's uemify() for consistency.
    """
    total_duration   = 0.0
    weighted_abs_err = 0.0
    weighted_sgn_err = 0.0
    exact_match_dur  = 0.0

    for segment in timeline:
        duration  = float(segment.duration)
        ref_count = len(reference.get_labels(segment, unique=False))
        hyp_count = len(hypothesis.get_labels(segment, unique=False))
        diff = hyp_count - ref_count

        total_duration   += duration
        weighted_abs_err += duration * abs(diff)
        weighted_sgn_err += duration * diff
        if diff == 0:
            exact_match_dur += duration

    return {
        "mean_abs_speaker_count_error": weighted_abs_err / total_duration,
        "mean_signed_speaker_count_error": weighted_sgn_err / total_duration,
        "exact_count_match_ratio": exact_match_dur / total_duration,
    }

# Retrieve the evaluation timeline from DER computation
der_metric = DiarizationErrorRate(collar=0.0, skip_overlap=False)
ref_proj, hyp_proj, timeline = der_metric.uemify(
    reference, hypothesis,
    collar=0.0, skip_overlap=False, returns_timeline=True,
)
result = compute_count_error(ref_proj, hyp_proj, timeline)
print(f"Mean abs count error:  {result['mean_abs_speaker_count_error']:.4f}")
print(f"Exact count match:     {result['exact_count_match_ratio']:.4f}")

Low Count Error Does Not Imply Low DER

A system can correctly estimate "3 speakers are active" while assigning completely wrong labels to all three. Count Error measures the accuracy of how many, not who. Use mean_signed_speaker_count_error to detect direction: positive means over-counting (splitting speakers), negative means under-counting (merging speakers).


5. Boundary Error

Boundary Error evaluates how accurately the system detects speaker change points — the moments when one speaker stops and another begins. DER's collar mechanism intentionally hides boundary errors from the main score, making Boundary Error necessary as a complementary metric.

The Two Types of Boundary Error

text
Reference:  A----A----A | B-----B | A---A
Hypothesis: A----A------ | B----B-------- | A---
                        ↑               ↑
           Detection Error          Localization Error
           (boundary exists,        (boundary found,
            detected late)           but offset by Δt ms)
  • Detection Error: A reference boundary is missed entirely, or the system inserts a spurious boundary.
  • Localization Error: The boundary is detected but its timestamp is off by some amount.

Implementation

The critical implementation detail is one-to-one boundary matching: each reference boundary is matched to at most one hypothesis boundary, and matched hypotheses are removed from the candidate pool. This prevents nearby false-positive detections from artificially inflating recall.

python
def extract_boundaries(annotation: Annotation) -> list[float]:
    """Return all segment start and end times as boundary timestamps."""
    boundaries: set[float] = set()
    for segment in annotation.itersegments():
        boundaries.add(round(float(segment.start), 9))
        boundaries.add(round(float(segment.end), 9))
    return sorted(boundaries)


def compute_boundary_error(
    reference: Annotation,
    hypothesis: Annotation,
    tolerance: float = 0.5,  # Default: 500 ms
) -> dict:
    ref_boundaries = extract_boundaries(reference)
    hyp_boundaries = list(extract_boundaries(hypothesis))  # mutable for pop()

    matched_pairs = []
    hyp_idx = 0

    for ref_b in ref_boundaries:
        # Advance scan pointer to left edge of tolerance window
        while (hyp_idx < len(hyp_boundaries)
               and hyp_boundaries[hyp_idx] < ref_b - tolerance):
            hyp_idx += 1

        # Find the single closest hypothesis boundary within tolerance
        candidates = []
        for ci in (hyp_idx - 1, hyp_idx, hyp_idx + 1):
            if 0 <= ci < len(hyp_boundaries):
                dist = abs(hyp_boundaries[ci] - ref_b)
                if dist <= tolerance:
                    candidates.append((dist, ci))

        if not candidates:
            continue

        _, best_ci = min(candidates)
        matched_pairs.append((ref_b, hyp_boundaries[best_ci]))
        hyp_boundaries.pop(best_ci)
        if best_ci < hyp_idx:
            hyp_idx -= 1

    matched   = len(matched_pairs)
    ref_total = len(ref_boundaries)
    hyp_total = matched + len(hyp_boundaries)  # Remaining = false positives

    precision = matched / hyp_total if hyp_total else 1.0
    recall    = matched / ref_total if ref_total else 1.0
    f1 = (2 * precision * recall / (precision + recall)
          if precision + recall else 0.0)
    errors = [abs(r - h) for r, h in matched_pairs]

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "mean_abs_error_sec": sum(errors) / matched if matched else None,
        "max_abs_error_sec": max(errors) if errors else None,
    }

result = compute_boundary_error(reference, hypothesis, tolerance=0.5)
print(f"Precision: {result['precision']:.3f}")
print(f"Recall:    {result['recall']:.3f}")
print(f"F1:        {result['f1']:.3f}")
if result["mean_abs_error_sec"]:
    print(f"Mean localization error: {result['mean_abs_error_sec']*1000:.1f} ms")

Precision < Recall means the system is inserting too many boundaries — fragmenting continuous speech. Recall < Precision means the system is missing real transitions — merging consecutive segments from different speakers.

Practical tolerance guidance:

  • Real-time captioning / transcription: 100 ms
  • Offline speaker labeling: 500 ms is generally acceptable
  • Benchmark reporting (CHiME-6, DIHARD): 500 ms is common convention

6. Utterance Length Recall (ULR)

Utterance Length Recall measures how accurately the system recovers speech segments, broken down by the length of those segments. It specifically exposes the system's performance on short utterances — a dimension that DER systematically hides because short segments contribute little to total duration.

Why Short Utterances Matter

Meeting audio is full of short responses: backchannels ("mm-hm"), brief acknowledgments ("right"), interruptions, and rapid-fire exchanges. Systems that perform well on long monologues but fail on short utterances will produce transcripts that feel choppy and attribution-heavy in exactly the places where human conversation is most dynamic.

Calculation

After applying the optimal speaker mapping from DER, compute per-utterance recall as matched_duration / utterance_duration, then group by length bins.

python
UTTERANCE_LENGTH_BINS = [
    ("0_to_1s",   0.0,  1.0),
    ("1_to_2s",   1.0,  2.0),
    ("2_to_5s",   2.0,  5.0),
    ("5_to_10s",  5.0, 10.0),
    ("10s_plus", 10.0, None),
]

def compute_utterance_length_recall(
    reference: Annotation,
    mapped_hypothesis: Annotation,  # hypothesis with optimal_mapping applied
) -> dict:
    hyp_by_label: dict[str, list[tuple[float, float]]] = {}
    for seg, _, lbl in mapped_hypothesis.itertracks(yield_label=True):
        hyp_by_label.setdefault(str(lbl), []).append(
            (float(seg.start), float(seg.end))
        )

    grouped = {
        name: {"count": 0, "ref_dur": 0.0, "matched_dur": 0.0}
        for name, *_ in UTTERANCE_LENGTH_BINS
    }

    for ref_seg, _, ref_lbl in reference.itertracks(yield_label=True):
        dur = float(ref_seg.duration)
        lbl = str(ref_lbl)
        matched = sum(
            max(0.0, min(float(ref_seg.end), h_end)
                   - max(float(ref_seg.start), h_start))
            for h_start, h_end in hyp_by_label.get(lbl, [])
        )
        bin_name = next(
            name for name, lo, hi in UTTERANCE_LENGTH_BINS
            if dur >= lo and (hi is None or dur < hi)
        )
        grouped[bin_name]["count"]       += 1
        grouped[bin_name]["ref_dur"]     += dur
        grouped[bin_name]["matched_dur"] += matched

    results = {}
    total_ref = total_matched = 0.0
    for name, stats in grouped.items():
        rd = stats["ref_dur"]
        results[name] = {
            "utterance_count": stats["count"],
            "duration_weighted_recall": stats["matched_dur"] / rd if rd > 0 else 0.0,
        }
        total_ref     += rd
        total_matched += stats["matched_dur"]

    results["overall_duration_weighted"] = (
        total_matched / total_ref if total_ref else 0.0
    )
    return results

Interpreting duration_weighted > macro:

If the duration-weighted recall (e.g., 0.847) is significantly higher than the macro recall (e.g., 0.807), it means long utterances are being recovered well but short utterances are dragging down the per-utterance average. This directly points to a segmentation or assignment problem for short speech events.


Real-World Diagnostic Walkthrough: 15-Speaker Meeting Recording

To show how these metrics work together in practice, here is a complete diagnostic walkthrough using real system output. Evaluation conditions: collar=0.0 (strict mode), skip_overlap=False.

Step 1: Read the Full Picture

text
DER:         17.69%  →  confusion 12.61% / missed 2.67% / FA 2.42%
JER:         37.36%
Purity:      97.31%
Coverage:    88.63%
ref/hyp:     15 speakers / 17 speakers
Count Error: mean_abs=0.050  exact_match=95.0%
Boundary:    precision=0.785  recall=0.886  f1=0.833
ULR:         duration_weighted=0.847  macro=0.807

Two signals jump out immediately.

Signal 1 — DER 17.7% vs. JER 37.4%. This 20-point gap is not noise. DER is time-weighted; the two longest speakers (GT 1: 144.6 s, GT 2: 106.7 s) account for 56% of total audio. If those are handled reasonably well, DER stays moderate. JER weights all 15 speakers equally, so the catastrophically poor results for minority speakers (GT 11: 2.38 s total, GT 14: 2.76 s total) hit the score directly. The gap warns us that the system's behavior is very unequal across speakers.

Signal 2 — Confusion dominates DER at ~71%. MISS (2.67%) and FA (2.42%) are both small. The system is detecting speech; it just can't consistently decide whose speech it is.

This immediately rules out VAD as the primary problem.

When MISS and FA are both under 3%, the voice activity detection is functioning. The engineering effort should target speaker assignment quality, not VAD threshold tuning.


Step 2: Purity/Coverage Asymmetry Points to Fragmentation

text
Purity:   97.31%   ← very high
Coverage: 88.63%   ← meaningfully lower

This is a textbook fragmentation signature. High Purity means each cluster contains predominantly one speaker — no aggressive merging. Low Coverage means a single reference speaker's time is scattered across multiple clusters.

Fragmentation data confirms it:

text
GT 1:  split_count=57  hyp_count=8   ← 144s speaker split into 8 clusters
GT 2:  split_count=42  hyp_count=4
GT 6:  split_count=17  hyp_count=4

GT 1 (the longest speaker at 144 seconds) has been fragmented into 8 hypothesis clusters across 57 segment boundaries. This is consistent with speaker embeddings that drift due to vocal variation — changes in pitch, emotional state, microphone distance, or room acoustics — causing the clustering algorithm to assign consecutive utterances from the same person to different centroids.


Step 3: Hypothesis Speaker Count Corroborates the Direction

text
ref_speakers: 15
hyp_speakers: 17   ← 2 extra clusters
count_error:  mean_abs=0.050  exact_match_ratio=0.950

The system creates 2 more clusters than exist. A merging-dominant system would produce the opposite (hyp < ref). The directional signal — hyp > ref — is consistent with fragmentation. Meanwhile, the excellent Count Error (95% exact frame-level match) tells us the system understands "how many speakers are active right now" almost perfectly. The problem is not the count — it is the consistency of identity assignment across time.

Good Count Error + High DER: What This Combination Means

"The system knows there are 3 speakers active. It just can't decide which of its 5 clusters they are." The bottleneck is not speaker detection or counting — it is the clustering algorithm's ability to maintain consistent speaker identity over a long recording.


Step 4: The Confusion Matrix Reveals the Mixing Structure

Examining the confusion matrix output:

text
GT 6:  dominant=speaker_9 share=49.6%
       overlaps=[speaker_9:49.6%, speaker_7:43.9%, speaker_10:3.4%, ...]
       → GT 6's speech is split nearly 50/50 between two hypothesis clusters

GT 9:  dominant=speaker_15 share=57.7%
       overlaps=[speaker_15:57.7%, speaker_10:21.3%, ...]
       → speaker_10 appears in both GT 6 and GT 9 — acting as a catch-all fragment cluster

GT 11: dominant=speaker_14 share=36.1%   ← lowest dominant share
       detected=53.8%                     ← 46% of speech is missed entirely
       → Only 2.38s of total speech; too short for a stable cluster to form

GT 11 illustrates the fundamental challenge with low-talktime speakers: the clustering algorithm has insufficient data to form a reliable centroid. This is why JER is so much worse than DER — these minority speakers have terrible per-speaker JER values, but they contribute almost nothing to the total-duration DER.


Step 5: Boundary Error Confirms the Fragmentation Mechanism

text
boundary_error: precision=0.785  recall=0.886  f1=0.833

Recall exceeds Precision, meaning the system inserts excess boundaries — it is cutting speech into more segments than exist in reality. This is the mechanical driver of fragmentation: each spurious boundary is an opportunity for the speaker assignment to flip to a different cluster ID.

Fine-grained boundary analysis reveals:

text
start+0.5s: coverage=76.1%  confusion=18.1%
end-0.5s:   coverage=67.1%  confusion=25.0%   ← consistently worse at utterance end

The final 500 ms of utterances shows more confusion than the opening 500 ms. Acoustic fading, trailing breath sounds, and near-silence at utterance endings cause the system's speaker embedding extraction to become unreliable, leading to boundary placement jitter and cascading label inconsistency.


Step 6: Short-Utterance Analysis Separates VAD from Assignment Errors

text
Utterance length conditioned errors:
  0.5–1s:   coverage=43.5%  confusion=56.5%  miss=0.0%   (n=4)
  1–2s:     coverage=70.7%  confusion=23.8%  miss=5.5%   (n=16)
  2s+:      coverage=85.9%  confusion=11.6%  miss=2.5%   (n=78)

This is the most important cut for actionable diagnosis.

For utterances under 1 second: miss=0% — VAD detects them. But confusion=56.5% — more than half are attributed to the wrong speaker. Short segments produce lower-quality embeddings. With less acoustic evidence, the assignment is essentially a coin flip between nearby cluster centroids.

This finding decisively locates the problem in speaker assignment logic, not VAD. The intervention should be at the clustering/assignment level: for example, using temporal context (prior/subsequent segments) to anchor short-segment assignments, or applying confidence-weighted assignment that defaults to the nearest temporal neighbor for segments below a length threshold.


Consolidated Diagnosis

Across all six metrics, a consistent story emerges:

"The system detects speech reliably and estimates speaker counts well. But it cannot maintain a consistent speaker identity over a long recording — the same person gets split into multiple clusters (fragmentation), which is most severe for short utterances and utterance endpoints."

HypothesisSupporting Evidence
Primary cause: fragmentationPurity high, Coverage low; hyp > ref; split_count large; Boundary precision < recall
VAD is not the primary issueMISS/FA both < 3%; short utterance miss = 0%
Speaker assignment is the bottleneckConfusion dominates DER; short utterance confusion high; Count Error excellent
Endpoint instability compounds the problemUtterance end-0.5s shows higher confusion than start+0.5s

Recommended improvement priority:

  1. Strengthen cluster merging strategy — directly addresses oversegmentation / fragmentation
  2. Use temporal context for short-segment assignment — targets the 0.5–2s confusion peak
  3. Improve utterance endpoint segmentation accuracy — removes a secondary driver of fragmentation

Noise removal and VAD tuning are low priority for this specific data. Applying those first would be optimizing the wrong component.

Collar=0.0 Makes All Numbers Look Worse

The figures above use strict collar=0.0. Switching to the commonly reported collar=0.25 would reduce DER to approximately 13–15% on the same output. This is an evaluation policy change, not a system improvement. When comparing to published benchmarks, ensure collar settings match exactly.


Diagnostic Decision Tree

text
DER is high
    │
    ├─ MISS component is large  → Increase VAD sensitivity
    │
    ├─ FA component is large    → Increase VAD threshold; add noise filtering
    │
    └─ CONF component is large
           │
           ├─ Count Error also large    → Fix speaker count estimation (clustering k)
           │
           ├─ Purity is low             → Clusters contain mixed speakers
           │                              → Make clustering finer-grained
           │
           ├─ Coverage is low           → One speaker split across many clusters
           │                              → Improve cluster merging strategy
           │
           ├─ Boundary F1 is low        → Speaker change point detection is inaccurate
           │
           └─ Short utterance recall    → VAD or embedding quality for
              is low                      brief speech segments

Metric Comparison Template

Use this table to structure multi-system comparisons in papers or reports.

markdown
| Metric                  | System A | System B | Baseline |
|-------------------------|----------|----------|----------|
| DER (%)                 | 12.3     | 10.8     | 18.5     |
| JER (%)                 | 15.6     | 13.2     | 22.1     |
| Purity                  | 0.921    | 0.943    | 0.872    |
| Coverage                | 0.884    | 0.901    | 0.821    |
| Count Error (mean abs)  | 0.4      | 0.3      | 1.2      |
| Boundary F1             | 0.823    | 0.871    | 0.741    |
| ULR < 1s                | 0.51     | 0.68     | 0.38     |
| Collar (sec)            | 0.25     | 0.25     | 0.25     |
| Overlap speech excluded | No       | No       | Yes      |
| Dataset                 | AMI      | AMI      | AMI      |
| pyannote.metrics ver.   | 3.2.1    | 3.2.1    | 3.2.1    |

Critical Notes for Benchmark Comparisons

  1. Always report the collar value. DER differences of 3–5 percentage points between collar=0 and collar=0.25 are common.
  2. Document overlapping speech handling. Many published results exclude overlapping speech segments from evaluation. Systems optimized for non-overlapping speech will appear better than those that handle overlap natively.
  3. Report whether optimal mapping (Hungarian algorithm) was used. Different label-assignment strategies affect scores, especially in low-speaker-count scenarios.
  4. Pin your library version. pyannote.metrics behavior has changed across versions; results may not be reproducible across versions without pinning.

FAQ

Q1: Can DER ever reach 0%?

In theory yes, but in practice, inter-annotator agreement between two human annotators on the same recording typically yields DER of 5–7% on corpora like AMI. Boundary placement is inherently subjective — where exactly a speaker "stops" and silence or another speaker "begins" is ambiguous even for humans. This is why DER should not be treated as a loss function to minimize to zero, but as a relative comparison tool.

Q2: Why does pyannote.metrics give different results from NIST's md-eval tool?

md-eval processes annotations at the utterance level, while pyannote.metrics operates on a continuous timeline. They also differ in default collar settings and how they handle speaker mapping and overlapping speech. Neither is definitively "correct" — they make different design choices. Always use whichever tool is specified by the benchmark you are submitting to, and never compare scores across tools directly.

Q3: Is JER useful when there are only 2 speakers?

With 2 speakers and reasonably balanced talk time, DER and JER will be nearly identical. JER's value grows with speaker count and talk-time imbalance. For a 2-speaker interview with 60/40 split, there is little reason to report both. For a 6-person meeting where the moderator speaks 60% of the time, JER is informative precisely because it does not let the moderator's accuracy dominate the score.

Q4: What tolerance should I use for Boundary Error?

Application context determines this:

  • Real-time captioning, live meeting transcription: 100 ms or tighter
  • Offline speaker labeling and analysis: 500 ms is generally acceptable
  • Benchmark evaluation (DIHARD, CHiME-6 conventions): 500 ms is standard

Report the tolerance as a hyperparameter. Boundary F1 at 500 ms tolerance vs. 100 ms can differ enormously for the same system.

Q5: How do I evaluate a meeting where speakers join and leave (late arrivals, early departures)?

Standard DER implementations will count missing speakers as false alarms for the entire file duration, even during intervals when those speakers were not present. The cleanest approach is to define a UEM (Unpartitioned Evaluation Map) that restricts scoring to the intervals when each speaker is actually present, then mask out intervals where the speaker count in the reference is undefined. Alternatively, use count-based evaluation restricted to time spans with annotated speaker activity.

Q6: My Purity and Coverage are both above 0.90 but DER is still above 15%. What's wrong?

Purity and Coverage are time-weighted ratios, so large, well-clustered speakers can pull both metrics high even when smaller speakers are fragmented. DER also includes FA and MISS components that Purity/Coverage do not reflect. Decompose DER into its three components first. If CONF is dominant, look at the confusion matrix for the specific pairs causing confusion. If MISS or FA is large, focus on VAD calibration rather than clustering quality.

Q7: What is the relationship between Boundary Error and DER's collar setting?

DER's collar hides boundary errors by excluding a window around reference boundaries from scoring. Boundary Error measures those same errors directly. A system where every boundary is consistently 200 ms late will show normal DER under collar=0.25 but poor Boundary localization scores. If your application requires precise timestamps — for subtitle synchronization, for example — Boundary Error is essential even when DER looks acceptable.

Q8: How do I handle overlapping speech in DER computation?

By default, pyannote.metrics includes overlapping speech segments (where two speakers talk simultaneously) in evaluation. Setting skip_overlap=True excludes these segments. Systems that output only single-speaker labels at any moment will be unfairly penalized by overlapping-speech segments under skip_overlap=False. Most published results on challenging corpora (CHiME-6, AMI with overlap) report with skip_overlap=False, but single-speaker diarization systems are sometimes evaluated with skip_overlap=True. Document this explicitly.


Summary

MetricWhat It MeasuresPrimary Use Case
DERTotal error rate (time-weighted)Benchmark comparison, overall system health
JERError rate with equal weight per speakerMulti-speaker fairness, minority speaker evaluation
Confusion MatrixWhich speakers are confused with whichRoot-cause debugging of CONF component
Purity / CoverageCluster quality — mixing vs. splittingDiagnosing fragmentation vs. merging problems
Count ErrorAccuracy of active speaker count estimationEvaluating clustering strategy, k selection
Boundary ErrorAccuracy of speaker change point detectionSegmentation quality, timestamp-sensitive applications
Utterance Length RecallShort utterance recovery rateVAD tuning, embedding quality for brief speech

Evaluating a diarization system with DER alone is like running a health check with only a thermometer. The single number tells you something is wrong but not where. By combining DER with the six complementary metrics described here, you can precisely locate whether the problem is in VAD, speaker embedding quality, cluster assignment, merging strategy, or boundary detection — and direct your engineering effort accordingly.

The diagnostic framework is not theoretical: every number in the real-world example section came from actual system output on a 15-speaker meeting recording, illustrating that the metrics genuinely tell different parts of the story.

Start with DER for the headline. Let Purity/Coverage tell you the category of failure. Use the confusion matrix to name the specific speakers causing problems. Then use Boundary Error and Utterance Length Recall to decide whether the fix lives in your segmentation or your clustering pipeline.

Related Articles

Explore more articles connected to this topic.