Suppressing LLM Repetition Hallucinations with a Custom Logits Processor in Qwen

Summary (TL;DR)

When fine-tuning Qwen for structured outputs like JSON, token repetition hallucinations are a common and frustrating problem. Standard generation parameters like repetition_penalty and no_repeat_ngram_size can help, but their side effects are often too broad to be practical.

This article walks through a custom Logits Processor implementation for Transformers that applies no-repeat-ngram control only inside the Content field of a JSON output — leaving the rest of the generation untouched. The result: hallucination loops suppressed where they matter, with minimal collateral damage.

What you'll take away:

  • A clear understanding of why repetition hallucinations happen after fine-tuning
  • The practical limits of repetition_penalty and no_repeat_ngram_size
  • A fully reproducible LogitsProcessor implementation for field-scoped repetition control
  • Built-in debug tooling for tuning in production

1. What Is Repetition Hallucination in Fine-Tuned Qwen?

Qwen is a strong base model with competitive accuracy and inference speed. But fine-tuning disrupts the original token distribution, and one of the clearest symptoms is repetition hallucination — the model enters a loop and keeps generating the same phrase indefinitely.

What it looks like in practice

Suppose you're fine-tuning Qwen to output structured JSON like this:

json
{
  "Title": "Efficiency Proposal",
  "Content": "To improve meeting efficiency, we recommend..."
}

After fine-tuning, the model might produce output like:

text
"Content": "To improve meeting efficiency, we recommend sharing agendas in advance, sharing agendas in advance, sharing agendas in advance, sharing agendas in advance..."

Why it happens

There are a few overlapping causes:

  • Training data distribution bias: If repeated phrases appear frequently in your training set, the model learns to reproduce the pattern.
  • Interaction with temperature settings: At low temperatures (closer to greedy decoding), once a high-probability token is selected, the model gets locked into it.
  • Overfitting from small datasets: Fine-tuning on limited data reduces generalization and increases convergence toward specific patterns.

Repetition hallucination after fine-tuning is not a Qwen-specific problem. The same issue surfaces in Llama, Mistral, and other Transformer-based LLMs — the mechanism is the same.


2. The Limits of Standard Parameters

repetition_penalty

The repetition_penalty parameter in Transformers penalizes logits for tokens that have already been generated. A value of 1.0 means no penalty; around 1.3 begins to have a noticeable effect.

python
generation_config = {
    "repetition_penalty": 1.3,
}

It works — but the scope is the entire generated sequence. That means legitimate repeated tokens — particles, conjunctions, proper nouns, common phrases — also get penalized, leading to unnatural output.

no_repeat_ngram_size

Setting no_repeat_ngram_size=3 outright bans any trigram that has appeared before in the generated sequence. It's powerful, but the side effects are equally powerful.

python
generation_config = {
    "no_repeat_ngram_size": 3,
}

Real examples of unintended side effects:

  • Bullet-pointed lists can't repeat common sentence endings (e.g., "is required", "should be noted")
  • Structured or templated outputs break because intended repetition gets banned
  • The same greeting or closing phrase can't appear more than once in a document

The root problem: you only want to constrain a specific field, but the parameter constrains everything.


3. The Solution: Field-Scoped Logits Processor

Transformers exposes a LogitsProcessor API that lets you inject custom logic at every token generation step, modifying the raw logit scores before sampling.

Key terminology

TermWhat it means
LogitsRaw per-token scores output by the model before softmax
LogitsProcessorA class that transforms logits at each generation step. Setting a logit to -inf effectively bans that token
LogitsProcessorListA container that chains multiple processors in sequence
n-gramA contiguous sequence of n tokens. A trigram (n=3) is any sequence of 3 tokens

Design philosophy

The ContentNoRepeatNGramLogitsProcessor implemented here works on a simple idea:

  1. At each generation step, decode the generated tokens so far and check whether the model is currently inside the Content field of the JSON output
  2. If yes, apply no-repeat-ngram to the content text only
  3. If no (e.g., the model is generating "Title": or closing brackets), leave logits untouched

This "where are we in the structure right now?" approach is a broadly applicable pattern for controlled structured generation.


4. Full Implementation

The following is a complete, self-contained implementation. It works with any Transformers-compatible model — not just Qwen.

python
from __future__ import annotations

from datetime import datetime
from typing import Optional

import torch
from transformers.generation.logits_process import (
    LogitsProcessor,
    LogitsProcessorList,
    _calc_banned_ngram_tokens,
)


class ContentNoRepeatNGramLogitsProcessor(LogitsProcessor):
    """Applies no-repeat-ngram exclusively inside the JSON Content field."""

    def __init__(
        self,
        tokenizer,
        ngram_size: int,
        decode_max_tokens: int = 2048,
        debug: bool = False,
    ):
        if not isinstance(ngram_size, int) or ngram_size <= 0:
            raise ValueError(
                f"`ngram_size` has to be a strictly positive integer, but is {ngram_size}"
            )
        if not isinstance(decode_max_tokens, int) or decode_max_tokens <= 0:
            raise ValueError(
                "`decode_max_tokens` has to be a strictly positive integer, "
                f"but is {decode_max_tokens}"
            )
        self.tokenizer = tokenizer
        self.ngram_size = ngram_size
        self.decode_max_tokens = decode_max_tokens
        self.debug = debug
        self._debug_calls = 0
        self._debug_hits = 0

    @staticmethod
    def _extract_open_content_text(decoded_text: str) -> Optional[str]:
        """
        Extracts the text inside an open (not yet closed) Content field.
        Returns None if the field is closed or not present.
        """
        marker = '"Content":"'
        marker_index = decoded_text.rfind(marker)
        if marker_index == -1:
            return None

        content_start = marker_index + len(marker)
        escaped = False
        for index in range(content_start, len(decoded_text)):
            char = decoded_text[index]
            if escaped:
                escaped = False
                continue
            if char == "\\":
                escaped = True
                continue
            if char == '"':
                # Closing quote found — Content field is complete
                return None
        return decoded_text[content_start:]

    @staticmethod
    def _debug_timestamp() -> str:
        return datetime.now().strftime("%H:%M:%S")

    def __call__(
        self, input_ids: torch.LongTensor, scores: torch.FloatTensor
    ) -> torch.FloatTensor:
        self._debug_calls += 1
        num_batch_hypotheses = scores.shape[0]
        scores_processed = scores.clone()

        for i in range(num_batch_hypotheses):
            # Limit decoding to recent tokens for performance
            decode_ids = input_ids[i, -self.decode_max_tokens :]
            decoded_text = self.tokenizer.decode(
                decode_ids,
                skip_special_tokens=False,
                clean_up_tokenization_spaces=False,
            )

            # Check whether we're inside an open Content field
            content_text = self._extract_open_content_text(decoded_text)
            if content_text is None:
                if self.debug:
                    print(
                        f"[{self._debug_timestamp()}][ContentNoRepeat] "
                        f"Not in Content field. Tail: {decoded_text[-80:]}"
                    )
                continue

            if self.debug:
                preview = content_text[-80:].replace("\n", "\\n")
                print(
                    f"[{self._debug_timestamp()}][ContentNoRepeat] "
                    f"step={self._debug_calls} batch={i} "
                    f"content_chars={len(content_text)} tail={preview!r}"
                )

            # Tokenize the Content field text
            content_ids = self.tokenizer.encode(
                content_text,
                add_special_tokens=False,
            )
            if len(content_ids) + 1 < self.ngram_size:
                # Not enough tokens yet to form an ngram
                continue

            content_input_ids = input_ids.new_tensor(content_ids).unsqueeze(0)
            banned_batch_tokens = _calc_banned_ngram_tokens(
                self.ngram_size,
                content_input_ids,
                1,
                content_input_ids.shape[-1],
            )

            banned_tokens = banned_batch_tokens[0]
            if banned_tokens:
                self._debug_hits += 1
                if self.debug:
                    print(
                        f"[{self._debug_timestamp()}][ContentNoRepeat] "
                        f"Banned {len(banned_tokens)} token(s) "
                        f"for batch={i} at step={self._debug_calls}"
                    )
                # Set banned token logits to -inf to prevent sampling
                scores_processed[i, banned_tokens] = -float("inf")

        return scores_processed


class ContentNoRepeatGenerationMixin:
    """Mixin for integrating the processor into CLI tools or model classes."""

    @staticmethod
    def add_content_no_repeat_cli_args(parser) -> None:
        parser.add_argument(
            "--content_no_repeat_ngram_size",
            type=int,
            default=0,
            help='no-repeat-ngram size applied only to the "Content" JSON field (0 to disable)',
        )
        parser.add_argument(
            "--content_no_repeat_decode_max_tokens",
            type=int,
            default=1024,
            help="Max tokens to decode for Content field detection",
        )
        parser.add_argument(
            "--content_no_repeat_debug",
            action="store_true",
            help="Enable debug logging",
        )

    @staticmethod
    def build_content_no_repeat_logits_processor(
        tokenizer,
        content_no_repeat_ngram_size: int = 0,
        content_no_repeat_decode_max_tokens: int = 2048,
        content_no_repeat_debug: bool = False,
    ) -> Optional[LogitsProcessorList]:
        processors = LogitsProcessorList()
        if content_no_repeat_ngram_size and content_no_repeat_ngram_size > 0:
            processors.append(
                ContentNoRepeatNGramLogitsProcessor(
                    tokenizer=tokenizer,
                    ngram_size=content_no_repeat_ngram_size,
                    decode_max_tokens=content_no_repeat_decode_max_tokens,
                    debug=content_no_repeat_debug,
                )
            )
        return processors or None

Code walkthrough

_extract_open_content_text

Searches the decoded text for "Content":" as a marker and returns everything after it — but only if the closing " hasn't appeared yet. The key subtlety: escape sequences must be handled correctly. In JSON, \" is a valid escaped quote inside a string value, so a naive search for " would falsely detect the end of the field. The method walks through the text character by character, tracking backslash escapes.

decode_max_tokens

At each generation step, decoding the full sequence from scratch gets expensive as output grows. Limiting decoding to the most recent decode_max_tokens tokens keeps inference practical. A value of 1024 is a reasonable default for most use cases.

_calc_banned_ngram_tokens

This is an internal Transformers utility that computes the list of banned tokens given a sequence and an ngram size. It's prefixed with _, meaning it's a private API — verify behavior when upgrading Transformers versions.


5. Reproducing the Results

Installation

bash
pip install transformers torch

Basic usage

python
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.generation.logits_process import LogitsProcessorList

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Build the logits processor
logits_processor = ContentNoRepeatGenerationMixin.build_content_no_repeat_logits_processor(
    tokenizer=tokenizer,
    content_no_repeat_ngram_size=4,       # Ban 4-gram repetitions
    content_no_repeat_decode_max_tokens=1024,
    content_no_repeat_debug=True,          # Set True during development
)

# Run inference
inputs = tokenizer("Generate a JSON response:", return_tensors="pt")
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.7,
    "do_sample": True,
}

output_ids = model.generate(
    **inputs,
    **generation_config,
    logits_processor=logits_processor,
)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Choosing ngram_size

ngram_sizeBehavior
2Bans any 2-token repetition. Very aggressive — often too tight
3Good balance. Start here if you're unsure
4Looser. Only bans longer repeated phrases
5+Minimal effect in practice

Start with ngram_size=3 or 4, enable debug mode, and observe which tokens are being banned. Adjust based on what you see.

Enable debug=True during development. The processor logs each step, showing whether you're inside the Content field and which tokens are being suppressed. This makes tuning ngram_size significantly faster.


6. Observed Results and Caveats

What improved

  • Repetition loops inside the Content field were eliminated. The most common pattern — the same sentence repeated 3–5 times — disappeared entirely.
  • JSON structure keys were unaffected. Since the processor only activates inside the open Content field, tokens like "Title": and "Content": are generated normally.
  • The debug output provided clear visibility into processor behavior, making it easy to confirm it was working as intended.

Caveats to keep in mind

  • Per-step decode cost is real. Every generation step decodes a window of tokens to check whether we're inside the Content field. With decode_max_tokens=1024, this is fast enough in practice, but large batch sizes amplify the cost.
  • _calc_banned_ngram_tokens is a private API. It could change between Transformers releases. Add a version pin or a smoke test to your CI if this goes to production.
  • This implementation assumes a single JSON structure with one Content field. If your output schema has multiple instances of the same key, rfind will locate the most recently opened one — which is usually the correct behavior but worth verifying.

Do not set decode_max_tokens too low. If the Content field start marker falls outside the decode window, the processor will fail to detect it and the repetition control will silently stop working.


7. FAQ

Q: Does this work with models other than Qwen?

Yes. The LogitsProcessor API is model-architecture-agnostic. Any model that supports model.generate() in Transformers — including Llama, Mistral, Phi, and others — will work with this code. Watch for tokenizer differences, particularly around special token handling.

Q: What if I want to apply this to a field other than Content?

Change the marker string in _extract_open_content_text from '"Content":"' to whatever field name you're targeting. To support multiple fields simultaneously, extend the class to accept a list of markers and activate on any open match.

Q: Can I use this with vLLM or TGI (Text Generation Inference)?

vLLM supports custom LogitsProcessors in some versions via a Transformers-compatible interface — worth checking for your specific version. TGI has limited support for custom processors as of this writing; verify before committing to it in a production stack.

Q: Can I combine this with repetition_penalty?

Yes, they're complementary. A mild repetition_penalty (around 1.1) applied globally combined with this processor for field-scoped ngram control can work well. Just avoid stacking both aggressively — the double constraint can over-suppress and degrade output quality. When combining, lean toward a larger ngram_size (4–5) to give the penalty room to work.

Q: Wouldn't fixing the training data solve this without inference-time hacks?

Ideally, yes. In practice, data cleaning alone rarely eliminates this completely, especially with small fine-tuning datasets. Inference-time control and training data quality are complementary. Work on both in parallel rather than treating one as a substitute for the other.

Q: Won't a large ngram_size remove all side effects?

As ngram_size increases, side effects decrease — but so does the suppression effect. Above 6, the processor has almost no impact on repetitive outputs. Use debug mode to observe which specific repeated patterns are appearing, then find the minimum effective ngram_size that suppresses them.

This article documents first-hand results from applying this processor to a fine-tuned Qwen model for JSON generation. Behavior may vary depending on your base model, fine-tuning dataset, and generation parameters — use debug mode and empirically validate in your specific environment.


Conclusion

Repetition hallucinations in fine-tuned LLMs are a well-known pain point, and the standard solutions (repetition_penalty, no_repeat_ngram_size) apply blunt global constraints that often cause as many problems as they solve. The custom LogitsProcessor approach described here solves the problem at the right scope: control is applied exactly where the issue occurs, and nowhere else.

The broader pattern — decode the current position in a structured output and apply conditional constraints — is reusable. If your model generates structured data, templates, or any output with a defined schema, this architecture gives you a reliable way to add fine-grained generation control without retraining.

Related Articles

Explore more articles connected to this topic.