Suppressing LLM Repetition Hallucinations with a Custom Logits Processor in Qwen
Summary (TL;DR)
When fine-tuning Qwen for structured outputs like JSON, token repetition hallucinations are a common and frustrating problem. Standard generation parameters like repetition_penalty and no_repeat_ngram_size can help, but their side effects are often too broad to be practical.
This article walks through a custom Logits Processor implementation for Transformers that applies no-repeat-ngram control only inside the Content field of a JSON output — leaving the rest of the generation untouched. The result: hallucination loops suppressed where they matter, with minimal collateral damage.
What you'll take away:
- A clear understanding of why repetition hallucinations happen after fine-tuning
- The practical limits of
repetition_penaltyandno_repeat_ngram_size - A fully reproducible
LogitsProcessorimplementation for field-scoped repetition control - Built-in debug tooling for tuning in production
1. What Is Repetition Hallucination in Fine-Tuned Qwen?
Qwen is a strong base model with competitive accuracy and inference speed. But fine-tuning disrupts the original token distribution, and one of the clearest symptoms is repetition hallucination — the model enters a loop and keeps generating the same phrase indefinitely.
What it looks like in practice
Suppose you're fine-tuning Qwen to output structured JSON like this:
{
"Title": "Efficiency Proposal",
"Content": "To improve meeting efficiency, we recommend..."
}
After fine-tuning, the model might produce output like:
"Content": "To improve meeting efficiency, we recommend sharing agendas in advance, sharing agendas in advance, sharing agendas in advance, sharing agendas in advance..."
Why it happens
There are a few overlapping causes:
- Training data distribution bias: If repeated phrases appear frequently in your training set, the model learns to reproduce the pattern.
- Interaction with temperature settings: At low temperatures (closer to greedy decoding), once a high-probability token is selected, the model gets locked into it.
- Overfitting from small datasets: Fine-tuning on limited data reduces generalization and increases convergence toward specific patterns.
Repetition hallucination after fine-tuning is not a Qwen-specific problem. The same issue surfaces in Llama, Mistral, and other Transformer-based LLMs — the mechanism is the same.
2. The Limits of Standard Parameters
repetition_penalty
The repetition_penalty parameter in Transformers penalizes logits for tokens that have already been generated. A value of 1.0 means no penalty; around 1.3 begins to have a noticeable effect.
generation_config = {
"repetition_penalty": 1.3,
}
It works — but the scope is the entire generated sequence. That means legitimate repeated tokens — particles, conjunctions, proper nouns, common phrases — also get penalized, leading to unnatural output.
no_repeat_ngram_size
Setting no_repeat_ngram_size=3 outright bans any trigram that has appeared before in the generated sequence. It's powerful, but the side effects are equally powerful.
generation_config = {
"no_repeat_ngram_size": 3,
}
Real examples of unintended side effects:
- Bullet-pointed lists can't repeat common sentence endings (e.g., "is required", "should be noted")
- Structured or templated outputs break because intended repetition gets banned
- The same greeting or closing phrase can't appear more than once in a document
The root problem: you only want to constrain a specific field, but the parameter constrains everything.
3. The Solution: Field-Scoped Logits Processor
Transformers exposes a LogitsProcessor API that lets you inject custom logic at every token generation step, modifying the raw logit scores before sampling.
Key terminology
| Term | What it means |
|---|---|
| Logits | Raw per-token scores output by the model before softmax |
| LogitsProcessor | A class that transforms logits at each generation step. Setting a logit to -inf effectively bans that token |
| LogitsProcessorList | A container that chains multiple processors in sequence |
| n-gram | A contiguous sequence of n tokens. A trigram (n=3) is any sequence of 3 tokens |
Design philosophy
The ContentNoRepeatNGramLogitsProcessor implemented here works on a simple idea:
- At each generation step, decode the generated tokens so far and check whether the model is currently inside the
Contentfield of the JSON output - If yes, apply no-repeat-ngram to the content text only
- If no (e.g., the model is generating
"Title":or closing brackets), leave logits untouched
This "where are we in the structure right now?" approach is a broadly applicable pattern for controlled structured generation.
4. Full Implementation
The following is a complete, self-contained implementation. It works with any Transformers-compatible model — not just Qwen.
from __future__ import annotations
from datetime import datetime
from typing import Optional
import torch
from transformers.generation.logits_process import (
LogitsProcessor,
LogitsProcessorList,
_calc_banned_ngram_tokens,
)
class ContentNoRepeatNGramLogitsProcessor(LogitsProcessor):
"""Applies no-repeat-ngram exclusively inside the JSON Content field."""
def __init__(
self,
tokenizer,
ngram_size: int,
decode_max_tokens: int = 2048,
debug: bool = False,
):
if not isinstance(ngram_size, int) or ngram_size <= 0:
raise ValueError(
f"`ngram_size` has to be a strictly positive integer, but is {ngram_size}"
)
if not isinstance(decode_max_tokens, int) or decode_max_tokens <= 0:
raise ValueError(
"`decode_max_tokens` has to be a strictly positive integer, "
f"but is {decode_max_tokens}"
)
self.tokenizer = tokenizer
self.ngram_size = ngram_size
self.decode_max_tokens = decode_max_tokens
self.debug = debug
self._debug_calls = 0
self._debug_hits = 0
@staticmethod
def _extract_open_content_text(decoded_text: str) -> Optional[str]:
"""
Extracts the text inside an open (not yet closed) Content field.
Returns None if the field is closed or not present.
"""
marker = '"Content":"'
marker_index = decoded_text.rfind(marker)
if marker_index == -1:
return None
content_start = marker_index + len(marker)
escaped = False
for index in range(content_start, len(decoded_text)):
char = decoded_text[index]
if escaped:
escaped = False
continue
if char == "\\":
escaped = True
continue
if char == '"':
# Closing quote found — Content field is complete
return None
return decoded_text[content_start:]
@staticmethod
def _debug_timestamp() -> str:
return datetime.now().strftime("%H:%M:%S")
def __call__(
self, input_ids: torch.LongTensor, scores: torch.FloatTensor
) -> torch.FloatTensor:
self._debug_calls += 1
num_batch_hypotheses = scores.shape[0]
scores_processed = scores.clone()
for i in range(num_batch_hypotheses):
# Limit decoding to recent tokens for performance
decode_ids = input_ids[i, -self.decode_max_tokens :]
decoded_text = self.tokenizer.decode(
decode_ids,
skip_special_tokens=False,
clean_up_tokenization_spaces=False,
)
# Check whether we're inside an open Content field
content_text = self._extract_open_content_text(decoded_text)
if content_text is None:
if self.debug:
print(
f"[{self._debug_timestamp()}][ContentNoRepeat] "
f"Not in Content field. Tail: {decoded_text[-80:]}"
)
continue
if self.debug:
preview = content_text[-80:].replace("\n", "\\n")
print(
f"[{self._debug_timestamp()}][ContentNoRepeat] "
f"step={self._debug_calls} batch={i} "
f"content_chars={len(content_text)} tail={preview!r}"
)
# Tokenize the Content field text
content_ids = self.tokenizer.encode(
content_text,
add_special_tokens=False,
)
if len(content_ids) + 1 < self.ngram_size:
# Not enough tokens yet to form an ngram
continue
content_input_ids = input_ids.new_tensor(content_ids).unsqueeze(0)
banned_batch_tokens = _calc_banned_ngram_tokens(
self.ngram_size,
content_input_ids,
1,
content_input_ids.shape[-1],
)
banned_tokens = banned_batch_tokens[0]
if banned_tokens:
self._debug_hits += 1
if self.debug:
print(
f"[{self._debug_timestamp()}][ContentNoRepeat] "
f"Banned {len(banned_tokens)} token(s) "
f"for batch={i} at step={self._debug_calls}"
)
# Set banned token logits to -inf to prevent sampling
scores_processed[i, banned_tokens] = -float("inf")
return scores_processed
class ContentNoRepeatGenerationMixin:
"""Mixin for integrating the processor into CLI tools or model classes."""
@staticmethod
def add_content_no_repeat_cli_args(parser) -> None:
parser.add_argument(
"--content_no_repeat_ngram_size",
type=int,
default=0,
help='no-repeat-ngram size applied only to the "Content" JSON field (0 to disable)',
)
parser.add_argument(
"--content_no_repeat_decode_max_tokens",
type=int,
default=1024,
help="Max tokens to decode for Content field detection",
)
parser.add_argument(
"--content_no_repeat_debug",
action="store_true",
help="Enable debug logging",
)
@staticmethod
def build_content_no_repeat_logits_processor(
tokenizer,
content_no_repeat_ngram_size: int = 0,
content_no_repeat_decode_max_tokens: int = 2048,
content_no_repeat_debug: bool = False,
) -> Optional[LogitsProcessorList]:
processors = LogitsProcessorList()
if content_no_repeat_ngram_size and content_no_repeat_ngram_size > 0:
processors.append(
ContentNoRepeatNGramLogitsProcessor(
tokenizer=tokenizer,
ngram_size=content_no_repeat_ngram_size,
decode_max_tokens=content_no_repeat_decode_max_tokens,
debug=content_no_repeat_debug,
)
)
return processors or None
Code walkthrough
_extract_open_content_text
Searches the decoded text for "Content":" as a marker and returns everything after it — but only if the closing " hasn't appeared yet. The key subtlety: escape sequences must be handled correctly. In JSON, \" is a valid escaped quote inside a string value, so a naive search for " would falsely detect the end of the field. The method walks through the text character by character, tracking backslash escapes.
decode_max_tokens
At each generation step, decoding the full sequence from scratch gets expensive as output grows. Limiting decoding to the most recent decode_max_tokens tokens keeps inference practical. A value of 1024 is a reasonable default for most use cases.
_calc_banned_ngram_tokens
This is an internal Transformers utility that computes the list of banned tokens given a sequence and an ngram size. It's prefixed with _, meaning it's a private API — verify behavior when upgrading Transformers versions.
5. Reproducing the Results
Installation
pip install transformers torch
Basic usage
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.generation.logits_process import LogitsProcessorList
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Build the logits processor
logits_processor = ContentNoRepeatGenerationMixin.build_content_no_repeat_logits_processor(
tokenizer=tokenizer,
content_no_repeat_ngram_size=4, # Ban 4-gram repetitions
content_no_repeat_decode_max_tokens=1024,
content_no_repeat_debug=True, # Set True during development
)
# Run inference
inputs = tokenizer("Generate a JSON response:", return_tensors="pt")
generation_config = {
"max_new_tokens": 512,
"temperature": 0.7,
"do_sample": True,
}
output_ids = model.generate(
**inputs,
**generation_config,
logits_processor=logits_processor,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Choosing ngram_size
| ngram_size | Behavior |
|---|---|
| 2 | Bans any 2-token repetition. Very aggressive — often too tight |
| 3 | Good balance. Start here if you're unsure |
| 4 | Looser. Only bans longer repeated phrases |
| 5+ | Minimal effect in practice |
Start with ngram_size=3 or 4, enable debug mode, and observe which tokens are being banned. Adjust based on what you see.
Enable debug=True during development. The processor logs each step, showing whether you're inside the Content field and which tokens are being suppressed. This makes tuning ngram_size significantly faster.
6. Observed Results and Caveats
What improved
- Repetition loops inside the Content field were eliminated. The most common pattern — the same sentence repeated 3–5 times — disappeared entirely.
- JSON structure keys were unaffected. Since the processor only activates inside the open Content field, tokens like
"Title":and"Content":are generated normally. - The debug output provided clear visibility into processor behavior, making it easy to confirm it was working as intended.
Caveats to keep in mind
- Per-step decode cost is real. Every generation step decodes a window of tokens to check whether we're inside the Content field. With
decode_max_tokens=1024, this is fast enough in practice, but large batch sizes amplify the cost. _calc_banned_ngram_tokensis a private API. It could change between Transformers releases. Add a version pin or a smoke test to your CI if this goes to production.- This implementation assumes a single JSON structure with one
Contentfield. If your output schema has multiple instances of the same key,rfindwill locate the most recently opened one — which is usually the correct behavior but worth verifying.
Do not set decode_max_tokens too low. If the Content field start marker falls outside the decode window, the processor will fail to detect it and the repetition control will silently stop working.
7. FAQ
Q: Does this work with models other than Qwen?
Yes. The LogitsProcessor API is model-architecture-agnostic. Any model that supports model.generate() in Transformers — including Llama, Mistral, Phi, and others — will work with this code. Watch for tokenizer differences, particularly around special token handling.
Q: What if I want to apply this to a field other than Content?
Change the marker string in _extract_open_content_text from '"Content":"' to whatever field name you're targeting. To support multiple fields simultaneously, extend the class to accept a list of markers and activate on any open match.
Q: Can I use this with vLLM or TGI (Text Generation Inference)?
vLLM supports custom LogitsProcessors in some versions via a Transformers-compatible interface — worth checking for your specific version. TGI has limited support for custom processors as of this writing; verify before committing to it in a production stack.
Q: Can I combine this with repetition_penalty?
Yes, they're complementary. A mild repetition_penalty (around 1.1) applied globally combined with this processor for field-scoped ngram control can work well. Just avoid stacking both aggressively — the double constraint can over-suppress and degrade output quality. When combining, lean toward a larger ngram_size (4–5) to give the penalty room to work.
Q: Wouldn't fixing the training data solve this without inference-time hacks?
Ideally, yes. In practice, data cleaning alone rarely eliminates this completely, especially with small fine-tuning datasets. Inference-time control and training data quality are complementary. Work on both in parallel rather than treating one as a substitute for the other.
Q: Won't a large ngram_size remove all side effects?
As ngram_size increases, side effects decrease — but so does the suppression effect. Above 6, the processor has almost no impact on repetitive outputs. Use debug mode to observe which specific repeated patterns are appearing, then find the minimum effective ngram_size that suppresses them.
This article documents first-hand results from applying this processor to a fine-tuned Qwen model for JSON generation. Behavior may vary depending on your base model, fine-tuning dataset, and generation parameters — use debug mode and empirically validate in your specific environment.
Conclusion
Repetition hallucinations in fine-tuned LLMs are a well-known pain point, and the standard solutions (repetition_penalty, no_repeat_ngram_size) apply blunt global constraints that often cause as many problems as they solve. The custom LogitsProcessor approach described here solves the problem at the right scope: control is applied exactly where the issue occurs, and nowhere else.
The broader pattern — decode the current position in a structured output and apply conditional constraints — is reusable. If your model generates structured data, templates, or any output with a defined schema, this architecture gives you a reliable way to add fine-grained generation control without retraining.
Related Articles
Explore more articles connected to this topic.
5 Books a Working AI Engineer Actually Read for Real-World Accuracy Improvement and Product Development
The reason you can call AI APIs but still can't improve accuracy is a lack of understanding of what's happening inside the model. These 5 books — all O'Reilly titles written by practitioners at the forefront of the field, including Hugging Face engineers and Chip Huyen — are the standard reading list for ML engineers worldwide.
Read article →Replaced Embedding RAG with PageIndex on My Own Site — Here's What Actually Happened
After hitting a wall with hybrid BM25 + Embedding + Rerank search (wrong retrievals, high infra cost, Japanese embedding model limitations), I implemented **PageIndex** — a vector-free RAG architecture — as the AI assistant for my company site. The result: no VectorDB, no embedding model, no Reranker needed. Two-stage LLM-based filtering via a URL tree navigated to the correct pages for queries like *"What products do you have?"* and *"What ASR-related blog posts exist?"* — zero vector math involved.
Read article →