Abstract
Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations.
Vision–language hallucination failure modes

Hallucinations in LVLMs increasingly arise from conflicts between language priors and visual information, rather than from perceptual limitations alone. However, existing evaluation benchmarks including POPE, CHAIR, SHR, and MMHAL-Bench do not distinguish between hallucinations originating from perception failures, learned object co-occurrence priors, or presuppositions introduced by the instruction itself.
We introduce HalluScope
, a benchmark designed to disentangle distinct causes of hallucination:
Perception Failures
Can the model correctly see what is in the image?
Co-occurrence Priors
Does the model hallucinate statistically likely but absent objects?
Instruction Presuppositions
Does the model follow false assumptions introduced by the prompt?
Overview of the HalluScope benchmark construction pipeline.
Using HalluScope, we show that hallucinations in modern LVLMs predominantly arise from over-reliance on textual instruction presuppositions and learned semantic priors rather than limitations of visual perception, revealing a shift in failure modes as visual backbones improve.
Mitigating hallucinations with HalluVL-DPO
To mitigate hallucinations, particularly those driven by over-reliance on textual instruction presuppositions, we propose HalluVL-DPO, a fine-tuning framework based on a sample-informativeness weighted variant of Direct Preference Optimization (DPO). We construct a dedicated training dataset where each sample is paired with a preferred (visually grounded) response and a rejected (hallucinated) one, providing explicit supervision to steer the model toward more grounded outputs.
–>