Online Self-Calibration Against Hallucination in Vision-Language Models
Abstract
Online self-calibration framework using Monte Carlo tree search and dual-granularity reward mechanism improves vision-language model accuracy by addressing hallucination through preference data construction and direct preference optimization.
Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose Online Self-CAlibRation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.
Community
ijcai 2026
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs (2026)
- Mitigating Multimodal Hallucination via Phase-wise Self-reward (2026)
- Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation (2026)
- First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models (2026)
- Hallucination-aware intermediate representation edit in large vision-language models (2026)
- Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation (2026)
- When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.00323 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper