arxiv:2601.04932

GenProve: Learning to Generate Text with Fine-Grained Provenance

Published on Apr 12

Authors:

Abstract

Large language models often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability & Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.

AI-generated summary

Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability & Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.

View arXiv page View PDF Add to collection

Community

rossssssi

3 days ago

This paper presents a timely and impactful contribution to trustworthy LLM generation by introducing Generation-time Fine-grained Provenance—a novel paradigm requiring models to simultaneously produce fluent answers and structured, sentence-level provenance triples (DocID, SentID, Relation). The authors construct ReFInE, the first expert-annotated QA dataset with dense, relation-aware supervision (Quotation/Compression/Inference), and propose GenProve, a two-stage framework combining SFT with GRPO-based reinforcement learning and a composite reward that jointly optimizes answer fidelity and provenance correctness. Extensive experiments show GenProve (based on Qwen3-8B) consistently outperforms 14 strong open- and closed-source LLMs—including GPT-5 and Gemini 2.5 Pro—across answer quality, provenance F1, and LLM-as-judge metrics. Notably, the diagnostic analysis revealing a significant reasoning gap (models excel at Quotation but struggle with Inference-based provenance) offers valuable insight for future research on verifiable reasoning. With code and data publicly released, this work sets a new standard for transparent, self-auditing text generation and provides a solid foundation for advancing accountable AI. Highly recommended for researchers working on RAG, factuality, and interpretable generation.