arxiv:2512.23711

CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

Published on Nov 26, 2025

Authors:

Abstract

The CAT framework evaluates the relationship between accuracy and response consistency in large language models using consistency-accuracy relation curves and a global CORE metric for comprehensive model assessment.

AI-generated summary

We introduce CAT, a framework designed to evaluate and visualize the interplay of accuracy and response consistency of Large Language Models (LLMs) under controllable input variations, using multiple-choice (MC) benchmarks as a case study. Current evaluation practices primarily focus on model capabilities such as accuracy or benchmark scores and, more recently, measuring consistency is being considered an essential property for deploying LLMs in high-stake, real-world applications. We argue in this paper that although both dimensions should still be evaluated independently, their inter-dependency also need to be considered for a more nuanced evaluation of LLMs. At the core of CAT are the Consistency-Accuracy Relation (CAR) curves, which visualize how model accuracy varies with increasing consistency requirements, as defined by the Minimum-Consistency Accuracy (MCA) metric. We further propose the Consistency-Oriented Robustness Estimate (CORE) index, a global metric that combines the area and shape of the CAR curve to quantify the trade-off between accuracy and consistency. We present a practical demonstration of our framework across a diverse set of generalist and domain-specific LLMs, evaluated on multiple MC benchmarks. We also outline how CAT can be extended beyond MC tasks to support long-form, open-ended evaluations through adaptable scoring functions.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.23711 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.23711 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.23711 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.