RQA-R2 / README.md

Update README.md

37f456e verified about 2 months ago

8.04 kB

	---
	license: mit
	base_model:
	- FacebookAI/xlm-roberta-large
	language:
	- ru
	tags:
	- reasoning
	- logical-analysis
	- text-classification
	- ai-safety
	- evaluation
	- judge-model
	- argumentation
	pipeline_tag: text-classification
	---

	# RQA — Reasoning Quality Analyzer (R2)

	RQA-R2 is a judge model for reasoning-quality evaluation.
	It does not generate, rewrite, or explain text. Instead, it determines whether a text contains a reasoning problem, whether that problem is hidden or explicit, and which explicit error types are present.

	> RQA is a judge, not a teacher and not a generator.

	---

	## What Is New in R2 Compared to R1

	R2 is not just a retrain of R1. It is a full methodological upgrade.

	### Core differences

	- R1 used a more limited 2-signal setup.
	- R2 uses a strict 3-head ontology:
	- `has_issue`
	- `is_hidden`
	- `error_types`

	### Key improvements in R2

	- explicit hidden-problem modeling instead of weaker implicit logic
	- strict `logical / hidden / explicit` inference contract
	- honest `train / val / calib / test` split
	- separate calibration split for temperatures and thresholds
	- per-class thresholds for error types
	- uncertainty-aware inference with `status=uncertain` and `review_required`
	- duplicate and conflict-duplicate filtering in the loader
	- truncation audit and richer evaluation reports
	- better optimizer setup for transformer fine-tuning
	- staged encoder fine-tuning with freeze/unfreeze
	- stronger schema/version safety for inference artifacts

	In short:

	> R1 was a strong prototype.
	> R2 is the first version that behaves like a full training + calibration + inference pipeline.

	---

	## What Problem RQA-R2 Solves

	Texts written by humans or LLMs can:

	- sound coherent
	- use correct vocabulary
	- appear persuasive

	...while still containing reasoning problems that are:

	- subtle
	- structural
	- hidden in argumentation

	RQA-R2 focuses specifically on reasoning quality, not on style, grammar, sentiment, or factual correctness.

	---

	## Model Overview

	\| Property \| Value \|
	\|---\|---\|
	\| Model Type \| Judge / Evaluator \|
	\| Base Encoder \| [XLM-RoBERTa Large](https://huggingface.co/FacebookAI/xlm-roberta-large) \|
	\| Pooling \| Mean pooling \|
	\| Heads \| 3 (`has_issue`, `is_hidden`, `error_types`) \|
	\| Language \| Russian \|
	\| License \| MIT \|

	---

	## What the Model Predicts

	RQA-R2 predicts three connected outputs.

	### 1. Logical Issue Detection

	- `has_logical_issue ∈ {false, true}`
	- calibrated probability available

	### 2. Hidden Problem Detection

	- `is_hidden_problem ∈ {false, true}`
	- evaluated only when a reasoning issue exists

	### 3. Explicit Error Type Classification

	If the text is classified as `explicit`, the model may assign one or more of the following error types:

	- `false_causality`
	- `unsupported_claim`
	- `overgeneralization`
	- `missing_premise`
	- `contradiction`
	- `circular_reasoning`

	This is a multi-label prediction head.

	---

	## Ontology

	R2 uses a strict three-class reasoning ontology.

	### `logical`

	- no reasoning issue
	- no hidden problem
	- no explicit errors

	### `hidden`

	- reasoning problem exists
	- no explicit labeled fallacy
	- the issue is structural, implicit, or argumentative

	### `explicit`

	- reasoning problem exists
	- at least one explicit error type is present

	This ontology is enforced in both training and inference.

	---

	## Inference Contract

	RQA-R2 uses gated inference:

	- if `has_issue = false` -> class is `logical`, no errors are returned
	- if `has_issue = true` and `is_hidden = true` -> class is `hidden`, no explicit errors are returned
	- if `has_issue = true` and `is_hidden = false` -> class is `explicit`, explicit errors may be returned

	R2 also supports:

	- calibrated thresholds
	- `uncertain` mode
	- `review_required` for borderline cases

	---

	## Architecture

	RQA-R2 is built on top of XLM-RoBERTa Large with:

	- mean pooling
	- separate projections per task
	- separate dropout per head
	- 3 task-specific heads
	- uncertainty-weighted multi-task training

	Training is hierarchical:

	- `has_issue` is trained on all samples
	- `is_hidden` is trained only on problem samples
	- `error_types` are trained only on explicit samples

	---

	## Training and Calibration

	R2 uses an honest experimental structure:

	- `train` for fitting
	- `val` for model selection
	- `calib` for temperature scaling and threshold tuning
	- `test` for final held-out evaluation

	Calibration includes:

	- issue temperature
	- hidden temperature
	- per-class error temperatures
	- threshold selection for `has_issue`
	- threshold selection for `is_hidden`
	- per-class thresholds for error types

	---

	## Held-Out Synthetic Benchmark

	The following metrics were obtained on the current held-out synthetic test split used for R2:

	- `Issue`: `F1 = 0.988`, `FPR = 0.029`, `PR-AUC = 0.999`
	- `Hidden`: `F1 = 0.960`, `PR-AUC = 0.994`
	- `Errors`: `macro-F1 = 0.822`, `micro-F1 = 0.813`, `samples-F1 = 0.838`
	- `Top-level class macro-F1 = 0.964`
	- `Coverage = 95.6%`
	- `Uncertain rate = 4.4%`

	These are strong results for the current data regime.

	Important:

	> These metrics are measured on a held-out split from the current synthetic dataset.
	> They demonstrate that the R2 design works very well in-distribution, but they should not be interpreted as proof of universal real-world reasoning performance.

	---

	## Training Data

	RQA-R2 was trained on a custom reasoning-quality dataset with:

	- `7292` total samples
	- `3150` logical texts
	- `4142` problematic texts
	- `1242` hidden problems
	- `2900` explicit cases

	Error-label counts:

	- `false_causality`: `518`
	- `unsupported_claim`: `524`
	- `overgeneralization`: `599`
	- `missing_premise`: `537`
	- `contradiction`: `475`
	- `circular_reasoning`: `540`

	Multi-label explicit cases:

	- `293`

	The current dataset is useful and already strong enough for training and benchmarking R2, but it is still primarily synthetic and should be expanded with real-world data in future versions.

	---

	## Intended Use

	### Recommended for

	- reasoning-quality evaluation
	- LLM output auditing
	- AI safety pipelines
	- judge/reranker pipelines
	- pre-filtering for downstream review
	- analytical tooling around argument structure

	### Not intended for

	- text generation
	- explanation generation
	- automatic rewriting or correction
	- factual verification
	- legal or scientific truth adjudication

	---

	## Output Example

	```json
	{
	"class": "explicit",
	"status": "ok",
	"review_required": false,
	"has_logical_issue": true,
	"has_issue_probability": 0.9993,
	"is_hidden_problem": false,
	"hidden_probability": 0.021,
	"errors": [
	{
	"type": "missing_premise",
	"probability": 0.923,
	"threshold": 0.54
	}
	]
	}
	```

	---

	## Limitations

	RQA-R2 still has important limits:

	- it evaluates reasoning structure, not factual truth
	- hidden problems remain partly subjective by nature
	- the current benchmark is still synthetic and in-distribution
	- real human-written texts and outputs from other LLMs may be harder
	- the model should still be validated externally before being treated as a fully general reasoning judge

	Also note:

	- R2 was optimized toward low false positives, but on the current held-out synthetic test set the observed `Issue FPR` is `2.9%`, not `1.0%`
	- if strict false-positive control is critical, threshold tuning may need to be tightened further for the target deployment environment

	---

	## Recommended Next Step

	The best next step after R2 is external validation on:

	- human-written argumentative texts
	- outputs from other LLM families
	- paraphrased and adversarially reworded samples
	- harder hidden-problem cases

	That is the correct way to turn a strong in-distribution result into a robust real-world system.

	---

	## Summary

	RQA-R2 is a major upgrade over R1:

	- better ontology
	- better training logic
	- better calibration
	- better inference safety
	- stronger held-out synthetic performance

	R1 proved the idea.
	R2 is the first version that fully validates it.