RQA-R2 / README.md
skatzR's picture
Update README.md
37f456e verified
---
license: mit
base_model:
- FacebookAI/xlm-roberta-large
language:
- ru
tags:
- reasoning
- logical-analysis
- text-classification
- ai-safety
- evaluation
- judge-model
- argumentation
pipeline_tag: text-classification
---
# RQA — Reasoning Quality Analyzer (R2)
**RQA-R2** is a **judge model** for reasoning-quality evaluation.
It does **not** generate, rewrite, or explain text. Instead, it determines whether a text contains a reasoning problem, whether that problem is **hidden** or **explicit**, and which explicit error types are present.
> RQA is a judge, not a teacher and not a generator.
---
## What Is New in R2 Compared to R1
R2 is not just a retrain of R1. It is a full methodological upgrade.
### Core differences
- **R1** used a more limited 2-signal setup.
- **R2** uses a strict **3-head ontology**:
- `has_issue`
- `is_hidden`
- `error_types`
### Key improvements in R2
- explicit hidden-problem modeling instead of weaker implicit logic
- strict `logical / hidden / explicit` inference contract
- honest `train / val / calib / test` split
- separate calibration split for temperatures and thresholds
- per-class thresholds for error types
- uncertainty-aware inference with `status=uncertain` and `review_required`
- duplicate and conflict-duplicate filtering in the loader
- truncation audit and richer evaluation reports
- better optimizer setup for transformer fine-tuning
- staged encoder fine-tuning with freeze/unfreeze
- stronger schema/version safety for inference artifacts
In short:
> **R1** was a strong prototype.
> **R2** is the first version that behaves like a full training + calibration + inference pipeline.
---
## What Problem RQA-R2 Solves
Texts written by humans or LLMs can:
- sound coherent
- use correct vocabulary
- appear persuasive
...while still containing **reasoning problems** that are:
- subtle
- structural
- hidden in argumentation
RQA-R2 focuses specifically on **reasoning quality**, not on style, grammar, sentiment, or factual correctness.
---
## Model Overview
| Property | Value |
|---|---|
| Model Type | Judge / Evaluator |
| Base Encoder | [XLM-RoBERTa Large](https://huggingface.co/FacebookAI/xlm-roberta-large) |
| Pooling | Mean pooling |
| Heads | 3 (`has_issue`, `is_hidden`, `error_types`) |
| Language | Russian |
| License | MIT |
---
## What the Model Predicts
RQA-R2 predicts three connected outputs.
### 1. Logical Issue Detection
- `has_logical_issue ∈ {false, true}`
- calibrated probability available
### 2. Hidden Problem Detection
- `is_hidden_problem ∈ {false, true}`
- evaluated only when a reasoning issue exists
### 3. Explicit Error Type Classification
If the text is classified as `explicit`, the model may assign one or more of the following error types:
- `false_causality`
- `unsupported_claim`
- `overgeneralization`
- `missing_premise`
- `contradiction`
- `circular_reasoning`
This is a **multi-label** prediction head.
---
## Ontology
R2 uses a strict three-class reasoning ontology.
### `logical`
- no reasoning issue
- no hidden problem
- no explicit errors
### `hidden`
- reasoning problem exists
- no explicit labeled fallacy
- the issue is structural, implicit, or argumentative
### `explicit`
- reasoning problem exists
- at least one explicit error type is present
This ontology is enforced in both training and inference.
---
## Inference Contract
RQA-R2 uses gated inference:
- if `has_issue = false` -> class is `logical`, no errors are returned
- if `has_issue = true` and `is_hidden = true` -> class is `hidden`, no explicit errors are returned
- if `has_issue = true` and `is_hidden = false` -> class is `explicit`, explicit errors may be returned
R2 also supports:
- calibrated thresholds
- `uncertain` mode
- `review_required` for borderline cases
---
## Architecture
RQA-R2 is built on top of **XLM-RoBERTa Large** with:
- mean pooling
- separate projections per task
- separate dropout per head
- 3 task-specific heads
- uncertainty-weighted multi-task training
Training is hierarchical:
- `has_issue` is trained on all samples
- `is_hidden` is trained only on problem samples
- `error_types` are trained only on explicit samples
---
## Training and Calibration
R2 uses an honest experimental structure:
- `train` for fitting
- `val` for model selection
- `calib` for temperature scaling and threshold tuning
- `test` for final held-out evaluation
Calibration includes:
- issue temperature
- hidden temperature
- per-class error temperatures
- threshold selection for `has_issue`
- threshold selection for `is_hidden`
- per-class thresholds for error types
---
## Held-Out Synthetic Benchmark
The following metrics were obtained on the current held-out synthetic test split used for R2:
- `Issue`: `F1 = 0.988`, `FPR = 0.029`, `PR-AUC = 0.999`
- `Hidden`: `F1 = 0.960`, `PR-AUC = 0.994`
- `Errors`: `macro-F1 = 0.822`, `micro-F1 = 0.813`, `samples-F1 = 0.838`
- `Top-level class macro-F1 = 0.964`
- `Coverage = 95.6%`
- `Uncertain rate = 4.4%`
These are strong results for the current data regime.
Important:
> These metrics are measured on a held-out split from the current synthetic dataset.
> They demonstrate that the R2 design works very well in-distribution, but they should not be interpreted as proof of universal real-world reasoning performance.
---
## Training Data
RQA-R2 was trained on a custom reasoning-quality dataset with:
- `7292` total samples
- `3150` logical texts
- `4142` problematic texts
- `1242` hidden problems
- `2900` explicit cases
Error-label counts:
- `false_causality`: `518`
- `unsupported_claim`: `524`
- `overgeneralization`: `599`
- `missing_premise`: `537`
- `contradiction`: `475`
- `circular_reasoning`: `540`
Multi-label explicit cases:
- `293`
The current dataset is useful and already strong enough for training and benchmarking R2, but it is still primarily **synthetic** and should be expanded with real-world data in future versions.
---
## Intended Use
### Recommended for
- reasoning-quality evaluation
- LLM output auditing
- AI safety pipelines
- judge/reranker pipelines
- pre-filtering for downstream review
- analytical tooling around argument structure
### Not intended for
- text generation
- explanation generation
- automatic rewriting or correction
- factual verification
- legal or scientific truth adjudication
---
## Output Example
```json
{
"class": "explicit",
"status": "ok",
"review_required": false,
"has_logical_issue": true,
"has_issue_probability": 0.9993,
"is_hidden_problem": false,
"hidden_probability": 0.021,
"errors": [
{
"type": "missing_premise",
"probability": 0.923,
"threshold": 0.54
}
]
}
```
---
## Limitations
RQA-R2 still has important limits:
- it evaluates reasoning structure, not factual truth
- hidden problems remain partly subjective by nature
- the current benchmark is still synthetic and in-distribution
- real human-written texts and outputs from other LLMs may be harder
- the model should still be validated externally before being treated as a fully general reasoning judge
Also note:
- R2 was optimized toward low false positives, but on the current held-out synthetic test set the observed `Issue FPR` is `2.9%`, not `1.0%`
- if strict false-positive control is critical, threshold tuning may need to be tightened further for the target deployment environment
---
## Recommended Next Step
The best next step after R2 is external validation on:
- human-written argumentative texts
- outputs from other LLM families
- paraphrased and adversarially reworded samples
- harder hidden-problem cases
That is the correct way to turn a strong in-distribution result into a robust real-world system.
---
## Summary
RQA-R2 is a major upgrade over R1:
- better ontology
- better training logic
- better calibration
- better inference safety
- stronger held-out synthetic performance
R1 proved the idea.
**R2 is the first version that fully validates it.**