| --- |
| license: mit |
| base_model: |
| - FacebookAI/xlm-roberta-large |
| language: |
| - ru |
| tags: |
| - reasoning |
| - logical-analysis |
| - text-classification |
| - ai-safety |
| - evaluation |
| - judge-model |
| - argumentation |
| pipeline_tag: text-classification |
| --- |
| |
| # RQA — Reasoning Quality Analyzer (R2) |
|
|
| **RQA-R2** is a **judge model** for reasoning-quality evaluation. |
| It does **not** generate, rewrite, or explain text. Instead, it determines whether a text contains a reasoning problem, whether that problem is **hidden** or **explicit**, and which explicit error types are present. |
|
|
| > RQA is a judge, not a teacher and not a generator. |
|
|
| --- |
|
|
| ## What Is New in R2 Compared to R1 |
|
|
| R2 is not just a retrain of R1. It is a full methodological upgrade. |
|
|
| ### Core differences |
|
|
| - **R1** used a more limited 2-signal setup. |
| - **R2** uses a strict **3-head ontology**: |
| - `has_issue` |
| - `is_hidden` |
| - `error_types` |
|
|
| ### Key improvements in R2 |
|
|
| - explicit hidden-problem modeling instead of weaker implicit logic |
| - strict `logical / hidden / explicit` inference contract |
| - honest `train / val / calib / test` split |
| - separate calibration split for temperatures and thresholds |
| - per-class thresholds for error types |
| - uncertainty-aware inference with `status=uncertain` and `review_required` |
| - duplicate and conflict-duplicate filtering in the loader |
| - truncation audit and richer evaluation reports |
| - better optimizer setup for transformer fine-tuning |
| - staged encoder fine-tuning with freeze/unfreeze |
| - stronger schema/version safety for inference artifacts |
|
|
| In short: |
|
|
| > **R1** was a strong prototype. |
| > **R2** is the first version that behaves like a full training + calibration + inference pipeline. |
|
|
| --- |
|
|
| ## What Problem RQA-R2 Solves |
|
|
| Texts written by humans or LLMs can: |
|
|
| - sound coherent |
| - use correct vocabulary |
| - appear persuasive |
|
|
| ...while still containing **reasoning problems** that are: |
|
|
| - subtle |
| - structural |
| - hidden in argumentation |
|
|
| RQA-R2 focuses specifically on **reasoning quality**, not on style, grammar, sentiment, or factual correctness. |
|
|
| --- |
|
|
| ## Model Overview |
|
|
| | Property | Value | |
| |---|---| |
| | Model Type | Judge / Evaluator | |
| | Base Encoder | [XLM-RoBERTa Large](https://huggingface.co/FacebookAI/xlm-roberta-large) | |
| | Pooling | Mean pooling | |
| | Heads | 3 (`has_issue`, `is_hidden`, `error_types`) | |
| | Language | Russian | |
| | License | MIT | |
|
|
| --- |
|
|
| ## What the Model Predicts |
|
|
| RQA-R2 predicts three connected outputs. |
|
|
| ### 1. Logical Issue Detection |
|
|
| - `has_logical_issue ∈ {false, true}` |
| - calibrated probability available |
|
|
| ### 2. Hidden Problem Detection |
|
|
| - `is_hidden_problem ∈ {false, true}` |
| - evaluated only when a reasoning issue exists |
|
|
| ### 3. Explicit Error Type Classification |
|
|
| If the text is classified as `explicit`, the model may assign one or more of the following error types: |
|
|
| - `false_causality` |
| - `unsupported_claim` |
| - `overgeneralization` |
| - `missing_premise` |
| - `contradiction` |
| - `circular_reasoning` |
|
|
| This is a **multi-label** prediction head. |
|
|
| --- |
|
|
| ## Ontology |
|
|
| R2 uses a strict three-class reasoning ontology. |
|
|
| ### `logical` |
|
|
| - no reasoning issue |
| - no hidden problem |
| - no explicit errors |
|
|
| ### `hidden` |
|
|
| - reasoning problem exists |
| - no explicit labeled fallacy |
| - the issue is structural, implicit, or argumentative |
|
|
| ### `explicit` |
|
|
| - reasoning problem exists |
| - at least one explicit error type is present |
|
|
| This ontology is enforced in both training and inference. |
|
|
| --- |
|
|
| ## Inference Contract |
|
|
| RQA-R2 uses gated inference: |
|
|
| - if `has_issue = false` -> class is `logical`, no errors are returned |
| - if `has_issue = true` and `is_hidden = true` -> class is `hidden`, no explicit errors are returned |
| - if `has_issue = true` and `is_hidden = false` -> class is `explicit`, explicit errors may be returned |
|
|
| R2 also supports: |
|
|
| - calibrated thresholds |
| - `uncertain` mode |
| - `review_required` for borderline cases |
|
|
| --- |
|
|
| ## Architecture |
|
|
| RQA-R2 is built on top of **XLM-RoBERTa Large** with: |
|
|
| - mean pooling |
| - separate projections per task |
| - separate dropout per head |
| - 3 task-specific heads |
| - uncertainty-weighted multi-task training |
|
|
| Training is hierarchical: |
|
|
| - `has_issue` is trained on all samples |
| - `is_hidden` is trained only on problem samples |
| - `error_types` are trained only on explicit samples |
|
|
| --- |
|
|
| ## Training and Calibration |
|
|
| R2 uses an honest experimental structure: |
|
|
| - `train` for fitting |
| - `val` for model selection |
| - `calib` for temperature scaling and threshold tuning |
| - `test` for final held-out evaluation |
|
|
| Calibration includes: |
|
|
| - issue temperature |
| - hidden temperature |
| - per-class error temperatures |
| - threshold selection for `has_issue` |
| - threshold selection for `is_hidden` |
| - per-class thresholds for error types |
|
|
| --- |
|
|
| ## Held-Out Synthetic Benchmark |
|
|
| The following metrics were obtained on the current held-out synthetic test split used for R2: |
|
|
| - `Issue`: `F1 = 0.988`, `FPR = 0.029`, `PR-AUC = 0.999` |
| - `Hidden`: `F1 = 0.960`, `PR-AUC = 0.994` |
| - `Errors`: `macro-F1 = 0.822`, `micro-F1 = 0.813`, `samples-F1 = 0.838` |
| - `Top-level class macro-F1 = 0.964` |
| - `Coverage = 95.6%` |
| - `Uncertain rate = 4.4%` |
|
|
| These are strong results for the current data regime. |
|
|
| Important: |
|
|
| > These metrics are measured on a held-out split from the current synthetic dataset. |
| > They demonstrate that the R2 design works very well in-distribution, but they should not be interpreted as proof of universal real-world reasoning performance. |
|
|
| --- |
|
|
| ## Training Data |
|
|
| RQA-R2 was trained on a custom reasoning-quality dataset with: |
|
|
| - `7292` total samples |
| - `3150` logical texts |
| - `4142` problematic texts |
| - `1242` hidden problems |
| - `2900` explicit cases |
|
|
| Error-label counts: |
|
|
| - `false_causality`: `518` |
| - `unsupported_claim`: `524` |
| - `overgeneralization`: `599` |
| - `missing_premise`: `537` |
| - `contradiction`: `475` |
| - `circular_reasoning`: `540` |
|
|
| Multi-label explicit cases: |
|
|
| - `293` |
|
|
| The current dataset is useful and already strong enough for training and benchmarking R2, but it is still primarily **synthetic** and should be expanded with real-world data in future versions. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| ### Recommended for |
|
|
| - reasoning-quality evaluation |
| - LLM output auditing |
| - AI safety pipelines |
| - judge/reranker pipelines |
| - pre-filtering for downstream review |
| - analytical tooling around argument structure |
|
|
| ### Not intended for |
|
|
| - text generation |
| - explanation generation |
| - automatic rewriting or correction |
| - factual verification |
| - legal or scientific truth adjudication |
|
|
| --- |
|
|
| ## Output Example |
|
|
| ```json |
| { |
| "class": "explicit", |
| "status": "ok", |
| "review_required": false, |
| "has_logical_issue": true, |
| "has_issue_probability": 0.9993, |
| "is_hidden_problem": false, |
| "hidden_probability": 0.021, |
| "errors": [ |
| { |
| "type": "missing_premise", |
| "probability": 0.923, |
| "threshold": 0.54 |
| } |
| ] |
| } |
| ``` |
|
|
| --- |
|
|
| ## Limitations |
|
|
| RQA-R2 still has important limits: |
|
|
| - it evaluates reasoning structure, not factual truth |
| - hidden problems remain partly subjective by nature |
| - the current benchmark is still synthetic and in-distribution |
| - real human-written texts and outputs from other LLMs may be harder |
| - the model should still be validated externally before being treated as a fully general reasoning judge |
|
|
| Also note: |
|
|
| - R2 was optimized toward low false positives, but on the current held-out synthetic test set the observed `Issue FPR` is `2.9%`, not `1.0%` |
| - if strict false-positive control is critical, threshold tuning may need to be tightened further for the target deployment environment |
|
|
| --- |
|
|
| ## Recommended Next Step |
|
|
| The best next step after R2 is external validation on: |
|
|
| - human-written argumentative texts |
| - outputs from other LLM families |
| - paraphrased and adversarially reworded samples |
| - harder hidden-problem cases |
|
|
| That is the correct way to turn a strong in-distribution result into a robust real-world system. |
|
|
| --- |
|
|
| ## Summary |
|
|
| RQA-R2 is a major upgrade over R1: |
|
|
| - better ontology |
| - better training logic |
| - better calibration |
| - better inference safety |
| - stronger held-out synthetic performance |
|
|
| R1 proved the idea. |
| **R2 is the first version that fully validates it.** |
|
|