| <h1 align="center">π Remedy-R: Generative Reasoning Models for MT Evaluation</h1> |
| <p align="center"><b>Reasoning-driven, reinforcement-trained metrics for machine translation evaluation</b></p> |
|
|
| --- |
|
|
| ## β¨ What is Remedy-R? |
|
|
| **Remedy-R** is a family of **reasoning-based MT evaluation models** trained with **reinforcement learning via verifiable rewards (RLVR)** on **pairwise human translation preferences**. |
|
|
| Instead of directly regressing a scalar score, Remedy-R: |
|
|
| - Generates **step-by-step analyses** of *accuracy*, *fluency*, and *completeness*. |
| - Outputs a **final numeric score in [0, 100]** that can be parsed and used like a standard metric. |
| - Is trained with **PPO + rule-based rewards** that check whether predicted preferences match human rankings and calibrate scores toward human ratings. |
| - Supports both **reference-based** and **reference-free (QE)** evaluation. |
|
|
| On WMT22β24 and MSLC24-style OOD stress tests, Remedy-R: |
|
|
| - **Surpasses** strong LLM-as-judge methods. |
| - Matches top-performing scalar SOTA metrics. |
| - Remains **robust under OOD conditions** such as source copy, empty translations, wrong language, and mixed-language outputs. |
| - Enables **Test-Time Scaling (TTS)** via multiple reasoning passes, improving segment-level meta-evaluation. |
| - Powers **Remedy-R Agent**, an evaluateβrevise pipeline that improves translations for diverse base systems. |
|
|
| --- |
|
|
| ## π Contents |
|
|
| - [β¨ What is Remedy-R?](#-what-is-remedy-r) |
| - [π Contents](#-contents) |
| - [π¦ Installation](#-installation) |
| - [From PyPI (recommended)](#from-pypi-recommended) |
| - [From source](#from-source) |
| - [βοΈ Requirements](#οΈ-requirements) |
| - [π§ Model Zoo](#-model-zoo) |
| - [π Quickstart](#-quickstart) |
| - [CLI: Local vLLM Inference](#cli-local-vllm-inference) |
| - [Reference-Free / QE Mode](#reference-free--qe-mode) |
| - [Test-Time Scaling (TTS)](#test-time-scaling-tts) |
| - [π Optional: vLLM Online Serving](#-optional-vllm-online-serving) |
| - [π Outputs](#-outputs) |
| - [π Citation](#-citation) |
|
|
| --- |
|
|
| ## π¦ Installation |
|
|
| ### From PyPI (unavailable for now) |
|
|
| ```bash |
| pip install --upgrade pip |
| pip install remedy-r-mt-eval |
| ```` |
|
|
| This installs the `remedy_r` package and the CLI entrypoint `remedy-r-score` (plus related tools). |
|
|
| ### From source |
|
|
| ```bash |
| git clone https://github.com/Smu-Tan/Remedy-R.git |
| cd Remedy-R |
| pip install -e . |
| ``` |
|
|
| --- |
|
|
| ## βοΈ Requirements |
|
|
| Core runtime dependencies (see `pyproject.toml` for exact versions): |
|
|
| * Python β₯ 3.10 (tested mostly with 3.12) |
| * [PyTorch](https://pytorch.org/) with GPU support |
| * [vLLM](https://github.com/vllm-project/vllm) for efficient batched inference |
| * `transformers`, `numpy`, `pandas`, `tqdm` |
|
|
| You also need: |
|
|
| * At least **1 GPU (16β24 GB)** for 7B models |
| * More memory/GPUs for 14B/32B models or large batch sizes |
|
|
| --- |
|
|
| ## π§ Model Zoo |
|
|
| Remedy-R models are hosted on HuggingFace under `ShaomuTan/`: |
|
|
| | Model | Size | Base model | Mode | Link | |
| | ------------ | ---- | ----------- | -------- | --------------------------- | |
| | Remedy-R-7B | 7B | Qwen2.5-7B | Ref + QE | [π€ HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-7B) | |
| | Remedy-R-14B | 14B | Qwen2.5-14B | Ref + QE | [π€ HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-14B) | |
| | Remedy-R-32B | 32B | Qwen2.5-32B | Ref + QE | [π€ HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-32B) | |
|
|
| You can cache them locally: |
|
|
| ```bash |
| HF_HUB_ENABLE_HF_TRANSFER=1 \ |
| huggingface-cli download ShaomuTan/Remedy-R-14B \ |
| --local-dir Models/Remedy-R-14B |
| ``` |
|
|
| Then point `--model` to either the **HF ID** or the **local path**. |
|
|
| --- |
|
|
| ## π Quickstart |
|
|
| ### CLI: Local vLLM Inference |
|
|
| The main entrypoint is: |
|
|
| ```bash |
| remedy-r-score \ |
| --model "$MODEL_CHECKPOINT" \ |
| --save_metric_name "$METRIC_NAME" \ |
| --output_dir "$DATA_DIR" \ |
| --max-tokens "$MAX_TOKENS" \ |
| --tp_size "$TP_SIZE" \ |
| --dp_size "$DP_SIZE" \ |
| --temperature "$DEC_TEMPERATURE" \ |
| --repetition_penalty "$REPETITION_PENALTY" \ |
| --gpu-memory-utilization "$GPU_MEM_UTIL" \ |
| --max-model-len "$MAX_MODEL_LEN" \ |
| --seed "$SEED" \ |
| --src-file "$SRC_FILE" \ |
| --mt-file "$MT_FILE" \ |
| --lp "$LP" \ |
| ``` |
|
|
| **Key arguments** |
|
|
| * `--model` : HF repo ID or local checkpoint |
| * `--src-file` : Source sentences (one per line) |
| * `--mt-file` : MT outputs (one per line) |
| * `--ref-file` : Reference translations (optional; enables ref-based mode) |
| * `--lp` : Language-pair codes (e.g., `en-de`) |
| * `--output_dir` : Output folder |
| * `--temperature` : Generation temperature |
| * `--tp_size` : Tensor parallel size |
| * `--dp_size` : Data parallel size |
| * `--num-seqs` : Max parallel sequences per step |
| * `--max-tokens` : Max generation token numebrs |
| * `--gpu-memory-utilization` : vLLM memory ratio (e.g. 0.9) |
|
|
|
|
| You can also call the CLI via Python: |
|
|
| ```bash |
| python -m remedy_r.cli.score \ |
| --model ShaomuTan/Remedy-R-7B \ |
| ... |
| ``` |
|
|
| --- |
|
|
| ### Reference-Free / QE Mode |
|
|
| If you donβt have references, just drop `--ref-file` and add `--no-ref`: |
|
|
| ```bash |
| remedy-r-score \ |
| --model ShaomuTan/Remedy-R-7B \ |
| --src-file ./testcase/en.src \ |
| --mt-file ./testcase/en-de.hyp \ |
| --no-ref \ |
| --src-lang en \ |
| --tgt-lang de \ |
| --save-dir ./testcase \ |
| --cache-dir ./Models |
| ``` |
|
|
| The prompt automatically switches to **reference-free quality estimation** while keeping the same [0, 100] score scale. |
|
|
| --- |
|
|
| ### Test-Time Scaling (TTS) |
|
|
| Remedy-R supports **Test-Time Scaling** by averaging multiple independent evaluation passes with different seeds: |
|
|
| ```bash |
| remedy-r-score \ |
| --model ShaomuTan/Remedy-R-14B \ |
| --src-file ./testcase/en.src \ |
| --mt-file ./testcase/en-de.hyp \ |
| --ref-file ./testcase/de.ref \ |
| --src-lang en --tgt-lang de \ |
| --save-dir ./testcase_tts \ |
| --TTS \ |
| --best-of-n 4 \ |
| --seed 42 |
| ``` |
|
|
| * `--TTS` : Enable multi-pass evaluation |
| * `--best-of-n` : Number of independent passes (e.g., 2β6) |
| * Scores are averaged; the detailed per-pass scores can be optionally logged. |
|
|
| TTS typically improves **segment-level pairwise accuracy** and stabilizes scores for difficult segments. |
|
|
| --- |
|
|
| ## π Optional: vLLM Online Serving |
|
|
| To avoid re-loading the model for every scoring run, you can: |
|
|
| 1. **Start a local vLLM server** (OpenAI-compatible): |
|
|
| ```bash |
| remedy-r-serve \ |
| --model ShaomuTan/Remedy-R-14B \ |
| --port 8000 \ |
| --max-model-len 4096 \ |
| --gpu-memory-utilization 0.9 |
| ``` |
|
|
| 2. **Score via the server**: |
|
|
| ```bash |
| remedy-r-score \ |
| --src-file ./testcase/en.src \ |
| --mt-file ./testcase/en-de.hyp \ |
| --ref-file ./testcase/de.ref \ |
| --lp en-de \ |
| --save_metric_name Remedy-R-14B \ |
| --save-dir ./testcase_server \ |
| --server-url http://localhost:8000/v1 |
| ``` |
|
|
| Internally this reuses the same Remedy-R prompting and scoring logic, but routes generation requests through the running vLLM server instead of instantiating `LLM()` in every process. |
|
|
| --- |
|
|
| ## π Outputs |
|
|
| For each language pair `SRC-TGT`, Remedy-R writes: |
|
|
| * `results.jsonl` |
| * `segment_scores.tsv` |
| * `system_score.txt` |
|
|
|
|
| ## π Citation |
|
|
| If you use Remedy-R or this codebase, please cite: |
|
|
| Arxiv coming soon... |
|
|