Update README.md

c171186 verified 4 months ago

7.1 kB

	<h1 align="center">🚀 Remedy-R: Generative Reasoning Models for MT Evaluation</h1>
	<p align="center"><b>Reasoning-driven, reinforcement-trained metrics for machine translation evaluation</b></p>

	---

	## ✨ What is Remedy-R?

	Remedy-R is a family of reasoning-based MT evaluation models trained with reinforcement learning via verifiable rewards (RLVR) on pairwise human translation preferences.

	Instead of directly regressing a scalar score, Remedy-R:

	- Generates step-by-step analyses of accuracy, fluency, and completeness.
	- Outputs a final numeric score in [0, 100] that can be parsed and used like a standard metric.
	- Is trained with PPO + rule-based rewards that check whether predicted preferences match human rankings and calibrate scores toward human ratings.
	- Supports both reference-based and reference-free (QE) evaluation.

	On WMT22–24 and MSLC24-style OOD stress tests, Remedy-R:

	- Surpasses strong LLM-as-judge methods.
	- Matches top-performing scalar SOTA metrics.
	- Remains robust under OOD conditions such as source copy, empty translations, wrong language, and mixed-language outputs.
	- Enables Test-Time Scaling (TTS) via multiple reasoning passes, improving segment-level meta-evaluation.
	- Powers Remedy-R Agent, an evaluate–revise pipeline that improves translations for diverse base systems.

	---

	## 📚 Contents

	- [✨ What is Remedy-R?](#-what-is-remedy-r)
	- [📚 Contents](#-contents)
	- [📦 Installation](#-installation)
	- [From PyPI (recommended)](#from-pypi-recommended)
	- [From source](#from-source)
	- [⚙️ Requirements](#️-requirements)
	- [🧠 Model Zoo](#-model-zoo)
	- [🚀 Quickstart](#-quickstart)
	- [CLI: Local vLLM Inference](#cli-local-vllm-inference)
	- [Reference-Free / QE Mode](#reference-free--qe-mode)
	- [Test-Time Scaling (TTS)](#test-time-scaling-tts)
	- [🌐 Optional: vLLM Online Serving](#-optional-vllm-online-serving)
	- [📄 Outputs](#-outputs)
	- [📚 Citation](#-citation)

	---

	## 📦 Installation

	### From PyPI (unavailable for now)

	```bash
	pip install --upgrade pip
	pip install remedy-r-mt-eval
	````

	This installs the `remedy_r` package and the CLI entrypoint `remedy-r-score` (plus related tools).

	### From source

	```bash
	git clone https://github.com/Smu-Tan/Remedy-R.git
	cd Remedy-R
	pip install -e .
	```

	---

	## ⚙️ Requirements

	Core runtime dependencies (see `pyproject.toml` for exact versions):

	* Python ≥ 3.10 (tested mostly with 3.12)
	* [PyTorch](https://pytorch.org/) with GPU support
	* [vLLM](https://github.com/vllm-project/vllm) for efficient batched inference
	* `transformers`, `numpy`, `pandas`, `tqdm`

	You also need:

	* At least 1 GPU (16–24 GB) for 7B models
	* More memory/GPUs for 14B/32B models or large batch sizes

	---

	## 🧠 Model Zoo

	Remedy-R models are hosted on HuggingFace under `ShaomuTan/`:

	\| Model \| Size \| Base model \| Mode \| Link \|
	\| ------------ \| ---- \| ----------- \| -------- \| --------------------------- \|
	\| Remedy-R-7B \| 7B \| Qwen2.5-7B \| Ref + QE \| [🤗 HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-7B) \|
	\| Remedy-R-14B \| 14B \| Qwen2.5-14B \| Ref + QE \| [🤗 HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-14B) \|
	\| Remedy-R-32B \| 32B \| Qwen2.5-32B \| Ref + QE \| [🤗 HuggingFace](https://huggingface.co/ShaomuTan/Remedy-R-32B) \|

	You can cache them locally:

	```bash
	HF_HUB_ENABLE_HF_TRANSFER=1 \
	huggingface-cli download ShaomuTan/Remedy-R-14B \
	--local-dir Models/Remedy-R-14B
	```

	Then point `--model` to either the HF ID or the local path.

	---

	## 🚀 Quickstart

	### CLI: Local vLLM Inference

	The main entrypoint is:

	```bash
	remedy-r-score \
	--model "$MODEL_CHECKPOINT" \
	--save_metric_name "$METRIC_NAME" \
	--output_dir "$DATA_DIR" \
	--max-tokens "$MAX_TOKENS" \
	--tp_size "$TP_SIZE" \
	--dp_size "$DP_SIZE" \
	--temperature "$DEC_TEMPERATURE" \
	--repetition_penalty "$REPETITION_PENALTY" \
	--gpu-memory-utilization "$GPU_MEM_UTIL" \
	--max-model-len "$MAX_MODEL_LEN" \
	--seed "$SEED" \
	--src-file "$SRC_FILE" \
	--mt-file "$MT_FILE" \
	--lp "$LP" \
	```

	Key arguments

	* `--model` : HF repo ID or local checkpoint
	* `--src-file` : Source sentences (one per line)
	* `--mt-file` : MT outputs (one per line)
	* `--ref-file` : Reference translations (optional; enables ref-based mode)
	* `--lp` : Language-pair codes (e.g., `en-de`)
	* `--output_dir` : Output folder
	* `--temperature` : Generation temperature
	* `--tp_size` : Tensor parallel size
	* `--dp_size` : Data parallel size
	* `--num-seqs` : Max parallel sequences per step
	* `--max-tokens` : Max generation token numebrs
	* `--gpu-memory-utilization` : vLLM memory ratio (e.g. 0.9)


	You can also call the CLI via Python:

	```bash
	python -m remedy_r.cli.score \
	--model ShaomuTan/Remedy-R-7B \
	...
	```

	---

	### Reference-Free / QE Mode

	If you don’t have references, just drop `--ref-file` and add `--no-ref`:

	```bash
	remedy-r-score \
	--model ShaomuTan/Remedy-R-7B \
	--src-file ./testcase/en.src \
	--mt-file ./testcase/en-de.hyp \
	--no-ref \
	--src-lang en \
	--tgt-lang de \
	--save-dir ./testcase \
	--cache-dir ./Models
	```

	The prompt automatically switches to reference-free quality estimation while keeping the same [0, 100] score scale.

	---

	### Test-Time Scaling (TTS)

	Remedy-R supports Test-Time Scaling by averaging multiple independent evaluation passes with different seeds:

	```bash
	remedy-r-score \
	--model ShaomuTan/Remedy-R-14B \
	--src-file ./testcase/en.src \
	--mt-file ./testcase/en-de.hyp \
	--ref-file ./testcase/de.ref \
	--src-lang en --tgt-lang de \
	--save-dir ./testcase_tts \
	--TTS \
	--best-of-n 4 \
	--seed 42
	```

	* `--TTS` : Enable multi-pass evaluation
	* `--best-of-n` : Number of independent passes (e.g., 2–6)
	* Scores are averaged; the detailed per-pass scores can be optionally logged.

	TTS typically improves segment-level pairwise accuracy and stabilizes scores for difficult segments.

	---

	## 🌐 Optional: vLLM Online Serving

	To avoid re-loading the model for every scoring run, you can:

	1. Start a local vLLM server (OpenAI-compatible):

	```bash
	remedy-r-serve \
	--model ShaomuTan/Remedy-R-14B \
	--port 8000 \
	--max-model-len 4096 \
	--gpu-memory-utilization 0.9
	```

	2. Score via the server:

	```bash
	remedy-r-score \
	--src-file ./testcase/en.src \
	--mt-file ./testcase/en-de.hyp \
	--ref-file ./testcase/de.ref \
	--lp en-de \
	--save_metric_name Remedy-R-14B \
	--save-dir ./testcase_server \
	--server-url http://localhost:8000/v1
	```

	Internally this reuses the same Remedy-R prompting and scoring logic, but routes generation requests through the running vLLM server instead of instantiating `LLM()` in every process.

	---

	## 📄 Outputs

	For each language pair `SRC-TGT`, Remedy-R writes:

	* `results.jsonl`
	* `segment_scores.tsv`
	* `system_score.txt`


	## 📚 Citation

	If you use Remedy-R or this codebase, please cite:

	Arxiv coming soon...