| --- |
| base_model: Qwen/Qwen3-VL-8B-Instruct |
| library_name: transformers |
| license: other |
| pipeline_tag: image-text-to-text |
| tags: |
| - reward-model |
| - image-editing |
| - FIRM |
| - llama-factory |
| - generated_from_trainer |
| model-index: |
| - name: FIRM-Edit-8B |
| results: [] |
| --- |
| |
| # FIRM-Edit-8B |
|
|
| [**Project Page**](https://firm-reward.github.io/) | [**Paper**](https://arxiv.org/abs/2603.12247) | [**GitHub**](https://github.com/VisionXLab/FIRM-Reward) |
|
|
| **FIRM-Edit-8B** is a robust reward model (critic) designed for faithful image editing. It is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) on the **FIRM-Edit-370K** dataset. The model is part of the **FIRM (Faithful Image Reward Modeling)** framework, which provides accurate and reliable guidance for visual reinforcement learning pipelines. |
|
|
| ## Model Description |
|
|
| Conventional reward models used for image editing often suffer from hallucinations and assign noisy scores, misguiding the optimization process. FIRM-Edit-8B addresses these issues by evaluating edits through two competing objectives: |
| 1. **Execution**: Adherence to the editing instruction. |
| 2. **Consistency**: Preservation of original content in unedited regions. |
|
|
| By formulating a "Consistency-Modulated Execution" (CME) reward strategy, this model acts as a stable critic that mitigates hallucinations and helps establish a new standard for fidelity in image editing. |
|
|
| ## Intended Uses & Limitations |
|
|
| - **Reward Modeling**: To be used as a reward signal in Reinforcement Learning (RL) pipelines for image editing. |
| - **Evaluation**: To serve as a metric for benchmarking the performance of image editing models. |
|
|
| ## Training procedure |
|
|
| The model was fine-tuned using the [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) framework. |
|
|
| ### Training hyperparameters |
|
|
| The following hyperparameters were used during training: |
| - learning_rate: 1e-05 |
| - train_batch_size: 10 |
| - eval_batch_size: 2 |
| - seed: 42 |
| - gradient_accumulation_steps: 2 |
| - lr_scheduler_type: cosine |
| - lr_scheduler_warmup_ratio: 0.1 |
| - num_epochs: 1.0 |
| |
| ### Training results |
| |
| | Training Loss | Epoch | Step | Validation Loss | |
| |:-------------:|:------:|:----:|:---------------:| |
| | 0.591 | 0.2182 | 500 | 0.5827 | |
| | 0.5605 | 0.4364 | 1000 | 0.5460 | |
| | 0.5252 | 0.6546 | 1500 | 0.5199 | |
| | 0.5075 | 0.8728 | 2000 | 0.5055 | |
| |
| ## Usage |
| |
| To use the model as a reward server for RL training, you can use the script provided in the official repository: |
| |
| ```bash |
| # Launch the reward server |
| python editing/reward_server/reward_server_qwen3_vl_8b_sft.py |
| ``` |
| |
| ## Citation |
| |
| If you find this work useful, please cite: |
| |
| ```bibtex |
| @article{zhao2026trust, |
| title={Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation}, |
| author={Zhao, Xiangyu and Zhang, Peiyuan and Lin, Junming and Liang, Tianhao and Duan, Yuchen and Ding, Shengyuan and Tian, Changyao and Zang, Yuhang and Yan, Junchi and Yang, Xue}, |
| journal={arXiv preprint arXiv:2603.12247}, |
| year={2026} |
| } |
| ``` |