Papers
arxiv:2603.00483

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

Published on Feb 28
· Submitted by
Liyao Jiang
on Mar 3
Authors:
,
,
,

Abstract

RAISE is a training-free, requirement-driven evolutionary framework that adaptively improves text-to-image generation by dynamically allocating computational resources based on prompt complexity through iterative refinement actions.

AI-generated summary

Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.

Community

“RAISE: Requirement-Adaptive Self-Improving Evolution for Training-Free Text-to-Image Alignment” has been accepted to CVPR 2026! 🎉

📌 Question we address: Without training on massive carefully curated data or model scaling, can we simply achieve precise prompt–image alignment at test time only?

🚀 Introducing RAISE — an agentic, autonomous multi-agent framework that evolves and refines text-to-image generation at test time through automated user intent discovery:

  • ✅ Multi-agent refinement: agents collaboratively discover requirements and iteratively verify and improve generations via a checklist
  • ✅ Requirement-adaptive compute: spend more effort only when the prompt is challenging, stop when requirements are met
  • ✅ Plug-and-play: works as an add-on, directly plugs into existing models
  • ✅ Training-free: no extra training or data curation
  • ✅ Model-agnostic: compatible with different diffusion models / VLM backbones

📊 Results highlight: strong prompt-image alignment gains on standard benchmarks, while being far more efficient than fixed-budget or sequential reflection pipelines

🔗 Links

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.00483 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.00483 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.00483 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.