How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("NU-World-Model-Embodied-AI/FlashWAM-RoboTwin", dtype=torch.bfloat16, device_map="cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

Flash-WAM β€” RoboTwin (distilled)

Single-step distilled checkpoint for Flash-WAM: Modality-Aware Distillation for World Action Models, applied to LingBot-VA and evaluated on RoboTwin 2.0. Flash-WAM distills each modality with a consistency function matched to its noise regime (linear-gradient-scaling for the action stream, variance-preserving for the video stream), compressing inference to a single step per modality for up to a 23Γ— speedup while preserving teacher-level task success.

This repository contains the complete model (distilled transformer + encoders):

Component Description
transformer/ Distilled Flash-WAM student
vae/ VAE (from the LingBot-VA teacher)
text_encoder/ UMT5-XXL text encoder (from the teacher)
tokenizer/ T5 tokenizer

Links

Usage

For environment setup and evaluation, follow the Flash-WAM repository and LingBot-VA. Point the inference server at this checkpoint directory.

Citation

@misc{akbari2026flashwammodalityawaredistillationworld,
      title={Flash-WAM: Modality-Aware Distillation for World Action Models}, 
      author={Arman Akbari and Ci Zhang and Arash Akbari and Lin Zhao and Yixiao Chen and Weiwei Chen and Xuan Zhang and Geng Yuan and Yanzhi Wang},
      year={2026},
      eprint={2606.05254},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.05254}, 
}

License: Apache-2.0.

Downloads last month
-
Video Preview
loading

Collection including NU-World-Model-Embodied-AI/FlashWAM-RoboTwin

Paper for NU-World-Model-Embodied-AI/FlashWAM-RoboTwin