Flash-WAM — RoboTwin (distilled)

Single-step distilled checkpoint for Flash-WAM: Modality-Aware Distillation for World Action Models, applied to LingBot-VA and evaluated on RoboTwin 2.0. Flash-WAM distills each modality with a consistency function matched to its noise regime (linear-gradient-scaling for the action stream, variance-preserving for the video stream), compressing inference to a single step per modality for up to a 23× speedup while preserving teacher-level task success.

This repository contains the complete model (distilled transformer + encoders):

Component	Description
`transformer/`	Distilled Flash-WAM student
`vae/`	VAE (from the LingBot-VA teacher)
`text_encoder/`	UMT5-XXL text encoder (from the teacher)
`tokenizer/`	T5 tokenizer

Usage

For environment setup and evaluation, follow the Flash-WAM repository and LingBot-VA. Point the inference server at this checkpoint directory.

Citation

@misc{akbari2026flashwammodalityawaredistillationworld,
      title={Flash-WAM: Modality-Aware Distillation for World Action Models}, 
      author={Arman Akbari and Ci Zhang and Arash Akbari and Lin Zhao and Yixiao Chen and Weiwei Chen and Xuan Zhang and Geng Yuan and Yanzhi Wang},
      year={2026},
      eprint={2606.05254},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.05254}, 
}

License: Apache-2.0.

Downloads last month: -

Safetensors

Model size

5B params

Tensor type

BF16

Video Preview

Robotics

Collection including NU-World-Model-Embodied-AI/FlashWAM-RoboTwin

Flash-WAM

Collection

1 item • Updated Jun 5

Paper for NU-World-Model-Embodied-AI/FlashWAM-RoboTwin

Flash-WAM: Modality-Aware Distillation for World Action Models

Paper • 2606.05254 • Published Jun 3 • 7

NU-World-Model-Embodied-AI
/

FlashWAM-RoboTwin