ACWM-Phys Checkpoints

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

Haotian Xue†, Yipu Chen*, Liqian Ma*, Zelin Zhao, Lama Moukheiber, Yongxin Chen
Georgia Institute of Technology

[Project Page] · [Paper] · [Dataset] · [Code]

Overview

This repository contains pretrained ACWM-DiT checkpoints — a latent diffusion transformer trained with flow matching on the ACWM-Phys benchmark. All released checkpoints are DiT-S (~200M parameters) trained for 100k steps.

Released Checkpoints

Environment	Category	Action Dim	Resolution	Checkpoint
Push Cube	Rigid-Body	2	240×240	`VideoDiT_S_push_cube_240x240/latest.pt`
Stack Cube	Rigid-Body	7	240×240	`VideoDiT_S_stack_cube_240x240/latest.pt`
Push Rope	Deformable	2	240×240	`VideoDiT_S_push_rope_240x240/latest.pt`
Cloth Move	Deformable	3	240×240	`VideoDiT_S_clothmove_240x240_240x240/latest.pt`
Push Sand	Particle	7	240×400	`VideoDiT_S_push_sand_240x400/latest.pt`
Pour Water	Particle	4	240×240	`VideoDiT_S_pour_water_240x240/latest.pt`
Robot Arm	Kinematics	7	240×240	`VideoDiT_S_robot_arm_240x240/latest.pt`
Reacher	Kinematics	2	240×240	`VideoDiT_S_reacher_240x240/latest.pt`

The Wan 2.1 VAE weights (Wan2.1_VAE.pth, 508 MB) are also included and required for encoding/decoding video latents.

Download

huggingface-cli download t1an/ACWM-Phys-checkpoints --local-dir ./checkpoints
export WAN_VAE_PATH=./checkpoints/Wan2.1_VAE.pth

Usage

See the ACWM-Phys code repository for full evaluation and training instructions.

Quick evaluation:

python eval.py --env push_cube --steps 50 --split both --save_videos

Model Architecture

ACWM-DiT takes the first video frame + full action sequence and predicts the complete future trajectory:

Causal VAE (Wan 2.1) — encodes video into 16-ch latent tokens at H/8×W/8, 4× temporal compression
DiT with flow matching — denoises the full latent trajectory
Action conditioning — injected via AdaLN (default) or cross-attention

Citation

Coming soon.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics