ACWM-Phys Checkpoints

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

Haotian Xue†, Yipu Chen*, Liqian Ma*, Zelin Zhao, Lama Moukheiber, Yongxin Chen
Georgia Institute of Technology

[Project Page] · [Paper] · [Dataset] · [Code]


Overview

This repository contains pretrained ACWM-DiT checkpoints — a latent diffusion transformer trained with flow matching on the ACWM-Phys benchmark. All released checkpoints are DiT-S (~200M parameters) trained for 100k steps.


Released Checkpoints

Environment Category Action Dim Resolution Checkpoint
Push Cube Rigid-Body 2 240×240 VideoDiT_S_push_cube_240x240/latest.pt
Stack Cube Rigid-Body 7 240×240 VideoDiT_S_stack_cube_240x240/latest.pt
Push Rope Deformable 2 240×240 VideoDiT_S_push_rope_240x240/latest.pt
Cloth Move Deformable 3 240×240 VideoDiT_S_clothmove_240x240_240x240/latest.pt
Push Sand Particle 7 240×400 VideoDiT_S_push_sand_240x400/latest.pt
Pour Water Particle 4 240×240 VideoDiT_S_pour_water_240x240/latest.pt
Robot Arm Kinematics 7 240×240 VideoDiT_S_robot_arm_240x240/latest.pt
Reacher Kinematics 2 240×240 VideoDiT_S_reacher_240x240/latest.pt

The Wan 2.1 VAE weights (Wan2.1_VAE.pth, 508 MB) are also included and required for encoding/decoding video latents.


Download

huggingface-cli download t1an/ACWM-Phys-checkpoints --local-dir ./checkpoints
export WAN_VAE_PATH=./checkpoints/Wan2.1_VAE.pth

Usage

See the ACWM-Phys code repository for full evaluation and training instructions.

Quick evaluation:

python eval.py --env push_cube --steps 50 --split both --save_videos

Model Architecture

ACWM-DiT takes the first video frame + full action sequence and predicts the complete future trajectory:

  1. Causal VAE (Wan 2.1) — encodes video into 16-ch latent tokens at H/8×W/8, 4× temporal compression
  2. DiT with flow matching — denoises the full latent trajectory
  3. Action conditioning — injected via AdaLN (default) or cross-attention

Citation

Citation

Coming soon.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading