SCAIL-2 — MLX (work in progress)

⚠️ WIP — pre-release conversion, expect changes

These are Apple-MLX conversions of zai-org/SCAIL-2 for the xocialize/scail-2-mlx port, published from our own namespace while the port is under active development. File formats, key layouts, and dtypes may change without notice. Quantized (q8/q4) variants, golden end-to-end validation against the PyTorch reference, and an mlx-community release are planned but not done. Use for experimentation, not production.

SCAIL-2 (Zhipu AI, arXiv 2512.05905) is an end-to-end controlled character-animation model: a reference character image + a driving video → the character performing that motion. Cross-identity replacement, multi-character scenes, and animal driving, with no intermediate pose representations required. The backbone is a Wan2.1-I2V-14B fork with a 3-segment (reference / video / pose) RoPE design and dual mask conditioning.

Files

file	component	dtype	size
`dit.safetensors`	SCAIL2 DiT (14B, Wan2.1-I2V fork)	bf16	33 GB
`umt5.safetensors`	umT5-XXL text encoder	bf16	11 GB
`clip.safetensors`	open-clip xlm-roberta ViT-H/14 visual tower	fp16	1.2 GB
`vae.safetensors`	Wan2.1 VAE (16-ch)	fp32	0.5 GB

Keys follow the scail-2-mlx module tree (MLX nn.Sequential uses .layers.N; conv weights are NDHWC/NHWC). Tokenizer: use google/umt5-xxl (or the umt5-xxl/ directory bundled with the original checkpoint).

Usage

git clone https://github.com/xocialize/scail-2-mlx && cd scail-2-mlx
uv venv --python 3.12 .venv
uv pip install -e refs/mlx-video -e .
hf download xocialize/SCAIL-2-bf16 --local-dir weights/mlx

.venv/bin/python scripts/generate.py \
  --weights-dir weights/mlx \
  --image ref.jpg --mask-image ref_mask.jpg \
  --pose driving.mp4 --mask-video driving_mask.mp4 \
  --prompt "the girl is dancing" \
  --target-h 480 --target-w 832 --save-file out.mp4

Requires Apple Silicon with ≥ 64 GB unified memory at bf16 (active ~34 GB, peak ~47 GB at 832×480×65 frames; ~3.7 min/step on an M5 Max — perf work ongoing). Driving-input preprocessing (masks / pose renders) comes from the upstream SCAIL-Pose toolchain.

Conversion provenance & fidelity

Converted by recipes/convert_scail2.py from the original FSDP checkpoint via upstream convert.py key remapping (1307/1307 strict key match). Component-level parity vs the PyTorch reference (fp32, CPU): CLIP visual max_abs 2.7e-4 on real weights; chunked causal VAE decode < 5e-4 per frame (canonical 1+(T−1)·4 frame mapping — see Blaizzy/mlx-video#38); DiT forward parity-locked at fp32 on the CPU oracle. End-to-end golden comparison against the PyTorch pipeline is pending.

License

Weights: converted from zai-org/SCAIL-2 (model card: MIT; source repository: Apache-2.0 — this card is marked Apache-2.0, the stricter of the two, pending upstream clarification). Conversion code: Apache-2.0. Derived from SCAIL-2 (Zhipu AI), Wan2.1 (Alibaba), open-clip.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Model tree for xocialize/SCAIL-2-bf16

Base model

zai-org/SCAIL-2

Finetuned

(4)

this model

Paper for xocialize/SCAIL-2-bf16

SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

Paper • 2512.05905 • Published Dec 5, 2025 • 21