SCAIL-2 โ€” MLX (work in progress)

โš ๏ธ WIP โ€” pre-release conversion, expect changes

These are Apple-MLX conversions of zai-org/SCAIL-2 for the xocialize/scail-2-mlx port, published from our own namespace while the port is under active development. File formats, key layouts, and dtypes may change without notice. Quantized (q8/q4) variants, golden end-to-end validation against the PyTorch reference, and an mlx-community release are planned but not done. Use for experimentation, not production.

SCAIL-2 (Zhipu AI, arXiv 2512.05905) is an end-to-end controlled character-animation model: a reference character image + a driving video โ†’ the character performing that motion. Cross-identity replacement, multi-character scenes, and animal driving, with no intermediate pose representations required. The backbone is a Wan2.1-I2V-14B fork with a 3-segment (reference / video / pose) RoPE design and dual mask conditioning.

Files

file component dtype size
dit.safetensors SCAIL2 DiT (14B, Wan2.1-I2V fork) bf16 33 GB
umt5.safetensors umT5-XXL text encoder bf16 11 GB
clip.safetensors open-clip xlm-roberta ViT-H/14 visual tower fp16 1.2 GB
vae.safetensors Wan2.1 VAE (16-ch) fp32 0.5 GB

Keys follow the scail-2-mlx module tree (MLX nn.Sequential uses .layers.N; conv weights are NDHWC/NHWC). Tokenizer: use google/umt5-xxl (or the umt5-xxl/ directory bundled with the original checkpoint).

Usage

git clone https://github.com/xocialize/scail-2-mlx && cd scail-2-mlx
uv venv --python 3.12 .venv
uv pip install -e refs/mlx-video -e .
hf download xocialize/SCAIL-2-bf16 --local-dir weights/mlx

.venv/bin/python scripts/generate.py \
  --weights-dir weights/mlx \
  --image ref.jpg --mask-image ref_mask.jpg \
  --pose driving.mp4 --mask-video driving_mask.mp4 \
  --prompt "the girl is dancing" \
  --target-h 480 --target-w 832 --save-file out.mp4

Requires Apple Silicon with โ‰ฅ 64 GB unified memory at bf16 (active ~34 GB, peak ~47 GB at 832ร—480ร—65 frames; ~3.7 min/step on an M5 Max โ€” perf work ongoing). Driving-input preprocessing (masks / pose renders) comes from the upstream SCAIL-Pose toolchain.

Conversion provenance & fidelity

Converted by recipes/convert_scail2.py from the original FSDP checkpoint via upstream convert.py key remapping (1307/1307 strict key match). Component-level parity vs the PyTorch reference (fp32, CPU): CLIP visual max_abs 2.7e-4 on real weights; chunked causal VAE decode < 5e-4 per frame (canonical 1+(Tโˆ’1)ยท4 frame mapping โ€” see Blaizzy/mlx-video#38); DiT forward parity-locked at fp32 on the CPU oracle. End-to-end golden comparison against the PyTorch pipeline is pending.

License

Weights: converted from zai-org/SCAIL-2 (model card: MIT; source repository: Apache-2.0 โ€” this card is marked Apache-2.0, the stricter of the two, pending upstream clarification). Conversion code: Apache-2.0. Derived from SCAIL-2 (Zhipu AI), Wan2.1 (Alibaba), open-clip.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for xocialize/SCAIL-2-bf16

Base model

zai-org/SCAIL-2
Finetuned
(1)
this model

Paper for xocialize/SCAIL-2-bf16