QtMeshEditor Text-to-Motion (experimental, #411)

A small, experimental from-scratch text-to-motion model for QtMeshEditor. Given a text prompt (an action keyword), it generates a 60-frame @30fps, 22-joint canonical WORLD-frame skeletal clip that QtMeshEditor retargets onto an arbitrary humanoid rig.

The model QtMeshEditor actually downloads at runtime lives in the shared fernandotonon/QtMeshEditor-models repo under motion/. This repo is the standalone model card + mirror.

Status: experimental

The shipped default in QtMeshEditor is the deterministic template-clip retarget (a curated library of 47 real CMU mocap clips across 15 actions, with per-action variety) — that is the quality bar. This model is an opt-in (--model / GUI checkbox / MCP model:true) that falls back to the template automatically when unavailable or out of vocabulary. It produces coherent, upright motion with per-generate variety, but is stylistically gentler/less crisp than the real-mocap templates.

Training data — permissive only

Trained from scratch on clean, dynamic, single-action windows mined from the CMU MoCap database (commercial-OK). AMASS / HumanML3D / KIT-ML were excluded (non-commercial). Windows are 30fps, 2s, selected for motion energy and snapped to a calm near-neutral start frame; mirror-augmented.

Architecture (v4)

6D-rotation representation (Zhou et al. 2019), correctly column-packed.
Cross-attention transformer decoder with an absolute per-frame pose head (self-attention models temporal coherence; no error-accumulating cumsum).
CVAE latent with z=0 supervision + aggregate-posterior matching.
Per-sample velocity/acceleration matching in both 6D and true rotation (geodesic) space; derived-local supervision (the quantity the retarget renders); 1-2-1 output smoothing baked into the ONNX graph.
~7.6M params, exports to ONNX (one forward pass).

I/O contract

input  "tokens" float32 [1, V]   one-hot over the fixed action vocab (see t2m-vocab.json)
input  "seed"   float32 [1, Z]   latent noise (host samples ~N(0,0.5) and does best-of-N)
output "motion" float32 [1, T, C]  C = 22*10 per-joint [tx,ty,tz, qx,qy,qz,qw, sx,sy,sz]

t2m-vocab.json ships the {vocab, Z, T, C, J, joints, fps, frame} the host needs — frame: "world" marks the WORLD-frame convention (retarget takes a world delta), fps: 30. Vocabulary: walk, run, jump, dance, march, kick, punch, wave, climb, sit, throw, boxing, idle.

Reproducing

scripts/prep-t2m-v4.py + scripts/train-t2m-onnx-v4.py in the QtMeshEditor repo (one-time, offline dev tools — the app never runs Python).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support