BFM-Design Β· Proxy v2 (BeyondMimic-aligned) β Unitree G1 terrain-aware motion-tracking teacher
Stage-1 proxy agent of the BFM-Design pipeline: a privileged PPO motion-tracking policy for the Unitree G1 humanoid on rough terrain. Trained to imitate PFNN-generated reference motion across 40 terrain cells, it serves as the teacher for Stage-2 CVAE (BFM) DAgger distillation.
TL;DR
| Task | Bfm-Proxy-V2-PfnnRough-Unitree-G1 (mjlab) |
| Robot | Unitree G1, 29 actuated DoF |
| Obs | 1150-dim privileged (proprio history + per-body state + motion goal + 21Γ21 heightmap) |
| Action | 29-dim joint position targets (PD), with 0β2 step action delay |
| Policy net | MLP [2048, 2048, 1024, 1024, 512, 512], PPO (rsl_rl) |
| Training data | motion_bag_v5 β 14.09 M frames / 84.6 k clips / 40 terrain cells (8 sub_types Γ 5 levels) |
| Scale | 1024 envs Γ 13 500 iters, single H20 (~15 h) |
| Result | ep_len ~70β87, reward ~+1.5, fall-terminations β 0 |
| Role | Stage-1 teacher β Stage-2 CVAE BFM distillation |
Why "v2" β reward design
This is the v2 reward recipe. v1 used a heavy 10-penalty linear reward curriculum +
an all-14-body 0.5 m termination, which collapsed episode length (28 β 10) as the
penalties ramped β the regularizers fought limb tracking and bad_motion_body_pos converted
the resulting limb excursions into early death, while mean body-position error stayed ~0.16 m
the whole time.
v2 reverts to the validated BeyondMimic / mjlab-stock minimal-shaping recipe:
| weight / setting | |
|---|---|
| Tracking rewards | motion_body_pos/ori/lin_vel/ang_vel = 1.0 each; motion_global_root_pos/ori = 0.5 |
| Penalties (only 3) | action_rate_l2 β0.1, joint_limit β10, self_collisions β10 |
| Reward curriculum | none (empty) |
| Termination | 3-way z-only: anchor_pos_z 0.25 + anchor_ori 0.8 + ee_body_pos_z 0.25 (4 EE: ankles+wrists) + motion_clip_end (time-out) |
Privileged proxy observations (per-body state + motion goal + heightmap) and the sim2real
domain-randomization stack (link-mass / PD-gain / friction / CoM / push / torque-RFI /
action-delay) are kept β they are orthogonal to the v1 collapse. Full rationale + data:
see docs/proxy_reward_design.md in the source repo.
Intended use & role
- Primary: teacher policy whose privileged rollouts are distilled into the deployable Stage-2 CVAE BFM (masked unified control interface) via DAgger.
- Secondary: a PHP-style "specialist on a single mode" baseline for ablation.
- Not directly deployable on hardware: observations are privileged (full sim state + heightmap), not the 25-step proprioceptive history the deployable BFM uses.
Checkpoint contents
model_13499.pt (full rsl_rl checkpoint, ~241 MB) β keys:
actor_state_dict, critic_state_dict, optimizer_state_dict, iter, infos.
Use actor_state_dict for inference; the rest is for resuming training.
ONNX export is not included (the run's auto-export failed to serialize; a clean export can be regenerated from the actor if a deployment graph is needed).
How to load (inference)
import torch
ckpt = torch.load("model_13499.pt", map_location="cpu")
actor_sd = ckpt["actor_state_dict"] # MLP 1150 -> [2048,2048,1024,1024,512,512] -> 29
# Rebuild via the source repo's runner cfg (bfm_design/tasks/proxy_v1_rl_cfg.py),
# register the task (import bfm_design.tasks) and load into the rsl_rl MLP actor.
Reproduce the exact env (obs/action layout, terrain) from the source repo at the pinned
commit; the rasterized terrain is included there (assets/terrain/mjlab_terrain_rasterized_v3h.npz).
Evaluation
Per-terrain eval, all 8 sub_types at level 4, 16 env Γ 12 s (model_13499.pt):
| terrain (lvl4) | track body err (m) | track joint err (rad) |
|---|---|---|
| flat | 0.033 | 0.70 |
| pyramid_stairs | 0.051 | 0.72 |
| pyramid_stairs_inv | 0.061 | 0.99 |
| hf_pyramid_slope | 0.073 | 0.88 |
| hf_pyramid_slope_inv | 0.058 | 0.99 |
| random_rough | 0.062 | 0.99 |
| wave_terrain | 0.062 | 0.89 |
| box_line | 0.041 | 0.77 |
Body tracking error 3.3β7.3 cm across all 8 lvl4 terrains (~1 order of magnitude better than the Phase-B v2 teacher at 0.286 m and the A3' student at 0.59 m).
Training-curve summary (tensorboard): mean reward β2.7 β +1.5; mean ep_len 6 β ~70β87;
ee_body_pos terminations 159 β ~1. Visually signed off via reference-ghost rollouts
(scripts/render_v2_ghost_8terrain.py in the source repo).
Note: a naive
survive_ratioover a fixed window reads 0 becausemotion_clip_end(a successful clip completion / time-out) is counted as "done"; actual fall terminations are β 0.
Limitations
- Privileged obs β teacher-only, not sim2real-ready as-is.
ee_body_pos_z0.25 m termination is tight for rough terrain (slow ep_len cold-start ~iter 0β1500 before the policy learns to keep feet/wrists within tolerance).- Trained on PFNN-derived reference motion (walk/jog/crouch gaits); behaviors outside the reference distribution are out of scope (handled later by BFM residual learning).
Source, data, citation
- License: MIT (Β© 2026 Huiqiao Fu), consistent with Robo-PFNN.
- HF repo:
tRNAoOO/<name>β public + gated (contact-info gate), matching the Robo-PFNN weights repo. - Code (pinned): GitLab
hqfu/bfm-design(internal). Reward design:docs/proxy_reward_design.md; data regen:docs/data_regeneration.md. - Base framework: mjlab v1.3.0 + MuJoCo-Warp + rsl_rl.
- Reference motion: Robo-PFNN (kinematic generator).
- Reward recipe lineage: BeyondMimic (arXiv 2508.08241 / HybridRobotics/whole_body_tracking); PARC (arXiv 2505.04002); DeepMimic.
- Target architecture: BFM (arXiv 2509.13780) β CVAE + masked unified control interface.
Motion-bag training data (24 GB) is not distributed; regenerate deterministically per docs/data_regeneration.md.