You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

BFM-Design Β· Proxy v2 (BeyondMimic-aligned) β€” Unitree G1 terrain-aware motion-tracking teacher

Stage-1 proxy agent of the BFM-Design pipeline: a privileged PPO motion-tracking policy for the Unitree G1 humanoid on rough terrain. Trained to imitate PFNN-generated reference motion across 40 terrain cells, it serves as the teacher for Stage-2 CVAE (BFM) DAgger distillation.

TL;DR

Task Bfm-Proxy-V2-PfnnRough-Unitree-G1 (mjlab)
Robot Unitree G1, 29 actuated DoF
Obs 1150-dim privileged (proprio history + per-body state + motion goal + 21Γ—21 heightmap)
Action 29-dim joint position targets (PD), with 0–2 step action delay
Policy net MLP [2048, 2048, 1024, 1024, 512, 512], PPO (rsl_rl)
Training data motion_bag_v5 β€” 14.09 M frames / 84.6 k clips / 40 terrain cells (8 sub_types Γ— 5 levels)
Scale 1024 envs Γ— 13 500 iters, single H20 (~15 h)
Result ep_len ~70–87, reward ~+1.5, fall-terminations β‰ˆ 0
Role Stage-1 teacher β†’ Stage-2 CVAE BFM distillation

Why "v2" β€” reward design

This is the v2 reward recipe. v1 used a heavy 10-penalty linear reward curriculum + an all-14-body 0.5 m termination, which collapsed episode length (28 β†’ 10) as the penalties ramped β€” the regularizers fought limb tracking and bad_motion_body_pos converted the resulting limb excursions into early death, while mean body-position error stayed ~0.16 m the whole time.

v2 reverts to the validated BeyondMimic / mjlab-stock minimal-shaping recipe:

weight / setting
Tracking rewards motion_body_pos/ori/lin_vel/ang_vel = 1.0 each; motion_global_root_pos/ori = 0.5
Penalties (only 3) action_rate_l2 βˆ’0.1, joint_limit βˆ’10, self_collisions βˆ’10
Reward curriculum none (empty)
Termination 3-way z-only: anchor_pos_z 0.25 + anchor_ori 0.8 + ee_body_pos_z 0.25 (4 EE: ankles+wrists) + motion_clip_end (time-out)

Privileged proxy observations (per-body state + motion goal + heightmap) and the sim2real domain-randomization stack (link-mass / PD-gain / friction / CoM / push / torque-RFI / action-delay) are kept β€” they are orthogonal to the v1 collapse. Full rationale + data: see docs/proxy_reward_design.md in the source repo.

Intended use & role

  • Primary: teacher policy whose privileged rollouts are distilled into the deployable Stage-2 CVAE BFM (masked unified control interface) via DAgger.
  • Secondary: a PHP-style "specialist on a single mode" baseline for ablation.
  • Not directly deployable on hardware: observations are privileged (full sim state + heightmap), not the 25-step proprioceptive history the deployable BFM uses.

Checkpoint contents

model_13499.pt (full rsl_rl checkpoint, ~241 MB) β€” keys: actor_state_dict, critic_state_dict, optimizer_state_dict, iter, infos. Use actor_state_dict for inference; the rest is for resuming training.

ONNX export is not included (the run's auto-export failed to serialize; a clean export can be regenerated from the actor if a deployment graph is needed).

How to load (inference)

import torch
ckpt = torch.load("model_13499.pt", map_location="cpu")
actor_sd = ckpt["actor_state_dict"]   # MLP 1150 -> [2048,2048,1024,1024,512,512] -> 29
# Rebuild via the source repo's runner cfg (bfm_design/tasks/proxy_v1_rl_cfg.py),
# register the task (import bfm_design.tasks) and load into the rsl_rl MLP actor.

Reproduce the exact env (obs/action layout, terrain) from the source repo at the pinned commit; the rasterized terrain is included there (assets/terrain/mjlab_terrain_rasterized_v3h.npz).

Evaluation

Per-terrain eval, all 8 sub_types at level 4, 16 env Γ— 12 s (model_13499.pt):

terrain (lvl4) track body err (m) track joint err (rad)
flat 0.033 0.70
pyramid_stairs 0.051 0.72
pyramid_stairs_inv 0.061 0.99
hf_pyramid_slope 0.073 0.88
hf_pyramid_slope_inv 0.058 0.99
random_rough 0.062 0.99
wave_terrain 0.062 0.89
box_line 0.041 0.77

Body tracking error 3.3–7.3 cm across all 8 lvl4 terrains (~1 order of magnitude better than the Phase-B v2 teacher at 0.286 m and the A3' student at 0.59 m).

Training-curve summary (tensorboard): mean reward βˆ’2.7 β†’ +1.5; mean ep_len 6 β†’ ~70–87; ee_body_pos terminations 159 β†’ ~1. Visually signed off via reference-ghost rollouts (scripts/render_v2_ghost_8terrain.py in the source repo).

Note: a naive survive_ratio over a fixed window reads 0 because motion_clip_end (a successful clip completion / time-out) is counted as "done"; actual fall terminations are β‰ˆ 0.

Limitations

  • Privileged obs β‡’ teacher-only, not sim2real-ready as-is.
  • ee_body_pos_z 0.25 m termination is tight for rough terrain (slow ep_len cold-start ~iter 0–1500 before the policy learns to keep feet/wrists within tolerance).
  • Trained on PFNN-derived reference motion (walk/jog/crouch gaits); behaviors outside the reference distribution are out of scope (handled later by BFM residual learning).

Source, data, citation

  • License: MIT (Β© 2026 Huiqiao Fu), consistent with Robo-PFNN.
  • HF repo: tRNAoOO/<name> β€” public + gated (contact-info gate), matching the Robo-PFNN weights repo.
  • Code (pinned): GitLab hqfu/bfm-design (internal). Reward design: docs/proxy_reward_design.md; data regen: docs/data_regeneration.md.
  • Base framework: mjlab v1.3.0 + MuJoCo-Warp + rsl_rl.
  • Reference motion: Robo-PFNN (kinematic generator).
  • Reward recipe lineage: BeyondMimic (arXiv 2508.08241 / HybridRobotics/whole_body_tracking); PARC (arXiv 2505.04002); DeepMimic.
  • Target architecture: BFM (arXiv 2509.13780) β€” CVAE + masked unified control interface.

Motion-bag training data (24 GB) is not distributed; regenerate deterministically per docs/data_regeneration.md.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Papers for tRNAoOO/bfm-design-proxy-g1