You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

BFM-Design · Proxy v2 (BeyondMimic-aligned) — Unitree G1 terrain-aware motion-tracking teacher

Stage-1 proxy agent of the BFM-Design pipeline: a privileged PPO motion-tracking policy for the Unitree G1 humanoid on rough terrain. Trained to imitate PFNN-generated reference motion across 40 terrain cells, it serves as the teacher for Stage-2 CVAE (BFM) DAgger distillation.

TL;DR


Task	`Bfm-Proxy-V2-PfnnRough-Unitree-G1` (mjlab)
Robot	Unitree G1, 29 actuated DoF
Obs	1150-dim privileged (proprio history + per-body state + motion goal + 21×21 heightmap)
Action	29-dim joint position targets (PD), with 0–2 step action delay
Policy net	MLP `[2048, 2048, 1024, 1024, 512, 512]`, PPO (rsl_rl)
Training data	`motion_bag_v5` — 14.09 M frames / 84.6 k clips / 40 terrain cells (8 sub_types × 5 levels)
Scale	1024 envs × 13 500 iters, single H20 (~15 h)
Result	ep_len ~70–87, reward ~+1.5, fall-terminations ≈ 0
Role	Stage-1 teacher → Stage-2 CVAE BFM distillation

Why "v2" — reward design

This is the v2 reward recipe. v1 used a heavy 10-penalty linear reward curriculum + an all-14-body 0.5 m termination, which collapsed episode length (28 → 10) as the penalties ramped — the regularizers fought limb tracking and bad_motion_body_pos converted the resulting limb excursions into early death, while mean body-position error stayed ~0.16 m the whole time.

v2 reverts to the validated BeyondMimic / mjlab-stock minimal-shaping recipe:

	weight / setting
Tracking rewards	`motion_body_pos/ori/lin_vel/ang_vel` = 1.0 each; `motion_global_root_pos/ori` = 0.5
Penalties (only 3)	`action_rate_l2` −0.1, `joint_limit` −10, `self_collisions` −10
Reward curriculum	none (empty)
Termination	3-way z-only: `anchor_pos_z` 0.25 + `anchor_ori` 0.8 + `ee_body_pos_z` 0.25 (4 EE: ankles+wrists) + `motion_clip_end` (time-out)

Privileged proxy observations (per-body state + motion goal + heightmap) and the sim2real domain-randomization stack (link-mass / PD-gain / friction / CoM / push / torque-RFI / action-delay) are kept — they are orthogonal to the v1 collapse. Full rationale + data: see docs/proxy_reward_design.md in the source repo.

Intended use & role

Primary: teacher policy whose privileged rollouts are distilled into the deployable Stage-2 CVAE BFM (masked unified control interface) via DAgger.
Secondary: a PHP-style "specialist on a single mode" baseline for ablation.
Not directly deployable on hardware: observations are privileged (full sim state + heightmap), not the 25-step proprioceptive history the deployable BFM uses.

Checkpoint contents

model_13499.pt (full rsl_rl checkpoint, ~241 MB) — keys: actor_state_dict, critic_state_dict, optimizer_state_dict, iter, infos. Use actor_state_dict for inference; the rest is for resuming training.

ONNX export is not included (the run's auto-export failed to serialize; a clean export can be regenerated from the actor if a deployment graph is needed).

How to load (inference)

import torch
ckpt = torch.load("model_13499.pt", map_location="cpu")
actor_sd = ckpt["actor_state_dict"]   # MLP 1150 -> [2048,2048,1024,1024,512,512] -> 29
# Rebuild via the source repo's runner cfg (bfm_design/tasks/proxy_v1_rl_cfg.py),
# register the task (import bfm_design.tasks) and load into the rsl_rl MLP actor.

Reproduce the exact env (obs/action layout, terrain) from the source repo at the pinned commit; the rasterized terrain is included there (assets/terrain/mjlab_terrain_rasterized_v3h.npz).

Evaluation

Per-terrain eval, all 8 sub_types at level 4, 16 env × 12 s (model_13499.pt):

terrain (lvl4)	track body err (m)	track joint err (rad)
flat	0.033	0.70
pyramid_stairs	0.051	0.72
pyramid_stairs_inv	0.061	0.99
hf_pyramid_slope	0.073	0.88
hf_pyramid_slope_inv	0.058	0.99
random_rough	0.062	0.99
wave_terrain	0.062	0.89
box_line	0.041	0.77

Body tracking error 3.3–7.3 cm across all 8 lvl4 terrains (~1 order of magnitude better than the Phase-B v2 teacher at 0.286 m and the A3' student at 0.59 m).

Training-curve summary (tensorboard): mean reward −2.7 → +1.5; mean ep_len 6 → ~70–87; ee_body_pos terminations 159 → ~1. Visually signed off via reference-ghost rollouts (scripts/render_v2_ghost_8terrain.py in the source repo).

Note: a naive survive_ratio over a fixed window reads 0 because motion_clip_end (a successful clip completion / time-out) is counted as "done"; actual fall terminations are ≈ 0.

Limitations

Privileged obs ⇒ teacher-only, not sim2real-ready as-is.
ee_body_pos_z 0.25 m termination is tight for rough terrain (slow ep_len cold-start ~iter 0–1500 before the policy learns to keep feet/wrists within tolerance).
Trained on PFNN-derived reference motion (walk/jog/crouch gaits); behaviors outside the reference distribution are out of scope (handled later by BFM residual learning).

Source, data, citation

HF repo: tRNAoOO/<name> — public + gated (contact-info gate), matching the Robo-PFNN weights repo.
Code (pinned): GitLab hqfu/bfm-design (internal). Reward design: docs/proxy_reward_design.md; data regen: docs/data_regeneration.md.
Base framework: mjlab v1.3.0 + MuJoCo-Warp + rsl_rl.
Reference motion: Robo-PFNN (kinematic generator).
Reward recipe lineage: BeyondMimic (arXiv 2508.08241 / HybridRobotics/whole_body_tracking); PARC (arXiv 2505.04002); DeepMimic.
Target architecture: BFM (arXiv 2509.13780) — CVAE + masked unified control interface.

Motion-bag training data (24 GB) is not distributed; regenerate deterministically per docs/data_regeneration.md.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Papers for tRNAoOO/bfm-design-proxy-g1

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Paper • 2508.08241 • Published Aug 11, 2025

PARC: Physics-based Augmentation with Reinforcement Learning for Character Controllers

Paper • 2505.04002 • Published May 6, 2025