--- license: cc-by-nc-4.0 library_name: transformers pipeline_tag: robotics tags: - robotics - vision-language-action - vla - contrastive-reinforcement-learning - goal-conditioned-rl - qwen3-vl - prts - custom_code language: - en ---

PRTS-4B — Primitive Reasoning and Tasking System

**PRTS-4B** is a **Vision–Language–Action (VLA) foundation model** that, for the first time, scales **reward-label-free contrastive RL** into VLA pre-training itself. By treating language instructions as goals and supervising a contrastive value head co-trained inside the same forward pass as behavior cloning, PRTS equips a **Qwen3-VL-4B** backbone with a quantitative, language-grounded sense of *how close the current state is to satisfying the instruction*. The released checkpoint is the result of pre-training on **~167 B tokens** of action-labeled and embodied-reasoning data on 64 × H100 GPUs. 📄 Paper: [arXiv:2604.27472](https://arxiv.org/abs/2604.27472) · 💻 Code: [github.com/TeleHuman/PRTS](https://github.com/TeleHuman/PRTS) · 🌐 Project: [rhodes-team-prts.github.io](https://rhodes-team-prts.github.io/) ## Highlights - **Goal-reachability awareness, end-to-end.** The contrastive value head is co-trained inside the policy backbone — no separate value network, no curated reward dataset, no offline-RL post-training loop. - **Reward-label-free.** Supervision comes purely from the temporal structure of demonstrations. - **Out-of-distribution wins grow with the shift.** On 5 simulation suites and 14 real-world tasks, PRTS matches or exceeds the strongest prior VLAs at ¼–⅛ the post-training compute, with the gap **widening** off-distribution — novel-instruction following (`+38.8` over π_0.5), long-horizon execution, and recovery under human intervention. ## Loading the checkpoint The released model ships its own `modeling_*.py`, `configuration_*.py`, and `processing_*.py` next to the weights, so it can be loaded directly via `transformers` with `trust_remote_code=True`. **No need to clone the GitHub repo for a smoke test.** ### Recommended environment | Component | Note | | :--- | :--- | | Python | 3.10+ (3.11+ recommended) | | `transformers` | `== 4.57.3` | | PyTorch | recent CUDA build from [pytorch.org](https://pytorch.org) | ```bash pip install "transformers==4.57.3" torch safetensors huggingface_hub \ numpy pillow sentencepiece protobuf colorama tokenizers pip install accelerate # recommended for device_map="auto" ``` ### From the Hub ```python import torch from transformers import AutoConfig, AutoModel, AutoProcessor REPO_ID = "TeleEmbodied/PRTS-4B" config = AutoConfig.from_pretrained(REPO_ID, trust_remote_code=True) model = AutoModel.from_pretrained( REPO_ID, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto", ) processor = AutoProcessor.from_pretrained(REPO_ID, trust_remote_code=True) print(config.model_type) # prts_qwen3_vl print(type(model).__name__) print(type(processor).__name__) ``` ## Prompt format PRTS expects a **single user turn** containing camera images, a discretized proprioceptive state, and a language instruction, followed by an assistant turn that emits the action chunk. The full prompt is built from these constants (declared in `prts/constants.py` of the open-source repo): | Token | Meaning | | :--- | :--- | | `<\|im_start\|>` `<\|im_end\|>` | Qwen-style turn delimiters | | `<\|vision_start\|>` `<\|image_pad\|>` `<\|vision_end\|>` | One image placeholder block per camera | | `<\|goal_repr\|>` | CRL value-head anchor tokens | ### Layout of one rollout step ```text <|im_start|>system You are a helpful physical assistant.<|im_end|> <|im_start|>user {cam_1_name}: <|vision_start|><|image_pad|><|vision_end|> {cam_2_name}: <|vision_start|><|image_pad|><|vision_end|> ... Proprioception (normalized to 0-1000 scale): {s_1} {s_2} ... {s_D} Instruction: {language instruction} Predict the next action chunk in low-level robotics action format.<|im_end|> <|im_start|>assistant <|action_start|><|action_token_1|>...<|action_token_999|><|action_end|><|im_end|> ``` ### Field-by-field spec - **System message:** fixed to `You are a helpful physical assistant.` - **Image block:** One `{cam_name}: <|vision_start|><|image_pad|><|vision_end|>` line per camera. - **Proprioceptive state:** The robot state is **q01/q99-normalized to `[-1, 1]`** per dimension (using stats from `compute_stats.py`), then linearly remapped to integers in `[0, 1000]` and rendered as a space-separated list. The line is prefixed by `Proprioception (normalized to 0-1000 scale): `. Omit the line entirely if the embodiment has no proprioception channel. Values outside the bounds are clipped to -1 or 1000. - **Instruction:** Free-form English natural-language goal (e.g. `Left gripper sequentially grasps two shoes and places them in the shoebox. Right gripper closes the shoebox.`). - **Suffix:** Always end the user turn with `Predict the next action chunk in low-level robotics action format.` if you want to use PRTS to generate actions. ## License This model is released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). Free for academic and non-commercial research; commercial use is **not** permitted under this license. --- ## Citation If you find PRTS useful, please cite: ```bibtex @article{zhang2026prts, title = {PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations}, author={Yang Zhang and Jiangyuan Zhao and Chenyou Fan and Fangzheng Yan and Tian Li and Haitong Tang and Sen Fu and Xuan'er Wu and Qizhen Weng and Weinan Zhang and Xiu Li and Chi Zhang and Chenjia Bai and Xuelong Li}, journal = {arXiv preprint arXiv:2604.27472}, year = {2026}, } ``` --- ## Acknowledgements PRTS builds on [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL), [FlashAttention](https://github.com/Dao-AILab/flash-attention), [LeRobot](https://github.com/huggingface/lerobot), and [OpenPI](https://github.com/openpilab/openpi). We thank the authors of [Contrastive RL](https://github.com/google-research/google-research/tree/master/contrastive_rl) for the ideas behind the contrastive value formulation.