Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string
OpenVLA-7B + BridgeData V2 LoRA adapter
LoRA adapter (rank 32) fine-tuned on top of openvla/openvla-7b
on the BridgeData V2 dataset (bridge_orig from the official Bridge V2 project website),
following the standard LoRA fine-tune recipe in the OpenVLA repo.
Files
adapter_model.safetensorsโ LoRA weights (~463 MB)adapter_config.jsonโ PEFT config (r=32,alpha=16,dropout=0.0)dataset_statistics.jsonโ bridge_orig action normalization stats (needed bypredict_action(unnorm_key="bridge_orig"))
Training setup
| Base model | openvla/openvla-7b |
| Dataset | bridge_orig (BridgeData V2, project-website version) |
| LoRA rank | 32 |
| LoRA alpha | 16 |
| LoRA dropout | 0.0 |
| Target modules | all q/k/v/o + MLP projections + lm_head (PEFT auto-mapping) |
| Batch size | 16 per GPU |
| Grad accumulation | 1 |
| Effective batch | 16 ร 8 GPUs = 128 |
| Learning rate | 5e-4 |
| Image augmentation | enabled (random resized crop, scale โ 0.9) |
| Hardware | 8ร NVIDIA A100-SXM4-80GB |
| Steps | 195,000 gradient steps (โ 2.5 ร 10โท transitions) |
| Precision | bf16, FlashAttention-2 |
Training command (script: vla-scripts/finetune.py):
torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \
--vla_path openvla/openvla-7b \
--data_root_dir <path-to-rlds-data> \
--dataset_name bridge_orig \
--run_root_dir runs --adapter_tmp_dir adapter-tmp \
--lora_rank 32 --batch_size 16 --grad_accumulation_steps 1 \
--learning_rate 5e-4 --image_aug True \
--save_steps 5000 --max_steps 200000
Quick offline evaluation
On 98 frames sampled from the bridge_orig val split (3 episodes, open-loop teacher-forcing โ no simulator), per-dimension MAE was:
| dim | dx | dy | dz | dRoll | dPitch | dYaw | gripper |
|---|---|---|---|---|---|---|---|
| MAE | 0.004 | 0.007 | 0.007 | 0.033 | 0.041 | 0.040 | 0.053 |
For context, bridge_orig action q99 magnitudes are roughly ~3e-2 for translation, ~0.1โ0.2 for rotation, and {0,1} for gripper. This is single-step open-loop accuracy, not closed-loop task success.
Usage
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import PeftModel
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
base = AutoModelForVision2Seq.from_pretrained(
"openvla/openvla-7b",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
trust_remote_code=True,
).to("cuda")
vla = PeftModel.from_pretrained(base, "RalphFH/openvla-7b")
# Load action normalization statistics for predict_action
import json, huggingface_hub
stats_path = huggingface_hub.hf_hub_download("RalphFH/openvla-7b", "dataset_statistics.json")
vla.norm_stats = json.load(open(stats_path))
from PIL import Image
img = Image.open("some_observation.png").convert("RGB")
inputs = processor("In: What action should the robot take to pick up the carrot?\nOut:", img).to("cuda", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action) # 7-D: [dx, dy, dz, dRoll, dPitch, dYaw, gripper]
If you prefer not to merge LoRA at inference, you can also call vla.merge_and_unload() first.
License
MIT (matches OpenVLA upstream).
- Downloads last month
- 11
Model tree for RalphFH/openvla-7b
Base model
openvla/openvla-7b