Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

OpenVLA-7B + BridgeData V2 LoRA adapter

LoRA adapter (rank 32) fine-tuned on top of openvla/openvla-7b on the BridgeData V2 dataset (bridge_orig from the official Bridge V2 project website), following the standard LoRA fine-tune recipe in the OpenVLA repo.

Files

  • adapter_model.safetensors โ€” LoRA weights (~463 MB)
  • adapter_config.json โ€” PEFT config (r=32, alpha=16, dropout=0.0)
  • dataset_statistics.json โ€” bridge_orig action normalization stats (needed by predict_action(unnorm_key="bridge_orig"))

Training setup

Base model openvla/openvla-7b
Dataset bridge_orig (BridgeData V2, project-website version)
LoRA rank 32
LoRA alpha 16
LoRA dropout 0.0
Target modules all q/k/v/o + MLP projections + lm_head (PEFT auto-mapping)
Batch size 16 per GPU
Grad accumulation 1
Effective batch 16 ร— 8 GPUs = 128
Learning rate 5e-4
Image augmentation enabled (random resized crop, scale โ‰ˆ 0.9)
Hardware 8ร— NVIDIA A100-SXM4-80GB
Steps 195,000 gradient steps (โ‰ˆ 2.5 ร— 10โท transitions)
Precision bf16, FlashAttention-2

Training command (script: vla-scripts/finetune.py):

torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \
  --vla_path openvla/openvla-7b \
  --data_root_dir <path-to-rlds-data> \
  --dataset_name bridge_orig \
  --run_root_dir runs --adapter_tmp_dir adapter-tmp \
  --lora_rank 32 --batch_size 16 --grad_accumulation_steps 1 \
  --learning_rate 5e-4 --image_aug True \
  --save_steps 5000 --max_steps 200000

Quick offline evaluation

On 98 frames sampled from the bridge_orig val split (3 episodes, open-loop teacher-forcing โ€” no simulator), per-dimension MAE was:

dim dx dy dz dRoll dPitch dYaw gripper
MAE 0.004 0.007 0.007 0.033 0.041 0.040 0.053

For context, bridge_orig action q99 magnitudes are roughly ~3e-2 for translation, ~0.1โ€“0.2 for rotation, and {0,1} for gripper. This is single-step open-loop accuracy, not closed-loop task success.

Usage

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import PeftModel

processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
base = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
).to("cuda")
vla = PeftModel.from_pretrained(base, "RalphFH/openvla-7b")

# Load action normalization statistics for predict_action
import json, huggingface_hub
stats_path = huggingface_hub.hf_hub_download("RalphFH/openvla-7b", "dataset_statistics.json")
vla.norm_stats = json.load(open(stats_path))

from PIL import Image
img = Image.open("some_observation.png").convert("RGB")
inputs = processor("In: What action should the robot take to pick up the carrot?\nOut:", img).to("cuda", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action)  # 7-D: [dx, dy, dz, dRoll, dPitch, dYaw, gripper]

If you prefer not to merge LoRA at inference, you can also call vla.merge_and_unload() first.

License

MIT (matches OpenVLA upstream).

Downloads last month
11
Video Preview
loading

Model tree for RalphFH/openvla-7b

Adapter
(18)
this model