Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

OpenVLA-7B + BridgeData V2 LoRA adapter

LoRA adapter (rank 32) fine-tuned on top of openvla/openvla-7b on the BridgeData V2 dataset (bridge_orig from the official Bridge V2 project website), following the standard LoRA fine-tune recipe in the OpenVLA repo.

Files

adapter_model.safetensors — LoRA weights (~463 MB)
adapter_config.json — PEFT config (r=32, alpha=16, dropout=0.0)
dataset_statistics.json — bridge_orig action normalization stats (needed by predict_action(unnorm_key="bridge_orig"))

Training setup


Base model	`openvla/openvla-7b`
Dataset	`bridge_orig` (BridgeData V2, project-website version)
LoRA rank	32
LoRA alpha	16
LoRA dropout	0.0
Target modules	all q/k/v/o + MLP projections + lm_head (PEFT auto-mapping)
Batch size	16 per GPU
Grad accumulation	1
Effective batch	16 × 8 GPUs = 128
Learning rate	5e-4
Image augmentation	enabled (random resized crop, scale ≈ 0.9)
Hardware	8× NVIDIA A100-SXM4-80GB
Steps	195,000 gradient steps (≈ 2.5 × 10⁷ transitions)
Precision	bf16, FlashAttention-2

Training command (script: vla-scripts/finetune.py):

torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py \
  --vla_path openvla/openvla-7b \
  --data_root_dir <path-to-rlds-data> \
  --dataset_name bridge_orig \
  --run_root_dir runs --adapter_tmp_dir adapter-tmp \
  --lora_rank 32 --batch_size 16 --grad_accumulation_steps 1 \
  --learning_rate 5e-4 --image_aug True \
  --save_steps 5000 --max_steps 200000

Quick offline evaluation

On 98 frames sampled from the bridge_orig val split (3 episodes, open-loop teacher-forcing — no simulator), per-dimension MAE was:

dim	dx	dy	dz	dRoll	dPitch	dYaw	gripper
MAE	0.004	0.007	0.007	0.033	0.041	0.040	0.053

For context, bridge_orig action q99 magnitudes are roughly ~3e-2 for translation, ~0.1–0.2 for rotation, and {0,1} for gripper. This is single-step open-loop accuracy, not closed-loop task success.

Usage

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import PeftModel

processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
base = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
).to("cuda")
vla = PeftModel.from_pretrained(base, "RalphFH/openvla-7b")

# Load action normalization statistics for predict_action
import json, huggingface_hub
stats_path = huggingface_hub.hf_hub_download("RalphFH/openvla-7b", "dataset_statistics.json")
vla.norm_stats = json.load(open(stats_path))

from PIL import Image
img = Image.open("some_observation.png").convert("RGB")
inputs = processor("In: What action should the robot take to pick up the carrot?\nOut:", img).to("cuda", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action)  # 7-D: [dx, dy, dz, dRoll, dPitch, dYaw, gripper]

If you prefer not to merge LoRA at inference, you can also call vla.merge_and_unload() first.

License

MIT (matches OpenVLA upstream).

Downloads last month: 11

Video Preview

Robotics

Model tree for RalphFH/openvla-7b

Base model

openvla/openvla-7b

Adapter

(18)

this model