Robotics
LeRobot
Safetensors
smolvla
so101
imitation-learning
isaaclab
sim
multi-task
code-as-policies
CoRL2026
HyeonseokE commited on
Commit
45f76f1
·
verified ·
1 Parent(s): 2a561dc

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: lerobot
4
+ base_model: lerobot/smolvla_base
5
+ pipeline_tag: robotics
6
+ tags:
7
+ - lerobot
8
+ - smolvla
9
+ - robotics
10
+ - so101
11
+ - imitation-learning
12
+ - isaaclab
13
+ - sim
14
+ - multi-task
15
+ - code-as-policies
16
+ - CoRL2026
17
+ datasets:
18
+ - CoRL2026-CSI/Isaaclab-so101_11task_baseCaP_3300epi_10fps
19
+ ---
20
+
21
+ # smolVLA · IsaacLab SO101 Multi-Task (11 tasks, 8 epoch)
22
+
23
+ [lerobot/smolvla_base](https://huggingface.co/lerobot/smolvla_base) 를 IsaacLab 시뮬레이션 SO101 **11-task 멀티태스크** 데이터셋
24
+ [CoRL2026-CSI/Isaaclab-so101_11task_baseCaP_3300epi_10fps](https://huggingface.co/datasets/CoRL2026-CSI/Isaaclab-so101_11task_baseCaP_3300epi_10fps)
25
+ 으로 8 epoch 파인튜닝한 SmolVLA 정책.
26
+
27
+ 이 체크포인트는 **full model** (`model.safetensors`) 입니다 — LoRA adapter 가 아니며, 그대로 로드해 사용합니다.
28
+
29
+ ## Model details
30
+
31
+ - **Base model**: `lerobot/smolvla_base` (SmolVLM2-500M-Video-Instruct VLM + action expert)
32
+ - **Robot**: SO101 (6-DOF, gripper 포함) — IsaacLab 시뮬레이션
33
+ - **Cameras**: `top`, `left_wrist` (480×640) — 정책 키 `camera1`(left_wrist) / `camera2`(top) 로 rename
34
+ - **Inputs**: `observation.state`[6] + 카메라 2개 + language instruction (task)
35
+ - **Output**: `action`[6] (joint position)
36
+ - **Action chunking**: `chunk_size=50`, `n_action_steps=50`
37
+
38
+ ## 학습 방식
39
+
40
+ **VLM frozen + action expert only** — SmolVLA 공식 표준 학습 방식 ([SmolVLA paper, arXiv:2506.01844](https://arxiv.org/abs/2506.01844)).
41
+
42
+ | 구성요소 | 상태 |
43
+ |---|---|
44
+ | VLM backbone (SmolVLM2) | ❄️ **완전 Frozen** (`freeze_vision_encoder=true`) |
45
+ | Action expert | 🔥 **학습** (`train_expert_only=true`) |
46
+ | PEFT / LoRA | 사용 안 함 |
47
+
48
+ ## Training hyperparameters
49
+
50
+ | 항목 | 값 |
51
+ |---|---|
52
+ | Dataset | [Isaaclab-so101_11task_baseCaP_3300epi_10fps](https://huggingface.co/datasets/CoRL2026-CSI/Isaaclab-so101_11task_baseCaP_3300epi_10fps) — 3,300 episodes / 1,175,352 frames / 11 tasks / 10 fps |
53
+ | Epochs / Steps | 8 epoch / 36,800 steps |
54
+ | Global batch size | 256 (micro batch 128 × 2 GPU) |
55
+ | Optimizer | AdamW — lr `1e-4`, weight_decay `1e-10`, grad_clip_norm `10.0` |
56
+ | LR scheduler | cosine_decay_with_warmup — warmup 1,000 / decay 30,000 / peak_lr `1e-4` / decay_lr `2.5e-6` |
57
+ | chunk_size / n_action_steps | 50 / 50 |
58
+ | Seed | 1000 |
59
+ | Dataloader workers | 16 |
60
+ | Mixed precision | no (bf16 inference) |
61
+ | Image augmentation | ColorJitter (brightness/contrast/saturation/hue) + SharpnessJitter — **기하학적 변형(회전/이동/반전) 없음** (VLA 좌우 의미 보존) |
62
+ | Hardware | 2 × NVIDIA H100 80GB |
63
+ | Final loss | 0.020 |
64
+
65
+ ## Camera rename
66
+
67
+ LeRobot dataset 의 카메라 키와 SmolVLA 정책 키 매핑:
68
+
69
+ | Dataset key | Policy key |
70
+ |---|---|
71
+ | `observation.images.left_wrist` | `observation.images.camera1` |
72
+ | `observation.images.top` | `observation.images.camera2` |
73
+
74
+ > 추론·평가 시 반드시 위와 동일한 rename 을 적용해야 합니다 (학습-추론 일관성).
75
+
76
+ ## Input / Output 규정
77
+
78
+ - **Input**: `observation.state`[6] (joint position) + 카메라 2개 + language instruction(task) 만
79
+ - **Output**: `action`[6] (joint position) 만
80
+ - 데이터셋의 `ee_pos` / `gripper_binary` / `state.radian_urdf0` / `action.radian_urdf0` 는 학습에서 제외
81
+ - SmolVLA 정책은 카메라 슬롯이 3개(`camera1/2/3`)로 고정이라 `camera3` 슬롯이 config 에 존재하지만, 데이터셋 카메라는 2개뿐이라 실제로 데이터가 흐르는 카메라는 2개입니다.
82
+
83
+ ## Usage
84
+
85
+ ```python
86
+ from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
87
+
88
+ policy = SmolVLAPolicy.from_pretrained("CoRL2026-CSI/smolVLA-IsaacLab-Multi-Task-8epoch-mod")
89
+ ```
90
+
91
+ ## Citation / Acknowledgement
92
+
93
+ Built on top of [LeRobot](https://github.com/huggingface/lerobot) and the
94
+ [SmolVLA](https://huggingface.co/lerobot/smolvla_base) base checkpoint. Project: CoRL 2026 CSI submission.
95
+
96
+ ### Framework versions
97
+
98
+ - LeRobot 0.5.2
config.json ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "type": "smolvla",
3
+ "n_obs_steps": 1,
4
+ "input_features": {
5
+ "observation.state": {
6
+ "type": "STATE",
7
+ "shape": [
8
+ 6
9
+ ]
10
+ },
11
+ "observation.images.camera1": {
12
+ "type": "VISUAL",
13
+ "shape": [
14
+ 3,
15
+ 256,
16
+ 256
17
+ ]
18
+ },
19
+ "observation.images.camera2": {
20
+ "type": "VISUAL",
21
+ "shape": [
22
+ 3,
23
+ 256,
24
+ 256
25
+ ]
26
+ },
27
+ "observation.images.camera3": {
28
+ "type": "VISUAL",
29
+ "shape": [
30
+ 3,
31
+ 256,
32
+ 256
33
+ ]
34
+ }
35
+ },
36
+ "output_features": {
37
+ "action": {
38
+ "type": "ACTION",
39
+ "shape": [
40
+ 6
41
+ ]
42
+ }
43
+ },
44
+ "device": "cuda",
45
+ "use_amp": false,
46
+ "use_peft": false,
47
+ "push_to_hub": false,
48
+ "repo_id": null,
49
+ "private": null,
50
+ "tags": null,
51
+ "license": null,
52
+ "pretrained_path": "lerobot/smolvla_base",
53
+ "chunk_size": 50,
54
+ "n_action_steps": 50,
55
+ "normalization_mapping": {
56
+ "VISUAL": "IDENTITY",
57
+ "STATE": "MEAN_STD",
58
+ "ACTION": "MEAN_STD"
59
+ },
60
+ "max_state_dim": 32,
61
+ "max_action_dim": 32,
62
+ "resize_imgs_with_padding": [
63
+ 512,
64
+ 512
65
+ ],
66
+ "empty_cameras": 0,
67
+ "adapt_to_pi_aloha": false,
68
+ "use_delta_joint_actions_aloha": false,
69
+ "tokenizer_max_length": 48,
70
+ "num_steps": 50,
71
+ "use_cache": true,
72
+ "freeze_vision_encoder": true,
73
+ "train_expert_only": true,
74
+ "train_state_proj": true,
75
+ "optimizer_lr": 0.0001,
76
+ "optimizer_betas": [
77
+ 0.9,
78
+ 0.95
79
+ ],
80
+ "optimizer_eps": 1e-08,
81
+ "optimizer_weight_decay": 1e-10,
82
+ "optimizer_grad_clip_norm": 10.0,
83
+ "scheduler_warmup_steps": 1000,
84
+ "scheduler_decay_steps": 30000,
85
+ "scheduler_decay_lr": 2.5e-06,
86
+ "vlm_model_name": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct",
87
+ "load_vlm_weights": true,
88
+ "add_image_special_tokens": false,
89
+ "attention_mode": "cross_attn",
90
+ "prefix_length": 0,
91
+ "pad_language_to": "max_length",
92
+ "num_expert_layers": 0,
93
+ "num_vlm_layers": 16,
94
+ "self_attn_every_n_layers": 2,
95
+ "expert_width_multiplier": 0.75,
96
+ "min_period": 0.004,
97
+ "max_period": 4.0,
98
+ "rtc_config": null,
99
+ "compile_model": false,
100
+ "compile_mode": "max-autotune"
101
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f1e72bfa62ed7e65b00ff2bcceb79857b230c25e7bfe60fad470bc5cd68017a3
3
+ size 906712520
policy_postprocessor.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "policy_postprocessor",
3
+ "steps": [
4
+ {
5
+ "registry_name": "unnormalizer_processor",
6
+ "config": {
7
+ "eps": 1e-08,
8
+ "features": {
9
+ "action": {
10
+ "type": "ACTION",
11
+ "shape": [
12
+ 6
13
+ ]
14
+ }
15
+ },
16
+ "norm_map": {
17
+ "VISUAL": "IDENTITY",
18
+ "STATE": "MEAN_STD",
19
+ "ACTION": "MEAN_STD"
20
+ }
21
+ },
22
+ "state_file": "policy_postprocessor_step_0_unnormalizer_processor.safetensors"
23
+ },
24
+ {
25
+ "registry_name": "device_processor",
26
+ "config": {
27
+ "device": "cpu",
28
+ "float_dtype": null
29
+ }
30
+ }
31
+ ]
32
+ }
policy_postprocessor_step_0_unnormalizer_processor.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:357c326340d754195872d27be0384d9cafbaa1fbcfe7d9b4b38180f7f8d47a06
3
+ size 16796
policy_preprocessor.json ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "policy_preprocessor",
3
+ "steps": [
4
+ {
5
+ "registry_name": "rename_observations_processor",
6
+ "config": {
7
+ "rename_map": {
8
+ "observation.images.left_wrist": "observation.images.camera1",
9
+ "observation.images.top": "observation.images.camera2"
10
+ }
11
+ }
12
+ },
13
+ {
14
+ "registry_name": "to_batch_processor",
15
+ "config": {}
16
+ },
17
+ {
18
+ "registry_name": "smolvla_new_line_processor",
19
+ "config": {}
20
+ },
21
+ {
22
+ "registry_name": "tokenizer_processor",
23
+ "config": {
24
+ "max_length": 48,
25
+ "task_key": "task",
26
+ "padding_side": "right",
27
+ "padding": "max_length",
28
+ "truncation": true,
29
+ "tokenizer_name": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
30
+ }
31
+ },
32
+ {
33
+ "registry_name": "device_processor",
34
+ "config": {
35
+ "device": "cuda",
36
+ "float_dtype": null
37
+ }
38
+ },
39
+ {
40
+ "registry_name": "normalizer_processor",
41
+ "config": {
42
+ "eps": 1e-08,
43
+ "features": {
44
+ "observation.state": {
45
+ "type": "STATE",
46
+ "shape": [
47
+ 6
48
+ ]
49
+ },
50
+ "observation.images.camera1": {
51
+ "type": "VISUAL",
52
+ "shape": [
53
+ 3,
54
+ 256,
55
+ 256
56
+ ]
57
+ },
58
+ "observation.images.camera2": {
59
+ "type": "VISUAL",
60
+ "shape": [
61
+ 3,
62
+ 256,
63
+ 256
64
+ ]
65
+ },
66
+ "observation.images.camera3": {
67
+ "type": "VISUAL",
68
+ "shape": [
69
+ 3,
70
+ 256,
71
+ 256
72
+ ]
73
+ },
74
+ "action": {
75
+ "type": "ACTION",
76
+ "shape": [
77
+ 6
78
+ ]
79
+ }
80
+ },
81
+ "norm_map": {
82
+ "VISUAL": "IDENTITY",
83
+ "STATE": "MEAN_STD",
84
+ "ACTION": "MEAN_STD"
85
+ }
86
+ },
87
+ "state_file": "policy_preprocessor_step_5_normalizer_processor.safetensors"
88
+ }
89
+ ]
90
+ }
policy_preprocessor_step_5_normalizer_processor.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:357c326340d754195872d27be0384d9cafbaa1fbcfe7d9b4b38180f7f8d47a06
3
+ size 16796
train_config.json ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dataset": {
3
+ "repo_id": "CoRL2026-CSI/Isaaclab-so101_11task_baseCaP_3300epi_10fps",
4
+ "root": null,
5
+ "episodes": null,
6
+ "image_transforms": {
7
+ "enable": true,
8
+ "max_num_transforms": 3,
9
+ "random_order": true,
10
+ "tfs": {
11
+ "brightness": {
12
+ "weight": 1.0,
13
+ "type": "ColorJitter",
14
+ "kwargs": {
15
+ "brightness": [
16
+ 0.8,
17
+ 1.2
18
+ ]
19
+ }
20
+ },
21
+ "contrast": {
22
+ "weight": 1.0,
23
+ "type": "ColorJitter",
24
+ "kwargs": {
25
+ "contrast": [
26
+ 0.8,
27
+ 1.2
28
+ ]
29
+ }
30
+ },
31
+ "saturation": {
32
+ "weight": 1.0,
33
+ "type": "ColorJitter",
34
+ "kwargs": {
35
+ "saturation": [
36
+ 0.5,
37
+ 1.5
38
+ ]
39
+ }
40
+ },
41
+ "hue": {
42
+ "weight": 1.0,
43
+ "type": "ColorJitter",
44
+ "kwargs": {
45
+ "hue": [
46
+ -0.05,
47
+ 0.05
48
+ ]
49
+ }
50
+ },
51
+ "sharpness": {
52
+ "weight": 1.0,
53
+ "type": "SharpnessJitter",
54
+ "kwargs": {
55
+ "sharpness": [
56
+ 0.5,
57
+ 1.5
58
+ ]
59
+ }
60
+ }
61
+ }
62
+ },
63
+ "revision": null,
64
+ "use_imagenet_stats": true,
65
+ "video_backend": "torchcodec",
66
+ "streaming": false
67
+ },
68
+ "env": null,
69
+ "policy": {
70
+ "type": "smolvla",
71
+ "n_obs_steps": 1,
72
+ "input_features": {
73
+ "observation.state": {
74
+ "type": "STATE",
75
+ "shape": [
76
+ 6
77
+ ]
78
+ },
79
+ "observation.images.camera1": {
80
+ "type": "VISUAL",
81
+ "shape": [
82
+ 3,
83
+ 256,
84
+ 256
85
+ ]
86
+ },
87
+ "observation.images.camera2": {
88
+ "type": "VISUAL",
89
+ "shape": [
90
+ 3,
91
+ 256,
92
+ 256
93
+ ]
94
+ },
95
+ "observation.images.camera3": {
96
+ "type": "VISUAL",
97
+ "shape": [
98
+ 3,
99
+ 256,
100
+ 256
101
+ ]
102
+ }
103
+ },
104
+ "output_features": {
105
+ "action": {
106
+ "type": "ACTION",
107
+ "shape": [
108
+ 6
109
+ ]
110
+ }
111
+ },
112
+ "device": "cuda",
113
+ "use_amp": false,
114
+ "use_peft": false,
115
+ "push_to_hub": false,
116
+ "repo_id": null,
117
+ "private": null,
118
+ "tags": null,
119
+ "license": null,
120
+ "pretrained_path": "lerobot/smolvla_base",
121
+ "chunk_size": 50,
122
+ "n_action_steps": 50,
123
+ "normalization_mapping": {
124
+ "VISUAL": "IDENTITY",
125
+ "STATE": "MEAN_STD",
126
+ "ACTION": "MEAN_STD"
127
+ },
128
+ "max_state_dim": 32,
129
+ "max_action_dim": 32,
130
+ "resize_imgs_with_padding": [
131
+ 512,
132
+ 512
133
+ ],
134
+ "empty_cameras": 0,
135
+ "adapt_to_pi_aloha": false,
136
+ "use_delta_joint_actions_aloha": false,
137
+ "tokenizer_max_length": 48,
138
+ "num_steps": 50,
139
+ "use_cache": true,
140
+ "freeze_vision_encoder": true,
141
+ "train_expert_only": true,
142
+ "train_state_proj": true,
143
+ "optimizer_lr": 0.0001,
144
+ "optimizer_betas": [
145
+ 0.9,
146
+ 0.95
147
+ ],
148
+ "optimizer_eps": 1e-08,
149
+ "optimizer_weight_decay": 1e-10,
150
+ "optimizer_grad_clip_norm": 10.0,
151
+ "scheduler_warmup_steps": 1000,
152
+ "scheduler_decay_steps": 30000,
153
+ "scheduler_decay_lr": 2.5e-06,
154
+ "vlm_model_name": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct",
155
+ "load_vlm_weights": true,
156
+ "add_image_special_tokens": false,
157
+ "attention_mode": "cross_attn",
158
+ "prefix_length": 0,
159
+ "pad_language_to": "max_length",
160
+ "num_expert_layers": 0,
161
+ "num_vlm_layers": 16,
162
+ "self_attn_every_n_layers": 2,
163
+ "expert_width_multiplier": 0.75,
164
+ "min_period": 0.004,
165
+ "max_period": 4.0,
166
+ "rtc_config": null,
167
+ "compile_model": false,
168
+ "compile_mode": "max-autotune"
169
+ },
170
+ "output_dir": "/home/csiwoo/CORL2026/lerobot/outputs/train/smolvla_isaaclab_so101_multitask_8ep_20260516_103222",
171
+ "job_name": "smolvla_isaaclab_so101_multitask_8ep_20260516_103222",
172
+ "resume": false,
173
+ "seed": 1000,
174
+ "cudnn_deterministic": false,
175
+ "num_workers": 16,
176
+ "persistent_workers": true,
177
+ "batch_size": 128,
178
+ "gradient_accumulation_steps": 1,
179
+ "steps": 36800,
180
+ "eval_freq": 0,
181
+ "log_freq": 100,
182
+ "tolerance_s": 0.0001,
183
+ "save_checkpoint": true,
184
+ "save_freq": 36800,
185
+ "use_policy_training_preset": true,
186
+ "optimizer": {
187
+ "type": "adamw",
188
+ "lr": 0.0001,
189
+ "weight_decay": 1e-10,
190
+ "grad_clip_norm": 10.0,
191
+ "betas": [
192
+ 0.9,
193
+ 0.95
194
+ ],
195
+ "eps": 1e-08
196
+ },
197
+ "scheduler": {
198
+ "type": "cosine_decay_with_warmup",
199
+ "num_warmup_steps": 1000,
200
+ "num_decay_steps": 30000,
201
+ "peak_lr": 0.0001,
202
+ "decay_lr": 2.5e-06
203
+ },
204
+ "eval": {
205
+ "n_episodes": 50,
206
+ "batch_size": 50,
207
+ "use_async_envs": true
208
+ },
209
+ "wandb": {
210
+ "enable": true,
211
+ "disable_artifact": false,
212
+ "project": "lerobot-smolvla",
213
+ "entity": null,
214
+ "notes": null,
215
+ "run_id": "ixd0kno4",
216
+ "mode": null,
217
+ "add_tags": true
218
+ },
219
+ "peft": null,
220
+ "use_rabc": false,
221
+ "rabc_progress_path": null,
222
+ "rabc_kappa": 0.01,
223
+ "rabc_epsilon": 1e-06,
224
+ "rabc_head_mode": "sparse",
225
+ "rename_map": {
226
+ "observation.images.left_wrist": "observation.images.camera1",
227
+ "observation.images.top": "observation.images.camera2"
228
+ },
229
+ "checkpoint_path": null
230
+ }