LVP / README.md

Enhance model card for Large Video Planner Enables Generalizable Robot Control (#1)

b87131a verified 2 months ago

5.22 kB

	---
	pipeline_tag: robotics
	license: apache-2.0
	---

	# Large Video Planner Enables Generalizable Robot Control

	This repository contains the trained large video planner checkpoints (14B parameter) for the model presented in the paper [Large Video Planner Enables Generalizable Robot Control](https://huggingface.co/papers/2512.15840).

	The Large Video Planner explores an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. It produces zero-shot video plans for novel scenes and tasks, which are then post-processed to extract executable robot actions.

	- Project Page: [https://www.boyuan.space/large-video-planner/](https://www.boyuan.space/large-video-planner/)
	- GitHub Repository: [https://github.com/buoyancy99/large-video-planner](https://github.com/buoyancy99/large-video-planner)
	- Hugging Face Demo: [https://huggingface.co/spaces/KempnerInstituteAI/LVP](https://huggingface.co/spaces/KempnerInstituteAI/LVP)

	---

	This folder contains the trained large video planner checkpoints (14B parameter) and all the metadata for eight dataset sources: Agibot-world, droid, bridge, language-tables, Pandas(filtered), SomethingSomethingV2, ego4d, epic_kitchens. We release a `merged_metadata.csv` and a `cleaned_metadata.csv` for each.
	We also released our test set in `data/ours_test/` with the images and text instructions gathered from third-parties.

	## Trained Checkpoints for our Large Video Planner
	`checkpoints/lvp_14B.ckpt` is the trained weights for the transformer backbone.

	## Dataset format
	We train on a mixture of datasets, so we define a unified dataset format for consistency and ease of management.

	Each dataset includes a global metadata file, typically named `metadata_merged.csv`, which contains key information for each video clip.

	The file is named as metdata_merged.csv because each video clip may have multiple recaptions. Instead of saving the captions for each video into a list within a single csv row, we just create another row on the `metadata_merged.csv`. So `metadata_merged.csv` may contain multiple rows referring to the same video with different captions. For some dataset, we also provide a `cleaned_metadata.csv`, which contains a deduplicated version of the metadata (one entry per video) but excludes the additional recaptions.

	Important fields of the global metadata includes:
	1. `video_path`: Relative path (from the metadata file) to the video clip.
	2. `trim_start`, and `trim_end` (optional): Specifies the trimmed segment of the clip. If absent, the full video is used.
	3. `gemini_caption`: Action-focused caption generated by Gemini Flash 2.0.
	4. `original_caption`: Original caption from the source dataset; used when no Gemini caption is available.
	5. `prompt_embed_path`: Path to precomputed T5 prompt embeddings (not released due to large size).
	6. `stable_brightess` (optional): 1.0 if brightness is stable, 0.0 otherwise. We recommend removing videos with `stable_brightess == 0.0`
	7. `stable_background` (optional): Either 1.0 or 0.0. Recommend to remove videos with `stable_background == 0.0`, this indicates the video has large average optical flow magnitudes, which very likely contains large background motions.
	8. `detected_hand_in_first_frame` (optional): 1.0 if a human hand is detected in the first frame, 0.0 otherwise. Videos with 0.0 often cause embodiment ambiguity and should be filtered out.
	9. There are some other fields which can help you understand more about this clips. `n_frames`, `n_fps`, `height`, `width`, ... etc.

	## Downloading the dataset
	We provide dataset-specific download scripts for AgiBot World, DROID, Ego4D, EpicKitchens, and Something-Something in their respective dataset.py files within the `datasets/` folder of the relased code.

	For downloading the filtered Pandas subset, we provide the unique `youtube_key_segment` for each video_clip, and the `trim_start`, and `trim_end` for each clip. To download these subset, please download the official metadata from [Pandas-70M](https://snap-research.github.io/Panda-70M/), then using the `youtube_key_segment` to find the URL of the video clips and then download with your own online video downloader.

	## Sample Usage (Inference)

	To run inference with the Large Video Planner, first ensure your environment is set up and checkpoints are downloaded as described in the [GitHub repository's instructions](https://github.com/buoyancy99/large-video-planner#instructions-for-running-the-code).

	Then, use the following command for basic inference:

	```bash
	mkdir -p <your-output-folder>
	python -m main \
	+name=<your_exp_name> \
	experiment=exp_video \
	algorithm=wan_i2v \
	dataset=ours_test \
	experiment.tasks=[validation] \
	algorithm.logging.video_type=single \
	experiment.num_nodes=1 \
	experiment.validation.limit_batch=null \
	algorithm.hist_guidance=1.5 \
	algorithm.lang_guidance=2.5
	```
	Replace `<your-output-folder>` and `<your_exp_name>` with your desired values. Refer to the [GitHub repository](https://github.com/buoyancy99/large-video-planner) for detailed explanations of arguments and further instructions, including how to download the checkpoints.