| | --- |
| | pipeline_tag: robotics |
| | license: apache-2.0 |
| | --- |
| | |
| | # Large Video Planner Enables Generalizable Robot Control |
| |
|
| | This repository contains the trained large video planner checkpoints (14B parameter) for the model presented in the paper [Large Video Planner Enables Generalizable Robot Control](https://huggingface.co/papers/2512.15840). |
| |
|
| | The Large Video Planner explores an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. It produces zero-shot video plans for novel scenes and tasks, which are then post-processed to extract executable robot actions. |
| |
|
| | - **Project Page:** [https://www.boyuan.space/large-video-planner/](https://www.boyuan.space/large-video-planner/) |
| | - **GitHub Repository:** [https://github.com/buoyancy99/large-video-planner](https://github.com/buoyancy99/large-video-planner) |
| | - **Hugging Face Demo:** [https://huggingface.co/spaces/KempnerInstituteAI/LVP](https://huggingface.co/spaces/KempnerInstituteAI/LVP) |
| |
|
| | --- |
| |
|
| | This folder contains the trained large video planner checkpoints (14B parameter) and all the metadata for eight dataset sources: Agibot-world, droid, bridge, language-tables, Pandas(filtered), SomethingSomethingV2, ego4d, epic_kitchens. We release a `merged_metadata.csv` and a `cleaned_metadata.csv` for each. |
| | We also released our test set in `data/ours_test/` with the images and text instructions gathered from third-parties. |
| |
|
| | ## Trained Checkpoints for our Large Video Planner |
| | `checkpoints/lvp_14B.ckpt` is the trained weights for the transformer backbone. |
| |
|
| | ## Dataset format |
| | We train on a mixture of datasets, so we define a unified dataset format for consistency and ease of management. |
| |
|
| | Each dataset includes a global metadata file, typically named `metadata_merged.csv`, which contains key information for each video clip. |
| |
|
| | The file is named as metdata_**merged**.csv because each video clip may have multiple recaptions. Instead of saving the captions for each video into a list within a single csv row, we just create another row on the `metadata_merged.csv`. So `metadata_merged.csv` may contain multiple rows referring to the same video with different captions. For some dataset, we also provide a `cleaned_metadata.csv`, which contains a deduplicated version of the metadata (one entry per video) but excludes the additional recaptions. |
| |
|
| | Important fields of the global metadata includes: |
| | 1. `video_path`: Relative path (from the metadata file) to the video clip. |
| | 2. `trim_start`, and `trim_end` (optional): Specifies the trimmed segment of the clip. If absent, the full video is used. |
| | 3. `gemini_caption`: Action-focused caption generated by Gemini Flash 2.0. |
| | 4. `original_caption`: Original caption from the source dataset; used when no Gemini caption is available. |
| | 5. `prompt_embed_path`: Path to precomputed T5 prompt embeddings (not released due to large size). |
| | 6. `stable_brightess` (optional): 1.0 if brightness is stable, 0.0 otherwise. We recommend removing videos with `stable_brightess == 0.0` |
| | 7. `stable_background` (optional): Either 1.0 or 0.0. Recommend to remove videos with `stable_background == 0.0`, this indicates the video has large average optical flow magnitudes, which very likely contains large background motions. |
| | 8. `detected_hand_in_first_frame` (optional): 1.0 if a human hand is detected in the first frame, 0.0 otherwise. Videos with 0.0 often cause embodiment ambiguity and should be filtered out. |
| | 9. There are some other fields which can help you understand more about this clips. `n_frames`, `n_fps`, `height`, `width`, ... etc. |
| |
|
| | ## Downloading the dataset |
| | We provide dataset-specific download scripts for AgiBot World, DROID, Ego4D, EpicKitchens, and Something-Something in their respective dataset.py files within the `datasets/` folder of the relased code. |
| |
|
| | For downloading the filtered Pandas subset, we provide the unique `youtube_key_segment` for each video_clip, and the `trim_start`, and `trim_end` for each clip. To download these subset, please download the official metadata from [Pandas-70M](https://snap-research.github.io/Panda-70M/), then using the `youtube_key_segment` to find the URL of the video clips and then download with your own online video downloader. |
| | |
| | ## Sample Usage (Inference) |
| | |
| | To run inference with the Large Video Planner, first ensure your environment is set up and checkpoints are downloaded as described in the [GitHub repository's instructions](https://github.com/buoyancy99/large-video-planner#instructions-for-running-the-code). |
| | |
| | Then, use the following command for basic inference: |
| | |
| | ```bash |
| | mkdir -p <your-output-folder> |
| | python -m main \ |
| | +name=<your_exp_name> \ |
| | experiment=exp_video \ |
| | algorithm=wan_i2v \ |
| | dataset=ours_test \ |
| | experiment.tasks=[validation] \ |
| | algorithm.logging.video_type=single \ |
| | experiment.num_nodes=1 \ |
| | experiment.validation.limit_batch=null \ |
| | algorithm.hist_guidance=1.5 \ |
| | algorithm.lang_guidance=2.5 |
| | ``` |
| | Replace `<your-output-folder>` and `<your_exp_name>` with your desired values. Refer to the [GitHub repository](https://github.com/buoyancy99/large-video-planner) for detailed explanations of arguments and further instructions, including how to download the checkpoints. |