| | --- |
| | license: mit |
| | library_name: transformers |
| | tags: |
| | - robotics |
| | - vla |
| | - diffusion |
| | - multimodal |
| | - pretraining |
| | language: |
| | - en |
| | pipeline_tag: robotics |
| | --- |
| | # CogACT-Small |
| |
|
| | CogACT is a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a componentized VLA architecture that has a specialized action module conditioned on VLM output. CogACT-Small employs a [DiT-S](https://github.com/facebookresearch/DiT) model as the action module. |
| |
|
| | All our [code](https://github.com/microsoft/CogACT), [pretrained model weights](https://huggingface.co/CogACT), are licensed under the MIT license. |
| |
|
| | Please refer to our [project page](https://cogact.github.io/) and [paper](https://arxiv.org/abs/2411.19650) for more details. |
| |
|
| |
|
| | ## Model Summary |
| |
|
| | - **Developed by:** The CogACT consisting of researchers from [Microsoft Research Asia](https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/). |
| | - **Model type:** Vision-Language-Action (language, image => robot actions) |
| | - **Language(s) (NLP):** en |
| | - **License:** MIT |
| | - **Model components:** |
| | + **Vision Backbone**: DINOv2 ViT-L/14 and SigLIP ViT-So400M/14 |
| | + **Language Model**: Llama-2 |
| | + **Action Model**: DiT-Small |
| | - **Pretraining Dataset:** A subset of [Open X-Embodiment](https://robotics-transformer-x.github.io/) |
| | - **Repository:** [https://github.com/microsoft/CogACT](https://github.com/microsoft/CogACT) |
| | - **Paper:** [CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation](https://arxiv.org/abs/2411.19650) |
| | - **Project Page:** [https://cogact.github.io/](https://cogact.github.io/) |
| |
|
| | ## Uses |
| | CogACT takes a language instruction and a single view RGB image as input and predicts the next 16 normalized robot actions (consisting of the 7-DoF end effector deltas |
| | of the form ``x, y, z, roll, pitch, yaw, gripper``). These actions should be unnormalized and integrated by our ``Adaptive Action Ensemble``(Optional). Unnormalization and ensemble depend on the dataset statistics. |
| |
|
| | CogACT models can be used zero-shot to control robots for setups seen in the [Open-X](https://robotics-transformer-x.github.io/) pretraining mixture. They can also be fine-tuned for new tasks and robot setups with an extremely small amount of demonstrations. See [our repository](https://github.com/microsoft/CogACT) for more information. |
| |
|
| | Here is a simple example for inference. |
| |
|
| | ```python |
| | # Please clone and install dependencies in our repo |
| | # Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...) |
| | |
| | from PIL import Image |
| | from vla import load_vla |
| | import torch |
| | |
| | model = load_vla( |
| | 'CogACT/CogACT-Small', |
| | load_for_training=False, |
| | action_model_type='DiT-S', |
| | future_action_window_size=15, |
| | ) |
| | # about 30G Memory in fp32; |
| | |
| | # (Optional) use "model.vlm = model.vlm.to(torch.bfloat16)" to load vlm in bf16 |
| | |
| | model.to('cuda:0').eval() |
| | |
| | image: Image.Image = <input_your_image> |
| | prompt = "move sponge near apple" # input your prompt |
| | |
| | # Predict Action (7-DoF; un-normalize for RT-1 google robot data, i.e. fractal20220817_data) |
| | actions, _ = model.predict_action( |
| | image, |
| | prompt, |
| | unnorm_key='fractal20220817_data', # input your unnorm_key of dataset |
| | cfg_scale = 1.5, # cfg from 1.5 to 7 also performs well |
| | use_ddim = True, # use DDIM sampling |
| | num_ddim_steps = 10, # number of steps for DDIM sampling |
| | ) |
| | |
| | # results in 7-DoF actions of 16 steps with shape [16, 7] |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{li2024cogact, |
| | title={CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation}, |
| | author={Li, Qixiu and Liang, Yaobo and Wang, Zeyu and Luo, Lin and Chen, Xi and Liao, Mozheng and Wei, Fangyun and Deng, Yu and Xu, Sicheng and Zhang, Yizhong and others}, |
| | journal={arXiv preprint arXiv:2411.19650}, |
| | year={2024} |
| | } |
| | ``` |