Instructions to use CondadosAI/xclip_base_patch32 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CondadosAI/xclip_base_patch32 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("video-classification", model="CondadosAI/xclip_base_patch32")# Load model directly from transformers import AutoProcessor, AutoModel processor = AutoProcessor.from_pretrained("CondadosAI/xclip_base_patch32") model = AutoModel.from_pretrained("CondadosAI/xclip_base_patch32") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: transformers | |
| pipeline_tag: video-classification | |
| tags: | |
| - video-classification | |
| - zero-shot | |
| - vision | |
| - acaua | |
| datasets: | |
| - kinetics-400 | |
| base_model: microsoft/xclip-base-patch32 | |
| # X-CLIP (base, patch 32) — acaua mirror | |
| MIT-licensed mirror hosted under `CondadosAI/` for use with the [acaua](https://github.com/CondadosAI/acaua) computer vision library. | |
| This is a **safetensors-only mirror** of the upstream Microsoft weights at the pinned commit shown below. The legacy `pytorch_model.bin` (pickle format) that upstream ships alongside `model.safetensors` has been deliberately removed for security hygiene — pickle loads can execute arbitrary code, and `transformers` auto-prefers safetensors when both are present, so removing it has zero functional impact on downstream users. | |
| X-CLIP is a **zero-shot video classification** model: you provide a list of candidate text labels at inference time and the model ranks them by similarity to the video clip. It is not a closed-set softmax classifier, and it does not appear in `AutoModelForVideoClassification`. | |
| ## Provenance | |
| | | | | |
| |---|---| | |
| | Upstream repo | [`microsoft/xclip-base-patch32`](https://huggingface.co/microsoft/xclip-base-patch32) | | |
| | Upstream commit SHA | `a2e27a78a2b5d802e894b8a1ef14f3a8ce490963` | | |
| | Upstream commit date | 2024-02-04 | | |
| | Declared license | MIT | | |
| | Paper | Ni et al., *"Expanding Language-Image Pretrained Models for General Video Recognition"*, ECCV 2022, arXiv:[2208.02816](https://arxiv.org/abs/2208.02816) | | |
| | Official code | [`microsoft/VideoX`](https://github.com/microsoft/VideoX) (MIT) | | |
| | Mirrored on | 2026-04-23 | | |
| | Mirrored by | [CondadosAI/acaua](https://github.com/CondadosAI/acaua) | | |
| ## Usage via acaua | |
| ```python | |
| import acaua | |
| model = acaua.Model.from_pretrained( | |
| "CondadosAI/xclip_base_patch32", | |
| allow_non_apache=True, # weights are MIT, not Apache-2.0 | |
| ) | |
| result = model.predict( | |
| "dance.mp4", | |
| labels=["dancing", "cooking", "running", "sleeping", "walking"], | |
| top_k=3, | |
| ) | |
| for label, score in zip(result.labels, result.scores.tolist()): | |
| print(f"{label}: {score:.3f}") | |
| ``` | |
| ## Usage via 🤗 Transformers | |
| This mirror is drop-in compatible with the upstream repo. | |
| ```python | |
| from transformers import XCLIPModel, XCLIPProcessor | |
| processor = XCLIPProcessor.from_pretrained("CondadosAI/xclip_base_patch32") | |
| model = XCLIPModel.from_pretrained("CondadosAI/xclip_base_patch32") | |
| ``` | |
| ## Expected input | |
| - **Frames:** 8 uniformly-sampled frames per clip (`vision_config.num_frames=8`). | |
| - **Resolution:** 224 × 224 after resize + center-crop. | |
| - **Normalization:** ImageNet mean/std (handled by `XCLIPProcessor`). | |
| - **Text prompts:** supplied at inference time — any natural-language strings. | |
| ## License and attribution | |
| Redistributed under MIT, consistent with the upstream declaration. See [`NOTICE`](./NOTICE) for required attribution. | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{ni2022expanding, | |
| title={Expanding language-image pretrained models for general video recognition}, | |
| author={Ni, Bolin and Peng, Houwen and Chen, Minghao and Zhang, Songyang and Meng, Gaofeng and Fu, Jianlong and Xiang, Shiming and Ling, Haibin}, | |
| booktitle={European Conference on Computer Vision (ECCV)}, | |
| pages={1--18}, | |
| year={2022}, | |
| publisher={Springer} | |
| } | |
| ``` | |