Improve model card: add metadata, links and sample usage (#1)

d40aa14 7 days ago

2.39 kB

	---
	license: mit
	library_name: transformers
	pipeline_tag: image-feature-extraction
	---

	# OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

	OmniStream is a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), the model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache.

	- Paper: [OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams](https://huggingface.co/papers/2603.12265)
	- Project Page: [https://go2heart.github.io/omnistream/](https://go2heart.github.io/omnistream/)
	- Repository: [https://github.com/Go2Heart/OmniStream](https://github.com/Go2Heart/OmniStream)

	## Sample Usage

	The following code snippet demonstrates how to use OmniStream for feature extraction. Note that this requires the `model.py` file from the official repository to be present in your environment.

	```python
	from model import OmnistreamMultiFrameTransformer
	from transformers import AutoImageProcessor
	import torch
	import numpy as np

	# Load processor and model
	processor = AutoImageProcessor.from_pretrained("StreamFormer/OmniStream")
	model = OmnistreamMultiFrameTransformer.from_pretrained("StreamFormer/OmniStream").to("cuda")

	model.eval()

	# Prepare dummy input: 16 frames of 512x512 RGB images (Batch x Time, Height, Width, Channels)
	fake_pixel = np.random.randn(16, 512, 512, 3)
	fake_input = processor(images=fake_pixel, return_tensors="pt").to("cuda")

	# Reshape to (Batch, Time, Channels, Height, Width)
	fake_input["pixel_values"] = fake_input["pixel_values"].unsqueeze(0).float()

	with torch.no_grad():
	output = model(**fake_input, return_dict=True)

	print(output.keys())
	print(output["last_hidden_state"].shape) # last layer's hidden states
	print(output["pooler_output"].shape) # cls token
	print(output["patch_start_idx"]) # index of the first patch of each frame
	```

	## Citation

	```bibtex
	@article{yan2026omnistream,
	title={OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams},
	author={Yibin Yan and Jilan Xu and Shangzhe Di and Haoning Wu and Weidi Xie},
	journal={arXiv preprint arXiv:2603.12265},
	year={2026},
	url={https://arxiv.org/abs/2603.12265}
	}
	```