| --- |
| library_name: transformers |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-VL-2B-Instruct |
| tags: |
| - video-understanding |
| - streaming |
| - proactive |
| - activation-model |
| - masked-diffusion |
| - multimodal |
| - plug-and-play |
| language: |
| - en |
| pipeline_tag: video-classification |
| model-index: |
| - name: STRIDE-2B |
| results: |
| - task: |
| type: video-classification |
| name: Proactive Streaming Activation |
| dataset: |
| type: custom |
| name: OVO-Bench |
| metrics: |
| - type: accuracy |
| value: 59.07 |
| name: Overall (w/ Qwen3-VL-8B) |
| - task: |
| type: video-classification |
| name: Proactive Streaming Activation |
| dataset: |
| type: custom |
| name: StreamingBench |
| metrics: |
| - type: accuracy |
| value: 59.29 |
| name: Overall (w/ Qwen3-VL-8B) |
| - task: |
| type: video-classification |
| name: Temporal Grounding |
| dataset: |
| type: custom |
| name: ET-Bench |
| metrics: |
| - type: f1 |
| value: 62.8 |
| name: TVG F1 |
| - type: f1 |
| value: 10.7 |
| name: EPM F1 |
| - type: f1 |
| value: 24.6 |
| name: TAL F1 |
| - type: f1 |
| value: 36.5 |
| name: DVC F1 |
| - type: f1 |
| value: 28.5 |
| name: SLC F1 |
| --- |
| |
| # STRIDE-2B |
|
|
| **STRIDE** (**S**tructured **T**emporal **R**efinement with **I**terative **DE**noising) is a lightweight proactive activation model for streaming video understanding. |
| It decides **when** a downstream Video-LLM should respond during a live video stream — without waiting for explicit user queries. |
|
|
| <p align="center"> |
| <a href="https://arxiv.org/abs/2603.27593"><img src="https://img.shields.io/badge/arXiv-2603.27593-b31b1b" alt="arXiv"></a> |
| <a href="https://interlive-team.github.io/STRIDE"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a> |
| <a href="https://github.com/interlive-team/STRIDE"><img src="https://img.shields.io/badge/GitHub-Code-black" alt="GitHub"></a> |
| <a href="https://huggingface.co/interlive"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Model_Collection-yellow" alt="HF"></a> |
| </p> |
|
|
| > **Paper**: *STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding* |
| > |
| > Junho Kim\*, Hosu Lee\*, James M. Rehg, Minsu Kim, Yong Man Ro |
| > |
| > UIUC, KAIST, Google DeepMind |
|
|
| ## What is STRIDE? |
|
|
| Existing streaming Video-LLMs are **reactive** — they only respond when a user explicitly asks a question. STRIDE makes them **proactive** by adding a lightweight front-end that continuously monitors incoming frames and predicts coherent activation spans indicating *when* to trigger a response. |
|
|
| The key insight is that activation in streaming video is not a point-wise binary decision ("should I respond *now*?"), but a **span-structured** sequence modeling problem — the model must capture consistent onset (0 → 1), persistence (1 → 1), and offset (1 → 0) transitions. STRIDE achieves this through **masked diffusion** over a temporal activation window, jointly predicting and iteratively refining activation signals across the window. |
|
|
| ### Two-Stage Architecture |
|
|
| ``` |
| Video Stream |
| │ |
| ▼ |
| [STRIDE Activation Model] ← this model (2B) |
| │ |
| │ trigger (only if active) |
| ▼ |
| [Downstream Video-LLM] ← frozen, any off-the-shelf |
| │ |
| ▼ |
| Response |
| ``` |
|
|
| - **Stage 1 — Activation (STRIDE):** Monitors the stream at 1 FPS, maintains a sliding activation window, and iteratively denoises binary activation labels via masked diffusion. |
| - **Stage 2 — Response (Downstream LLM):** When triggered, the frozen downstream Video-LLM receives the accumulated frame cache and generates a response. STRIDE is fully **plug-and-play** — compatible with any off-the-shelf Video-LLM. |
|
|
| ## Results |
|
|
| ### OVO-Bench (Online Video Understanding) |
|
|
| | Method | Real-Time Perception | Backward Tracing | Forward Active Responding | Overall | |
| |---|:---:|:---:|:---:|:---:| |
| | Flash-VStream-7B | 28.37 | 27.38 | 45.09 | 33.61 | |
| | Dispider | 54.55 | 36.06 | 34.72 | 41.78 | |
| | TimeChat-Online-7B | 58.60 | 42.00 | 36.40 | 45.60 | |
| | QueryStream-7B | 61.40 | 42.10 | 39.03 | 47.51 | |
| | StreamAgent-7B | 61.30 | 41.70 | 45.40 | 49.40 | |
| | **STRIDE** + Gemma3-4B | 60.93 | 34.87 | 55.73 | 50.51 | |
| | **STRIDE** + InternVL3-8B | 67.72 | 45.23 | 58.00 | 56.98 | |
| | **STRIDE** + Qwen3-VL-8B | 69.68 | 47.83 | 59.70 | **59.07** | |
|
|
| ### StreamingBench (Streaming Comprehension) |
|
|
| | Method | Real-Time Visual | Omni-Source | Contextual | Overall | |
| |---|:---:|:---:|:---:|:---:| |
| | Flash-VStream-7B | 23.23 | 26.00 | 24.12 | 24.04 | |
| | VideoLLM-Online-8B | 35.99 | 28.45 | 26.55 | 32.48 | |
| | Dispider | 67.63 | 35.66 | 33.61 | 53.12 | |
| | StreamAgent-7B | 74.31 | 36.26 | 34.62 | 57.02 | |
| | **STRIDE** + Gemma3-4B | 60.00 | 36.80 | 38.80 | 50.14 | |
| | **STRIDE** + InternVL3-8B | 72.45 | 39.20 | 38.80 | 57.58 | |
| | **STRIDE** + Qwen3-VL-8B | 74.24 | 41.30 | 39.90 | **59.29** | |
|
|
| ### ET-Bench (Temporal Grounding, Activation-Only) |
|
|
| | Model | Params | TVG | EPM | TAL | DVC | SLC | Avg | |
| |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | *Temporal-Localization Specialized* | | | | | | | | |
| | VTimeLLM | 7B | 7.6 | 1.9 | 18.2 | 12.4 | 8.7 | 9.8 | |
| | TimeChat | 7B | 26.2 | 3.9 | 10.1 | 16.6 | 5.6 | 12.5 | |
| | VTG-LLM | 7B | 15.9 | 3.7 | 14.4 | **40.2** | 20.8 | 19.0 | |
| | LITA | 13B | 22.2 | 4.6 | 18.0 | <u>39.7</u> | 21.0 | 21.1 | |
| | ETChat | 5B | <u>38.6</u> | 10.2 | **30.8** | 38.4 | <u>24.4</u> | <u>28.5</u> | |
| | *Streaming Baselines* | | | | | | | | |
| | VideoLLM-Online | 8B | 13.2 | 3.8 | 9.1 | 24.0 | 9.9 | 12.0 | |
| | Dispider | 9B | 36.1 | **15.5** | <u>27.3</u> | 33.8 | 18.8 | 26.3 | |
| | StreamBridge | 8B | 34.3 | – | 24.3 | 38.3 | 22.6 | – | |
| | *Ours* | | | | | | | | |
| | **STRIDE** | **2B** | **62.8** | <u>10.7</u> | 24.6 | 36.5 | **28.5** | **32.6** | |
|
|
| STRIDE achieves the best overall average with only 2B parameters, outperforming 7-13B temporal-localization specialized models and streaming baselines. |
|
|
| ## Usage |
|
|
| For the full streaming inference pipeline and evaluation scripts, please refer to the [STRIDE GitHub repository](https://github.com/interlive-team/STRIDE). |
|
|
| ## Training |
|
|
| - **Architecture:** `Qwen3VLForProactiveMDM` (Qwen3-VL backbone with a temporal activation head) |
| - **Base model:** [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) |
| - **Training data:** Temporal activation annotations curated from eight publicly available video understanding datasets (ActivityNet-Captions, LITA, YouCook2, ET-Instruct, Charades, CharadesEgo, DiDeMo, Grounded-VideoLLM) |
| ## Model Variants |
|
|
| | Model | Params | Description | |
| |---|---|---| |
| | [**STRIDE-2B**](https://huggingface.co/interlive/STRIDE-2B) (this) | 2B | Default activation model | |
| | STRIDE-4B | 4B | Scaled variant with improved accuracy | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{kim2026stride, |
| title={STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding}, |
| author={Kim, Junho and Lee, Hosu and Rehg, James M. and Kim, Minsu and Ro, Yong Man}, |
| journal={arXiv preprint arXiv:2603.27593}, |
| year={2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). |
|
|