VideoSearch-R1 ActivityNet Stage 1

This is the Stage 1 VideoSearch-R1 checkpoint trained for ActivityNet, presented in the paper VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement.

Project Page: mlvlab.github.io/VideoSearch-R1
Repository: mlvlab/VideoSearch-R1

Stage 1 is trained from the Qwen3-VL base model on constructed video retrieval and temporal grounding supervision. It is intended as the initialization for Stage 2 training or as a released checkpoint for ablations.

Usage

Use with the VideoSearch-R1 codebase:

bash scripts/data_construct/download_preextracted_data.bash activitynet
EVAL_GPUS=0 bash scripts/inference/inference.bash activitynet --checkpoint VideoSearchR1/activitynet-stage1

Citation

@inproceedings{lee2026videosearchr1,
  title     = {VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement},
  author    = {Lee, Seohyun and Choi, Seoung and Ko, Dohwan and Kim, Jongha and Kim, Hyunwoo J.},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Downloads last month: 27

Safetensors

Model size

2B params

Tensor type

F32

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VideoSearchR1/activitynet-stage1

Base model

Qwen/Qwen3-VL-4B-Instruct

Finetuned

(331)

this model

Paper for VideoSearchR1/activitynet-stage1

VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement

Paper • 2607.00446 • Published 4 days ago • 16