YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

๐ŸŽฌ Video Intelligence Platform

Akinator-style video search with RAG, boolean queries, and tree-based refinement.

Upload any video โ†’ the system indexes every frame โ†’ then search with natural language to find exact timestamps.

๐ŸŒŸ Features

  • Natural Language Search: "person wearing white clothes" โ†’ exact timestamps
  • Boolean Queries: "red car AND bicycle" โ†’ timestamps where BOTH appear together
  • Akinator Tree Refinement: Too many results? The system asks discriminative questions to narrow down (indoor/outdoor? day/night? etc.)
  • RAG Answers: Generates grounded answers citing specific timestamps
  • Multi-Channel Fusion: Combines visual similarity + caption search + object detection

๐Ÿ—๏ธ Architecture

Video โ†’ Frame Extraction (1fps)
         โ”‚
         โ”œโ”€โ–บ Grounding DINO โ†’ Object detection with attributes
         โ”‚   (detects "person in white shirt", "red car", etc.)
         โ”‚   โ†’ SQLite structured DB
         โ”‚
         โ”œโ”€โ–บ SigLIP2 โ†’ Frame embeddings (1152-dim)
         โ”‚   โ†’ FAISS vector index
         โ”‚
         โ””โ”€โ–บ Gemini 2.0 Flash โ†’ Dense captions
             โ†’ Gemini text-embedding-004 โ†’ Caption embeddings (768-dim)
             โ†’ FAISS vector index

Query โ†’ Gemini (decompose boolean) โ†’ Sub-queries
         โ”‚
         โ”œโ”€โ–บ Visual search (SigLIP2 FAISS)
         โ”œโ”€โ–บ Caption search (Gemini FAISS)
         โ””โ”€โ–บ Detection search (SQL)
              โ”‚
              โ–ผ
         Score Fusion โ†’ Boolean Ops (AND/OR) โ†’ Ranked Timestamps
              โ”‚
              โ–ผ
         Akinator Refinement (if too many results)
              โ”‚
              โ–ผ
         RAG Answer Generation (Gemini)

๐Ÿš€ Quick Start

1. Clone the repo

git clone https://huggingface.co/notRaphael/video-intelligence-platform
cd video-intelligence-platform

2. Install dependencies

pip install -r requirements.txt

Note: Requires transformers >= 4.49 (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with โ‰ฅ8GB RAM is recommended.

3. Get a Gemini API key (free)

4. Launch the UI

export GEMINI_API_KEY="your-key-here"
python app.py

5. Or use the CLI

# Index a video
python app.py --index video.mp4 --api-key YOUR_KEY

# Search
python app.py --search "red car" --api-key YOUR_KEY

๐Ÿ“‹ Models Used

Component Model Size Runs On
Frame Embeddings SigLIP2 ~1.5GB CPU โœ… / GPU
Object Detection Grounding DINO ~657MB CPU โœ… / GPU
Captioning Gemini 2.0 Flash API Cloud
Text Embeddings Gemini text-embedding-004 API Cloud
Query/RAG Gemini 2.0 Flash API Cloud

๐Ÿ”ง API Verification (Apr 2026)

All model APIs verified against transformers 5.6.2 and google-genai 1.73.1:

SigLIP2 (google/siglip2-so400m-patch14-384)

  • AutoModel / AutoProcessor โ†’ resolves to SiglipModel / SiglipProcessor
  • model.get_image_features(**inputs) returns BaseModelOutputWithPooling (.pooler_output = [B, 1152])
  • Text input must use padding="max_length" (training requirement)
  • Uses sigmoid (not softmax) for similarity scores

Grounding DINO (IDEA-Research/grounding-dino-tiny)

  • AutoModelForZeroShotObjectDetection / AutoProcessor โ†’ resolves to GroundingDinoForObjectDetection / GroundingDinoProcessor
  • Processor accepts text as str, list[str], or list[list[str]] โ€” auto-converts internally
  • post_process_grounded_object_detection: threshold kwarg (not box_threshold), input_ids optional
  • Returns dict with both "text_labels" and "labels" keys
  • target_sizes expects (height, width) tuples

Gemini (google-genai SDK)

  • Uses google.genai (NOT deprecated google.generativeai)
  • genai.Client(api_key=...) โ†’ client.models.generate_content(...), client.models.embed_content(...)
  • types.Part.from_bytes(data=..., mime_type=...), types.Part.from_text(text=...)
  • Embedding is text-only โ€” cannot embed images/video directly

๐ŸŒณ How the Akinator Tree Works

When a search returns too many results (>10), the system:

  1. Extracts attributes from all candidate frames (objects, colors, location, time, actions)
  2. Computes information gain for each attribute (same algorithm as decision trees!)
  3. Asks the most discriminative question (e.g., "Indoor or outdoor?")
  4. Splits results based on your answer
  5. Repeats until results are manageable
"Found 47 clips with people"
    โ”‚
    โ–ผ
"Indoor or outdoor?" โ†’ Outdoor (24 clips)
    โ”‚
    โ–ผ
"Daytime or nighttime?" โ†’ Daytime (15 clips)
    โ”‚
    โ–ผ
"What color clothing?" โ†’ White (6 clips) โœ… Done!

๐Ÿ”ฎ Future: TPU Training

The platform is designed for future fine-tuning on TPU:

  • VLM2Vec-V2 (Qwen2-VL-7B + LoRA) for domain-specific video embeddings
  • TimeLens recipe (GRPO/RLVR) for temporal grounding
  • Uses accelerate + FSDPv2 with bf16 on TPU v5e

๐Ÿ“š Based On Research

Paper What We Use
AVA Event Knowledge Graphs + semantic chunking
VideoRAG Dual-channel retrieval architecture
ForeSea Attribute-based forensic search
TimeLens Temporal grounding recipes
SigLIP2 Frame-text shared embeddings
Grounding DINO Open-vocab attribute detection

โš ๏ธ Troubleshooting

"Could not import module 'AutoProcessor'"

This means your transformers version is too old. SigLIP2 requires >= 4.49:

pip install -U transformers
# Also clear stale cache:
rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384
rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny

Out of Memory during model loading

SigLIP2 (1.5GB) + Grounding DINO (657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM:

  • Set device="cpu" in config (default)
  • Close other memory-heavy applications
  • Consider using only one model at a time

Gemini rate limiting

The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider:

  • Increasing caption_every_n (e.g., 5 = caption every 5th frame)
  • Using a paid Gemini API tier

๐Ÿ“ Project Structure

video_intelligence/
โ”œโ”€โ”€ __init__.py          # Package init
โ”œโ”€โ”€ config.py            # Configuration dataclass
โ”œโ”€โ”€ frame_extractor.py   # OpenCV frame extraction
โ”œโ”€โ”€ gemini_client.py     # Gemini API (captioning, embedding, RAG, query decomposition)
โ”œโ”€โ”€ visual_encoders.py   # SigLIP2 + Grounding DINO
โ”œโ”€โ”€ index_store.py       # SQLite + FAISS index
โ”œโ”€โ”€ query_engine.py      # Multi-channel search + boolean ops + fusion
โ”œโ”€โ”€ akinator.py          # Decision-tree refinement
โ”œโ”€โ”€ pipeline.py          # End-to-end indexing orchestrator
โ””โ”€โ”€ app.py               # Gradio UI
app.py                   # Entry point (CLI + UI)
requirements.txt

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Papers for notRaphael/video-intelligence-platform