YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🎬 Video Intelligence Platform

Akinator-style video search with RAG, boolean queries, and tree-based refinement.

Upload any video → the system indexes every frame → then search with natural language to find exact timestamps.

🌟 Features

Natural Language Search: "person wearing white clothes" → exact timestamps
Boolean Queries: "red car AND bicycle" → timestamps where BOTH appear together
Akinator Tree Refinement: Too many results? The system asks discriminative questions to narrow down (indoor/outdoor? day/night? etc.)
RAG Answers: Generates grounded answers citing specific timestamps
Multi-Channel Fusion: Combines visual similarity + caption search + object detection

🏗️ Architecture

Video → Frame Extraction (1fps)
         │
         ├─► Grounding DINO → Object detection with attributes
         │   (detects "person in white shirt", "red car", etc.)
         │   → SQLite structured DB
         │
         ├─► SigLIP2 → Frame embeddings (1152-dim)
         │   → FAISS vector index
         │
         └─► Gemini 2.0 Flash → Dense captions
             → Gemini text-embedding-004 → Caption embeddings (768-dim)
             → FAISS vector index

Query → Gemini (decompose boolean) → Sub-queries
         │
         ├─► Visual search (SigLIP2 FAISS)
         ├─► Caption search (Gemini FAISS)
         └─► Detection search (SQL)
              │
              ▼
         Score Fusion → Boolean Ops (AND/OR) → Ranked Timestamps
              │
              ▼
         Akinator Refinement (if too many results)
              │
              ▼
         RAG Answer Generation (Gemini)

🚀 Quick Start

1. Clone the repo

git clone https://huggingface.co/notRaphael/video-intelligence-platform
cd video-intelligence-platform

2. Install dependencies

pip install -r requirements.txt

Note: Requires transformers >= 4.49 (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with ≥8GB RAM is recommended.

3. Get a Gemini API key (free)

Go to https://aistudio.google.com/apikey
Create a free API key

4. Launch the UI

export GEMINI_API_KEY="your-key-here"
python app.py

5. Or use the CLI

# Index a video
python app.py --index video.mp4 --api-key YOUR_KEY

# Search
python app.py --search "red car" --api-key YOUR_KEY

📋 Models Used

Component	Model	Size	Runs On
Frame Embeddings	SigLIP2	~1.5GB	CPU ✅ / GPU
Object Detection	Grounding DINO	~657MB	CPU ✅ / GPU
Captioning	Gemini 2.0 Flash	API	Cloud
Text Embeddings	Gemini text-embedding-004	API	Cloud
Query/RAG	Gemini 2.0 Flash	API	Cloud

🔧 API Verification (Apr 2026)

All model APIs verified against transformers 5.6.2 and google-genai 1.73.1:

SigLIP2 (`google/siglip2-so400m-patch14-384`)

AutoModel / AutoProcessor → resolves to SiglipModel / SiglipProcessor
model.get_image_features(**inputs) returns BaseModelOutputWithPooling (.pooler_output = [B, 1152])
Text input must use padding="max_length" (training requirement)
Uses sigmoid (not softmax) for similarity scores

Grounding DINO (`IDEA-Research/grounding-dino-tiny`)

AutoModelForZeroShotObjectDetection / AutoProcessor → resolves to GroundingDinoForObjectDetection / GroundingDinoProcessor
Processor accepts text as str, list[str], or list[list[str]] — auto-converts internally
post_process_grounded_object_detection: threshold kwarg (not box_threshold), input_ids optional
Returns dict with both "text_labels" and "labels" keys
target_sizes expects (height, width) tuples

Gemini (`google-genai` SDK)

Uses google.genai (NOT deprecated google.generativeai)
genai.Client(api_key=...) → client.models.generate_content(...), client.models.embed_content(...)
types.Part.from_bytes(data=..., mime_type=...), types.Part.from_text(text=...)
Embedding is text-only — cannot embed images/video directly

🌳 How the Akinator Tree Works

When a search returns too many results (>10), the system:

Extracts attributes from all candidate frames (objects, colors, location, time, actions)
Computes information gain for each attribute (same algorithm as decision trees!)
Asks the most discriminative question (e.g., "Indoor or outdoor?")
Splits results based on your answer
Repeats until results are manageable

"Found 47 clips with people"
    │
    ▼
"Indoor or outdoor?" → Outdoor (24 clips)
    │
    ▼
"Daytime or nighttime?" → Daytime (15 clips)
    │
    ▼
"What color clothing?" → White (6 clips) ✅ Done!

🔮 Future: TPU Training

The platform is designed for future fine-tuning on TPU:

VLM2Vec-V2 (Qwen2-VL-7B + LoRA) for domain-specific video embeddings
TimeLens recipe (GRPO/RLVR) for temporal grounding
Uses accelerate + FSDPv2 with bf16 on TPU v5e

📚 Based On Research

Paper	What We Use
AVA	Event Knowledge Graphs + semantic chunking
VideoRAG	Dual-channel retrieval architecture
ForeSea	Attribute-based forensic search
TimeLens	Temporal grounding recipes
SigLIP2	Frame-text shared embeddings
Grounding DINO	Open-vocab attribute detection

⚠️ Troubleshooting

"Could not import module 'AutoProcessor'"

This means your transformers version is too old. SigLIP2 requires >= 4.49:

pip install -U transformers
# Also clear stale cache:
rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384
rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny

Out of Memory during model loading

SigLIP2 (~~1.5GB) + Grounding DINO (~~657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM:

Set device="cpu" in config (default)
Close other memory-heavy applications
Consider using only one model at a time

Gemini rate limiting

The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider:

Increasing caption_every_n (e.g., 5 = caption every 5th frame)
Using a paid Gemini API tier

📁 Project Structure

video_intelligence/
├── __init__.py          # Package init
├── config.py            # Configuration dataclass
├── frame_extractor.py   # OpenCV frame extraction
├── gemini_client.py     # Gemini API (captioning, embedding, RAG, query decomposition)
├── visual_encoders.py   # SigLIP2 + Grounding DINO
├── index_store.py       # SQLite + FAISS index
├── query_engine.py      # Multi-channel search + boolean ops + fusion
├── akinator.py          # Decision-tree refinement
├── pipeline.py          # End-to-end indexing orchestrator
└── app.py               # Gradio UI
app.py                   # Entry point (CLI + UI)
requirements.txt

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for notRaphael/video-intelligence-platform

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published Feb 20, 2025 • 164

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

Paper • 2502.01549 • Published Feb 3, 2025 • 3