YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
๐ฌ Video Intelligence Platform
Akinator-style video search with RAG, boolean queries, and tree-based refinement.
Upload any video โ the system indexes every frame โ then search with natural language to find exact timestamps.
๐ Features
- Natural Language Search: "person wearing white clothes" โ exact timestamps
- Boolean Queries: "red car AND bicycle" โ timestamps where BOTH appear together
- Akinator Tree Refinement: Too many results? The system asks discriminative questions to narrow down (indoor/outdoor? day/night? etc.)
- RAG Answers: Generates grounded answers citing specific timestamps
- Multi-Channel Fusion: Combines visual similarity + caption search + object detection
๐๏ธ Architecture
Video โ Frame Extraction (1fps)
โ
โโโบ Grounding DINO โ Object detection with attributes
โ (detects "person in white shirt", "red car", etc.)
โ โ SQLite structured DB
โ
โโโบ SigLIP2 โ Frame embeddings (1152-dim)
โ โ FAISS vector index
โ
โโโบ Gemini 2.0 Flash โ Dense captions
โ Gemini text-embedding-004 โ Caption embeddings (768-dim)
โ FAISS vector index
Query โ Gemini (decompose boolean) โ Sub-queries
โ
โโโบ Visual search (SigLIP2 FAISS)
โโโบ Caption search (Gemini FAISS)
โโโบ Detection search (SQL)
โ
โผ
Score Fusion โ Boolean Ops (AND/OR) โ Ranked Timestamps
โ
โผ
Akinator Refinement (if too many results)
โ
โผ
RAG Answer Generation (Gemini)
๐ Quick Start
1. Clone the repo
git clone https://huggingface.co/notRaphael/video-intelligence-platform
cd video-intelligence-platform
2. Install dependencies
pip install -r requirements.txt
Note: Requires
transformers >= 4.49(for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with โฅ8GB RAM is recommended.
3. Get a Gemini API key (free)
- Go to https://aistudio.google.com/apikey
- Create a free API key
4. Launch the UI
export GEMINI_API_KEY="your-key-here"
python app.py
5. Or use the CLI
# Index a video
python app.py --index video.mp4 --api-key YOUR_KEY
# Search
python app.py --search "red car" --api-key YOUR_KEY
๐ Models Used
| Component | Model | Size | Runs On |
|---|---|---|---|
| Frame Embeddings | SigLIP2 | ~1.5GB | CPU โ / GPU |
| Object Detection | Grounding DINO | ~657MB | CPU โ / GPU |
| Captioning | Gemini 2.0 Flash | API | Cloud |
| Text Embeddings | Gemini text-embedding-004 | API | Cloud |
| Query/RAG | Gemini 2.0 Flash | API | Cloud |
๐ง API Verification (Apr 2026)
All model APIs verified against transformers 5.6.2 and google-genai 1.73.1:
SigLIP2 (google/siglip2-so400m-patch14-384)
AutoModel/AutoProcessorโ resolves toSiglipModel/SiglipProcessormodel.get_image_features(**inputs)returnsBaseModelOutputWithPooling(.pooler_output=[B, 1152])- Text input must use
padding="max_length"(training requirement) - Uses sigmoid (not softmax) for similarity scores
Grounding DINO (IDEA-Research/grounding-dino-tiny)
AutoModelForZeroShotObjectDetection/AutoProcessorโ resolves toGroundingDinoForObjectDetection/GroundingDinoProcessor- Processor accepts text as
str,list[str], orlist[list[str]]โ auto-converts internally post_process_grounded_object_detection:thresholdkwarg (notbox_threshold),input_idsoptional- Returns dict with both
"text_labels"and"labels"keys target_sizesexpects(height, width)tuples
Gemini (google-genai SDK)
- Uses
google.genai(NOT deprecatedgoogle.generativeai) genai.Client(api_key=...)โclient.models.generate_content(...),client.models.embed_content(...)types.Part.from_bytes(data=..., mime_type=...),types.Part.from_text(text=...)- Embedding is text-only โ cannot embed images/video directly
๐ณ How the Akinator Tree Works
When a search returns too many results (>10), the system:
- Extracts attributes from all candidate frames (objects, colors, location, time, actions)
- Computes information gain for each attribute (same algorithm as decision trees!)
- Asks the most discriminative question (e.g., "Indoor or outdoor?")
- Splits results based on your answer
- Repeats until results are manageable
"Found 47 clips with people"
โ
โผ
"Indoor or outdoor?" โ Outdoor (24 clips)
โ
โผ
"Daytime or nighttime?" โ Daytime (15 clips)
โ
โผ
"What color clothing?" โ White (6 clips) โ
Done!
๐ฎ Future: TPU Training
The platform is designed for future fine-tuning on TPU:
- VLM2Vec-V2 (Qwen2-VL-7B + LoRA) for domain-specific video embeddings
- TimeLens recipe (GRPO/RLVR) for temporal grounding
- Uses
accelerate+ FSDPv2 with bf16 on TPU v5e
๐ Based On Research
| Paper | What We Use |
|---|---|
| AVA | Event Knowledge Graphs + semantic chunking |
| VideoRAG | Dual-channel retrieval architecture |
| ForeSea | Attribute-based forensic search |
| TimeLens | Temporal grounding recipes |
| SigLIP2 | Frame-text shared embeddings |
| Grounding DINO | Open-vocab attribute detection |
โ ๏ธ Troubleshooting
"Could not import module 'AutoProcessor'"
This means your transformers version is too old. SigLIP2 requires >= 4.49:
pip install -U transformers
# Also clear stale cache:
rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384
rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny
Out of Memory during model loading
SigLIP2 (1.5GB) + Grounding DINO (657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM:
- Set
device="cpu"in config (default) - Close other memory-heavy applications
- Consider using only one model at a time
Gemini rate limiting
The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider:
- Increasing
caption_every_n(e.g., 5 = caption every 5th frame) - Using a paid Gemini API tier
๐ Project Structure
video_intelligence/
โโโ __init__.py # Package init
โโโ config.py # Configuration dataclass
โโโ frame_extractor.py # OpenCV frame extraction
โโโ gemini_client.py # Gemini API (captioning, embedding, RAG, query decomposition)
โโโ visual_encoders.py # SigLIP2 + Grounding DINO
โโโ index_store.py # SQLite + FAISS index
โโโ query_engine.py # Multi-channel search + boolean ops + fusion
โโโ akinator.py # Decision-tree refinement
โโโ pipeline.py # End-to-end indexing orchestrator
โโโ app.py # Gradio UI
app.py # Entry point (CLI + UI)
requirements.txt
License
MIT