--- title: MoodSyncAI emoji: 🎭 colorFrom: indigo colorTo: pink sdk: gradio sdk_version: 5.50.0 app_file: app.py pinned: false license: apache-2.0 short_description: Multi-modal emotion analyser (face, text, audio) --- # 🎭 MoodSyncAI **Multi-Modal Sentiment & Emotion Analyser** — combines facial emotion (Vision Transformer), text sentiment (Transformer), a fusion layer (with mismatch detection), and a generative model that summarises the emotional state in plain language. Includes a **webcam / short-video timeline** view. All models are **100% free & open-source** (Hugging Face Hub). ## Components | Stage | Model | Type | Requirement satisfied | |---|---|---|---| | Visual emotion | `trpakov/vit-face-expression` | **ViT** | CNN/ViT for facial emotion ✅ | | Text sentiment | `j-hartmann/emotion-english-distilroberta-base` | **Transformer** | RNN/LSTM/Transformer ✅ | | Speech-to-text | `openai/whisper-tiny` | **Whisper encoder-decoder** | Audio → text channel ✅ | | Fusion | Valence-aligned multimodal fusion | rule-based + weighted | Fusion + mismatch ✅ | | Generative | `google/flan-t5-base` | seq2seq Transformer | Generative summary ✅ | | Webcam / video | OpenCV frame sampling + Plotly timeline | — | Real-time / video input ✅ | | Attention viz | ViT attention rollout + last-layer text attention | interpretability | Attention visualisation ✅ | ## Run **Prerequisite:** Python **3.10 – 3.13** (CPU is enough — no GPU required, no system ffmpeg required). ```powershell # 1. Clone / copy this folder onto the new machine, then: cd "" # 2. Create a virtual env python -m venv .venv .\.venv\Scripts\Activate.ps1 # Windows # source .venv/bin/activate # macOS / Linux # 3. Install (use --only-binary to skip Rust/MSVC compilation on Py3.13) python -m pip install --upgrade pip pip install -r requirements.txt --only-binary=:all: # 4. Launch python app.py ``` Browser opens at `http://127.0.0.1:7860`. **To stop the app:** press `Ctrl+C` in the terminal running `python app.py`. **First launch only:** downloads ~1.2 GB of models from Hugging Face into `~/.cache/huggingface/` (cached for all future runs, fully offline afterwards). That's it — no system packages, no ffmpeg, no GPU, no model files to download manually. ## Tabs 1. **đŸ–ŧī¸ Image + Text** — upload a face photo + type the spoken sentence → visual emotion bars, text emotion bars, fusion badge, generative summary. *Optional* attention-rollout heatmap on the face + per-token attention HTML when the toggle is on. 2. **📹 Webcam / Video + Text** — record a 3–10 s clip in the browser → per-frame emotion **timeline chart**, aggregated bars, fusion, summary. 3. **đŸŽ™ī¸ Audio + Image** — record/upload audio + face photo. Whisper transcribes the audio; the transcript drives the text channel; full fusion + summary. 4. **đŸŽŦ Video with Audio** — record/upload a video *with sound*. Audio is extracted (imageio-ffmpeg), transcribed by Whisper, fed to the text classifier; frames produce the visual timeline; fused result + summary — no typing needed. 5. **â„šī¸ About** — architecture & fusion logic. ## Fusion / mismatch rule Each modality's emotion distribution is mapped to a **valence** in `[-1, +1]`. - Opposite-sign valences → **MISMATCH DETECTED** (amber 🟠) - Small delta → **ALIGNED** (green đŸŸĸ) - Otherwise → **PARTIALLY ALIGNED** (yellow 🟡) The generative model is prompted with the structured signals and writes a 2–3 sentence empathetic summary.