Spaces:
Sleeping
Sleeping
| title: MoodSyncAI | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: pink | |
| sdk: gradio | |
| sdk_version: 5.50.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Multi-modal emotion analyser (face, text, audio) | |
| # π MoodSyncAI | |
| **Multi-Modal Sentiment & Emotion Analyser** β combines facial emotion (Vision Transformer), text sentiment (Transformer), a fusion layer (with mismatch detection), and a generative model that summarises the emotional state in plain language. Includes a **webcam / short-video timeline** view. | |
| All models are **100% free & open-source** (Hugging Face Hub). | |
| ## Components | |
| | Stage | Model | Type | Requirement satisfied | | |
| |---|---|---|---| | |
| | Visual emotion | `trpakov/vit-face-expression` | **ViT** | CNN/ViT for facial emotion β | | |
| | Text sentiment | `j-hartmann/emotion-english-distilroberta-base` | **Transformer** | RNN/LSTM/Transformer β | | |
| | Speech-to-text | `openai/whisper-tiny` | **Whisper encoder-decoder** | Audio β text channel β | | |
| | Fusion | Valence-aligned multimodal fusion | rule-based + weighted | Fusion + mismatch β | | |
| | Generative | `google/flan-t5-base` | seq2seq Transformer | Generative summary β | | |
| | Webcam / video | OpenCV frame sampling + Plotly timeline | β | Real-time / video input β | | |
| | Attention viz | ViT attention rollout + last-layer text attention | interpretability | Attention visualisation β | | |
| ## Run | |
| **Prerequisite:** Python **3.10 β 3.13** (CPU is enough β no GPU required, no system ffmpeg required). | |
| ```powershell | |
| # 1. Clone / copy this folder onto the new machine, then: | |
| cd "<path-to-folder>" | |
| # 2. Create a virtual env | |
| python -m venv .venv | |
| .\.venv\Scripts\Activate.ps1 # Windows | |
| # source .venv/bin/activate # macOS / Linux | |
| # 3. Install (use --only-binary to skip Rust/MSVC compilation on Py3.13) | |
| python -m pip install --upgrade pip | |
| pip install -r requirements.txt --only-binary=:all: | |
| # 4. Launch | |
| python app.py | |
| ``` | |
| Browser opens at `http://127.0.0.1:7860`. | |
| **To stop the app:** press `Ctrl+C` in the terminal running `python app.py`. | |
| **First launch only:** downloads ~1.2 GB of models from Hugging Face into `~/.cache/huggingface/` (cached for all future runs, fully offline afterwards). | |
| That's it β no system packages, no ffmpeg, no GPU, no model files to download manually. | |
| ## Tabs | |
| 1. **πΌοΈ Image + Text** β upload a face photo + type the spoken sentence β visual emotion bars, text emotion bars, fusion badge, generative summary. *Optional* attention-rollout heatmap on the face + per-token attention HTML when the toggle is on. | |
| 2. **πΉ Webcam / Video + Text** β record a 3β10 s clip in the browser β per-frame emotion **timeline chart**, aggregated bars, fusion, summary. | |
| 3. **ποΈ Audio + Image** β record/upload audio + face photo. Whisper transcribes the audio; the transcript drives the text channel; full fusion + summary. | |
| 4. **π¬ Video with Audio** β record/upload a video *with sound*. Audio is extracted (imageio-ffmpeg), transcribed by Whisper, fed to the text classifier; frames produce the visual timeline; fused result + summary β no typing needed. | |
| 5. **βΉοΈ About** β architecture & fusion logic. | |
| ## Fusion / mismatch rule | |
| Each modality's emotion distribution is mapped to a **valence** in `[-1, +1]`. | |
| - Opposite-sign valences β **MISMATCH DETECTED** (amber π ) | |
| - Small delta β **ALIGNED** (green π’) | |
| - Otherwise β **PARTIALLY ALIGNED** (yellow π‘) | |
| The generative model is prompted with the structured signals and writes a 2β3 sentence empathetic summary. | |