Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
title: MoodSyncAI
emoji: π
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Multi-modal emotion analyser (face, text, audio)
π MoodSyncAI
Multi-Modal Sentiment & Emotion Analyser β combines facial emotion (Vision Transformer), text sentiment (Transformer), a fusion layer (with mismatch detection), and a generative model that summarises the emotional state in plain language. Includes a webcam / short-video timeline view.
All models are 100% free & open-source (Hugging Face Hub).
Components
| Stage | Model | Type | Requirement satisfied |
|---|---|---|---|
| Visual emotion | trpakov/vit-face-expression |
ViT | CNN/ViT for facial emotion β |
| Text sentiment | j-hartmann/emotion-english-distilroberta-base |
Transformer | RNN/LSTM/Transformer β |
| Speech-to-text | openai/whisper-tiny |
Whisper encoder-decoder | Audio β text channel β |
| Fusion | Valence-aligned multimodal fusion | rule-based + weighted | Fusion + mismatch β |
| Generative | google/flan-t5-base |
seq2seq Transformer | Generative summary β |
| Webcam / video | OpenCV frame sampling + Plotly timeline | β | Real-time / video input β |
| Attention viz | ViT attention rollout + last-layer text attention | interpretability | Attention visualisation β |
Run
Prerequisite: Python 3.10 β 3.13 (CPU is enough β no GPU required, no system ffmpeg required).
# 1. Clone / copy this folder onto the new machine, then:
cd "<path-to-folder>"
# 2. Create a virtual env
python -m venv .venv
.\.venv\Scripts\Activate.ps1 # Windows
# source .venv/bin/activate # macOS / Linux
# 3. Install (use --only-binary to skip Rust/MSVC compilation on Py3.13)
python -m pip install --upgrade pip
pip install -r requirements.txt --only-binary=:all:
# 4. Launch
python app.py
Browser opens at http://127.0.0.1:7860.
To stop the app: press Ctrl+C in the terminal running python app.py.
First launch only: downloads 1.2 GB of models from Hugging Face into `/.cache/huggingface/` (cached for all future runs, fully offline afterwards).
That's it β no system packages, no ffmpeg, no GPU, no model files to download manually.
Tabs
- πΌοΈ Image + Text β upload a face photo + type the spoken sentence β visual emotion bars, text emotion bars, fusion badge, generative summary. Optional attention-rollout heatmap on the face + per-token attention HTML when the toggle is on.
- πΉ Webcam / Video + Text β record a 3β10 s clip in the browser β per-frame emotion timeline chart, aggregated bars, fusion, summary.
- ποΈ Audio + Image β record/upload audio + face photo. Whisper transcribes the audio; the transcript drives the text channel; full fusion + summary.
- π¬ Video with Audio β record/upload a video with sound. Audio is extracted (imageio-ffmpeg), transcribed by Whisper, fed to the text classifier; frames produce the visual timeline; fused result + summary β no typing needed.
- βΉοΈ About β architecture & fusion logic.
Fusion / mismatch rule
Each modality's emotion distribution is mapped to a valence in [-1, +1].
- Opposite-sign valences β MISMATCH DETECTED (amber π )
- Small delta β ALIGNED (green π’)
- Otherwise β PARTIALLY ALIGNED (yellow π‘)
The generative model is prompted with the structured signals and writes a 2β3 sentence empathetic summary.