File size: 3,556 Bytes
717601c
 
 
 
 
 
 
 
 
 
dd1f5f6
717601c
 
1f07aba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
title: MoodSyncAI
emoji: 🎭
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Multi-modal emotion analyser (face, text, audio)
---

# 🎭 MoodSyncAI

**Multi-Modal Sentiment & Emotion Analyser** β€” combines facial emotion (Vision Transformer), text sentiment (Transformer), a fusion layer (with mismatch detection), and a generative model that summarises the emotional state in plain language. Includes a **webcam / short-video timeline** view.

All models are **100% free & open-source** (Hugging Face Hub).

## Components

| Stage | Model | Type | Requirement satisfied |
|---|---|---|---|
| Visual emotion | `trpakov/vit-face-expression` | **ViT** | CNN/ViT for facial emotion βœ… |
| Text sentiment | `j-hartmann/emotion-english-distilroberta-base` | **Transformer** | RNN/LSTM/Transformer βœ… |
| Speech-to-text | `openai/whisper-tiny` | **Whisper encoder-decoder** | Audio β†’ text channel βœ… |
| Fusion | Valence-aligned multimodal fusion | rule-based + weighted | Fusion + mismatch βœ… |
| Generative | `google/flan-t5-base` | seq2seq Transformer | Generative summary βœ… |
| Webcam / video | OpenCV frame sampling + Plotly timeline | β€” | Real-time / video input βœ… |
| Attention viz | ViT attention rollout + last-layer text attention | interpretability | Attention visualisation βœ… |

## Run

**Prerequisite:** Python **3.10 – 3.13** (CPU is enough β€” no GPU required, no system ffmpeg required).

```powershell
# 1. Clone / copy this folder onto the new machine, then:
cd "<path-to-folder>"

# 2. Create a virtual env
python -m venv .venv
.\.venv\Scripts\Activate.ps1        # Windows
# source .venv/bin/activate         # macOS / Linux

# 3. Install (use --only-binary to skip Rust/MSVC compilation on Py3.13)
python -m pip install --upgrade pip
pip install -r requirements.txt --only-binary=:all:

# 4. Launch
python app.py
```

Browser opens at `http://127.0.0.1:7860`.

**To stop the app:** press `Ctrl+C` in the terminal running `python app.py`.

**First launch only:** downloads ~1.2 GB of models from Hugging Face into `~/.cache/huggingface/` (cached for all future runs, fully offline afterwards).

That's it β€” no system packages, no ffmpeg, no GPU, no model files to download manually.

## Tabs

1. **πŸ–ΌοΈ Image + Text** β€” upload a face photo + type the spoken sentence β†’ visual emotion bars, text emotion bars, fusion badge, generative summary. *Optional* attention-rollout heatmap on the face + per-token attention HTML when the toggle is on.
2. **πŸ“Ή Webcam / Video + Text** β€” record a 3–10 s clip in the browser β†’ per-frame emotion **timeline chart**, aggregated bars, fusion, summary.
3. **πŸŽ™οΈ Audio + Image** β€” record/upload audio + face photo. Whisper transcribes the audio; the transcript drives the text channel; full fusion + summary.
4. **🎬 Video with Audio** β€” record/upload a video *with sound*. Audio is extracted (imageio-ffmpeg), transcribed by Whisper, fed to the text classifier; frames produce the visual timeline; fused result + summary β€” no typing needed.
5. **ℹ️ About** β€” architecture & fusion logic.

## Fusion / mismatch rule

Each modality's emotion distribution is mapped to a **valence** in `[-1, +1]`.

- Opposite-sign valences β†’ **MISMATCH DETECTED** (amber 🟠)
- Small delta β†’ **ALIGNED** (green 🟒)
- Otherwise β†’ **PARTIALLY ALIGNED** (yellow 🟑)

The generative model is prompted with the structured signals and writes a 2–3 sentence empathetic summary.