MoodSyncAI / README.md
vijesh418
Shorten HF short_description to <=60 chars
dd1f5f6

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: MoodSyncAI
emoji: 🎭
colorFrom: indigo
colorTo: pink
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Multi-modal emotion analyser (face, text, audio)

🎭 MoodSyncAI

Multi-Modal Sentiment & Emotion Analyser β€” combines facial emotion (Vision Transformer), text sentiment (Transformer), a fusion layer (with mismatch detection), and a generative model that summarises the emotional state in plain language. Includes a webcam / short-video timeline view.

All models are 100% free & open-source (Hugging Face Hub).

Components

Stage Model Type Requirement satisfied
Visual emotion trpakov/vit-face-expression ViT CNN/ViT for facial emotion βœ…
Text sentiment j-hartmann/emotion-english-distilroberta-base Transformer RNN/LSTM/Transformer βœ…
Speech-to-text openai/whisper-tiny Whisper encoder-decoder Audio β†’ text channel βœ…
Fusion Valence-aligned multimodal fusion rule-based + weighted Fusion + mismatch βœ…
Generative google/flan-t5-base seq2seq Transformer Generative summary βœ…
Webcam / video OpenCV frame sampling + Plotly timeline β€” Real-time / video input βœ…
Attention viz ViT attention rollout + last-layer text attention interpretability Attention visualisation βœ…

Run

Prerequisite: Python 3.10 – 3.13 (CPU is enough β€” no GPU required, no system ffmpeg required).

# 1. Clone / copy this folder onto the new machine, then:
cd "<path-to-folder>"

# 2. Create a virtual env
python -m venv .venv
.\.venv\Scripts\Activate.ps1        # Windows
# source .venv/bin/activate         # macOS / Linux

# 3. Install (use --only-binary to skip Rust/MSVC compilation on Py3.13)
python -m pip install --upgrade pip
pip install -r requirements.txt --only-binary=:all:

# 4. Launch
python app.py

Browser opens at http://127.0.0.1:7860.

To stop the app: press Ctrl+C in the terminal running python app.py.

First launch only: downloads 1.2 GB of models from Hugging Face into `/.cache/huggingface/` (cached for all future runs, fully offline afterwards).

That's it β€” no system packages, no ffmpeg, no GPU, no model files to download manually.

Tabs

  1. πŸ–ΌοΈ Image + Text β€” upload a face photo + type the spoken sentence β†’ visual emotion bars, text emotion bars, fusion badge, generative summary. Optional attention-rollout heatmap on the face + per-token attention HTML when the toggle is on.
  2. πŸ“Ή Webcam / Video + Text β€” record a 3–10 s clip in the browser β†’ per-frame emotion timeline chart, aggregated bars, fusion, summary.
  3. πŸŽ™οΈ Audio + Image β€” record/upload audio + face photo. Whisper transcribes the audio; the transcript drives the text channel; full fusion + summary.
  4. 🎬 Video with Audio β€” record/upload a video with sound. Audio is extracted (imageio-ffmpeg), transcribed by Whisper, fed to the text classifier; frames produce the visual timeline; fused result + summary β€” no typing needed.
  5. ℹ️ About β€” architecture & fusion logic.

Fusion / mismatch rule

Each modality's emotion distribution is mapped to a valence in [-1, +1].

  • Opposite-sign valences β†’ MISMATCH DETECTED (amber 🟠)
  • Small delta β†’ ALIGNED (green 🟒)
  • Otherwise β†’ PARTIALLY ALIGNED (yellow 🟑)

The generative model is prompted with the structured signals and writes a 2–3 sentence empathetic summary.