Spaces:

Vijesh251
/

MoodSyncAI

Sleeping

App Files Files Community

MoodSyncAI / README.md

vijesh418

Shorten HF short_description to <=60 chars

dd1f5f6 2 days ago

preview code

raw

history blame contribute delete

3.56 kB

	---
	title: MoodSyncAI
	emoji: 🎭
	colorFrom: indigo
	colorTo: pink
	sdk: gradio
	sdk_version: 5.50.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: Multi-modal emotion analyser (face, text, audio)
	---

	# 🎭 MoodSyncAI

	Multi-Modal Sentiment & Emotion Analyser — combines facial emotion (Vision Transformer), text sentiment (Transformer), a fusion layer (with mismatch detection), and a generative model that summarises the emotional state in plain language. Includes a webcam / short-video timeline view.

	All models are 100% free & open-source (Hugging Face Hub).

	## Components

	\| Stage \| Model \| Type \| Requirement satisfied \|
	\|---\|---\|---\|---\|
	\| Visual emotion \| `trpakov/vit-face-expression` \| ViT \| CNN/ViT for facial emotion ✅ \|
	\| Text sentiment \| `j-hartmann/emotion-english-distilroberta-base` \| Transformer \| RNN/LSTM/Transformer ✅ \|
	\| Speech-to-text \| `openai/whisper-tiny` \| Whisper encoder-decoder \| Audio → text channel ✅ \|
	\| Fusion \| Valence-aligned multimodal fusion \| rule-based + weighted \| Fusion + mismatch ✅ \|
	\| Generative \| `google/flan-t5-base` \| seq2seq Transformer \| Generative summary ✅ \|
	\| Webcam / video \| OpenCV frame sampling + Plotly timeline \| — \| Real-time / video input ✅ \|
	\| Attention viz \| ViT attention rollout + last-layer text attention \| interpretability \| Attention visualisation ✅ \|

	## Run

	Prerequisite: Python 3.10 – 3.13 (CPU is enough — no GPU required, no system ffmpeg required).

	```powershell
	# 1. Clone / copy this folder onto the new machine, then:
	cd "<path-to-folder>"

	# 2. Create a virtual env
	python -m venv .venv
	.\.venv\Scripts\Activate.ps1 # Windows
	# source .venv/bin/activate # macOS / Linux

	# 3. Install (use --only-binary to skip Rust/MSVC compilation on Py3.13)
	python -m pip install --upgrade pip
	pip install -r requirements.txt --only-binary=:all:

	# 4. Launch
	python app.py
	```

	Browser opens at `http://127.0.0.1:7860`.

	To stop the app: press `Ctrl+C` in the terminal running `python app.py`.

	First launch only: downloads ~1.2 GB of models from Hugging Face into `~/.cache/huggingface/` (cached for all future runs, fully offline afterwards).

	That's it — no system packages, no ffmpeg, no GPU, no model files to download manually.

	## Tabs

	1. 🖼️ Image + Text — upload a face photo + type the spoken sentence → visual emotion bars, text emotion bars, fusion badge, generative summary. Optional attention-rollout heatmap on the face + per-token attention HTML when the toggle is on.
	2. 📹 Webcam / Video + Text — record a 3–10 s clip in the browser → per-frame emotion timeline chart, aggregated bars, fusion, summary.
	3. 🎙️ Audio + Image — record/upload audio + face photo. Whisper transcribes the audio; the transcript drives the text channel; full fusion + summary.
	4. 🎬 Video with Audio — record/upload a video with sound. Audio is extracted (imageio-ffmpeg), transcribed by Whisper, fed to the text classifier; frames produce the visual timeline; fused result + summary — no typing needed.
	5. ℹ️ About — architecture & fusion logic.

	## Fusion / mismatch rule

	Each modality's emotion distribution is mapped to a valence in `[-1, +1]`.

	- Opposite-sign valences → MISMATCH DETECTED (amber 🟠)
	- Small delta → ALIGNED (green 🟢)
	- Otherwise → PARTIALLY ALIGNED (yellow 🟡)

	The generative model is prompted with the structured signals and writes a 2–3 sentence empathetic summary.