X-MERLIN / README.md

ducdatit2002

Update README with metadata

bd32f6b verified 6 months ago

3.95 kB

	---
	language: vi
	library_name: transformers
	license: other
	pipeline_tag: text-classification
	tags:
	- speech-emotion-recognition
	- audio-classification
	- text-classification
	- multimodal
	---

	# Vietnamese Emotion Models (Text, Voice, Multimodal)

	Three Vietnamese emotion recognition models (text, voice, multimodal) packaged for Hugging Face with configs/labels/metrics and inference snippets. Only the best checkpoint is kept for each branch.

	## Structure
	- `text-phobert-focalloss/`: PhoBERT + focal loss for text emotion classification.
	- `voice-wav2vec2-vi-emotion/`: Wav2Vec2-base-vi-250h fine-tuned for Vietnamese SER.
	- `multimodal/`: Fusion weights for audio + text (`best.pt`) with `labels.json`.

	## Setup
	```bash
	pip install transformers torch torchaudio soundfile
	```
	Voice and multimodal require 16 kHz audio; resample if your files differ.


	## Text model (PhoBERT focal loss)
	```python
	from pathlib import Path
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	base = Path(__file__).resolve().parent # .../hf-release
	repo = base / "text-phobert-focalloss"
	tok = AutoTokenizer.from_pretrained(repo, use_fast=False)
	model = AutoModelForSequenceClassification.from_pretrained(repo)

	inputs = tok("Tôi đang rất vui và hào hứng", return_tensors="pt")
	with torch.no_grad():
	probs = model(**inputs).logits.softmax(-1)[0]
	pred = model.config.id2label[str(int(probs.argmax()))]
	print(pred, float(probs.max()))
	```

	## Voice model (Wav2Vec2 SER)
	```python
	from pathlib import Path
	import torch, torchaudio
	from transformers import Wav2Vec2ForSequenceClassification, AutoProcessor

	base = Path(__file__).resolve().parent
	repo = base / "voice-wav2vec2-vi-emotion"
	processor = AutoProcessor.from_pretrained(repo)
	model = Wav2Vec2ForSequenceClassification.from_pretrained(repo)

	wav, sr = torchaudio.load("audio.wav")
	if sr != 16000:
	wav = torchaudio.functional.resample(wav, sr, 16000)
	inputs = processor(wav.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
	with torch.no_grad():
	probs = model(**inputs).logits.softmax(-1)[0]
	pred = model.config.id2label[int(probs.argmax())]
	print(pred, float(probs.max()))
	```

	## Multimodal model (audio + transcript)

	```python
	from pathlib import Path
	import sys, torch, torchaudio
	from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor
	from multimodal.multimodal_train_eval import FusionXMerlin

	base = Path(__file__).resolve().parent # .../hf-release
	sys.path.append(str(base.parent)) # add repo root

	text_repo = base / "text-phobert-focalloss"
	audio_repo = base / "voice-wav2vec2-vi-emotion"
	ckpt_path = base / "multimodal" / "best.pt"

	ckpt = torch.load(ckpt_path, map_location="cpu")
	label2id = ckpt["label2id"]
	id2label = {v: k for k, v in label2id.items()}

	tokenizer = AutoTokenizer.from_pretrained(text_repo, use_fast=False)
	processor = Wav2Vec2FeatureExtractor.from_pretrained(audio_repo)

	model = FusionXMerlin(
	text_model_path=text_repo,
	audio_model_path=audio_repo,
	num_classes=len(label2id),
	freeze_encoders=True,
	).eval()
	model.load_state_dict(ckpt["model_state"])

	transcript = "Tôi rất thất vọng về dịch vụ."
	wav, sr = torchaudio.load("audio.wav")
	if sr != 16000:
	wav = torchaudio.functional.resample(wav, sr, 16000)

	t_inputs = tokenizer(transcript, return_tensors="pt", padding=True, truncation=True, max_length=256)
	a_inputs = processor(wav.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")

	with torch.no_grad():
	logits, _ = model(
	t_inputs["input_ids"],
	t_inputs["attention_mask"],
	a_inputs["input_values"],
	a_inputs["attention_mask"],
	)
	probs = torch.softmax(logits, dim=-1)[0]
	pred = id2label[int(probs.argmax())]
	print(pred, float(probs.max()))
	```

	## Extra info
	- Label set: Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise (mappings inside each config/labels.json).