Text Classification
Transformers
Safetensors
Vietnamese
speech-emotion-recognition
audio-classification
multimodal
Instructions to use ducdatit2002/X-MERLIN with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ducdatit2002/X-MERLIN with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="ducdatit2002/X-MERLIN")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ducdatit2002/X-MERLIN", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: vi | |
| library_name: transformers | |
| license: other | |
| pipeline_tag: text-classification | |
| tags: | |
| - speech-emotion-recognition | |
| - audio-classification | |
| - text-classification | |
| - multimodal | |
| # Vietnamese Emotion Models (Text, Voice, Multimodal) | |
| Three Vietnamese emotion recognition models (text, voice, multimodal) packaged for Hugging Face with configs/labels/metrics and inference snippets. Only the best checkpoint is kept for each branch. | |
| ## Structure | |
| - `text-phobert-focalloss/`: PhoBERT + focal loss for text emotion classification. | |
| - `voice-wav2vec2-vi-emotion/`: Wav2Vec2-base-vi-250h fine-tuned for Vietnamese SER. | |
| - `multimodal/`: Fusion weights for audio + text (`best.pt`) with `labels.json`. | |
| ## Setup | |
| ```bash | |
| pip install transformers torch torchaudio soundfile | |
| ``` | |
| Voice and multimodal require 16 kHz audio; resample if your files differ. | |
| ## Text model (PhoBERT focal loss) | |
| ```python | |
| from pathlib import Path | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| base = Path(__file__).resolve().parent # .../hf-release | |
| repo = base / "text-phobert-focalloss" | |
| tok = AutoTokenizer.from_pretrained(repo, use_fast=False) | |
| model = AutoModelForSequenceClassification.from_pretrained(repo) | |
| inputs = tok("Tôi đang rất vui và hào hứng", return_tensors="pt") | |
| with torch.no_grad(): | |
| probs = model(**inputs).logits.softmax(-1)[0] | |
| pred = model.config.id2label[str(int(probs.argmax()))] | |
| print(pred, float(probs.max())) | |
| ``` | |
| ## Voice model (Wav2Vec2 SER) | |
| ```python | |
| from pathlib import Path | |
| import torch, torchaudio | |
| from transformers import Wav2Vec2ForSequenceClassification, AutoProcessor | |
| base = Path(__file__).resolve().parent | |
| repo = base / "voice-wav2vec2-vi-emotion" | |
| processor = AutoProcessor.from_pretrained(repo) | |
| model = Wav2Vec2ForSequenceClassification.from_pretrained(repo) | |
| wav, sr = torchaudio.load("audio.wav") | |
| if sr != 16000: | |
| wav = torchaudio.functional.resample(wav, sr, 16000) | |
| inputs = processor(wav.squeeze().numpy(), sampling_rate=16000, return_tensors="pt") | |
| with torch.no_grad(): | |
| probs = model(**inputs).logits.softmax(-1)[0] | |
| pred = model.config.id2label[int(probs.argmax())] | |
| print(pred, float(probs.max())) | |
| ``` | |
| ## Multimodal model (audio + transcript) | |
| ```python | |
| from pathlib import Path | |
| import sys, torch, torchaudio | |
| from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor | |
| from multimodal.multimodal_train_eval import FusionXMerlin | |
| base = Path(__file__).resolve().parent # .../hf-release | |
| sys.path.append(str(base.parent)) # add repo root | |
| text_repo = base / "text-phobert-focalloss" | |
| audio_repo = base / "voice-wav2vec2-vi-emotion" | |
| ckpt_path = base / "multimodal" / "best.pt" | |
| ckpt = torch.load(ckpt_path, map_location="cpu") | |
| label2id = ckpt["label2id"] | |
| id2label = {v: k for k, v in label2id.items()} | |
| tokenizer = AutoTokenizer.from_pretrained(text_repo, use_fast=False) | |
| processor = Wav2Vec2FeatureExtractor.from_pretrained(audio_repo) | |
| model = FusionXMerlin( | |
| text_model_path=text_repo, | |
| audio_model_path=audio_repo, | |
| num_classes=len(label2id), | |
| freeze_encoders=True, | |
| ).eval() | |
| model.load_state_dict(ckpt["model_state"]) | |
| transcript = "Tôi rất thất vọng về dịch vụ." | |
| wav, sr = torchaudio.load("audio.wav") | |
| if sr != 16000: | |
| wav = torchaudio.functional.resample(wav, sr, 16000) | |
| t_inputs = tokenizer(transcript, return_tensors="pt", padding=True, truncation=True, max_length=256) | |
| a_inputs = processor(wav.squeeze().numpy(), sampling_rate=16000, return_tensors="pt") | |
| with torch.no_grad(): | |
| logits, _ = model( | |
| t_inputs["input_ids"], | |
| t_inputs["attention_mask"], | |
| a_inputs["input_values"], | |
| a_inputs["attention_mask"], | |
| ) | |
| probs = torch.softmax(logits, dim=-1)[0] | |
| pred = id2label[int(probs.argmax())] | |
| print(pred, float(probs.max())) | |
| ``` | |
| ## Extra info | |
| - Label set: Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise (mappings inside each config/labels.json). | |