README.md · Pectics/vad-macbert at main

vad-macbert / README.md

Pectics

Metadata

c795015 2 months ago

preview code

raw

history blame contribute delete

4.8 kB

	---
	license: mit
	datasets:
	- Helsinki-NLP/open_subtitles
	language:
	- zh
	base_model:
	- hfl/chinese-macbert-base
	pipeline_tag: text-classification
	tags:
	- agent
	- nlp
	- chinese
	- sentiment-analysis
	- emotion
	- regression
	- vad
	- valence-arousal-dominance
	- transformers
	- bert
	- macbert
	---

	<div align="center">
	<h1>vad-macbert</h1>
	<p>Chinese VAD (valence/arousal/dominance) regression on top of chinese-macbert-base.</p>
	<p>
	<a href="https://huggingface.co/Pectics/vad-macbert">
	<img alt="HF Model" src="https://img.shields.io/badge/Hugging%20Face-Model-yellow">
	</a>
	<img alt="Task" src="https://img.shields.io/badge/task-VAD%20regression-1f6feb">
	<img alt="Backbone" src="https://img.shields.io/badge/backbone-chinese--macbert--base-4b8bbe">
	</p>
	</div>

	The model predicts 3 continuous values aligned to the VAD scale produced by
	`RobroKools/vad-bert` (teacher model).

	## Quickstart

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_path = "Pectics/vad-macbert"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForSequenceClassification.from_pretrained(model_path)
	model.eval()

	text = "这部电影让我很感动。"
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
	with torch.no_grad():
	outputs = model(**inputs)
	vad = outputs.logits.squeeze().tolist()
	print("VAD:", vad)
	```

	## Model Details

	- Base model: `hfl/chinese-macbert-base`
	- Task: VAD regression (3 outputs: valence, arousal, dominance)
	- Head: `AutoModelForSequenceClassification` with `num_labels=3`, `problem_type=regression`

	## Data Sources & Labeling

	### en-zh_cn_vad_clean.csv
	- Source: OpenSubtitles EN-ZH parallel corpus.
	- Labeling: English side fed into `RobroKools/vad-bert` to obtain VAD values,
	then assigned to the paired Chinese text.

	### en-zh_cn_vad_long.csv
	- Derived from `en-zh_cn_vad_clean.csv` by filtering for longer texts using a
	length threshold (original threshold was not recorded).
	- Inferred from statistics: minimum length is 32 characters, so the filter
	likely kept samples with length >= 32 chars.

	### en-zh_cn_vad_long_clean.csv
	- Cleaned from `en-zh_cn_vad_long.csv` by removing subtitle formatting noise:
	- ASS/SSA tag blocks like `{\\fs..\\pos(..)}` (including broken `{` blocks)
	- HTML-like tags (e.g. `<i>...</i>`)
	- Escape codes like `\\N`, `\\n`, `\\h`, `\\t`
	- Extra whitespace normalization
	- Non-CJK rows were dropped.

	### en-zh_cn_vad_mix.csv
	- Mixed dataset created for replay training:
	- 200k samples from `en-zh_cn_vad_clean.csv`
	- 200k samples from `en-zh_cn_vad_long_clean.csv`
	- Shuffled after sampling

	## Training Summary

	The final model (`vad-macbert-mix/best`) was obtained in three stages:

	1. Base training on `en-zh_cn_vad_clean.csv`
	2. Long-text adaptation on `en-zh_cn_vad_long_clean.csv`
	3. Replay mix on `en-zh_cn_vad_mix.csv` (resume from stage 2)

	### Final-stage Command (Replay Mix)

	```
	--model_name hfl/chinese-macbert-base
	--output_dir train/vad-macbert-mix
	--data_path train/en-zh_cn_vad_mix.csv
	--epochs 4
	--batch_size 32
	--grad_accum_steps 4
	--learning_rate 0.00001
	--weight_decay 0.01
	--warmup_ratio 0.1
	--warmup_steps 0
	--max_length 512
	--eval_ratio 0.01
	--eval_every 100
	--eval_batches 200
	--loss huber
	--huber_delta 1.0
	--shuffle_buffer 4096
	--min_chars 2
	--save_every 100
	--log_every 1
	--max_steps 5000
	--seed 42
	--dtype fp16
	--num_rows 400000
	--resume_from train/vad-macbert-long/best
	--encoding utf-8
	```

	Training environment (conda `llm`):

	- Python 3.10.19
	- torch 2.9.1+cu130
	- transformers 4.57.6

	## Evaluation

	Benchmark script: `train/vad_benchmark.py`

	- Evaluation uses a fixed stride derived from `eval_ratio=0.01`
	(roughly 1 out of 100 samples).
	- Length buckets by character count: 0–20, 20–40, 40–80, 80–120, 120–200,
	200–400, 400+

	### Results (vad-macbert-mix/best)

	en-zh_cn_vad_clean.csv

	- mse_mean=0.043734
	- mae_mean=0.149322
	- pearson_mean=0.7335

	en-zh_cn_vad_long_clean.csv

	- mse_mean=0.031895
	- mae_mean=0.131320
	- pearson_mean=0.7565

	Notes:
	- `400+` bucket Pearson is unstable due to small sample size; interpret with care.

	## Limitations

	- Labels are derived from an English VAD teacher and transferred via parallel
	alignment, so they reflect the teacher’s bias and may not match human Chinese
	annotations.
	- Subtitle corpora include translation artifacts and formatting noise; cleaned
	versions mitigate but do not fully remove this.
	- Extreme-length sentences are under-represented; performance on 400+ chars
	is not reliable.

	## Files in This Repo

	- `config.json`
	- `model.safetensors`
	- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `vocab.txt`
	- `training_args.json`