This is the model used in the papers
- N. Mousavi and F. Burkhardt: The Emotional Portrayal of an Ordinary Talk, Proc. ESSV 2026
- Mousavi, Burkhardt and Schuller: Modeling Emotion in German Ordinary Speech, to be published
We used the embeddings of a transformer model that give emotional dimension values (trained on MSPPodcast: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) to train a Multi Layer Perceptron with layers = [1024, 64] , default learning rate (.0001) and Adam optimizer, no dropout, patience set to 10.
With the nkululeko framework Training data was the test set of Berlin Emodb and the whole of Italian Emovo database, for classification from audio to ["happy", "angry", "sad", "scared", "neutral"]. Cross-domain evaluation with Ravdess database, without the songs, resulted in .561 UAR
Here's the screenshot of this outcome:

We attach a test_model.py script to this model, so you should be able to try it yourself:
Usage: test_model.py [OPTIONS] MODEL AUDIO
Predict emotion from an audio file using a nkululeko MLP + audwav2vec2
model.
MODEL Path to the .model file (torch state dict saved by nkululeko).
AUDIO Path to the audio file (must be 16 kHz mono WAV).
Example:
uv run test_model.py my_experiment_0_011.model sample.wav
uv run test_model.py my_experiment_0_011.model sample.wav --w2v2-root /data/audmodel/
Options:
--w2v2-root DIR Directory where the w2v2 onnx model is cached or will be
downloaded to. [default: ./audmodel/]
-h, --help Show this message and exit.