arxiv:2504.16405

EEmo-Bench: A Benchmark for Multi-modal Large Language Models on Image Evoked Emotion Assessment

Published on Sep 17, 2025

Authors:

Abstract

A new benchmark named EEmo-Bench is presented for evaluating multi-modal large language models' ability to perceive and understand image-evoked emotions through diverse tasks and emotional attributes.

AI-generated summary

The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities. Among these, understanding image-evoked emotions aims to enhance MLLMs' empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluations of this MLLM capability remain coarse-grained, and a systematic and comprehensive assessment is still lacking. To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories. Our core contributions include: 1) Regarding the diversity of the evoked emotions, we adopt an emotion ranking strategy and employ the Valence-Arousal-Dominance (VAD) as emotional attributes for emotional assessment. In line with this methodology, 1,960 images are collected and manually annotated. 2) We design four tasks to evaluate MLLMs' ability to capture the evoked emotions by single images and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced to investigate the model's proficiency in performing joint and comparative analysis. In total, we collect 6,773 question-answer pairs and perform a thorough assessment on 19 commonly-used MLLMs. The results indicate that while some proprietary and large-scale open-source MLLMs achieve promising overall performance, the analytical capabilities in certain evaluation dimensions remain suboptimal. Our EEmo-Bench paves the path for further research aimed at enhancing the comprehensive perceiving and understanding capabilities of MLLMs concerning image-evoked emotions, which is crucial for machine-centric emotion perception and understanding.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2504.16405

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.16405 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.16405 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.16405 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.