EmoSym: A Symbiotic Framework for Unified Emotional Understanding and Generation via Latent Reasoning

Yijie Zhu, Yibo Lyu, Zitong Yu*, Rui Shao*, Kaiyang Zhou, Liqiang Nie

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

Abstract

Current affective computing paradigms often treat emotional understanding and generation as separate tasks, yet they inherently possess symbiotic potential for mutual enhancement. In this paper, we aim to bridge the gap by developing a unified framework. The primary challenge lies in the extraction of precise and semantically rich representations of abstract emotions, which are crucial for both tasks. To address this, we harness the Chain-of-Thought reasoning at the latent space of multimodal large language models and propose EmoSym, a unified framework built upon this advanced foundation. Our framework is executed through three key steps: 1) Emotional reasoning knowledge compression. To enable efficient transfer of emotional reasoning priors, we design specialized reasoning tokens to compact emotion-aware contexts from external reasoning knowledge bases into latent representations. 2) Verifiable reinforcement reasoning optimization. To ensure more reliable and consistent emotional reasoning, we develop a verifiable reinforcement learning paradigm to further enhance the reasoning token by emotion-specific verifiable reward signals. Processed through the above two steps, the reasoning token simultaneously enhances emotional understanding while enriching semantic representations, benefiting subsequent emotional generation tasks. 3) Reasoning-augmented generation and online feedback. We then fuse it with emotional representations and feed them into a diffusion model to generate emotion-evoking images. Additionally, to create a generative-to-understanding enhancement feedback, we propose an Online Emotional Memory Bank (OEMB). It leverages newly generated images to progressively update the training dataset in the training process to reinforce understanding. Extensive experiments demonstrate the superior capabilities of our framework in both emotional understanding and generation tasks.
Original languageEnglish
Title of host publicationProceedings of the 33rd ACM International Conference on Multimedia
Place of PublicationNew York
PublisherAssociation for Computing Machinery (ACM)
Pages5451–5460
Number of pages10
ISBN (Print)9798400720352
DOIs
Publication statusPublished - 27 Oct 2025
Event33rd ACM International Conference on Multimedia, ACMMM25 - Dublin Royal Convention Centre, Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025
https://whova.com/embedded/event/sa54pNCpHUFy1OTIEiEzceQu5kPuSm3dYlEnqAJdV4o%3D/?utc_source=ems (Conference program)
https://acmmm2025.org/ (Conference website)
https://dl.acm.org/doi/proceedings/10.1145/3746027 (Conference proceedings)

Publication series

NameProceedings of the ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery

Conference

Conference33rd ACM International Conference on Multimedia, ACMMM25
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25
Internet address

User-Defined Keywords

  • Visual emotion understanding
  • Emotional image content generation
  • Unified emotional understanding and generation framework

Fingerprint

Dive into the research topics of 'EmoSym: A Symbiotic Framework for Unified Emotional Understanding and Generation via Latent Reasoning'. Together they form a unique fingerprint.

Cite this