Multimodal Emotion Recognition Using Transfer Learning on Audio and Text Data

James J. Deng*, Clement H.C. Leung*, Yuanxi Li

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

9 Citations (Scopus)

Abstract

Emotion recognition has been extensively studied in a single modality in the last decade. However, humans express their emotions usually through multiple modalities like voice, facial expressions, or text. In this paper, we propose a new method to find a unified emotion representation for multimodal emotion recognition through speech audio, and text. Emotion-based feature representation from speech audio is learned by an unsupervised triplet-loss objective, and a text-to-text transformer network is constructed to extract latent emotional meaning. As deep neural network models trained by huge datasets exhaust a lot of unaffordable resources, transfer learning provides a powerful and reusable technique to help fine-tune emotion recognition models trained on mega audio and text datasets respectively. Automatic multimodal fusion of emotion-based features from speech audio and text is conducted by a new transformer. Both the accuracy and robustness of proposed method are evaluated, and we show that our method for multimodal fusion using transfer learning in emotion recognition achieves good results.

Original languageEnglish
Title of host publicationComputational Science and Its Applications – ICCSA 2021
Subtitle of host publication21st International Conference, Cagliari, Italy, September 13–16, 2021, Proceedings, Part III
EditorsOsvaldo Gervasi, Beniamino Murgante, Sanjay Misra, Chiara Garau, Ivan Blečić, David Taniar, Bernady O. Apduhan, Ana Maria Rocha, Eufemia Tarantino, Carmelo Maria Torre
PublisherSpringer Cham
Pages552-563
Number of pages12
Edition1st
ISBN (Electronic)9783030869700
ISBN (Print)9783030869694
DOIs
Publication statusPublished - 10 Sept 2021
Event21st International Conference on Computational Science and Its Applications, ICCSA 2021 - Virtual, Online, Cagliari, Italy
Duration: 13 Sept 202116 Sept 2021

Publication series

NameLecture Notes in Computer Science
Volume12951
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349
NameTheoretical Computer Science and General Issues
NameICCSA: International Conference on Computational Science and Its Applications

Conference

Conference21st International Conference on Computational Science and Its Applications, ICCSA 2021
Country/TerritoryItaly
CityCagliari
Period13/09/2116/09/21

User-Defined Keywords

  • Multimodal emotion recognition
  • Multimodal fusion
  • Transformer network

Fingerprint

Dive into the research topics of 'Multimodal Emotion Recognition Using Transfer Learning on Audio and Text Data'. Together they form a unique fingerprint.

Cite this