Attention guided deep audio-face fusion for efficient speaker naming

Xin Liu*, Jiajia Geng, Haibin Ling, Yiu ming Cheung

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

18 Citations (Scopus)

Abstract

Speaker naming has recently received considerable attention in identifying the active speaking character in a movie video, and face cue alone is generally insufficient to achieve reliable performance due to its significant appearance variations. In this paper, we treat the speaker naming task as a group of matched audio-face pair finding problems, and present an efficient attention guided deep audio-face fusion approach to detect the active speakers. First, we start with VGG-encoding of face images and extract the Mel-Frequency Cepstrum Coefficients from audio signals. Then, two efficient audio encoding modules, namely two-layer Long Short-Term Memory encoding and two-dimensional convolution encoding, are addressed to discriminate the high-level audio features. Meanwhile, we train an end-to-end audio-face common attention model to discriminate the face attention vector, featuring adaptively to accommodate various face variations. Further, an efficient factorized bilinear model is presented to deeply fuse the paired audio-face features, whereby the joint audio-face representation can be reliably obtained for speaker naming. Extensive experiments highlight the superiority of the proposed approach and show its very competitive performance with the state-of-the-arts.

Original languageEnglish
Pages (from-to)557-568
Number of pages12
JournalPattern Recognition
Volume88
DOIs
Publication statusPublished - Apr 2019

Scopus Subject Areas

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence

User-Defined Keywords

  • Common attention model
  • Deep audio-face fusion
  • Factorized bilinear model
  • Speaker naming

Fingerprint

Dive into the research topics of 'Attention guided deep audio-face fusion for efficient speaker naming'. Together they form a unique fingerprint.

Cite this