TY - JOUR
T1 - Attention guided deep audio-face fusion for efficient speaker naming
AU - Liu, Xin
AU - Geng, Jiajia
AU - Ling, Haibin
AU - Cheung, Yiu ming
N1 - Funding Information:
This work was supported by the National Science Foundation of China (Nos. 61673185 and 61672444 ), National Science Foundation of Fujian Province (No. 2017J01112), Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (No. ZQN-PY309), Science and Technology Project of Quanzhou (No. 2018C107R), the National Key Research and Development Plan (No. 2016YFB1001200), US National Science Foundation (Nos. 1814745 , 1407156 and 1350521 ), SZSTI Grant (No. JCYJ20160531194006833) and the Faculty Research Grant of Hong Kong Baptist University (No. FRG2/17-18/082 ).
Funding Information:
This work was supported by the National Science Foundation of China (Nos. 61673185 and 61672444), National Science Foundation of Fujian Province (No. 2017J01112), Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (No. ZQN-PY309), Science and Technology Project of Quanzhou (No. 2018C107R), the National Key Research and Development Plan (No. 2016YFB1001200), US National Science Foundation (Nos. 1814745, 1407156 and 1350521), SZSTI Grant (No. JCYJ20160531194006833) and the Faculty Research Grant of Hong Kong Baptist University (No. FRG2/17-18/082).
PY - 2019/4
Y1 - 2019/4
N2 - Speaker naming has recently received considerable attention in identifying the active speaking character in a movie video, and face cue alone is generally insufficient to achieve reliable performance due to its significant appearance variations. In this paper, we treat the speaker naming task as a group of matched audio-face pair finding problems, and present an efficient attention guided deep audio-face fusion approach to detect the active speakers. First, we start with VGG-encoding of face images and extract the Mel-Frequency Cepstrum Coefficients from audio signals. Then, two efficient audio encoding modules, namely two-layer Long Short-Term Memory encoding and two-dimensional convolution encoding, are addressed to discriminate the high-level audio features. Meanwhile, we train an end-to-end audio-face common attention model to discriminate the face attention vector, featuring adaptively to accommodate various face variations. Further, an efficient factorized bilinear model is presented to deeply fuse the paired audio-face features, whereby the joint audio-face representation can be reliably obtained for speaker naming. Extensive experiments highlight the superiority of the proposed approach and show its very competitive performance with the state-of-the-arts.
AB - Speaker naming has recently received considerable attention in identifying the active speaking character in a movie video, and face cue alone is generally insufficient to achieve reliable performance due to its significant appearance variations. In this paper, we treat the speaker naming task as a group of matched audio-face pair finding problems, and present an efficient attention guided deep audio-face fusion approach to detect the active speakers. First, we start with VGG-encoding of face images and extract the Mel-Frequency Cepstrum Coefficients from audio signals. Then, two efficient audio encoding modules, namely two-layer Long Short-Term Memory encoding and two-dimensional convolution encoding, are addressed to discriminate the high-level audio features. Meanwhile, we train an end-to-end audio-face common attention model to discriminate the face attention vector, featuring adaptively to accommodate various face variations. Further, an efficient factorized bilinear model is presented to deeply fuse the paired audio-face features, whereby the joint audio-face representation can be reliably obtained for speaker naming. Extensive experiments highlight the superiority of the proposed approach and show its very competitive performance with the state-of-the-arts.
KW - Common attention model
KW - Deep audio-face fusion
KW - Factorized bilinear model
KW - Speaker naming
UR - http://www.scopus.com/inward/record.url?scp=85058679783&partnerID=8YFLogxK
U2 - 10.1016/j.patcog.2018.12.011
DO - 10.1016/j.patcog.2018.12.011
M3 - Journal article
AN - SCOPUS:85058679783
SN - 0031-3203
VL - 88
SP - 557
EP - 568
JO - Pattern Recognition
JF - Pattern Recognition
ER -