TY - JOUR
T1 - Deep Audio-Visual Beamforming for Speaker Localization
AU - Qian, Xinyuan
AU - Zhang, Qiquan
AU - Guan, Guohui
AU - Xue, Wei
N1 - Funding information:
This work was supported by the Science and Engineering Research Council, Agency of Science, Technology and Research (A * STAR), Singapore, through the National Robotics Program under Grant 192 25 00054.
Publisher Copyright:
© 2022 IEEE.
PY - 2022/4/6
Y1 - 2022/4/6
N2 - Generalized Cross Correlation (GCC) is the most popular localization technique over the past decades and can be extended with the beamforming method e.g. Steered Response Power (SRP) when multiple microphone pairs exist. Considering the promising results of Deep Learning (DL) strategies over classical approaches, in this work, instead of directly using Generalized Cross Correlation (GCC), SRP is derived with the DL-learnt ideal correlation functions for each pair of a microphone array. To deploy visual information, we explore the Conditional Variational Auto-Encoder (CVAE) framework in which the audio generative process is conditioned on the visual features encoded by face detections. The vision-derived auxiliary correlation function eventually contributes to the back-end beamformer for improved localization performance. To the best of our knowledge, this is the first deep-generative audiovisual method for speaker localization. Experimental results demonstrate our superior performance over other competitive methods, especially when the speech signal is corrupted by noise.
AB - Generalized Cross Correlation (GCC) is the most popular localization technique over the past decades and can be extended with the beamforming method e.g. Steered Response Power (SRP) when multiple microphone pairs exist. Considering the promising results of Deep Learning (DL) strategies over classical approaches, in this work, instead of directly using Generalized Cross Correlation (GCC), SRP is derived with the DL-learnt ideal correlation functions for each pair of a microphone array. To deploy visual information, we explore the Conditional Variational Auto-Encoder (CVAE) framework in which the audio generative process is conditioned on the visual features encoded by face detections. The vision-derived auxiliary correlation function eventually contributes to the back-end beamformer for improved localization performance. To the best of our knowledge, this is the first deep-generative audiovisual method for speaker localization. Experimental results demonstrate our superior performance over other competitive methods, especially when the speech signal is corrupted by noise.
KW - Audio-visual fusion
KW - Speaker localization
KW - Variational auto-encoder
UR - http://www.scopus.com/inward/record.url?scp=85127740993&partnerID=8YFLogxK
U2 - 10.1109/LSP.2022.3165466
DO - 10.1109/LSP.2022.3165466
M3 - Journal article
SN - 1070-9908
VL - 29
SP - 1132
EP - 1136
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
ER -