Deep Audio-Visual Beamforming for Speaker Localization

Xinyuan Qian, Qiquan Zhang*, Guohui Guan, Wei Xue

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

5 Citations (Scopus)


Generalized Cross Correlation (GCC) is the most popular localization technique over the past decades and can be extended with the beamforming method e.g. Steered Response Power (SRP) when multiple microphone pairs exist. Considering the promising results of Deep Learning (DL) strategies over classical approaches, in this work, instead of directly using Generalized Cross Correlation (GCC), SRP is derived with the DL-learnt ideal correlation functions for each pair of a microphone array. To deploy visual information, we explore the Conditional Variational Auto-Encoder (CVAE) framework in which the audio generative process is conditioned on the visual features encoded by face detections. The vision-derived auxiliary correlation function eventually contributes to the back-end beamformer for improved localization performance. To the best of our knowledge, this is the first deep-generative audiovisual method for speaker localization. Experimental results demonstrate our superior performance over other competitive methods, especially when the speech signal is corrupted by noise.
Original languageEnglish
Pages (from-to)1132-1136
Number of pages5
JournalIEEE Signal Processing Letters
Publication statusPublished - 6 Apr 2022

Scopus Subject Areas

  • Signal Processing
  • Applied Mathematics
  • Electrical and Electronic Engineering

User-Defined Keywords

  • Audio-visual fusion
  • Speaker localization
  • Variational auto-encoder


Dive into the research topics of 'Deep Audio-Visual Beamforming for Speaker Localization'. Together they form a unique fingerprint.

Cite this