Hearing like Seeing: Improving Voice-Face Interactions and Associations via Adversarial Deep Semantic Matching Network

Kai Cheng, Xin Liu*, Yiu-ming Cheung*, Rui Wang, Xing Xu, Bineng Zhong

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

9 Citations (Scopus)


Many cognitive researches have shown that human may 'see voices' or 'hear faces', and such ability can be potentially associated by machine vision and intelligence. However, this research is still under early stage. In this paper, we present a novel adversarial deep semantic matching network for efficient voice-face interactions and associations, which can well learn the correspondence between voices and faces for various cross-modal matching and retrieval tasks. Within the proposed framework, we exploit a simple and efficient adversarial learning architecture to learn the cross-modal embeddings between faces and voices, which consists of two subnetworks, respectively, for generator and discriminator. The former subnetwork is designed to adaptively discriminate the high-level semantical features between voices and faces, in which the triplet loss and multi-modal center loss are in tandem utilized to explicitly regularize the correspondences among them. The latter subnetwork is further leveraged to maximally bridge the semantic gap between the representations of voice and face data, featuring on maintaining the semantic consistency. Through the joint exploitation of the above, the proposed framework can well push representations of voice-face data from the same person closer while pulling those representations of different person away. Extensive experiments empirically show that the proposed approach involves fewer parameters and calculations, adapts various cross-modal matching tasks for voice-face data and brings substantial improvements over the state-of-the-art methods.
Original languageEnglish
Title of host publicationMM '20: Proceedings of the 28th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery (ACM)
Number of pages8
ISBN (Electronic)9781450379885
Publication statusPublished - 20 Oct 2020
Event28th ACM International Conference on Multimedia, MM 2020 - Virtual, Online, United States
Duration: 12 Oct 202016 Oct 2020

Publication series

NameProceedings of ACM International Conference on Multimedia


Conference28th ACM International Conference on Multimedia, MM 2020
Country/TerritoryUnited States
CityVirtual, Online

User-Defined Keywords

  • Voice-face association
  • adversarial deep semantic matching
  • multi-modal center loss
  • cross-modal embeddings


Dive into the research topics of 'Hearing like Seeing: Improving Voice-Face Interactions and Associations via Adversarial Deep Semantic Matching Network'. Together they form a unique fingerprint.

Cite this