Detach and Enhance: Learning Disentangled Cross-modal Latent Representation for Efficient Face-Voice Association and Matching

Zhenning Yu, Xin Liu*, Yiu-ming Cheung*, Minghang Zhu, Xing Xu, Nannan Wang, Taihao Li

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review


Many researches in cognitive science have shown that humans often perform face-voice association for various perception tasks, and some recent data mining works have been designed in emulating such ability intelligently. Nevertheless, most methods often suffer from the degraded performance when there exist semantically irrelevant interference factors across different modalities. To alleviate this concern, this paper presents an efficient Disentangled Cross-modal Latent Representation (DCLR) method to adaptively detach the discriminative feature attributes and enhance the face-voice association. To be specific, the proposed DCLR framework consists of two-stage cross-modal disentangling process. First, the former stage employs the supervised contrastive learning to push the representations of face-voice data from the same person closer while pulling those representations of different person away. Then, the latter stage freezes all the parameters of the former stage, and further innovates a multi-layer orthogonal decoupling scheme to learn the disentangled latent representations, while filtering out the modality-dependent irrelevant factors. Besides, the cross-modal reconstruction loss is further utilized to narrow down the semantic gap between heterogeneous feature expressions. Through the joint exploitation of the above, the proposed framework can well associate the face-voice data to benefit various kinds of cross-modal perception tasks. Extensive experiments verify the superiorities of the proposed face-voice association framework and show its competitive performances.
Original languageEnglish
Title of host publication2022 IEEE International Conference on Data Mining (ICDM)
Number of pages8
ISBN (Electronic)9781665450997
ISBN (Print)9781665451000
Publication statusPublished - 28 Nov 2022
Event22nd International Conference on Data Mining, ICDM 2022 - Orlando, United States
Duration: 28 Nov 20221 Dec 2022

Publication series

NameIEEE International Conference on Data Mining (ICDM)
ISSN (Print)1550-4786
ISSN (Electronic)2374-8486


Conference22nd International Conference on Data Mining, ICDM 2022
Country/TerritoryUnited States
Internet address

User-Defined Keywords

  • Face-voice association
  • disentangled latent representation
  • contrastive learning
  • orthogonal decoupling


Dive into the research topics of 'Detach and Enhance: Learning Disentangled Cross-modal Latent Representation for Efficient Face-Voice Association and Matching'. Together they form a unique fingerprint.

Cite this