TY - GEN
T1 - Hearing like Seeing
T2 - 28th ACM International Conference on Multimedia, MM 2020
AU - Cheng, Kai
AU - Liu, Xin
AU - Cheung, Yiu-ming
AU - Wang, Rui
AU - Xu, Xing
AU - Zhong, Bineng
PY - 2020/10/20
Y1 - 2020/10/20
N2 - Many cognitive researches have shown that human may 'see voices' or 'hear faces', and such ability can be potentially associated by machine vision and intelligence. However, this research is still under early stage. In this paper, we present a novel adversarial deep semantic matching network for efficient voice-face interactions and associations, which can well learn the correspondence between voices and faces for various cross-modal matching and retrieval tasks. Within the proposed framework, we exploit a simple and efficient adversarial learning architecture to learn the cross-modal embeddings between faces and voices, which consists of two subnetworks, respectively, for generator and discriminator. The former subnetwork is designed to adaptively discriminate the high-level semantical features between voices and faces, in which the triplet loss and multi-modal center loss are in tandem utilized to explicitly regularize the correspondences among them. The latter subnetwork is further leveraged to maximally bridge the semantic gap between the representations of voice and face data, featuring on maintaining the semantic consistency. Through the joint exploitation of the above, the proposed framework can well push representations of voice-face data from the same person closer while pulling those representations of different person away. Extensive experiments empirically show that the proposed approach involves fewer parameters and calculations, adapts various cross-modal matching tasks for voice-face data and brings substantial improvements over the state-of-the-art methods.
AB - Many cognitive researches have shown that human may 'see voices' or 'hear faces', and such ability can be potentially associated by machine vision and intelligence. However, this research is still under early stage. In this paper, we present a novel adversarial deep semantic matching network for efficient voice-face interactions and associations, which can well learn the correspondence between voices and faces for various cross-modal matching and retrieval tasks. Within the proposed framework, we exploit a simple and efficient adversarial learning architecture to learn the cross-modal embeddings between faces and voices, which consists of two subnetworks, respectively, for generator and discriminator. The former subnetwork is designed to adaptively discriminate the high-level semantical features between voices and faces, in which the triplet loss and multi-modal center loss are in tandem utilized to explicitly regularize the correspondences among them. The latter subnetwork is further leveraged to maximally bridge the semantic gap between the representations of voice and face data, featuring on maintaining the semantic consistency. Through the joint exploitation of the above, the proposed framework can well push representations of voice-face data from the same person closer while pulling those representations of different person away. Extensive experiments empirically show that the proposed approach involves fewer parameters and calculations, adapts various cross-modal matching tasks for voice-face data and brings substantial improvements over the state-of-the-art methods.
KW - Voice-face association
KW - adversarial deep semantic matching
KW - multi-modal center loss
KW - cross-modal embeddings
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85106936023&origin=resultslist&sort=plf-f&src=s&sid=ead1c307c9fb93ef37830dcb66e41d32&sot=b&sdt=b&s=DOI%2810.1145%2F3394171.3413710%29&sl=28&sessionSearchId=ead1c307c9fb93ef37830dcb66e41d32
U2 - 10.1145/3394171.3413710
DO - 10.1145/3394171.3413710
M3 - Conference proceeding
T3 - Proceedings of ACM International Conference on Multimedia
SP - 448
EP - 455
BT - MM '20: Proceedings of the 28th ACM International Conference on Multimedia
PB - Association for Computing Machinery (ACM)
Y2 - 12 October 2020 through 16 October 2020
ER -