TY - JOUR
T1 - Adversarial Tri-Fusion Hashing Network for Imbalanced Cross-Modal Retrieval
AU - Liu, Xin
AU - Cheung, Yiu Ming
AU - Hu, Zhikai
AU - He, Yi
AU - Zhong, Bineng
N1 - ©2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. [viewed at https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9139424, LIB 2021-11-19]
Funding Information:
Manuscript received March 6, 2020; accepted June 24, 2020. Date of publication July 13, 2020; date of current version July 22, 2021. This work was supported in part by the National Science Foundation of China under Grants 61673185, 61672444 and 61972167, in part by Quanzhou City Science & Technology Program of China under Grant 2018C107R, in part by the State Key Laboratory of Integrated Services Networks of Xidian University under Grant ISN20-11, in part by Hong Kong Baptist University, Research Committee, Initiation Grant-Faculty Niche Research Areas (IG-FNRA) 2018/19 under Grant RC-FNRA-IG/18-19/SCI/03, in part by the project funded by HKBU Interdisciplinary Research Clusters Matching Scheme under Grant RC-IRCMs/18-19/SCI/01, and in part by ITF of ITC of Hong Kong SAR under Project ITS/339/18. (Corresponding author: Xin Liu.) Xin Liu is with the Department of Computer Science, Huaqiao University, Xiamen 361021, China, and with the State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an 710071, China, and also with the Department of Computer Science, Hong Kong Baptist University, Kowloon, Hong Kong (e-mail: [email protected]).
Publisher Copyright:
© 2017 IEEE.
PY - 2021/8
Y1 - 2021/8
N2 - Cross-modal retrieval has received increasing attentions for efficient retrieval across different modalities, and hashing technique has made significant progress recently due to its low storage cost and high query speed. However, most existing cross-modal hashing works still face the challenges of narrowing down the semantic gap between different modalities and training with imbalanced multi-modal data. This article presents an efficient Adversarial Tri-Fusion Hashing Network (ATFH-N) for cross-modal retrieval, which lies among the early attempts to incorporate adversarial learning for working with imbalanced multi-modal data. Specifically, a triple fusion network associated with zero padding operation is proposed to adapt either balanced or imbalanced multi-modal training data. At the same time, an adversarial training mechanism is leveraged to maximally bridge the semantic gap of the common representations between balanced and imbalanced data. Further, a label prediction network is utilized to guide the feature learning process and promote hash code learning, while additionally embedding the manifold structure to preserve both inter-modal and intra-modal similarities. Through the joint exploitation of the above, the underlying semantic structure of multimedia data can be well preserved in Hamming space, which can benefit various cross-modal retrieval tasks. Extensive experiments on three benchmark datasets show that the proposed ATFH-N method yields the comparable performance in balanced scenario and brings substantial improvements over the state-of-the-art methods in imbalanced scenarios.
AB - Cross-modal retrieval has received increasing attentions for efficient retrieval across different modalities, and hashing technique has made significant progress recently due to its low storage cost and high query speed. However, most existing cross-modal hashing works still face the challenges of narrowing down the semantic gap between different modalities and training with imbalanced multi-modal data. This article presents an efficient Adversarial Tri-Fusion Hashing Network (ATFH-N) for cross-modal retrieval, which lies among the early attempts to incorporate adversarial learning for working with imbalanced multi-modal data. Specifically, a triple fusion network associated with zero padding operation is proposed to adapt either balanced or imbalanced multi-modal training data. At the same time, an adversarial training mechanism is leveraged to maximally bridge the semantic gap of the common representations between balanced and imbalanced data. Further, a label prediction network is utilized to guide the feature learning process and promote hash code learning, while additionally embedding the manifold structure to preserve both inter-modal and intra-modal similarities. Through the joint exploitation of the above, the underlying semantic structure of multimedia data can be well preserved in Hamming space, which can benefit various cross-modal retrieval tasks. Extensive experiments on three benchmark datasets show that the proposed ATFH-N method yields the comparable performance in balanced scenario and brings substantial improvements over the state-of-the-art methods in imbalanced scenarios.
KW - Cross-modal hashing
KW - imbalanced multi-modal data
KW - adversarial tri-fusion hashing
KW - manifold structure
UR - http://www.scopus.com/inward/record.url?scp=85089295514&partnerID=8YFLogxK
U2 - 10.1109/TETCI.2020.3007143
DO - 10.1109/TETCI.2020.3007143
M3 - Journal article
AN - SCOPUS:85089295514
SN - 2471-285X
VL - 5
SP - 607
EP - 619
JO - IEEE Transactions on Emerging Topics in Computational Intelligence
JF - IEEE Transactions on Emerging Topics in Computational Intelligence
IS - 4
ER -