TY - JOUR
T1 - Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image–Text Retrieval
AU - Peng, Shu-Juan
AU - He, Yi
AU - Liu, Xin
AU - Cheung, Yiu-ming
AU - Xu, Xing
AU - Cui, Zhen
N1 - Funding information:
This work was supported in part by the Open Research Projects of Zhejiang Lab under Grant 2021KH0AB01, in part by the Fundamental Research Funds for the Central Universities of Huaqiao University under Grant ZQN-709, in part by the National Science Foundation of China under Grant 61673185 and Grant 61976049, in part by the National Science Foundation of Fujian Province under Grant 2020J01083 and Grant 2020J01084, in part by the Natural Science Foundation of Shandong Province under Grant ZR2020LZH008, in part by the National Science Foundation of China (NSFC)/Research Grants Council (RGC) Joint Research Scheme under Grant N_HKBU214/21, in part by the RGC General Research Fund under Grant 12201321, in part by the Hong Kong Baptist University under Grant RCFNRA-IG/18-19/SCI/03, and in part by the Innovation and Technology Fund of Innovation and Technology Commission of the Hong Kong Government under Grant ITS/339/18. (Corresponding author: Xin Liu.)
Publisher copyright:
© 2022 IEEE.
PY - 2024/2
Y1 - 2024/2
N2 - Fine-grained image–text retrieval has been a hot research topic to bridge the vision and languages, and its main challenge is how to learn the semantic correspondence across different modalities. The existing methods mainly focus on learning the global semantic correspondence or intramodal relation correspondence in separate data representations, but which rarely consider the intermodal relation that interactively provide complementary hints for fine-grained semantic correlation learning. To address this issue, we propose a relation-aggregated cross-graph (RACG) model to explicitly learn the fine-grained semantic correspondence by aggregating both intramodal and intermodal relations, which can be well utilized to guide the feature correspondence learning process. More specifically, we first build semantic-embedded graph to explore both fine-grained objects and their relations of different media types, which aim not only to characterize the object appearance in each modality, but also to capture the intrinsic relation information to differentiate intramodal discrepancies. Then, a cross-graph relation encoder is newly designed to explore the intermodal relation across different modalities, which can mutually boost the cross-modal correlations to learn more precise intermodal dependencies. Besides, the feature reconstruction module and multihead similarity alignment are efficiently leveraged to optimize the node-level semantic correspondence, whereby the relation-aggregated cross-modal embeddings between image and text are discriminatively obtained to benefit various image–text retrieval tasks with high retrieval performance. Extensive experiments evaluated on benchmark datasets quantitatively and qualitatively verify the advantages of the proposed framework for fine-grained image–text retrieval and show its competitive performance with the state of the arts.
AB - Fine-grained image–text retrieval has been a hot research topic to bridge the vision and languages, and its main challenge is how to learn the semantic correspondence across different modalities. The existing methods mainly focus on learning the global semantic correspondence or intramodal relation correspondence in separate data representations, but which rarely consider the intermodal relation that interactively provide complementary hints for fine-grained semantic correlation learning. To address this issue, we propose a relation-aggregated cross-graph (RACG) model to explicitly learn the fine-grained semantic correspondence by aggregating both intramodal and intermodal relations, which can be well utilized to guide the feature correspondence learning process. More specifically, we first build semantic-embedded graph to explore both fine-grained objects and their relations of different media types, which aim not only to characterize the object appearance in each modality, but also to capture the intrinsic relation information to differentiate intramodal discrepancies. Then, a cross-graph relation encoder is newly designed to explore the intermodal relation across different modalities, which can mutually boost the cross-modal correlations to learn more precise intermodal dependencies. Besides, the feature reconstruction module and multihead similarity alignment are efficiently leveraged to optimize the node-level semantic correspondence, whereby the relation-aggregated cross-modal embeddings between image and text are discriminatively obtained to benefit various image–text retrieval tasks with high retrieval performance. Extensive experiments evaluated on benchmark datasets quantitatively and qualitatively verify the advantages of the proposed framework for fine-grained image–text retrieval and show its competitive performance with the state of the arts.
KW - Cross-graph relation encoder
KW - fine-grained correspondence
KW - image–text retrieval
KW - intermodal relation
UR - http://www.scopus.com/inward/record.url?scp=85134327658&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2022.3188569
DO - 10.1109/TNNLS.2022.3188569
M3 - Journal article
SN - 2162-237X
VL - 35
SP - 2194
EP - 2207
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 2
ER -