TY - JOUR
T1 - Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image–Text Matching
AU - Liu, Xin
AU - He, Yi
AU - Cheung, Yiu-Ming
AU - Xu, Xing
AU - Wang, Nannan
N1 - Publisher Copyright:
© 2022 IEEE.
Funding Information:
This work was supported in part by the Open Project of Zhejiang Lab under Grant 2021KH0AB01 and Grant 2021KG0AB01; in part by the National Science Foundation of China under Grant 61673185, Grant 61672444, Grant 61976049, Grant 61922066, and Grant 61876142; in part by the NSFC/RGC Joint Research Scheme under Grant N_HKBU214/21; in part by the General Research Fund of Research Grants Council (RGC) under Grant RGC/HKBU/12201321; in part by Hong Kong Baptist University under Grant RC-FNRA-IG/18-19/SCI/03 and Grant RC-IRCMs/18-19/SCI/01; in part by the Innovation and Technology Fund of Innovation and Technology Commission of the Government of the Hong Kong SAR under Grant ITS/339/18; in part by the National Science Foundation of Fujian Province under Grant 2020J01084; and in part by the Technology Innovation Leading Program of Shaanxi under Grant 2022QFY01-15.
PY - 2023/2
Y1 - 2023/2
N2 - Image–text matching of natural scenes has been a popular research topic in both computer vision and natural language processing communities. Recently, fine-grained image–text matching has shown its significant advance in inferring the high-level semantic correspondence by aggregating pairwise region–word similarity, but it remains challenging mainly due to insufficient representation of high-order semantic concepts and their explicit connections in one modality as its matched in another modality. To tackle this issue, we propose a relationship-enhanced semantic graph (ReSG) model, which can improve the image–text representations by learning their locally discriminative semantic concepts and then organizing their relationships in a contextual order. To be specific, two tailored graph encoders, visual relationship-enhanced graph (VReG) and textual relationship-enhanced graph (TReG), are respectively exploited to encode the high-level semantic concepts of corresponding instances and their semantic relationships. Meanwhile, the representations of each graph node are optimized by aggregating semantically contextual information to enhance the node-level semantic correspondence. Further, the hard-negative triplet ranking loss, center hinge loss, and positive–negative margin loss are jointly leveraged to learn the fine-grained correspondence between the ReSG representations of image and text, whereby the discriminative cross-modal embeddings can be explicitly obtained to benefit various image–text matching tasks in a more interpretable way. Extensive experiments verify the advantages of the proposed fine-grained graph matching approach, by achieving the state-of-the-art image–text matching results on public benchmark datasets.
AB - Image–text matching of natural scenes has been a popular research topic in both computer vision and natural language processing communities. Recently, fine-grained image–text matching has shown its significant advance in inferring the high-level semantic correspondence by aggregating pairwise region–word similarity, but it remains challenging mainly due to insufficient representation of high-order semantic concepts and their explicit connections in one modality as its matched in another modality. To tackle this issue, we propose a relationship-enhanced semantic graph (ReSG) model, which can improve the image–text representations by learning their locally discriminative semantic concepts and then organizing their relationships in a contextual order. To be specific, two tailored graph encoders, visual relationship-enhanced graph (VReG) and textual relationship-enhanced graph (TReG), are respectively exploited to encode the high-level semantic concepts of corresponding instances and their semantic relationships. Meanwhile, the representations of each graph node are optimized by aggregating semantically contextual information to enhance the node-level semantic correspondence. Further, the hard-negative triplet ranking loss, center hinge loss, and positive–negative margin loss are jointly leveraged to learn the fine-grained correspondence between the ReSG representations of image and text, whereby the discriminative cross-modal embeddings can be explicitly obtained to benefit various image–text matching tasks in a more interpretable way. Extensive experiments verify the advantages of the proposed fine-grained graph matching approach, by achieving the state-of-the-art image–text matching results on public benchmark datasets.
KW - Contextual information
KW - high-level semantic concept
KW - image–text matching
KW - relationship-enhanced graph
UR - http://www.scopus.com/inward/record.url?scp=85133756021&partnerID=8YFLogxK
U2 - 10.1109/TCYB.2022.3179020
DO - 10.1109/TCYB.2022.3179020
M3 - Journal article
SN - 2168-2267
VL - 54
SP - 948
EP - 961
JO - IEEE Transactions on Cybernetics
JF - IEEE Transactions on Cybernetics
IS - 2
ER -