Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image–Text Matching

Xin Liu, Yi He, Yiu-Ming Cheung*, Xing Xu, Nannan Wang

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

4 Citations (Scopus)


Image–text matching of natural scenes has been a popular research topic in both computer vision and natural language processing communities. Recently, fine-grained image–text matching has shown its significant advance in inferring the high-level semantic correspondence by aggregating pairwise region–word similarity, but it remains challenging mainly due to insufficient representation of high-order semantic concepts and their explicit connections in one modality as its matched in another modality. To tackle this issue, we propose a relationship-enhanced semantic graph (ReSG) model, which can improve the image–text representations by learning their locally discriminative semantic concepts and then organizing their relationships in a contextual order. To be specific, two tailored graph encoders, visual relationship-enhanced graph (VReG) and textual relationship-enhanced graph (TReG), are respectively exploited to encode the high-level semantic concepts of corresponding instances and their semantic relationships. Meanwhile, the representations of each graph node are optimized by aggregating semantically contextual information to enhance the node-level semantic correspondence. Further, the hard-negative triplet ranking loss, center hinge loss, and positive–negative margin loss are jointly leveraged to learn the fine-grained correspondence between the ReSG representations of image and text, whereby the discriminative cross-modal embeddings can be explicitly obtained to benefit various image–text matching tasks in a more interpretable way. Extensive experiments verify the advantages of the proposed fine-grained graph matching approach, by achieving the state-of-the-art image–text matching results on public benchmark datasets.
Original languageEnglish
Pages (from-to)948-961
Number of pages14
JournalIEEE Transactions on Cybernetics
Issue number2
Early online date20 Jun 2022
Publication statusPublished - Feb 2023

Scopus Subject Areas

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Electrical and Electronic Engineering
  • Control and Systems Engineering
  • Computer Science Applications

User-Defined Keywords

  • Contextual information
  • high-level semantic concept
  • image–text matching
  • relationship-enhanced graph


Dive into the research topics of 'Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image–Text Matching'. Together they form a unique fingerprint.

Cite this