Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Yi He, Xin Liu*, Yiu-ming Cheung, Shu-Juan Peng, Jinhan Yi, Wentao Fan

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference contributionpeer-review

Abstract

Fine-grained Image-text retrieval is challenging but vital technology in the field of multimedia analysis. Existing methods mainly focus on learning the common embedding space of images (or patches) and sentences (or words), whereby their mapping features in such embedding space can be directly measured. Nevertheless, most existing image-text retrieval works rarely consider the shared semantic concepts that potentially correlated the heterogeneous modalities, which can enhance the discriminative power of learning such embedding space. Toward this end, we propose a Cross-Graph Attention model (CGAM) to explicitly learn the shared semantic concepts, which can be well utilized to guide the feature learning process of each modality and promote the common embedding learning. More specifically, we build semantic-embedded graph for each modality, and smooth the discrepancy between two modalities via cross-graph attention model to obtain shared semantic-enhanced features. Meanwhile, we reconstruct image and text features via the shared semantic concepts and original embedding representations, and leverage multi-head mechanism for similarity calculation. Accordingly, the semantic-enhanced cross-modal embedding between image and text is discriminatively obtained to benefit the fine-grained retrieval with high retrieval performance. Extensive experiments evaluated on benchmark datasets show the performance improvements in comparison with state-of-the-arts.
Original languageEnglish
Title of host publicationSIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery (ACM)
Pages1865-1869
Number of pages5
ISBN (Print)9781450380379
DOIs
Publication statusPublished - 11 Jul 2021
Event44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021 - Virtual, Montreal, Canada
Duration: 11 Jul 202115 Jul 2021
https://sigir.org/sigir2021/

Conference

Conference44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021
Country/TerritoryCanada
CityMontreal
Period11/07/2115/07/21
Internet address

User-Defined Keywords

  • Image-text retrieval
  • cross-graph attention
  • shared cemantic concept
  • multi-head mechanism

Fingerprint

Dive into the research topics of 'Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval'. Together they form a unique fingerprint.

Cite this