LSECA: local semantic enhancement and cross aggregation for video-text retrieval

Zhiwen Wang, Donglin Zhang*, Zhikai Hu

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Recently video retrieval based on the pre-training models (e.g., CLIP) has achieved outstanding success. To further improve the search performance, most existing methods usually utilize the multi-grained contrastive fine tuning scheme. For example, frame features and word features are taken as fine-grained representations, aggregate features for frame features and [CLS] token for textual side are used as global representations. However, the above scheme still remains challenging. There are redundant and noise information in the raw output features of pre-training encoders, leading to suboptimal retrieval performance. Besides, a video usually correlates several text descriptions, while video embedding is fixed in previous works, which may also reduce the search performance. To conquer these problems, we propose a novel video-text retrieval model, named Local Semantic Enhancement and Cross Aggregation (LSECA). To be specific, we design a local semantic enhancement scheme, which utilizes global feature for video and keyword information for text to augment fine-grained semantic representations. Moreover, the cross aggregation module is proposed to enhance the interaction between video and text modalities. In this way, the local semantic enhancement scheme can increase the related representation of modalities and the developed cross aggregation module can make the representations of texts and videos more uniform. Extensive experiments on three popular text-video retrieval benchmark datasets demonstrate that our LSECA outperforms several state-of-the-art methods.

Original languageEnglish
Article number30
Number of pages13
JournalInternational Journal of Multimedia Information Retrieval
Volume13
Issue number3
Early online date22 Jul 2024
DOIs
Publication statusPublished - Sept 2024

Scopus Subject Areas

  • Information Systems
  • Media Technology
  • Library and Information Sciences

User-Defined Keywords

  • Cross aggregation
  • Multi-grained contrast
  • Semantic enhancement
  • Video-text retrieval

Fingerprint

Dive into the research topics of 'LSECA: local semantic enhancement and cross aggregation for video-text retrieval'. Together they form a unique fingerprint.

Cite this