LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval

Ziyang Luo, Pu Zhao, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, Jing Ma*, Qingwei Lin, Daxin Jiang*

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

2 Citations (Scopus)

Abstract

Image-text retrieval (ITR) aims to retrieve images or texts that match a query originating from the other modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations with dual-stream encoders. However, this approach is limited by slow retrieval speeds in large-scale scenarios. To address this issue, we propose a novel sparse retrieval paradigm for ITR that exploits sparse representations in the vocabulary space for images and texts. This paradigm enables us to leverage bag-of-words models and efficient inverted indexes, significantly reducing retrieval latency. A critical gap emerges from representing continuous image data in a sparse vocabulary space. To bridge this gap, we introduce a novel pre-training framework, Lexicon-Bottlenecked Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon representations. By using lexicon-bottlenecked modules between the dual-stream encoders and weakened text decoders, we are able to construct continuous bag-of-words bottlenecks and learn lexicon-importance distributions. Upon pre-training with same-scale data, our LexLIP achieves state-of-the-art performance on two ITR benchmarks, MSCOCO and Flickr30k. Furthermore, in large-scale retrieval scenarios, LexLIP outperforms CLIP with 5.8× faster retrieval speed and 19.1× less index storage memory. Beyond this, LexLIP surpasses CLIP across 8 out of 10 zero-shot image classification tasks.

Original languageEnglish
Title of host publication2023 IEEE/CVF International Conference on Computer Vision (ICCV)
PublisherIEEE
Pages11172-11183
Number of pages12
ISBN (Electronic)9798350307184
ISBN (Print)9798350307191
DOIs
Publication statusPublished - Oct 2023
Event2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France
Duration: 2 Oct 20236 Oct 2023
https://iccv2023.thecvf.com/ (Conference website)
https://ieeexplore.ieee.org/xpl/conhome/10376473/proceeding (Conference proceedings)

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499
ISSN (Electronic)2380-7504

Conference

Conference2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Country/TerritoryFrance
CityParis
Period2/10/236/10/23
Internet address

Scopus Subject Areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval'. Together they form a unique fingerprint.

Cite this