Abstract
With the rapid growth of video and the demand for similarity search, text-video retrieval has garnered significant attention. Existing methods usually extract features by unified encoders and compute pair-wise similarity. However, most methods employ symmetric cross-entropy loss for contrastive learning, crudely reducing the distance between positives and expanding the distance between negatives. This may ignore some effective information between negatives and lead to learning untrustworthy representations. Besides, by relying solely on constructed feature similarity, these methods often fail to capture the shared semantic information between negatives. To mitigate these problems, we propose a novel Hierarchical Text-to-Video Retrieval based on Relative Similarity, named HTVR. The developed HTVR introduces a relative similarity matrix to replace the traditional contrastive loss. Specifically, to build the relative similarity matrix, we leverage textual descriptions to construct the virtual semantic label features. Then, we can measure the relative similarity between instances based on virtual semantic label features, providing more detailed supervision for cross-modal matching. Moreover, we perform multi-level alignment by aggregating tokens to handle fine-grained semantic concepts with diverse granularity and flexible combinations. The relative similarity is further utilized for intra-modal constraints, preserving shared semantic concepts. Finally, extensive experiments on three benchmark datasets (including MSRVTT, MSVD, and DiDeMo) illustrate that our HTVR can achieve superior performances, demonstrating the efficacy of the proposed method. The source code of this work will be available at: https://github.com/junmaZ/HTVR.
| Original language | English |
|---|---|
| Article number | 112145 |
| Number of pages | 9 |
| Journal | Pattern Recognition |
| Volume | 171, Part A |
| Early online date | 24 Jul 2025 |
| DOIs | |
| Publication status | Published - Mar 2026 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 9 Industry, Innovation, and Infrastructure
User-Defined Keywords
- Contrastive learning
- Cross-modal matching
- Relative similarity
- Text-video retrieval
Fingerprint
Dive into the research topics of 'HTVR: Hierarchical text-to-video retrieval based on relative similarity'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver