Abstract
Open-vocabulary object detection involves detecting and recognizing objects from an unbounded set of categories. Ideally, an open-vocabulary detector should be extendable to produce bounding boxes based on user input, whether it is a natural language instruction or an exemplar image. This enhances flexibility and improves the user experience in human-computer interaction. In this chapter, we discuss OV-DETR, an open-vocabulary detector built upon the Detection Transformer (DETR) architecture. Once trained, OV-DETR can detect any object provided its class name or an exemplar image. A primary challenge in adapting DETR for open-vocabulary detection lies in the inability to compute the classification cost matrix for novel classes without access to their labeled images. We overcome this challenge by reformulating the learning objective as a binary matching task between input queries and their corresponding objects. This strategy enables the model to learn robust correspondences that generalize effectively to unseen queries at test time. During training, we condition the Transformer decoder on input embeddings derived from a pretrained vision-language model such as CLIP. This allows for matching both text and image queries. Through extensive experiments on LVIS and COCO datasets, we demonstrate that OV-DETR, the first end-to-end Transformer-based open-vocabulary detector, significantly outperforms existing baselines.
| Original language | English |
|---|---|
| Title of host publication | Large Vision-Language Models |
| Subtitle of host publication | Pre-training, Prompting, and Applications |
| Editors | Kaiyang Zhou, Ziwei Liu, Peng Gao |
| Place of Publication | Cham |
| Publisher | Springer Cham |
| Chapter | 10 |
| Pages | 229-248 |
| Number of pages | 20 |
| ISBN (Electronic) | 9783031949692 |
| ISBN (Print) | 9783031949715, 9783031949685 |
| DOIs | |
| Publication status | Published - 30 Aug 2025 |
Publication series
| Name | Advances in Computer Vision and Pattern Recognition |
|---|---|
| Volume | Part F886 |
| ISSN (Print) | 2191-6586 |
| ISSN (Electronic) | 2191-6594 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 9 Industry, Innovation, and Infrastructure
User-Defined Keywords
- Conditional matching
- Detection
- Image-text alignment
- Open-vocabulary detection
- Transformer
Fingerprint
Dive into the research topics of 'Open-Vocabulary Object Detection Based on Detection Transformers'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver