Skip to main navigation Skip to search Skip to main content

Open-Vocabulary Object Detection Based on Detection Transformers

  • Yuhang Zang*
  • , Wei Li
  • , Kaiyang Zhou
  • , Chen Huang
  • , Chen Change Loy
  • *Corresponding author for this work

Research output: Chapter in book/report/conference proceedingChapterpeer-review

Abstract

Open-vocabulary object detection involves detecting and recognizing objects from an unbounded set of categories. Ideally, an open-vocabulary detector should be extendable to produce bounding boxes based on user input, whether it is a natural language instruction or an exemplar image. This enhances flexibility and improves the user experience in human-computer interaction. In this chapter, we discuss OV-DETR, an open-vocabulary detector built upon the Detection Transformer (DETR) architecture. Once trained, OV-DETR can detect any object provided its class name or an exemplar image. A primary challenge in adapting DETR for open-vocabulary detection lies in the inability to compute the classification cost matrix for novel classes without access to their labeled images. We overcome this challenge by reformulating the learning objective as a binary matching task between input queries and their corresponding objects. This strategy enables the model to learn robust correspondences that generalize effectively to unseen queries at test time. During training, we condition the Transformer decoder on input embeddings derived from a pretrained vision-language model such as CLIP. This allows for matching both text and image queries. Through extensive experiments on LVIS and COCO datasets, we demonstrate that OV-DETR, the first end-to-end Transformer-based open-vocabulary detector, significantly outperforms existing baselines.

Original languageEnglish
Title of host publicationLarge Vision-Language Models
Subtitle of host publicationPre-training, Prompting, and Applications
EditorsKaiyang Zhou, Ziwei Liu, Peng Gao
Place of PublicationCham
PublisherSpringer Cham
Chapter10
Pages229-248
Number of pages20
ISBN (Electronic)9783031949692
ISBN (Print)9783031949715, 9783031949685
DOIs
Publication statusPublished - 30 Aug 2025

Publication series

NameAdvances in Computer Vision and Pattern Recognition
VolumePart F886
ISSN (Print)2191-6586
ISSN (Electronic)2191-6594

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 9 - Industry, Innovation, and Infrastructure
    SDG 9 Industry, Innovation, and Infrastructure

User-Defined Keywords

  • Conditional matching
  • Detection
  • Image-text alignment
  • Open-vocabulary detection
  • Transformer

Fingerprint

Dive into the research topics of 'Open-Vocabulary Object Detection Based on Detection Transformers'. Together they form a unique fingerprint.

Cite this