Open-Vocabulary DETR with Conditional Matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy*

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

27 Citations (Scopus)

Abstract

Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR—hence the name OV-DETR—which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR—the first end-to-end Transformer-based open-vocabulary detector—achieves non-trivial improvements over current state of the arts. Code is available at https://github.com/yuhangzang/OV-DETR.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2022
Subtitle of host publication17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
Place of PublicationCham
PublisherSpringer
Pages106-122
Number of pages17
Edition1st
ISBN (Electronic)9783031200779
ISBN (Print)9783031200762
DOIs
Publication statusPublished - 5 Nov 2022
Event17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel
Duration: 23 Oct 202227 Oct 2022
https://eccv2022.ecva.net/
https://link.springer.com/conference/eccv
https://link.springer.com/book/10.1007/978-3-031-19769-7

Publication series

NameLecture Notes in Computer Science
Volume13669
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349
NameECCV: European Conference on Computer Vision

Conference

Conference17th European Conference on Computer Vision, ECCV 2022
Country/TerritoryIsrael
CityTel Aviv
Period23/10/2227/10/22
Internet address

Scopus Subject Areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Open-Vocabulary DETR with Conditional Matching'. Together they form a unique fingerprint.

Cite this