Skip to main navigation Skip to search Skip to main content

Foundations of Vision-Language Models: Concepts and Roadmap

  • Kaiyang Zhou*
  • , Ziwei Liu
  • , Peng Gao
  • *Corresponding author for this work

Research output: Chapter in book/report/conference proceedingChapterpeer-review

Abstract

Vision-language models have significantly advanced the field of artificial intelligence by bridging the gap between visual and textual understanding. These models can enable wide-ranging applications including image recognition, object detection, scene understanding, visual content generation and editing in both 2D and 3D, and visual question answering, to name a few. This chapter introduces the foundational concepts underlying these models, emphasizing their unique ability to learn multimodal representations through novel neural network architectures and large-scale data pre-training. We explore the vision-language modeling paradigm, highlight key challenges in feature alignment, scalability, and data and evaluation, and review notable progress in the field. In addition, we discuss the limitations of current approaches, from computational inefficiencies to ethical concerns, and outline potential directions for future research. This chapter serves as a roadmap for understanding the field’s core principles and its transformative potential in AI applications.

Original languageEnglish
Title of host publicationLarge Vision-Language Models
Subtitle of host publicationPre-training, Prompting, and Applications
EditorsKaiyang Zhou, Ziwei Liu, Peng Gao
Place of PublicationCham
PublisherSpringer Cham
Chapter1
Pages1-19
Number of pages19
ISBN (Electronic)9783031949692
ISBN (Print)9783031949685, 9783031949715
DOIs
Publication statusPublished - 30 Aug 2025

Publication series

NameAdvances in Computer Vision and Pattern Recognition
VolumePart F886
ISSN (Print)2191-6586
ISSN (Electronic)2191-6594

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 9 - Industry, Innovation, and Infrastructure
    SDG 9 Industry, Innovation, and Infrastructure

User-Defined Keywords

  • Feature Alignment
  • Large-Scale Pre-training
  • Multimodal Learning
  • Vision-Language Model

Fingerprint

Dive into the research topics of 'Foundations of Vision-Language Models: Concepts and Roadmap'. Together they form a unique fingerprint.

Cite this