Abstract
Vision-language models have significantly advanced the field of artificial intelligence by bridging the gap between visual and textual understanding. These models can enable wide-ranging applications including image recognition, object detection, scene understanding, visual content generation and editing in both 2D and 3D, and visual question answering, to name a few. This chapter introduces the foundational concepts underlying these models, emphasizing their unique ability to learn multimodal representations through novel neural network architectures and large-scale data pre-training. We explore the vision-language modeling paradigm, highlight key challenges in feature alignment, scalability, and data and evaluation, and review notable progress in the field. In addition, we discuss the limitations of current approaches, from computational inefficiencies to ethical concerns, and outline potential directions for future research. This chapter serves as a roadmap for understanding the field’s core principles and its transformative potential in AI applications.
| Original language | English |
|---|---|
| Title of host publication | Large Vision-Language Models |
| Subtitle of host publication | Pre-training, Prompting, and Applications |
| Editors | Kaiyang Zhou, Ziwei Liu, Peng Gao |
| Place of Publication | Cham |
| Publisher | Springer Cham |
| Chapter | 1 |
| Pages | 1-19 |
| Number of pages | 19 |
| ISBN (Electronic) | 9783031949692 |
| ISBN (Print) | 9783031949685, 9783031949715 |
| DOIs | |
| Publication status | Published - 30 Aug 2025 |
Publication series
| Name | Advances in Computer Vision and Pattern Recognition |
|---|---|
| Volume | Part F886 |
| ISSN (Print) | 2191-6586 |
| ISSN (Electronic) | 2191-6594 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 9 Industry, Innovation, and Infrastructure
User-Defined Keywords
- Feature Alignment
- Large-Scale Pre-training
- Multimodal Learning
- Vision-Language Model
Fingerprint
Dive into the research topics of 'Foundations of Vision-Language Models: Concepts and Roadmap'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver