A Chain-of-Thought Prompting Framework for Responsible Visual Decision-Making

Project: Research project

Project Details


As computer vision technologies become increasingly integrated into various aspects of our daily lives—from security systems and augmented reality to autonomous vehicles and medical diagnostics—the need for transparency and explainability in machines’ decision-making processes has never been more pronounced. However, existing computer vision models often operate like black boxes and fail to produce results that are interpretable at a human-comprehensible level. The ability to explain the rationale behind machine-generated decisions and to communicate seamlessly with human operators is paramount, especially in high-stakes applications like healthcare where diagnostic decisions impact human lives, or autonomous driving where a tiny change in steering can cause life-threatening consequences.

Over the last decade, various methods have been proposed to tackle explainability in AI models, with some attempting to establish inherent explainability in modelling while others focusing on post-hoc explanations. Despite the efforts, two challenges remain unaddressed. First, existing methods often fall short in delivering interpretable explanations and heavily rely on domainspecific expertise. For instance, the most widely-used methods in computer vision are based on visualising saliency maps to identify which image areas contributed the most to the decisionmaking process. However, in essence, saliency maps explain nothing about decision-making except where the model is looking at—which does not answer exactly why a particular image region is deemed “important” by the model. Second, most vision models make predictions in isolation from the generated explanations. This suggests that the connection between the two is too weak to justify the validity of the explanations. Ideally, an explanation should reflect a model’s reasoning process, which subsequently leads to a prediction.

To improve explainability in computer vision, this project proposes a novel chain-of-thought prompting approach, taking the first step towards using natural language as the explanatory medium and integrating natural language explanations into decision-making. Compared with visual explanations like saliency maps, natural language offers a more interpretable approach to explainability and significantly improves accessibility for non-expert users. Specifically, a new vision-language model is proposed, featuring a large language model (LLM) that can be prompted with visual data to perform chain-of-thought reasoning, with the textual output further translated into executable predictions. To this end, the project aims to develop (i) an efficient prompt learning method to turn an LLM into a multimodal agent capable of handling both image and text modalities, (ii) a comprehensive alignment strategy for aligning vision and language in the pre-trained LLM, and (iii) a robust learning policy allowing the aligned vision-language model to perform chain-of-thought reasoning and produce actionable predictions.

This project will be the first in computer vision to comprehensively study the integration of LLMs into visual decision-making for better explainability. The research output will not only enhance accountability for computer vision models but also enable human users to identify and mitigate biases, errors, or unexpected behaviours arising from deploying computer vision models in real-world applications. The model weights, datasets, and source code will be publicly released to facilitate the development of explainable and responsible visual intelligence.
StatusNot started
Effective start/end date1/01/2531/12/27


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.