Abstract
Building surface defect detection plays a crucial role in structural health monitoring, ensuring the safety and aesthetics of buildings. Recently, Visual Question Answering (VQA) has been promising in architecture, especially for inspection automation and employee training. However, the insufficient pre-training on architectural knowledge and the limited defect detection accuracy of Large Multi-modal Models (LMMs) result in poor performance in multi-modal building surface defect analysis. Therefore, this paper proposes a two-stage fine-tuning framework for improving LMMs' performance in this task. Experiment results show that our framework significantly enhances the Visual Question Answering performance in the building surface defect analysis. Furthermore, our framework enhances the defect detection accuracy compared to conventional fine-tuning approaches, which leads to more accurate and reliable multi-modal analysis responses from the LMMs.
| Original language | English |
|---|---|
| Title of host publication | 2025 33rd European Signal Processing Conference (EUSIPCO) |
| Publisher | IEEE |
| Pages | 706-710 |
| Number of pages | 5 |
| ISBN (Electronic) | 9789464593624 |
| ISBN (Print) | 9798350391831 |
| DOIs | |
| Publication status | Published - Sept 2025 |
| Event | 33rd European Signal Processing Conference, EUSIPCO 2025 - Palermo, Italy Duration: 8 Sept 2025 → 12 Sept 2025 https://ieeexplore.ieee.org/xpl/conhome/11225917/proceeding (Conference proceedings) |
Publication series
| Name | European Signal Processing Conference (EUSIPCO) |
|---|
Conference
| Conference | 33rd European Signal Processing Conference, EUSIPCO 2025 |
|---|---|
| Country/Territory | Italy |
| City | Palermo |
| Period | 8/09/25 → 12/09/25 |
| Internet address |
|
User-Defined Keywords
- Computer Vision
- Large Multi-Modal Model
- Fine-Tuning
- Prompt Engineering
- Defect Detection