TY - JOUR
T1 - Transformer-driven feature fusion network and visual feature coding for multi-label image classification
AU - Liu, Pingzhu
AU - Qian, Wenbin
AU - Huang, Jintao
AU - Tu, Yanqiang
AU - Cheung, Yiu-Ming
N1 - This work is supported by the National Key Research and Development Program of China (No. 2024YFF1307305), National Natural Science Foundation of China (No. 62366019 and No. 61966016), and Jiangxi Provincial Natural Science Foundation, China (No. 20242BAB23 014 and No. 20224BAB202020).
PY - 2025/3/17
Y1 - 2025/3/17
N2 - Multi-label image classification (MLIC) has attracted extensive research attention in recent years. Nevertheless, most of the existing methods have difficulty in effectively fusing multi-scale features and focusing on critical visual information, which makes it difficult to recognize objects from images. Besides, recent studies have utilized graph convolutional networks and attention mechanisms to model label dependencies in order to improve the model performance. However, these methods often rely on manually predefined label structures, which limits flexibility and model generality. And they also fail to capture intrinsic object correlations within images and spatial contexts. To address these challenges, we propose a novel Feature Fusion network combined with Transformer (FFTran) to fuse different visual features. Firstly, to address the difficulties of current methods in recognizing small objects, we propose a Multi-level Scale Information Integration Mechanism (MSIIM) that fuses different feature maps from the backbone network. Secondly, we develop an Intra-Image Spatial-Channel Semantic Mining (ISCM) module for learning important spaces and channel information. Thirdly, we design a Visual Feature Coding based on Transformer (VFCT) module to enhance the contextual information by pooling different visual features. Compared to the baseline model, FFTran achieves a significant boost in mean Average Precision (mAP) on both the VOC2007 and COCO2014 datasets, with enhancements of 2.9% and 5.1% respectively, highlighting its superior performance in multi-label image classification tasks.
AB - Multi-label image classification (MLIC) has attracted extensive research attention in recent years. Nevertheless, most of the existing methods have difficulty in effectively fusing multi-scale features and focusing on critical visual information, which makes it difficult to recognize objects from images. Besides, recent studies have utilized graph convolutional networks and attention mechanisms to model label dependencies in order to improve the model performance. However, these methods often rely on manually predefined label structures, which limits flexibility and model generality. And they also fail to capture intrinsic object correlations within images and spatial contexts. To address these challenges, we propose a novel Feature Fusion network combined with Transformer (FFTran) to fuse different visual features. Firstly, to address the difficulties of current methods in recognizing small objects, we propose a Multi-level Scale Information Integration Mechanism (MSIIM) that fuses different feature maps from the backbone network. Secondly, we develop an Intra-Image Spatial-Channel Semantic Mining (ISCM) module for learning important spaces and channel information. Thirdly, we design a Visual Feature Coding based on Transformer (VFCT) module to enhance the contextual information by pooling different visual features. Compared to the baseline model, FFTran achieves a significant boost in mean Average Precision (mAP) on both the VOC2007 and COCO2014 datasets, with enhancements of 2.9% and 5.1% respectively, highlighting its superior performance in multi-label image classification tasks.
KW - Attention mechanism
KW - Feature fusion network
KW - Multi-label image classification
KW - Multi-scale features
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=105000242808&partnerID=8YFLogxK
U2 - 10.1016/j.patcog.2025.111584
DO - 10.1016/j.patcog.2025.111584
M3 - Journal article
SN - 0031-3203
VL - 164
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 111584
ER -