Transformer-driven feature fusion network and visual feature coding for multi-label image classification

Pingzhu Liu, Wenbin Qian*, Jintao Huang, Yanqiang Tu, Yiu-Ming Cheung

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Multi-label image classification (MLIC) has attracted extensive research attention in recent years. Nevertheless, most of the existing methods have difficulty in effectively fusing multi-scale features and focusing on critical visual information, which makes it difficult to recognize objects from images. Besides, recent studies have utilized graph convolutional networks and attention mechanisms to model label dependencies in order to improve the model performance. However, these methods often rely on manually predefined label structures, which limits flexibility and model generality. And they also fail to capture intrinsic object correlations within images and spatial contexts. To address these challenges, we propose a novel Feature Fusion network combined with Transformer (FFTran) to fuse different visual features. Firstly, to address the difficulties of current methods in recognizing small objects, we propose a Multi-level Scale Information Integration Mechanism (MSIIM) that fuses different feature maps from the backbone network. Secondly, we develop an Intra-Image Spatial-Channel Semantic Mining (ISCM) module for learning important spaces and channel information. Thirdly, we design a Visual Feature Coding based on Transformer (VFCT) module to enhance the contextual information by pooling different visual features. Compared to the baseline model, FFTran achieves a significant boost in mean Average Precision (mAP) on both the VOC2007 and COCO2014 datasets, with enhancements of 2.9% and 5.1% respectively, highlighting its superior performance in multi-label image classification tasks.
Original languageEnglish
Article number111584
Number of pages13
JournalPattern Recognition
Volume164
Early online date17 Mar 2025
DOIs
Publication statusE-pub ahead of print - 17 Mar 2025

User-Defined Keywords

  • Attention mechanism
  • Feature fusion network
  • Multi-label image classification
  • Multi-scale features
  • Transformer

Fingerprint

Dive into the research topics of 'Transformer-driven feature fusion network and visual feature coding for multi-label image classification'. Together they form a unique fingerprint.

Cite this