Spatially adaptive pyramid feature fusion for scale-aware crowd counting

Shenjian Gong, Zhaoliang Yao, Wangmeng Zuo, Jian Yang, Pong Chi Yuen, Shanshan Zhang*

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Crowd counting is a challenging task to predict the number of persons in crowded scenes. The major challenge comes from density/scale variation in one single image, which results in a significant performance decrease for most methods. In this work, we aim to explore how to solve the scale variation problem with assistance from the segmentation task. We first simultaneously predict a coarse density map, along with multi-scale segmentation masks, which provide strong priors of background and head regions with different scales; then we propose a Spatially Adaptive Pyramid (SAP) feature fusion block to fully integrate the coarse density map and multi-scale segmentation masks; in SAP, a feature pyramid and corresponding attention pyramid are obtained for multi-scale head regions, and then cross-scale correlations are computed to handle the scale variation problem. These high-dimensional cross-scale correlation features are then used to refine the density map, based on which the final counting numbers are obtained. Experiments show that our proposed SAP block significantly outperforms previous fusion methods, and obtains consistent improvements on various backbones. Compared to previous state-of-the-art methods, our method is better at handling scale variation, surpassing previous methods on the challenging ShanghaiTech A, UCF-QNRF and NWPU datasets.

Original languageEnglish
Article number111832
Number of pages9
JournalPattern Recognition
Volume168
Early online date22 May 2025
DOIs
Publication statusE-pub ahead of print - 22 May 2025

User-Defined Keywords

  • Coarse-to-fine
  • Crowd counting
  • Dense scene analysis
  • Multi-scale segmentation
  • Pyramid features

Fingerprint

Dive into the research topics of 'Spatially adaptive pyramid feature fusion for scale-aware crowd counting'. Together they form a unique fingerprint.

Cite this