Abstract
Crowd counting is a challenging task to predict the number of persons in crowded scenes. The major challenge comes from density/scale variation in one single image, which results in a significant performance decrease for most methods. In this work, we aim to explore how to solve the scale variation problem with assistance from the segmentation task. We first simultaneously predict a coarse density map, along with multi-scale segmentation masks, which provide strong priors of background and head regions with different scales; then we propose a Spatially Adaptive Pyramid (SAP) feature fusion block to fully integrate the coarse density map and multi-scale segmentation masks; in SAP, a feature pyramid and corresponding attention pyramid are obtained for multi-scale head regions, and then cross-scale correlations are computed to handle the scale variation problem. These high-dimensional cross-scale correlation features are then used to refine the density map, based on which the final counting numbers are obtained. Experiments show that our proposed SAP block significantly outperforms previous fusion methods, and obtains consistent improvements on various backbones. Compared to previous state-of-the-art methods, our method is better at handling scale variation, surpassing previous methods on the challenging ShanghaiTech A, UCF-QNRF and NWPU datasets.
Original language | English |
---|---|
Article number | 111832 |
Number of pages | 9 |
Journal | Pattern Recognition |
Volume | 168 |
Early online date | 22 May 2025 |
DOIs | |
Publication status | E-pub ahead of print - 22 May 2025 |
User-Defined Keywords
- Coarse-to-fine
- Crowd counting
- Dense scene analysis
- Multi-scale segmentation
- Pyramid features