Abstract
Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.
Original language | English |
---|---|
Pages (from-to) | 23039-23048 |
Number of pages | 10 |
Journal | Applied Intelligence |
Volume | 53 |
Issue number | 20 |
Early online date | 5 Jul 2023 |
DOIs | |
Publication status | Published - Oct 2023 |
Scopus Subject Areas
- Artificial Intelligence
User-Defined Keywords
- Spatial attention
- Temporal attention
- Transformer
- Video classification