Abstract
Multimodal intent recognition is vital for understanding human interactions across diverse real-world scenarios, leveraging various modalities to discern user intents. However, current approaches predominantly prioritize text, overlooking rich semantic cues in video modalities, such as body language and expressions, which are closely linked to intent and can be more easily integrated with both text and audio semantics. This paper introduces ViSK, a Visual Semantic Knowledge Discovery model using adaptive perturbation and gradient-based quantification. First, we employ Video Swin Transformer as the backbone to extract feature maps from video frames in the form of spatiotemporal blocks which are considered as the elementary units of visual semantics. Then, we employ an adaptive perturbation module to generate instance-aware noises by transforming input features into perturbation parameters through stacked convolution layers. The noises are applied to obtain perturbed features after scaling. Finally, on the basis of high-quality perturbed features, we introduce a quantification mechanism by estimating gradients based on the DNN’s Lipschitz condition to evaluate the contribution of each spatiotemporal block for intent recognition. The original features are then weighted by the quantification score to achieve visual semantic knowledge features for video-based intent recognition. To assess the effectiveness of discovered visual semantics in enhancing multimodal fusion, we also construct a multimodal intent recognition framework to refine and leverage semantic features. Extensive experiments across two challenging datasets demonstrate that our approach significantly outperforms state-of-the-art methods. Visualization results provide deeper insights into discovered semantics and serve as a pioneering work for interpretability research in multimodal intent recognition a.
| Original language | English |
|---|---|
| Pages (from-to) | 213151-213166 |
| Number of pages | 16 |
| Journal | IEEE Access |
| Volume | 13 |
| DOIs | |
| Publication status | Published - 10 Dec 2025 |
User-Defined Keywords
- Gradient-based Quantification
- Multimodal Intent Recognition
- Video-based Intent Recognition
- Visual Semantics Discovery
Fingerprint
Dive into the research topics of 'Visual Semantic Knowledge Discovery for Multimodal Intent Recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver