Exploring temporal consistency for human pose estimation in videos

  • Yang Li
  • , Kan Li*
  • , Xinxin Wang
  • , Richard Yi Da Xu
  • *Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

15 Citations (Scopus)

Abstract

In this paper, we introduce a method of exploring temporal information for estimating human poses in videos. The current state-of-the-art methods utilizing temporal information can be categorized into two major branches. The first category is a model-based method that captures the temporal information entirely by using a learnable function such as RNN or 3D convolution. However, these methods are limited in exploring temporal consistency, which is essential for estimating human joint positions in videos. The second category is the posterior enhancement method, where an independent post-processing step (e.g., using optical flow) is applied to enhance the prediction. However, operations such as optical flow estimation can be susceptible to the occlusion and motion blur problems, which will adversely affect the final performance. We propose a novel Temporal Consistency Exploration (TCE) module to address both shortcomings. Compared to previous approaches, the TCE module is more efficient as it captures the temporal consistency at the feature level without having to post-process and calculate extra optical flow. Further, to capture the rich spatial context in video data, we design a multi-scale TCE to explore the time consistency information at multi-scale spatial levels. Finally, a video-based pose estimation network is designed, which is based on the encoder-decoder architecture and extended with the powerful multi-scale TCE module. We comprehensively evaluate the proposed model on two video datasets, Sub-JHMDB and Penn, and our model achieves state-of-the-art performance on both datasets.

Original languageEnglish
Article number107258
Number of pages13
JournalPattern Recognition
Volume103
DOIs
Publication statusPublished - Jul 2020

User-Defined Keywords

  • Convolution neural network
  • Temporal information
  • Video-based pose estimation

Fingerprint

Dive into the research topics of 'Exploring temporal consistency for human pose estimation in videos'. Together they form a unique fingerprint.

Cite this