In the era of artificial intelligence, human action recognition is a hot spot in the field of vision research, which makes the interaction between human and machine possible. Many intelligent applications benefit from human action recognition. Traditional traffic police gesture recognition methods often ignore the spatial and temporal information, so its timeliness in human computer interaction is limited. We propose a method that is Spatio-Temporal Convolutional Neural Networks (ST-CNN) which can detect and identify traffic police gestures. The method can identify traffic police gestures by using the correlation between spatial and temporal. Specifically, we use the convolutional neural network for feature extraction by taking into account both the spatial and temporal characteristics of the human actions. After the extraction of spatial and temporal features, the improved LSTM network can be used to effectively fuse, classify and recognize various features, so as to achieve the goal of human action recognition. We can make full use of the spatial and temporal information of the video and select effective features to reduce the computational load of the network. A large number of experiments on the Chinese traffic police gesture dataset show that our method is superior.