Abstract:
For the problem that the commonconvolutional neural network has a single spatial and temporal receptive field scale, and it is difficult to extract the spatio-temporal information in video, we use the (2+1)D model to decouple the time information and spatial information to some extent, and propose a convolution residual neural network with (2+1)D multi-time-space information fusion to recognize human behavior recognition. The model is dominated by 3×3 spatial receptive field, supplemented by 1×1 spatial receptive field, and combined with three different time domain receptive fields to construct six different scales of temporal and spatial receptive fields. The proposed multi-temporal receptive field fusion model can simultaneously acquire spatio-temporal information of different scales and extract more abundant human behavior characteristics, therefore it can more effectively identify human behaviors with different time periods and different motion amplitudes. In addition, we propose a video timing extension method which can simultaneously augment the video data set in spatial information and time series to enrich the training samples. The proposed method has a recognition rate of sub-videos on the public video human behavior data sets UCF101 and HMDB51 that exceeds or is close to the latest video behavior recognition method.