(2+1)D多时空信息融合模型及在行为识别的应用

(2+1)D Multi-spatio-temporal Information Fusion Model and Its Application in Behaviour Recognition

  • 摘要: 针对常规的卷积神经网络时空感受野尺度单一,难以提取视频中多变的时空信息的问题,利用(2+1)D模型将时间信息和空间信息在一定程度上解耦的特性,提出了(2+1)D多时空信息融合的卷积残差神经网络,并用于人体行为识别.该模型以3×3空间感受野为主,1×1空间感受野为辅,与3种不同时域感受野交叉组合构建了6种不同尺度的时空感受野.提出的多时空感受野融合模型能够同时获取不同尺度的时空信息,提取更丰富的人体行为特征,因此能够更有效识别不同时间周期、不同动作幅度的人体行为.另外提出了一种视频时序扩充方法,该方法能够同时在空间信息和时间序列扩充视频数据集,丰富训练样本.提出的方法在公共视频人体行为数据集UCF101和HMDB51上子视频的识别率超过或接近最新的视频行为识别方法.

     

    Abstract: For the problem that the commonconvolutional neural network has a single spatial and temporal receptive field scale, and it is difficult to extract the spatio-temporal information in video, we use the (2+1)D model to decouple the time information and spatial information to some extent, and propose a convolution residual neural network with (2+1)D multi-time-space information fusion to recognize human behavior recognition. The model is dominated by 3×3 spatial receptive field, supplemented by 1×1 spatial receptive field, and combined with three different time domain receptive fields to construct six different scales of temporal and spatial receptive fields. The proposed multi-temporal receptive field fusion model can simultaneously acquire spatio-temporal information of different scales and extract more abundant human behavior characteristics, therefore it can more effectively identify human behaviors with different time periods and different motion amplitudes. In addition, we propose a video timing extension method which can simultaneously augment the video data set in spatial information and time series to enrich the training samples. The proposed method has a recognition rate of sub-videos on the public video human behavior data sets UCF101 and HMDB51 that exceeds or is close to the latest video behavior recognition method.

     

/

返回文章
返回