Abstract:
Human action recognition holds diverse research value across various scenarios and tasks, with promising applications in intelligent security, autonomous driving, and human-computer interaction. Although extensive research has been conducted on action recognition using human skeletal data, a systematic review of its development trajectory and underlying logic remain lacking. We review the major milestones in human skeletal action recognition, categorizing them into four key technological approaches: recurrent neural networks, convolutional neural networks, graph convolutional networks, and transformers. The developmental contexts of these methods are outlined, with an analysis of two key technological aspects: spatial modeling and temporal modeling. Strategies for constructing rich input representations are also highlighted. Additionally, the significance of skeletal modalities in multimodal integration for action recognition is discussed. Finally, we discusse future directions for techniques and applications in human skeletal action recognition.