Recognizing Skeleton-Based Actions As Points

Baiqiao Yin; Jiaying Lin; Jiajun Wen 0003; Yue Li; Jinfu Liu; Yanfei Wang; Mengyuan Liu 0001

Back to IROS

IROS 2025

Recognizing Skeleton-Based Actions As Points

Conference Paper Accepted Paper Artificial Intelligence · Robotics

Details

Abstract

Recent advances in skeleton-based action recognition have been primarily driven by Graph Convolutional Networks (GCNs) and skeleton transformers. While conventional approaches focus on modeling joint co-occurrences through skeletal connections, they overlook the inherent positional information in 3D coordinates. Although Hyper-Graphs partially address the limitation of pairwise aggregation in capturing higher-order kinematic dependencies, challenges remain in their topological definitions. To solve these problems, this paper proposes a skeleton-to-point network (Skeleton2Point) to model joints’ position relationships directly in three-dimensional space without fixed topology limitation, which is the first to regard skeleton recognition as point clouds. However, simply considering the raw 3D coordinates would result in the loss of the anatomical identity of each keypoint and its temporal position in the sequence. To address this limitation, we augment the three-dimensional spatial coordinates with two additional dimensions: the anatomical index of each keypoint and its corresponding frame number with a proposed Information Transform Module (ITM). This transformation extends the representation from a three-dimensional to a five-dimensional feature space. Furthermore, we propose a Cluster-Dispatch-Based Interaction module (CDI) to enhance the discrimination of local-global information. In comparison with existing methods on NTU-RGB+D 60 and NTU-RGB+D 120 datasets, Skeleton2Point has demonstrated state-of-the-art performance on both joint modality and stream fusion. Especially, on the challenging NTU-RGB+D 120 dataset under the X-Sub and X-Set setting, the accuracies reach 90. 63% and 91. 92%.

Authors

Keywords

Point cloud compression
Representation learning
Learning systems
Solid modeling
Three-dimensional displays
Network topology
Transforms
Transformers
Skeleton
Topology
Relative Position
Point Cloud
Position Information
Sequence Position
Action Recognition
3D Coordinates
Graph Convolutional Network
Interaction Module
Temporal Localization
Graph Convolution
Three-dimensional Coordinates
Sampling Method
Local Features
Human Bone
Multilayer Perceptron
Stochastic Gradient Descent
Feature Aggregation
Neighboring Points
Relative Validity
Human Activity Recognition
Point Cloud Information
Skeleton Data
Movement Trend
Actual Accuracy
Attention Matrix
Transformer-based Methods
Input Point
Point Cloud Model

Context

Venue: IEEE/RSJ International Conference on Intelligent Robots and Systems
Archive span: 1988-2025
Indexed papers: 26578
Paper id: 297307119211539023