Incorporating Transformer Networks and Joint Distance Images into Skeleton-driven Human Activity Recognition

Shabaninia, Elham; Shafizadegan, Fatemeh; Nezamabadi-pour, Hossein; Naghsh-Nilchi, Ahmad R.

doi:10.22060/miscj.2024.23094.5357

	Incorporating Transformer Networks and Joint Distance Images into Skeleton-driven Human Activity Recognition
AUT Journal of Modeling and Simulation
مقاله 5، دوره 56، شماره 1، 2024، صفحه 69-86 اصل مقاله (1.74 M)
نوع مقاله: Research Article
شناسه دیجیتال (DOI): 10.22060/miscj.2024.23094.5357
نویسندگان
Elham Shabaninia^* ¹؛ Fatemeh Shafizadegan²؛ Hossein Nezamabadi-pour³؛ Ahmad R. Naghsh-Nilchi²
¹Department of Applied Mathematics, Graduate University of Advanced Technology, Kerman, Iran
²Department of Artificial Intelligence, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran
³Department of Electrical Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
چکیده
Skeleton-based action recognition has attracted significant attention in the field of computer vision. In recent years, Transformer networks have improved action recognition as a result of their ability to capture long-range dependencies and relationships in sequential data. In this context, a novel approach is proposed to enhance skeleton-based activity recognition by introducing Transformer self-attention alongside Convolutional Neural Network (CNN) architectures. The proposed method capitalizes on the 3D distances between pair-wise joints, utilizing this information to generate Joint Distance Images (JDIs) for each frame. These JDIs offer a relatively view-independent representation, allowing the model to discern intricate details of human actions. To further enhance the model's understanding of spatial features and relationships, the extracted JDIs from different frames are processed. They can be directly input into the Transformer network or first fed into a CNN, enabling the extraction of crucial spatial features. The obtained features, combined with positional embeddings, serve as input to a Transformer encoder, enabling the model to reconstruct the underlying structure of the action from the training data. Experimental results showcase the effectiveness of the proposed method, demonstrating performance comparable to other state-of-the-art transformer-based approaches on benchmark datasets such as NTU RGB+D and NTU RGB+D120. The incorporation of Transformer networks and Joint Distance Images presents a promising avenue for advancing the field of skeleton-based human action recognition, offering robust performance and improved generalization across diverse action datasets.
کلیدواژه‌ها
Human Activity Recognition؛ Joint Distance Image؛ Vision Transformer؛ Deep Learning
مراجع

آمار تعداد مشاهده مقاله: 537 تعداد دریافت فایل اصل مقاله: 356

پیوندهای مفید

دانشگاه صنعتی امیرکبیر

آمار

تعداد نشریات	9
تعداد شماره‌ها	455
تعداد مقالات	5,772
تعداد مشاهده مقاله	8,412,592
تعداد دریافت فایل اصل مقاله	6,972,432