Incorporating Transformer Networks and Joint Distance Images into Skeleton-driven Human Activity Recognition | ||
| AUT Journal of Modeling and Simulation | ||
| مقاله 5، دوره 56، شماره 1، 2024، صفحه 69-86 اصل مقاله (1.74 M) | ||
| نوع مقاله: Research Article | ||
| شناسه دیجیتال (DOI): 10.22060/miscj.2024.23094.5357 | ||
| نویسندگان | ||
| Elham Shabaninia* 1؛ Fatemeh Shafizadegan2؛ Hossein Nezamabadi-pour3؛ Ahmad R. Naghsh-Nilchi2 | ||
| 1Department of Applied Mathematics, Graduate University of Advanced Technology, Kerman, Iran | ||
| 2Department of Artificial Intelligence, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran | ||
| 3Department of Electrical Engineering, Shahid Bahonar University of Kerman, Kerman, Iran | ||
| چکیده | ||
| Skeleton-based action recognition has attracted significant attention in the field of computer vision. In recent years, Transformer networks have improved action recognition as a result of their ability to capture long-range dependencies and relationships in sequential data. In this context, a novel approach is proposed to enhance skeleton-based activity recognition by introducing Transformer self-attention alongside Convolutional Neural Network (CNN) architectures. The proposed method capitalizes on the 3D distances between pair-wise joints, utilizing this information to generate Joint Distance Images (JDIs) for each frame. These JDIs offer a relatively view-independent representation, allowing the model to discern intricate details of human actions. To further enhance the model's understanding of spatial features and relationships, the extracted JDIs from different frames are processed. They can be directly input into the Transformer network or first fed into a CNN, enabling the extraction of crucial spatial features. The obtained features, combined with positional embeddings, serve as input to a Transformer encoder, enabling the model to reconstruct the underlying structure of the action from the training data. Experimental results showcase the effectiveness of the proposed method, demonstrating performance comparable to other state-of-the-art transformer-based approaches on benchmark datasets such as NTU RGB+D and NTU RGB+D120. The incorporation of Transformer networks and Joint Distance Images presents a promising avenue for advancing the field of skeleton-based human action recognition, offering robust performance and improved generalization across diverse action datasets. | ||
| کلیدواژهها | ||
| Human Activity Recognition؛ Joint Distance Image؛ Vision Transformer؛ Deep Learning | ||
| مراجع | ||
|
| ||
|
آمار تعداد مشاهده مقاله: 537 تعداد دریافت فایل اصل مقاله: 356 |
||
| تعداد نشریات | 9 |
| تعداد شمارهها | 455 |
| تعداد مقالات | 5,772 |
| تعداد مشاهده مقاله | 8,412,592 |
| تعداد دریافت فایل اصل مقاله | 6,972,432 |