Multiple Human Motion Understanding

Lei Li; Sen Jia; Jenq-Neng Hwang

doi:10.1609/aaai.v40i8.37556

Back to AAAI

AAAI 2026

Multiple Human Motion Understanding

Conference Paper AAAI Technical Track on Computer Vision V Artificial Intelligence

PDF Details DOI

Abstract

We introduce LLaMMo (Large Language and Multi-Person Motion Assistant), the first instruction-tuning multimodal framework tailored for multi-human motion analysis. LLaMMo incorporates a novel human-centric and social-temporal learner that models and fuses both intra-person dynamics and inter-person dependencies, yielding robust, context-aware representations of complex group behaviors while maintaining low computational overhead. To support LLaMMo, we construct LLaVerse, a large-scale dataset with fine-grained manual annotations covering diverse multi-person activities spanning daily social interaction and professional team sports. Built on top of LLaVerse, we also propose LLaMI-Bench, a dedicated benchmark for evaluating multi-human behavior understanding across motion and video modalities. Extensive experiments demonstrate that LLaMMo consistently outperforms baselines in understanding multi-person interactions under low-latency settings, with notable gains in both social and sport-specific contexts.

Multiple Human Motion Understanding

Abstract

Authors

Keywords

Context