MoEG-HOI: Mixture of Expert Groups for One-Stage Hand-Object Interaction Motion Generation with Hand-Finger-Joint Semantic Guidance

Hang Xu; Yang Xiao; Changlong Jiang; Haohong Kuang; Kaidi Zhang; Min Du; Ran Wang

doi:10.1609/aaai.v40i13.38102

Back to AAAI

AAAI 2026

MoEG-HOI: Mixture of Expert Groups for One-Stage Hand-Object Interaction Motion Generation with Hand-Finger-Joint Semantic Guidance

Conference Paper AAAI Technical Track on Computer Vision X Artificial Intelligence

PDF Details DOI

Abstract

In this paper, MoEG-HOI is proposed as a novel method for the challenging 3D hand-object interaction (HOI) motion generation task, by introducing Mixture-of-Experts (MoE) to this field for the first time. Almost all the mainstream approaches in HOI motion generation leverage diffusion model as its strong generative ability. Nevertheless, due to HOI’s fine-grained property, well training diffusion in one-stage way is actually not trivial. Existing state-of-the-art (SOTA) methods (e.g.,Text2HOI and MF-MDM) alleviate this mainly via a coarse-to-fine, multi-stage paradigm. Although effective and practical, this paradigm prevents end-to-end training for optimal performance. In contrast, MoEG-HOI applies MoE to address this in one-stage way, with end-to-end training ability. This allows each expert to specialize in certain distinct HOI patterns, which alleviates individual expert’s training difficulty. However, intuitively applying MoE is not optimal due to the issues of: (1) towards expert design, original MoE cannot well characterize hand’s articulated structure at the levels of hand, finger, and joint explicitly, and (2) for expert routing mechanism, the characteristics of variational HOI action classes and diffusion noise levels have not been concerned. Towards the first problem, MoE’s experts are designed into groups that correspond to motion generation for hand, finger, and joint respectively, under the semantic guidance from global to local. To facilitate this, HOI’s text description will be correspondingly refined at Hand-Finger-Joint levels using LLM. Secondly, during MoE routing, the information of HOI’s action label and diffusion noise level is concerned to select experts jointly, to better reveal actions’ inter-class variation and dynamics of diffusion generation. SOTA performance on ARCTIC, GRAB and H2O datasets demonstrates the effectiveness of our method.

MoEG-HOI: Mixture of Expert Groups for One-Stage Hand-Object Interaction Motion Generation with Hand-Finger-Joint Semantic Guidance

Abstract

Authors

Keywords

Context