QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection

Yuxiao Wang; Wolin Liang; Yu Lei; Weiying Xue; Nan Zhuang; Qi Liu

doi:10.1609/aaai.v40i21.38840

Back to AAAI

AAAI 2026

QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection

Conference Paper AAAI Technical Track on Humans and AI Artificial Intelligence

PDF Details DOI

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is ACTOR (Action-aware Cross-modal TransfORmer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to infer interaction semantics and produce semantically meaningful query representations. To further enhance object-level query quality, we introduce a Perceptual Distilled Query Decoder (PDQD), which distills object category awareness from a pre-trained detector to serve as object query initiation. This dual-branch query initialization enables the model to generate more interpretable and effective queries for HOI detection. Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that our method achieves state-of-the-art performance and strong generalization.

QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection

Abstract

Authors

Keywords

Context