Author name cluster

Xiaoyu Du

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

1 author row

AAAI Conference 2026 Conference Paper

IMAGGarment+: Efficient Attribute-Wise Diffusion for Garment Generation

Jian Yu
Fei Shen
Cong Wang
Yanpeng Sun
Hao Tang
Qin Guo
Xiaoyu Du

Diffusion models have advanced fine-grained garment generation, yet balancing controllability, efficiency, and texture fidelity remains challenging. Adapter-based methods often yield incoherent details, while full fine-tuning is computationally expensive and prone to overwriting pretrained priors. To address these limitations, we propose IMAGGarment+, an efficient diffusion framework for controllable and high-quality garment synthesis. It comprises two key modules designed for efficient and attribute-aware conditioning. First, we introduce an attribute-wise feature extractor (AFE) that disentangles key garment attributes, silhouette, logo, position, and color, into parallel latent streams. Each stream is optimized independently via LoRA, ensuring minimal parameter overhead while retaining expressive capacity. Second, we develop an attribute-adaptive attention (AA) module to inject attribute-specific cues into the generative process through a selective, layer-wise injection strategy. Specifically, silhouette and color features are injected into early decoder layers to guide structural and appearance formation, while logo features are propagated across all layers to ensure cross-scale consistency. Extensive experiments on fine-grained garment benchmarks demonstrate that IMAGGarment+ outperforms state-of-the-art baselines with less than 20% additional parameters, validating its effectiveness and efficiency.

PDF Details DOI

EAAI Journal 2026 Journal Article

Industrial Internet of Things intrusion detection based on a hybrid model of Pearson-Deep Neural Network And Transformer

Ying Du
Aosheng Ning
Pu Cheng
Roshan Kumar
Xiaoyu Du

With the rapid growth of Industrial Internet of Things (IIoT) devices and the increasing complexity of networks, the safety and security of data have become a key task. Deep learning-based Intrusion Detection Systems (IDS) have demonstrated remarkable capabilities in IIoT environments, effectively identifying network attacks. However, existing systems often fail to fully consider the unique characteristics of IIoT data. To address such characteristics, this paper proposes a hybrid model intrusion detection system specifically designed to accommodate the unique characteristics of IIoT datasets. The model, named Pearson-Deep Neural Network-Transformer (P-DNT), combines the strengths of the Pearson Correlation Coefficient (PCC), Deep Neural Network (DNN), and Transformer architectures. PCC is initially used to effectively evaluate the correlation between data generated from different devices, filtering out the features most relevant to intrusion detection. Subsequently, the P-DNT model is designed by integrating the high-dimensional information extraction capability of DNN with the attention mechanism of Transformer. This model can not only automatically extract high-dimensional features from high-dimensional data but also capture long-range dependencies in time-series data, enhancing its ability to processing complex temporal information. Experimental results demonstrate that the model achieved an accuracy of 99. 98% in the binary classification stage and an overall accuracy rate of 99. 12% in the multi-classification stage, demonstrating competitive performance compared to existing state of the art models.

Details DOI

TIST Journal 2025 Journal Article

Enriching Responses with Crowd-Sourced Knowledge for Task-Oriented Conversational Agents

Zhaohui Wei
Lizi Liao
Xinguang Xiang
Xiaoyu Du

Task-oriented conversational agents strive to aid users across various tasks by concentrating on generating suitable responses to guarantee successful task accomplishment. Nonetheless, several factors have a substantial influence on user contentment beyond task fulfillment, requiring further investigation. Within this work, we aim to analyze diverse behavioral patterns of conversational agents with the goal of enhancing user satisfaction. Our findings lead to the exploration of three different enriched response generation schemes: EnRG-ATT, EnRG-TIP, and EnRG-SIM. Specifically, EnRG-ATT is designed to integrate the model's capabilities with a dual attention mechanism across two distinct modalities of external resources. It employs a pair of gates to regulate the utilization of such sources efficiently. More elegantly, we introduce EnRG-TIP, which simplifies response enrichment as a sequence prediction problem and exploits the pre-trained language model to capture user tips related to the conversation. Moreover, building on the efficiency of grounding on similar responses, EnRG-SIM further enhances response generation by inserting similar responses into the training sequences, to direct the pre-trained model's attention towards this additional knowledge. Our comprehensive experiments demonstrate that our three proposed methods not only achieve good task completion but also generate responses that yield higher user satisfaction.

Details DOI

AAAI Conference 2025 Conference Paper

IMAGDressing-v1: Customizable Virtual Dressing

Fei Shen
Xin Jiang
Xin He
Hu Ye
Cong Wang
Xiaoyu Du
Zechao Li
Jinhui Tang

Existing virtual try-on (VTON) methods provide only limited user control over garment attributes and generally overlook essential factors such as face, pose, and scene context. To address these limitations, we introduce the virtual dressing (VD) task, which aims to synthesize freely editable human images conditioned on fixed garments and optional user-defined inputs. We further propose a comprehensive affinity metric index (CAMI) to quantify the consistency between generated outputs and reference garments. We present IMAGDressing-v1, which leverages a garment-specific U-Net to integrate semantic features from CLIP and texture features from a VAE. To incorporate these garment features into a frozen denoising U-Net for flexible text-driven scene control, we employ a hybrid attention mechanism composed of frozen self-attention and trainable cross-attention layers. IMAGDressing-v1 seamlessly integrates with extension modules, such as ControlNet and IP-Adapter, enabling enhanced diversity and controllability. To alleviate data constraints, we introduce the Interactive Garment Pairing (IGPair) dataset, comprising over 300,000 garment–image pairs and a standardized data assembly pipeline. Extensive experiments demonstrate that IMAGDressing-v1 achieves state-of-the-art performance in controlled human image synthesis. The code and model will be available at https://github.com/muzishen/IMAGDressing.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification

Rui Yan
Jin Wang
Hongyu Qu
Xiaoyu Du
Dong Zhang
Jinhui Tang
Tieniu Tan

Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework, namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts inquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. TEST-V achieves state-of-the-art results across four benchmarks and shows good interpretability.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Delving into Multimodal Prompting for Fine-Grained Visual Classification

Xin Jiang
Hao Tang
Junyao Gao
Xiaoyu Du
Shengfeng He
Zechao Li

Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.

PDF Details DOI

AAAI Conference 2024 Conference Paper

MGNet: Learning Correspondences via Multiple Graphs

Dai Luanyuan
Xiaoyu Du
Hanwang Zhang
Jinhui Tang

Learning correspondences aims to find correct correspondences (inliers) from the initial correspondence set with an uneven correspondence distribution and a low inlier rate, which can be regarded as graph data. Recent advances usually use graph neural networks (GNNs) to build a single type of graph or simply stack local graphs into the global one to complete the task. But they ignore the complementary relationship between different types of graphs, which can effectively capture potential relationships among sparse correspondences. To address this problem, we propose MGNet to effectively combine multiple complementary graphs. To obtain information integrating implicit and explicit local graphs, we construct local graphs from implicit and explicit aspects and combine them effectively, which is used to build a global graph. Moreover, we propose Graph Soft Degree Attention (GSDA) to make full use of all sparse correspondence information at once in the global graph, which can capture and amplify discriminative features. Extensive experiments demonstrate that MGNet outperforms state-of-the-art methods in different visual tasks. The code is provided in https://github.com/DAILUANYUAN/MGNet-2024AAAI.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Learning to Match on Graph for Fashion Compatibility Modeling

Xun Yang
Xiaoyu Du
Meng Wang

Understanding the mix-and-match relationships between items receives increasing attention in the fashion industry. Existing methods have primarily learned visual compatibility from dyadic co-occurrence or co-purchase information of items to model the item-item matching interaction. Despite effectiveness, rich extra-connectivities between compatible items, e. g. , user-item interactions and item-item substitutable relationships, which characterize the structural properties of items, have been largely ignored. This paper presents a graphbased fashion matching framework named Deep Relational Embedding Propagation (DREP), aiming to inject the extraconnectivities between items into the pairwise compatibility modeling. Speciﬁcally, we ﬁrst build a multi-relational itemitem-user graph which encodes diverse item-item and useritem relationships. Then we compute structured representations of items by an attentive relational embedding propagation rule that performs messages propagation along edges of the relational graph. This leads to expressive modeling of higher-order connectivity between items and also better representation of fashion items. Finally, we predict pairwise compatibility based on a compatibility metric learning module. Extensive experiments show that DREP can signiﬁcantly improve the performance of state-of-the-art methods.

PDF Details

IJCAI Conference 2018 Conference Paper

Outer Product-based Neural Collaborative Filtering

Xiangnan He
Xiaoyu Du
Xiang Wang
Feng Tian
Jinhui Tang
Tat-Seng Chua

In this work, we contribute a new multi-layer neural network architecture named ONCF to perform collaborative filtering. The idea is to use an outer product to explicitly model the pairwise correlations between the dimensions of the embedding space. In contrast to existing neural recommender models that combine user embedding and item embedding via a simple concatenation or element-wise product, our proposal of using outer product above the embedding layer results in a two-dimensional interaction map that is more expressive and semantically plausible. Above the interaction map obtained by outer product, we propose to employ a convolutional neural network to learn high-order correlations among embedding dimensions. Extensive experiments on two public implicit feedback data demonstrate the effectiveness of our proposed ONCF framework, in particular, the positive effect of using outer product to model the correlations between embedding dimensions in the low level of multi-layer neural recommender model.

PDF Details