Arrow Research search

Author name cluster

Xiaoyu Du

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
1 author row

Possible papers

9

AAAI Conference 2026 Conference Paper

IMAGGarment+: Efficient Attribute-Wise Diffusion for Garment Generation

  • Jian Yu
  • Fei Shen
  • Cong Wang
  • Yanpeng Sun
  • Hao Tang
  • Qin Guo
  • Xiaoyu Du

Diffusion models have advanced fine-grained garment generation, yet balancing controllability, efficiency, and texture fidelity remains challenging. Adapter-based methods often yield incoherent details, while full fine-tuning is computationally expensive and prone to overwriting pretrained priors. To address these limitations, we propose IMAGGarment+, an efficient diffusion framework for controllable and high-quality garment synthesis. It comprises two key modules designed for efficient and attribute-aware conditioning. First, we introduce an attribute-wise feature extractor (AFE) that disentangles key garment attributes, silhouette, logo, position, and color, into parallel latent streams. Each stream is optimized independently via LoRA, ensuring minimal parameter overhead while retaining expressive capacity. Second, we develop an attribute-adaptive attention (AA) module to inject attribute-specific cues into the generative process through a selective, layer-wise injection strategy. Specifically, silhouette and color features are injected into early decoder layers to guide structural and appearance formation, while logo features are propagated across all layers to ensure cross-scale consistency. Extensive experiments on fine-grained garment benchmarks demonstrate that IMAGGarment+ outperforms state-of-the-art baselines with less than 20% additional parameters, validating its effectiveness and efficiency.

EAAI Journal 2026 Journal Article

Industrial Internet of Things intrusion detection based on a hybrid model of Pearson-Deep Neural Network And Transformer

  • Ying Du
  • Aosheng Ning
  • Pu Cheng
  • Roshan Kumar
  • Xiaoyu Du

With the rapid growth of Industrial Internet of Things (IIoT) devices and the increasing complexity of networks, the safety and security of data have become a key task. Deep learning-based Intrusion Detection Systems (IDS) have demonstrated remarkable capabilities in IIoT environments, effectively identifying network attacks. However, existing systems often fail to fully consider the unique characteristics of IIoT data. To address such characteristics, this paper proposes a hybrid model intrusion detection system specifically designed to accommodate the unique characteristics of IIoT datasets. The model, named Pearson-Deep Neural Network-Transformer (P-DNT), combines the strengths of the Pearson Correlation Coefficient (PCC), Deep Neural Network (DNN), and Transformer architectures. PCC is initially used to effectively evaluate the correlation between data generated from different devices, filtering out the features most relevant to intrusion detection. Subsequently, the P-DNT model is designed by integrating the high-dimensional information extraction capability of DNN with the attention mechanism of Transformer. This model can not only automatically extract high-dimensional features from high-dimensional data but also capture long-range dependencies in time-series data, enhancing its ability to processing complex temporal information. Experimental results demonstrate that the model achieved an accuracy of 99. 98% in the binary classification stage and an overall accuracy rate of 99. 12% in the multi-classification stage, demonstrating competitive performance compared to existing state of the art models.

TIST Journal 2025 Journal Article

Enriching Responses with Crowd-Sourced Knowledge for Task-Oriented Conversational Agents

  • Zhaohui Wei
  • Lizi Liao
  • Xinguang Xiang
  • Xiaoyu Du

Task-oriented conversational agents strive to aid users across various tasks by concentrating on generating suitable responses to guarantee successful task accomplishment. Nonetheless, several factors have a substantial influence on user contentment beyond task fulfillment, requiring further investigation. Within this work, we aim to analyze diverse behavioral patterns of conversational agents with the goal of enhancing user satisfaction. Our findings lead to the exploration of three different enriched response generation schemes: EnRG-ATT, EnRG-TIP, and EnRG-SIM. Specifically, EnRG-ATT is designed to integrate the model's capabilities with a dual attention mechanism across two distinct modalities of external resources. It employs a pair of gates to regulate the utilization of such sources efficiently. More elegantly, we introduce EnRG-TIP, which simplifies response enrichment as a sequence prediction problem and exploits the pre-trained language model to capture user tips related to the conversation. Moreover, building on the efficiency of grounding on similar responses, EnRG-SIM further enhances response generation by inserting similar responses into the training sequences, to direct the pre-trained model's attention towards this additional knowledge. Our comprehensive experiments demonstrate that our three proposed methods not only achieve good task completion but also generate responses that yield higher user satisfaction.

AAAI Conference 2025 Conference Paper

IMAGDressing-v1: Customizable Virtual Dressing

  • Fei Shen
  • Xin Jiang
  • Xin He
  • Hu Ye
  • Cong Wang
  • Xiaoyu Du
  • Zechao Li
  • Jinhui Tang

Existing virtual try-on (VTON) methods provide only limited user control over garment attributes and generally overlook essential factors such as face, pose, and scene context. To address these limitations, we introduce the virtual dressing (VD) task, which aims to synthesize freely editable human images conditioned on fixed garments and optional user-defined inputs. We further propose a comprehensive affinity metric index (CAMI) to quantify the consistency between generated outputs and reference garments. We present IMAGDressing-v1, which leverages a garment-specific U-Net to integrate semantic features from CLIP and texture features from a VAE. To incorporate these garment features into a frozen denoising U-Net for flexible text-driven scene control, we employ a hybrid attention mechanism composed of frozen self-attention and trainable cross-attention layers. IMAGDressing-v1 seamlessly integrates with extension modules, such as ControlNet and IP-Adapter, enabling enhanced diversity and controllability. To alleviate data constraints, we introduce the Interactive Garment Pairing (IGPair) dataset, comprising over 300,000 garment–image pairs and a standardized data assembly pipeline. Extensive experiments demonstrate that IMAGDressing-v1 achieves state-of-the-art performance in controlled human image synthesis. The code and model will be available at https://github.com/muzishen/IMAGDressing.

IJCAI Conference 2025 Conference Paper

TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification

  • Rui Yan
  • Jin Wang
  • Hongyu Qu
  • Xiaoyu Du
  • Dong Zhang
  • Jinhui Tang
  • Tieniu Tan

Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework, namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts inquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. TEST-V achieves state-of-the-art results across four benchmarks and shows good interpretability.

AAAI Conference 2024 Conference Paper

Delving into Multimodal Prompting for Fine-Grained Visual Classification

  • Xin Jiang
  • Hao Tang
  • Junyao Gao
  • Xiaoyu Du
  • Shengfeng He
  • Zechao Li

Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.

AAAI Conference 2024 Conference Paper

MGNet: Learning Correspondences via Multiple Graphs

  • Dai Luanyuan
  • Xiaoyu Du
  • Hanwang Zhang
  • Jinhui Tang

Learning correspondences aims to find correct correspondences (inliers) from the initial correspondence set with an uneven correspondence distribution and a low inlier rate, which can be regarded as graph data. Recent advances usually use graph neural networks (GNNs) to build a single type of graph or simply stack local graphs into the global one to complete the task. But they ignore the complementary relationship between different types of graphs, which can effectively capture potential relationships among sparse correspondences. To address this problem, we propose MGNet to effectively combine multiple complementary graphs. To obtain information integrating implicit and explicit local graphs, we construct local graphs from implicit and explicit aspects and combine them effectively, which is used to build a global graph. Moreover, we propose Graph Soft Degree Attention (GSDA) to make full use of all sparse correspondence information at once in the global graph, which can capture and amplify discriminative features. Extensive experiments demonstrate that MGNet outperforms state-of-the-art methods in different visual tasks. The code is provided in https://github.com/DAILUANYUAN/MGNet-2024AAAI.

AAAI Conference 2020 Conference Paper

Learning to Match on Graph for Fashion Compatibility Modeling

  • Xun Yang
  • Xiaoyu Du
  • Meng Wang

Understanding the mix-and-match relationships between items receives increasing attention in the fashion industry. Existing methods have primarily learned visual compatibility from dyadic co-occurrence or co-purchase information of items to model the item-item matching interaction. Despite effectiveness, rich extra-connectivities between compatible items, e. g. , user-item interactions and item-item substitutable relationships, which characterize the structural properties of items, have been largely ignored. This paper presents a graphbased fashion matching framework named Deep Relational Embedding Propagation (DREP), aiming to inject the extraconnectivities between items into the pairwise compatibility modeling. Specifically, we first build a multi-relational itemitem-user graph which encodes diverse item-item and useritem relationships. Then we compute structured representations of items by an attentive relational embedding propagation rule that performs messages propagation along edges of the relational graph. This leads to expressive modeling of higher-order connectivity between items and also better representation of fashion items. Finally, we predict pairwise compatibility based on a compatibility metric learning module. Extensive experiments show that DREP can significantly improve the performance of state-of-the-art methods.

IJCAI Conference 2018 Conference Paper

Outer Product-based Neural Collaborative Filtering

  • Xiangnan He
  • Xiaoyu Du
  • Xiang Wang
  • Feng Tian
  • Jinhui Tang
  • Tat-Seng Chua

In this work, we contribute a new multi-layer neural network architecture named ONCF to perform collaborative filtering. The idea is to use an outer product to explicitly model the pairwise correlations between the dimensions of the embedding space. In contrast to existing neural recommender models that combine user embedding and item embedding via a simple concatenation or element-wise product, our proposal of using outer product above the embedding layer results in a two-dimensional interaction map that is more expressive and semantically plausible. Above the interaction map obtained by outer product, we propose to employ a convolutional neural network to learn high-order correlations among embedding dimensions. Extensive experiments on two public implicit feedback data demonstrate the effectiveness of our proposed ONCF framework, in particular, the positive effect of using outer product to model the correlations between embedding dimensions in the low level of multi-layer neural recommender model.