Arrow Research search

Author name cluster

Xubin Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

AAAI Conference 2026 Conference Paper

Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

  • Yang Chen
  • Xiaowei Xu
  • Shuai Wang
  • Chenhui Zhu
  • Ruxue Wen
  • Xubin Li
  • Tiezheng Ge
  • Limin Wang

Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3x, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64 x 64 and 256 x 256.

AAAI Conference 2026 Conference Paper

Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

  • Zhaopei Huang
  • Qifeng Dai
  • Guozheng Wu
  • Xiaopeng Wu
  • Xubin Li
  • Tiezheng Ge
  • Wenxuan Wang
  • Qin Jin

With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H2Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.

EAAI Journal 2026 Journal Article

Multi-modal large language model-based image captioning algorithm in information and communication technology: Bridging the gap between general and industry domain

  • Lianying Chao
  • Kai Zhang
  • Xubin Li
  • Linfeng Yin
  • Haoran Cai
  • Sijie Wu
  • DingCheng Shan

In the Information and Communications Technology (ICT) industry, training a domain-specific Large Language Model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, this knowledge resides not only in text but also in images. Traditional methods can parse text from domain documents but lack image captioning ability. Multi-Modal Large Language Models (MLLMs) understand images but lack sufficient domain expertise. To address this, this paper proposes a multi-stage progressive training strategy for a Domain-specific Image Captioning Model (DICModel) in the ICT domain and constructs a standard evaluation system. Specifically, we synthesize 7233 image-text pairs via the Mermaid tool and LLMs for the first-stage Supervised Fine-Tuning (SFT) of DICModel. Then, ICT domain experts manually annotate 2274 pairs for the second-stage SFT. Finally, experts and LLMs jointly build 1573 Visual Question Answering (VQA) data for instruction-based SFT. Experimental results indicate that our DICModel achieves state-of-the-art (SOTA) performance in both diagram parsing and VQA tasks. With only 7 billion parameters, DICModel outperforms 32 billion parameter SOTA models. In the parsing task, it increases the Bilingual Evaluation Understudy (BLEU) metric by 56. 8% and 20. 8% compared to SOTA models with 7 and 32 billion parameters, respectively. In the VQA task, DICModel surpasses 32 billion parameter MLLMs by 1% in accuracy on objective questions constructed by ICT experts. In summary, this work can efficiently extract logical text from images, promoting the development of multi-modal models in the ICT domain.

ICML Conference 2025 Conference Paper

Differentiable Solver Search for Fast Diffusion Sampling

  • Shuai Wang
  • Zexian Li
  • Qipeng Zhang
  • Tianhui Song
  • Xubin Li
  • Tiezheng Ge
  • Bo Zheng 0007
  • Limin Wang 0002

Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e. g. , SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2. 40 and 2. 35, respectively, on ImageNet-$256\times256$ with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2. 33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.

ICLR Conference 2025 Conference Paper

Minimal Impact ControlNet: Advancing Multi-ControlNet Integration

  • Shikun Sun
  • Min Zhou
  • Zixuan Wang 0026
  • Xubin Li
  • Tiezheng Ge
  • Zijie Ye
  • Xiaoyu Qin 0001
  • Junliang Xing

With the advancement of diffusion models, there is a growing demand for high-quality, controllable image generation, particularly through methods that utilize one or multiple control signals based on ControlNet. However, in current ControlNet training, each control is designed to influence all areas of an image, which can lead to conflicts when different control signals are expected to manage different parts of the image in practical applications. This issue is especially pronounced with edge-type control conditions, where regions lacking boundary information often represent low-frequency signals, referred to as silent control signals. When combining multiple ControlNets, these silent control signals can suppress the generation of textures in related areas, resulting in suboptimal outcomes. To address this problem, we propose Minimal Impact ControlNet. Our approach mitigates conflicts through three key strategies: constructing a balanced dataset, combining and injecting feature signals in a balanced manner, and addressing the asymmetry in the score function’s Jacobian matrix induced by ControlNet. These improvements enhance the compatibility of control signals, allowing for freer and more harmonious generation in areas with silent control signals.

AAAI Conference 2025 Conference Paper

RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

  • Chengrui Wang
  • Pengfei Liu
  • Min Zhou
  • Ming Zeng
  • Xubin Li
  • Tiezheng Ge
  • Bo Zheng

Although diffusion models can generate high-quality human images, their applications are limited by the instability in generating hands with correct structures. In this paper, we introduce RHanDS, a conditional diffusion-based framework designed to refine malformed hands by utilizing decoupled structure and style guidance. The hand mesh reconstructed from the malformed hand offers structure guidance for correcting the structure of the hand, while the malformed hand itself provides style guidance for preserving the style of the hand. To alleviate the mutual interference between style and structure guidance, we introduce a two-stage training strategy and build a series of multi-style hand datasets. In the first stage, we use paired hand images for training to ensure stylistic consistency in hand refining. In the second stage, various hand images generated based on human meshes are used for training, enabling the model to gain control over the hand structure. Experimental results demonstrate that RHanDS can effectively refine hand structure while preserving consistency in hand style.

NeurIPS Conference 2024 Conference Paper

Exploring DCN-like architecture for fast image generation with arbitrary resolution

  • Shuai Wang
  • Zexian Li
  • Tianhui Song
  • Xubin Li
  • Tiezheng Ge
  • Bo Zheng
  • Limin Wang

Arbitrary-resolution image generation still remains a challenging task in AIGC, as it requires handling varying resolutions and aspect ratios while maintaining high visual quality. Existing transformer-based diffusion methods suffer from quadratic computation cost and limited resolution extrapolation capabilities, making them less effective for this task. In this paper, we propose FlowDCN, a purely convolution-based generative model with linear time and memory complexity, that can efficiently generate high-quality images at arbitrary resolutions. Equipped with a new design of learnable group-wise deformable convolution block, our FlowDCN yields higher flexibility and capability to handle different resolutions with a single model. FlowDCN achieves the state-of-the-art 4. 30 sFID on $256\times256$ ImageNet Benchmark and comparable resolution extrapolation results, surpassing transformer-based counterparts in terms of convergence speed (only $\frac{1}{5}$ images), visual quality, parameters ($8\%$ reduction) and FLOPs ($20\%$ reduction). We believe FlowDCN offers a promising solution to scalable and flexible image synthesis.