Author name cluster

Xubin Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

AAAI Conference 2026 Conference Paper

Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

Yang Chen
Xiaowei Xu
Shuai Wang
Chenhui Zhu
Ruxue Wen
Xubin Li
Tiezheng Ge
Limin Wang

Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3x, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64 x 64 and 256 x 256.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

Zhaopei Huang
Qifeng Dai
Guozheng Wu
Xiaopeng Wu
Xubin Li
Tiezheng Ge
Wenxuan Wang
Qin Jin

With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H2Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.

PDF Details DOI

EAAI Journal 2026 Journal Article

Multi-modal large language model-based image captioning algorithm in information and communication technology: Bridging the gap between general and industry domain

Lianying Chao
Kai Zhang
Xubin Li
Linfeng Yin
Haoran Cai
Sijie Wu
DingCheng Shan

In the Information and Communications Technology (ICT) industry, training a domain-specific Large Language Model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, this knowledge resides not only in text but also in images. Traditional methods can parse text from domain documents but lack image captioning ability. Multi-Modal Large Language Models (MLLMs) understand images but lack sufficient domain expertise. To address this, this paper proposes a multi-stage progressive training strategy for a Domain-specific Image Captioning Model (DICModel) in the ICT domain and constructs a standard evaluation system. Specifically, we synthesize 7233 image-text pairs via the Mermaid tool and LLMs for the first-stage Supervised Fine-Tuning (SFT) of DICModel. Then, ICT domain experts manually annotate 2274 pairs for the second-stage SFT. Finally, experts and LLMs jointly build 1573 Visual Question Answering (VQA) data for instruction-based SFT. Experimental results indicate that our DICModel achieves state-of-the-art (SOTA) performance in both diagram parsing and VQA tasks. With only 7 billion parameters, DICModel outperforms 32 billion parameter SOTA models. In the parsing task, it increases the Bilingual Evaluation Understudy (BLEU) metric by 56. 8% and 20. 8% compared to SOTA models with 7 and 32 billion parameters, respectively. In the VQA task, DICModel surpasses 32 billion parameter MLLMs by 1% in accuracy on objective questions constructed by ICT experts. In summary, this work can efficiently extract logical text from images, promoting the development of multi-modal models in the ICT domain.

Details DOI

ICML Conference 2025 Conference Paper

Differentiable Solver Search for Fast Diffusion Sampling

Shuai Wang
Zexian Li
Qipeng Zhang
Tianhui Song
Xubin Li
Tiezheng Ge
Bo Zheng 0007
Limin Wang 0002

Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e. g. , SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2. 40 and 2. 35, respectively, on ImageNet-$256\times256$ with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2. 33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.

Details

ICLR Conference 2025 Conference Paper

Minimal Impact ControlNet: Advancing Multi-ControlNet Integration

Shikun Sun
Min Zhou
Zixuan Wang 0026
Xubin Li
Tiezheng Ge
Zijie Ye
Xiaoyu Qin 0001
Junliang Xing

With the advancement of diffusion models, there is a growing demand for high-quality, controllable image generation, particularly through methods that utilize one or multiple control signals based on ControlNet. However, in current ControlNet training, each control is designed to influence all areas of an image, which can lead to conflicts when different control signals are expected to manage different parts of the image in practical applications. This issue is especially pronounced with edge-type control conditions, where regions lacking boundary information often represent low-frequency signals, referred to as silent control signals. When combining multiple ControlNets, these silent control signals can suppress the generation of textures in related areas, resulting in suboptimal outcomes. To address this problem, we propose Minimal Impact ControlNet. Our approach mitigates conflicts through three key strategies: constructing a balanced dataset, combining and injecting feature signals in a balanced manner, and addressing the asymmetry in the score function’s Jacobian matrix induced by ControlNet. These improvements enhance the compatibility of control signals, allowing for freer and more harmonious generation in areas with silent control signals.

Details

AAAI Conference 2025 Conference Paper

RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

Chengrui Wang
Pengfei Liu
Min Zhou
Ming Zeng
Xubin Li
Tiezheng Ge
Bo Zheng

Although diffusion models can generate high-quality human images, their applications are limited by the instability in generating hands with correct structures. In this paper, we introduce RHanDS, a conditional diffusion-based framework designed to refine malformed hands by utilizing decoupled structure and style guidance. The hand mesh reconstructed from the malformed hand offers structure guidance for correcting the structure of the hand, while the malformed hand itself provides style guidance for preserving the style of the hand. To alleviate the mutual interference between style and structure guidance, we introduce a two-stage training strategy and build a series of multi-style hand datasets. In the first stage, we use paired hand images for training to ensure stylistic consistency in hand refining. In the second stage, various hand images generated based on human meshes are used for training, enabling the model to gain control over the hand structure. Experimental results demonstrate that RHanDS can effectively refine hand structure while preserving consistency in hand style.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Exploring DCN-like architecture for fast image generation with arbitrary resolution

Shuai Wang
Zexian Li
Tianhui Song
Xubin Li
Tiezheng Ge
Bo Zheng
Limin Wang

Arbitrary-resolution image generation still remains a challenging task in AIGC, as it requires handling varying resolutions and aspect ratios while maintaining high visual quality. Existing transformer-based diffusion methods suffer from quadratic computation cost and limited resolution extrapolation capabilities, making them less effective for this task. In this paper, we propose FlowDCN, a purely convolution-based generative model with linear time and memory complexity, that can efficiently generate high-quality images at arbitrary resolutions. Equipped with a new design of learnable group-wise deformable convolution block, our FlowDCN yields higher flexibility and capability to handle different resolutions with a single model. FlowDCN achieves the state-of-the-art 4. 30 sFID on $256\times256$ ImageNet Benchmark and comparable resolution extrapolation results, surpassing transformer-based counterparts in terms of convergence speed (only $\frac{1}{5}$ images), visual quality, parameters ($8\%$ reduction) and FLOPs ($20\%$ reduction). We believe FlowDCN offers a promising solution to scalable and flexible image synthesis.

PDF Details DOI