Arrow Research search

Author name cluster

Hanbin Zhao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

AAAI Conference 2026 Conference Paper

Bring Your Dreams to Life: Continual Text-to-Video Customization

  • Jiahua Dong
  • Xudong Wang
  • Wenqi Liang
  • Zongyan Han
  • Meng Cao
  • Duzhen Zhang
  • Hanbin Zhao
  • Zhi Han

Customized text-to-video generation (CTVG) has recently witnessed great progress in generating tailored videos from user-specific text. However, most CTVG methods assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling forgetting and concept neglect. To address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Besides, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention-guided noise estimation. Extensive experimental comparisons demonstrate that our CCVD outperforms existing CTVG models.

AAAI Conference 2026 Conference Paper

Mass Concept Erasure in Diffusion Models with Concept Hierarchy

  • Jiahang Tu
  • Ye Li
  • Yiming Wu
  • Hanbin Zhao
  • Chao Zhang
  • Hui Qian

The success of diffusion models has raised concerns about the generation of unsafe or harmful content, prompting concept erasure approaches that fine-tune modules to suppress specific concepts while preserving general generative capabilities. However, as the number of erased concepts grows, these methods often become inefficient and ineffective, since each concept requires a separate set of fine-tuned parameters and may degrade the overall generation quality. In this work, we propose a supertype-subtype concept hierarchy that organizes erased concepts into a parent–child structure. Each erased concept is treated as a child node, and semantically related concepts (e.g., macaw, and bald eagle) are grouped under a shared parent node, referred to as a supertype concept (e.g., bird). Rather than erasing concepts individually, we introduce an effective and efficient group-wise suppression method, where semantically similar concepts are grouped and erased jointly by sharing a single set of learnable parameters. During the erasure phase, standard diffusion regularization is applied to preserve denoising process in unmasked regions. To mitigate the degradation of supertype generation caused by excessive erasure of semantically related subtypes, we propose a novel method called Supertype-Preserving Low-Rank Adaptation (SuPLoRA), which encodes the supertype concept information in the frozen down-projection matrix and updates only the up-projection matrix during erasure. Theoretical analysis demonstrates the effectiveness of SuPLoRA in mitigating generation performance degradation. We construct a more challenging benchmark that requires simultaneous erasure of concepts across diverse domains, including celebrities, objects, and pornographic content. Comprehensive experiments demonstrate that our method achieves a superior balance between effective multi-concept erasure and the preservation of desirable generative performance.

IJCAI Conference 2025 Conference Paper

A Timestep-Adaptive Frequency-Enhancement Framework for Diffusion-based Image Super-Resolution

  • Yueying Li
  • Hanbin Zhao
  • Jiaqing Zhou
  • Guozhi Xu
  • Tianlei Hu
  • Gang Chen
  • Haobo Wang

Image super-resolution (ISR) is a classic and challenging problem in computer vision because of complex and unknown degradation patterns in the data collection process. Leveraging powerful generative priors, diffusion-based methods have recently established new state-of-the-art ISR performance, but their characteristics in the frequency domain are still underexplored. In this paper, we innovatively investigate their frequency-domain behaviors from a sampling timestep perspective. Experimentally, we find that current diffusion-based ISR algorithms exhibit insufficiency in different frequency components in distinct groups of timesteps during the sampling. To address this, we first propose a Timestep Division Controller that is able to adaptively divide the timesteps into groups based on the performance gradient across different components. Next, we design two dedicated modules --- the Amplitude and Phase Enhancement Module (APEM) and the High- and Low-Frequency Enhancement Module (HLEM), to regulate the information flow of distinct frequency-domain features. By adaptively enhancing specific frequency components at different stages of the sampling process, the two modules effectively compensate for the insufficient frequency-domain perception of diffusion-based ISR models. Extensive experiments on three benchmark datasets verify the superior ISR performance of our method, e. g. , achieving an average 5. 40% improvement on CLIP-IQA compared to the best diffusion-based ISR baseline.

ICML Conference 2025 Conference Paper

Efficiently Access Diffusion Fisher: Within the Outer Product Span Space

  • Fangyikang Wang
  • Hubery Yin
  • Shaobin Zhuang
  • Huminhao Zhu
  • Yinan Li
  • Lei Qian
  • Chao Zhang 0029
  • Hanbin Zhao

Recent Diffusion models (DMs) advancements have explored incorporating the second-order diffusion Fisher information (DF), defined as the negative Hessian of log density, into various downstream tasks and theoretical analysis. However, current practices typically approximate the diffusion Fisher by applying auto-differentiation to the learned score network. This black-box method, though straightforward, lacks any accuracy guarantee and is time-consuming. In this paper, we show that the diffusion Fisher actually resides within a space spanned by the outer products of score and initial data. Based on the outer-product structure, we develop two efficient approximation algorithms to access the trace and matrix-vector multiplication of DF, respectively. These algorithms bypass the auto-differentiation operations with time-efficient vector-product calculations. Furthermore, we establish the approximation error bounds for the proposed algorithms. Experiments in likelihood evaluation and adjoint optimization demonstrate the superior accuracy and reduced computational cost of our proposed algorithms. Additionally, based on the novel outer-product formulation of DF, we design the first numerical verification experiment for the optimal transport property of the general PF-ODE deduced map.

IJCAI Conference 2025 Conference Paper

Few-Shot Incremental Multi-modal Learning via Touch Guidance and Imaginary Vision Synthesis

  • Lina Wei
  • Yuhang Ma
  • Zhongsheng Lin
  • Fangfang Wang
  • Canghong Jin
  • Hanbin Zhao
  • Dapeng Chen

Multimodal perception, which integrates vision and touch, is increasingly demonstrating its significance in domains such as embodied intelligence and human-computer interaction. However, in open-world scenarios, multimodal data streams face significant challenges, including catastrophic forgetting and overfitting, during few-shot class incremental learning (FSCIL), leading to a severe degradation in model performance. In this work, we propose a novel approach named Few-Shot Incremental Multi-modal Learning via Touch Guidance and Imaginary Vision Synthesis (TIFS). Our method leverages vision imagination synthesis to enhance the semantic understanding and integrates touch and vision fusion to improve the problem of modal imbalance. Specifically, we introduce a framework that employs touch-guided vision information for cross-modal contrastive learning to address the challenges of few-shot learning. Additionally, we incorporate multiple learning mechanisms, including regularization, memory mechanisms, and attention mechanisms, to mitigate catastrophic forgetting during multi-incremental step learning. Experimental results on the Touch and Go and VisGel datasets demonstrate that the TIFS framework exhibits robust continuous learning capabilities and strong generalization performance in touch-vision few-shot incremental learning tasks. Our code is available at https: //github. com/Vision-Multimodal-Lab-HZCU/TIFS.

AAAI Conference 2025 Conference Paper

MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental Learning

  • Hai-Long Sun
  • Da-Wei Zhou
  • Hanbin Zhao
  • Le Gan
  • De-Chuan Zhan
  • Han-Jia Ye

Class-Incremental Learning (CIL) requires models to continually acquire knowledge of new classes without forgetting old ones. Despite Pre-trained Models (PTMs) have shown excellent performance in CIL, catastrophic forgetting still occurs as the model learns new concepts. Existing work seeks to utilize lightweight components to adjust the PTM, while the forgetting phenomenon still comes from parameter and retrieval levels. Specifically, iterative updates of the model result in parameter drift, while mistakenly retrieving irrelevant modules leads to the mismatch during inference. To this end, we propose MOdel Surgery (MOS) to rescue the model from forgetting previous knowledge. By training task-specific adapters, we continually adjust the PTM to downstream tasks. To mitigate parameter-level forgetting, we present an adapter merging approach to learn task-specific adapters, which aims to bridge the gap between different components while reserve task-specific information. Besides, to address retrieval-level forgetting, we introduce a training-free self-refined adapter retrieval mechanism during inference, which leverages the model's inherent ability for better adapter retrieval. By jointly rectifying the model with those steps, MOS can robustly resist catastrophic forgetting in the learning process. Extensive experiments on seven benchmark datasets validate MOS's state-of-the-art performance.

AAAI Conference 2025 Conference Paper

TextToucher: Fine-Grained Text-to-Touch Generation

  • Jiahang Tu
  • Hao Fu
  • Fengyu Yang
  • Hanbin Zhao
  • Chao Zhang
  • Hui Qian

Tactile sensation plays a crucial role in the development of multi-modal large models and embodied intelligence. To collect tactile data with minimal cost as possible, a series of studies have attempted to generate tactile images by vision-to-touch image translation. However, compared to text modality, visual modality-driven tactile generation cannot accurately depict human tactile sensation. In this work, we analyze the characteristics of tactile images in detail from two granularities: object-level (tactile texture, tactile shape), and sensor-level (gel status). We model these granularities of information through text descriptions and propose a fine-grained Text-to-Touch generation method (TextToucher) to generate high-quality tactile samples. Specifically, we introduce a multimodal large language model to build the text sentences about object-level tactile information and employ a set of learnable text prompts to represent the sensor-level tactile information. To better guide the tactile generation process with the built text information, we fuse the dual grains of text information and explore various dual-grain text conditioning methods within the diffusion transformer architecture. Furthermore, we propose a Contrastive Text-Touch Pre-training (CTTP) metric to precisely evaluate the quality of text-driven generated tactile data. Extensive experiments demonstrate the superiority of our TextToucher method.

NeurIPS Conference 2024 Conference Paper

BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models

  • Fangyikang Wang
  • Hubery Yin
  • Yuejiang Dong
  • Huminhao Zhu
  • Chao Zhang
  • Hanbin Zhao
  • Hui Qian
  • Chen Li

The inversion of diffusion model sampling, which aims to find the corresponding initial noise of a sample, plays a critical role in various tasks. Recently, several heuristic exact inversion samplers have been proposed to address the inexact inversion issue in a training-free manner. However, the theoretical properties of these heuristic samplers remain unknown and they often exhibit mediocre sampling quality. In this paper, we introduce a generic formulation, \emph{Bidirectional Explicit Linear Multi-step} (BELM) samplers, of the exact inversion samplers, which includes all previously proposed heuristic exact inversion samplers as special cases. The BELM formulation is derived from the variable-stepsize-variable-formula linear multi-step method via integrating a bidirectional explicit constraint. We highlight this bidirectional explicit constraint is the key of mathematically exact inversion. We systematically investigate the Local Truncation Error (LTE) within the BELM framework and show that the existing heuristic designs of exact inversion samplers yield sub-optimal LTE. Consequently, we propose the Optimal BELM (O-BELM) sampler through the LTE minimization approach. We conduct additional analysis to substantiate the theoretical stability and global convergence property of the proposed optimal sampler. Comprehensive experiments demonstrate our O-BELM sampler establishes the exact inversion property while achieving high-quality sampling. Additional experiments in image editing and image interpolation highlight the extensive potential of applying O-BELM in varying applications.

NeurIPS Conference 2024 Conference Paper

D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models

  • Yikun Jiang
  • Huanyu Wang
  • Lei Xie
  • Hanbin Zhao
  • Chao Zhang
  • Hui Qian
  • John C. Lui

Large language models have shown an impressive societal impact owing to their excellent understanding and logical reasoning skills. However, such strong ability relies on a huge amount of computing resources, which makes it difficult to deploy LLMs on computing resource-constrained platforms. Currently, LLMs process each token equivalently, but we argue that not every word is equally important. Some words should not be allocated excessive computing resources, particularly for dispensable terms in simple questions. In this paper, we propose a novel dynamic inference paradigm for LLMs, namely D-LLMs, which adaptively allocate computing resources in token processing. We design a dynamic decision module for each transformer layer that decides whether a network unit should be executed or skipped. Moreover, we tackle the issue of adapting D-LLMs to real-world applications, specifically concerning the missing KV-cache when layers are skipped. To overcome this, we propose a simple yet effective eviction policy to exclude the skipped layers from subsequent attention calculations. The eviction policy not only enables D-LLMs to be compatible with prevalent applications but also reduces considerable storage resources. Experimentally, D-LLMs show superior performance, in terms of computational cost and KV storage utilization. It can reduce up to 45\% computational cost and KV storage on Q&A, summarization, and math solving tasks, 50\% on commonsense reasoning tasks.

AAAI Conference 2024 Conference Paper

GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework

  • Fangyikang Wang
  • Huminhao Zhu
  • Chao Zhang
  • Hanbin Zhao
  • Hui Qian

Particle-based Variational Inference (ParVI) methods approximate the target distribution by iteratively evolving finite weighted particle systems. Recent advances of ParVI methods reveal the benefits of accelerated position update strategies and dynamic weight adjustment approaches. In this paper, we propose the first ParVI framework that possesses both accelerated position update and dynamical weight adjustment simultaneously, named the General Accelerated Dynamic-Weight Particle-based Variational Inference (GAD-PVI) framework. Generally, GAD-PVI simulates the semi-Hamiltonian gradient flow on a novel Information-Fisher-Rao space, which yields an additional decrease on the local functional dissipation. GAD-PVI is compatible with different dissimilarity functionals and associated smoothing approaches under three information metrics. Experiments on both synthetic and real-world data demonstrate the faster convergence and reduced approximation error of GAD-PVI methods over the state-of-the-art.

NeurIPS Conference 2024 Conference Paper

Solving Zero-Sum Markov Games with Continuous State via Spectral Dynamic Embedding

  • Chenhao Zhou
  • Zebang Shen
  • Chao Zhang
  • Hanbin Zhao
  • Hui Qian

In this paper, we propose a provably efficient natural policy gradient algorithm called Spectral Dynamic Embedding Policy Optimization (\SDEPO) for two-player zero-sum stochastic Markov games with continuous state space and finite action space. In the policy evaluation procedure of our algorithm, a novel kernel embedding method is employed to construct a finite-dimensional linear approximations to the state-action value function. We explicitly analyze the approximation error in policy evaluation, and show that \SDEPO\ achieves an $\tilde{O}(\frac{1}{(1-\gamma)^3\epsilon})$ last-iterate convergence to the $\epsilon-$optimal Nash equilibrium, which is independent of the cardinality of the state space. The complexity result matches the best-known results for global convergence of policy gradient algorithms for single agent setting. Moreover, we also propose a practical variant of \SDEPO\ to deal with continuous action space and empirical results demonstrate the practical superiority of the proposed method.

IJCAI Conference 2018 Conference Paper

Progressive Blockwise Knowledge Distillation for Neural Network Acceleration

  • Hui Wang
  • Hanbin Zhao
  • Xi Li
  • Xu Tan

As an important and challenging problem in machine learning and computer vision, neural network acceleration essentially aims to enhance the computational efficiency without sacrificing the model accuracy too much. In this paper, we propose a progressive blockwise learning scheme for teacher-student model distillation at the subnetwork block level. The proposed scheme is able to distill the knowledge of the entire teacher network by locally extracting the knowledge of each block in terms of progressive blockwise function approximation. Furthermore, we propose a structure design criterion for the student subnetwork block, which is able to effectively preserve the original receptive field from the teacher network. Experimental results demonstrate the effectiveness of the proposed scheme against the state-of-the-art approaches.