Arrow Research search

Author name cluster

Yue Lu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers
2 author rows

Possible papers

19

AAAI Conference 2026 Conference Paper

Attention Retention for Continual Learning with Vision Transformers

  • Yue Lu
  • Xiangyu Zhou
  • Shizhou Zhang
  • Yinghui Xing
  • Guoqiang Liang
  • Wencong Zhang

Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.

NeurIPS Conference 2025 Conference Paper

A solvable model of learning generative diffusion: theory and insights

  • Hugo Cui
  • Cengiz Pehlevan
  • Yue Lu

In this manuscript, we analyze a solvable model of flow or diffusion-based generative model. We consider the problem of learning a model parametrized by a two-layer auto-encoder, trained with online stochastic gradient descent, on a high-dimensional target density with an underlying low-dimensional manifold structure. We derive a tight asymptotic characterization of low-dimensional projections of the distribution of samples generated by the learned model, ascertaining in particular its dependence on the number of training samples. Building on this analysis, we discuss how mode collapse can arise, and lead to model collapse when the generative model is re-trained on generated synthetic data.

AAAI Conference 2025 Conference Paper

BiMAC: Bidirectional Multimodal Alignment in Contrastive Learning

  • Masoumeh Zareapoor
  • Pourya Shamsolmoali
  • Yue Lu

Achieving robust performance in vision-language tasks requires strong multimodal alignment, where textual and visual data interact seamlessly. Existing frameworks often combine contrastive learning with image captioning to unify visual and textual representations. However, reliance on global representations and unidirectional information flow from images to text limits their ability to reconstruct visual content accurately from textual descriptions. To address this limitation, we propose BiMAC, a novel framework that enables bidirectional interactions between images and text at both global and local levels. BiMAC employs advanced components to simultaneously reconstruct visual content from textual cues and generate textual descriptions guided by visual features. By integrating a text-region alignment mechanism, BiMAC identifies and selects relevant image patches for precise cross-modal interaction, reducing information noise and enhancing mapping accuracy. BiMAC achieves state-of-the-art performance across diverse vision-language tasks, including image-text retrieval, captioning, and classification.

AAAI Conference 2025 Conference Paper

Training Consistent Mixture-of-Experts-Based Prompt Generator for Continual Learning

  • Yue Lu
  • Shizhou Zhang
  • De Cheng
  • Guoqiang Liang
  • Yinghui Xing
  • Nannan Wang
  • Yanning Zhang

Visual prompt tuning-based continual learning (CL) methods have shown promising performance in exemplar-free scenarios, where their key component can be viewed as a prompt generator. Existing approaches generally rely on freezing old prompts, slow updating and task discrimination for prompt generators to preserve stability and minimize forgetting. In contrast, we introduce a novel approach that trains a consistent prompt generator to ensure stability during CL. Consistency means that for any instance from an old task, its corresponding instance-ware prompt generated by the prompt generator remains consistent even as the generator continually updates in a new task. This ensures that the representation of a specific instance remains stable across tasks and thereby prevents forgetting. We employ a mixture of experts (MoE) as the prompt generator, which contains a router and multiple experts. By deriving conditions sufficient to achieve the consistency for the MoE prompt generator, we demonstrate that: during training in a new task, if the router and experts update in the directions orthogonal to the subspaces spanned by old input features and gating vectors, respectively, the consistency can be theoretically guaranteed. To implement this orthogonality, we project parameter gradients to those orthogonal directions using the orthogonal projection matrices computed via the null space method. Extensive experiments on four class-incremental learning benchmarks validate the effectiveness and superiority of our approach.

NeurIPS Conference 2025 Conference Paper

Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

  • Yangfu Li
  • Hongjian Zhan
  • Tianyi Chen
  • Qi Liu
  • Yu-Jie Xiong
  • Yue Lu

Existing visual token pruning methods target prompt alignment and visual preservation with static strategies, overlooking the varying relative importance of these objectives across tasks, which leads to inconsistent performance. To address this, we derive the first closed-form error bound for visual token pruning based on the Hausdorff distance, uniformly characterizing the contributions of both objectives. Moreover, leveraging $\epsilon$-covering theory, we reveal an intrinsic trade-off between these objectives and quantify their optimal attainment levels under a fixed budget. To practically handle this trade-off, we propose Multi-Objective Balanced Covering (MoB), which reformulates visual token pruning as a bi-objective covering problem. In this framework, the attainment trade-off reduces to budget allocation via greedy radius trading. MoB offers a provable performance bound and linear scalability with respect to the number of input visual tokens, enabling adaptation to challenging pruning scenarios. Extensive experiments show that MoB preserves 96. 4\% of performance for LLaVA-1. 5-7B using only 11. 1\% of the original visual tokens and accelerates LLaVA-Next-7B by 1. 3-1. 5$\times$ with negligible performance loss. Additionally, evaluations on Qwen2-VL and Video-LLaVA confirm that MoB integrates seamlessly into advanced MLLMs and diverse vision-language tasks. The code will be made available soon.

AAAI Conference 2024 Conference Paper

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

  • Lingjun Zhang
  • Xinyuan Chen
  • Yaohui Wang
  • Yue Lu
  • Yu Qiao

Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.

AAAI Conference 2024 Conference Paper

Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network

  • Jiajun Wei
  • Hongjian Zhan
  • Yue Lu
  • Xiao Tu
  • Bing Yin
  • Cong Liu
  • Umapada Pal

Scene text recognition is inherently a vision-language task. However, previous works have predominantly focused either on extracting more robust visual features or designing better language modeling. How to effectively and jointly model vision and language to mitigate heavy reliance on a single modality remains a problem. In this paper, aiming to enhance vision-language reasoning in scene text recognition, we present a balanced, unified and synchronized vision-language reasoning network (BUSNet). Firstly, revisiting the image as a language by balanced concatenation along length dimension alleviates the issue of over-reliance on vision or language. Secondly, BUSNet learns an ensemble of unified external and internal vision-language model with shared weight by masked modality modeling (MMM). Thirdly, a novel vision-language reasoning module (VLRM) with synchronized vision-language decoding capacity is proposed. Additionally, BUSNet achieves improved performance through iterative reasoning, which utilizes the vision-language prediction as a new language input. Extensive experiments indicate that BUSNet achieves state-of-the-art performance on several mainstream benchmark datasets and more challenge datasets for both synthetic and real training data compared to recent outstanding methods. Code and dataset will be available at https://github.com/jjwei66/BUSNet.

AAAI Conference 2024 Conference Paper

Spotting the Unseen: Reciprocal Consensus Network Guided by Visual Archetypes

  • Wenbo Hu
  • Hongjian Zhan
  • Xinchen Ma
  • Yue Lu
  • Ching Y. Suen

Humans often require only a few visual archetypes to spot novel objects. Based on this observation, we present a strategy rooted in ``spotting the unseen" by establishing dense correspondences between potential query image regions and a visual archetype, and we propose the Consensus Network (CoNet). Our method leverages relational patterns intra and inter images via Auto-Correlation Representation (ACR) and Mutual-Correlation Representation (MCR). Within each image, the ACR module is capable of encoding both local self-similarity and global context simultaneously. Between the query and support images, the MCR module computes the cross-correlation across two image representations and introduces a reciprocal consistency constraint, which can incorporate to exclude outliers and enhance model robustness. To overcome the challenges of low-resource training data, particularly in one-shot learning scenarios, we incorporate an adaptive margin strategy to better handle diverse instances. The experimental results indicate the effectiveness of the proposed method across diverse domains such as object detection in natural scenes, and text spotting in both historical manuscripts and natural scenes, which demonstrates its sparkling generalization ability. Our code is available at: https://github.com/infinite-hwb/conet.

NeurIPS Conference 2024 Conference Paper

Visual Prompt Tuning in Null Space for Continual Learning

  • Yue Lu
  • Shizhou Zhang
  • De Cheng
  • Yinghui Xing
  • Nannan Wang
  • Peng Wang
  • Yanning Zhang

Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL), by selecting and updating relevant prompts in the vision-transformer models. On the contrary, this paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features, so as to ensure no interference on tasks that have been learned to overcome catastrophic forgetting in CL. However, different from the orthogonal projection in the traditional CNN architecture, the prompt gradient orthogonal projection in the ViT architecture shows completely different and greater challenges, i. e. , 1) the high-order and non-linear self-attention operation; 2) the drift of prompt distribution brought by the LayerNorm in the transformer block. Theoretically, we have finally deduced two consistency conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of eliminating interference on previously learned knowledge via the self-attention mechanism in visual prompt tuning. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient orthogonal projection. Extensive experimental results demonstrate the effectiveness of anti-forgetting on four class-incremental benchmarks with diverse pre-trained baseline models, and our approach achieves superior performances to state-of-the-art methods. Our code is available at https: //github. com/zugexiaodui/VPTinNSforCL

IS Journal 2023 Journal Article

Generating Emotion Descriptions for Fine Art Paintings Via Multiple Painting Representations

  • Yue Lu
  • Chao Guo
  • Xingyuan Dai
  • Fei-Yue Wang

The task of generating emotion descriptions for fine art paintings using machine learning is gaining increasing attention. However, captioning the emotions depicted in paintings is challenging due to the artistic and subtle nature of the relied-upon visual clues. Previous studies on painting emotion captioning mainly focus on content-oriented semantic features, resulting in limited performance. Recognizing that facial expressions and body language can reflect human emotions, we propose a novel painting emotion captioning model that incorporates two additional features: facial expression feature and human pose feature. Our model includes a feature fusion method to incorporate these features with commonly used object features. The experiment results on public datasets demonstrate that our proposed model outperforms the baseline. Further experiments on paintings with abstract appearances and image corruptions show the promising performance of our proposed model.

NeurIPS Conference 2022 Conference Paper

Precise Learning Curves and Higher-Order Scalings for Dot-product Kernel Regression

  • Lechao Xiao
  • Hong Hu
  • Theodor Misiakiewicz
  • Yue Lu
  • Jeffrey Pennington

As modern machine learning models continue to advance the computational frontier, it has become increasingly important to develop precise estimates for expected performance improvements under different model and data scaling regimes. Currently, theoretical understanding of the learning curves that characterize how the prediction error depends on the number of samples is restricted to either large-sample asymptotics ($m\to\infty$) or, for certain simple data distributions, to the high-dimensional asymptotics in which the number of samples scales linearly with the dimension ($m\propto d$). There is a wide gulf between these two regimes, including all higher-order scaling relations $m\propto d^r$, which are the subject of the present paper. We focus on the problem of kernel ridge regression for dot-product kernels and present precise formulas for the mean of the test error, bias, and variance, for data drawn uniformly from the sphere with isotropic random labels in the $r$th-order asymptotic scaling regime $m\to\infty$ with $m/d^r$ held constant. We observe a peak in the learning curve whenever $m \approx d^r/r! $ for any integer $r$, leading to multiple sample-wise descent and nontrivial behavior at multiple scales. We include a colab notebook that reproduces the essential results of the paper.

ECAI Conference 2020 Conference Paper

CIDetector: Semi-Supervised Method for Multi-Topic Confidential Information Detection

  • Jianguo Jiang
  • Yue Lu
  • Min Yu 0001
  • Yantao Jia
  • Jiafeng Guo
  • Chao Liu 0020
  • Weiqing Huang

Confidential information firewalling with text classifier is to identify the text containing confidential information whose publication might be harmful to national security, business trade, or personal life. Traditional methods, e. g. , listing a set of suspicious keywords together with regular-expression based filter, fail to solve the multi-topic phenomenon, i. e. , one text containing the confidential information with different topics. In this paper, we propose a semi-supervised method, CIDetector, for multi-topic confidential information detection. We introduce coarse confidential polarity as prior knowledge into word embeddings, which can regularize the distribution of words to have a clear task classification boundary. Then we introduce a multi-attention network classifier to extract task-related features and model dependencies between features for multi-topic classification. Experiments are conducted by real-world data from WikiLeaks and demonstrated the superiority of our proposed method.

NeurIPS Conference 2020 Conference Paper

Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization

  • Benjamin Aubin
  • Florent Krzakala
  • Yue Lu
  • Lenka Zdeborová

We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer non-linear neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where $\alpha=\frac{n}{d}$ is kept finite in the limit of a high dimension $d$ and number of samples $n$. Our contribution is three-fold: First, we prove a formula for the generalization error achieved by $\ell_2$ regularized classifiers that minimize a convex loss. This formula was first obtained by the heuristic replica method of statistical physics. Secondly, focussing on commonly used loss functions and optimizing the $\ell_2$ regularization strength, we observe that while ridge regression performance is poor, logistic and hinge regression are surprisingly able to approach the Bayes-optimal generalization error extremely closely. As $\alpha \to \infty$ they lead to Bayes-optimal rates, a fact that does not follow from predictions of margin-based generalization error bounds. Third, we design an optimal loss and regularizer that provably leads to Bayes-optimal generalization error.

NeurIPS Conference 2019 Conference Paper

A Solvable High-Dimensional Model of GAN

  • Chuang Wang
  • Hong Hu
  • Yue Lu

We present a theoretical analysis of the training process for a single-layer GAN fed by high-dimensional input data. The training dynamics of the proposed model at both microscopic and macroscopic scales can be exactly analyzed in the high-dimensional limit. In particular, we prove that the macroscopic quantities measuring the quality of the training process converge to a deterministic process characterized by an ordinary differential equation (ODE), whereas the microscopic states containing all the detailed weights remain stochastic, whose dynamics can be described by a stochastic differential equation (SDE). This analysis provides a new perspective different from recent analyses in the limit of small learning rate, where the microscopic state is always considered deterministic, and the contribution of noise is ignored. From our analysis, we show that the level of the background noise is essential to the convergence of the training process: setting the noise level too strong leads to failure of feature recovery, whereas setting the noise too weak causes oscillation. Although this work focuses on a simple copy model of GAN, we believe the analysis methods and insights developed here would prove useful in the theoretical understanding of other variants of GANs with more advanced training algorithms.

IJCAI Conference 2017 Conference Paper

A Sequence Labeling Convolutional Network and Its Application to Handwritten String Recognition

  • Qingqing Wang
  • Yue Lu

Handwritten string recognition has been struggling with connected patterns fiercely. Segmentation-free and over-segmentation frameworks are commonly applied to deal with this issue. For the past years, RNN combining with CTC has occupied the domain of segmentation-free handwritten string recognition, while CNN is just employed as a single character recognizer in the over-segmentation framework. The main challenges for CNN to directly recognize handwritten strings are the appropriate processing of arbitrary input string length, which implies arbitrary input image size, and reasonable design of the output layer. In this paper, we propose a sequence labeling convolutional network for the recognition of handwritten strings, in particular, the connected patterns. We properly design the structure of the network to predict how many characters present in the input images and what exactly they are at every position. Spatial pyramid pooling (SPP) is utilized with a new implementation to handle arbitrary string length. Moreover, we propose a more flexible pooling strategy called FSPP to adapt the network to the straightforward recognition of long strings better. Experiments conducted on handwritten digital strings from two benchmark datasets and our own cell-phone number dataset demonstrate the superiority of the proposed network.

NeurIPS Conference 2017 Conference Paper

The Scaling Limit of High-Dimensional Online Independent Component Analysis

  • Chuang Wang
  • Yue Lu

We analyze the dynamics of an online algorithm for independent component analysis in the high-dimensional scaling limit. As the ambient dimension tends to infinity, and with proper time scaling, we show that the time-varying joint empirical measure of the target feature vector and the estimates provided by the algorithm will converge weakly to a deterministic measured-valued process that can be characterized as the unique solution of a nonlinear PDE. Numerical solutions of this PDE, which involves two spatial variables and one time variable, can be efficiently obtained. These solutions provide detailed information about the performance of the ICA algorithm, as many practical performance metrics are functionals of the joint empirical measures. Numerical simulations show that our asymptotic analysis is accurate even for moderate dimensions. In addition to providing a tool for understanding the performance of the algorithm, our PDE analysis also provides useful insight. In particular, in the high-dimensional limit, the original coupled dynamics associated with the algorithm will be asymptotically “decoupled”, with each coordinate independently solving a 1-D effective minimization problem via stochastic gradient descent. Exploiting this insight to design new algorithms for achieving optimal trade-offs between computational and statistical efficiency may prove an interesting line of future research.