Author name cluster

Yifu Lu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

ICML Conference 2025 Conference Paper

Attention-Only Transformers via Unrolled Subspace Denoising

Peng Wang 0098
Yifu Lu
Yaodong Yu
Druv Pai
Qing Qu 0001
Yi Ma 0001

Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of only self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations at a linear rate with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.

Details

NeurIPS Conference 2025 Conference Paper

Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models

Soumya Suvra Ghosal
Souradip Chakraborty
Avinash Reddy
Yifu Lu
Mengdi Wang
Dinesh Manocha
Furong Huang
Mohammad Ghavamzadeh

Recent trends in test-time scaling for reasoning models (e. g. , OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like “Wait” or “Let me rethink” can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking". To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance—creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from "more thinking" are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking. This provides a simple yet effective mechanism for test-time scaling of reasoning models.

PDF Details

ICLR Conference 2025 Conference Paper

ELFS: Label-Free Coreset Selection with Proxy Training Dynamics

Haizhong Zheng
Elisa Tsai
Yifu Lu
Jiachen Sun
Brian R. Bartoldson
Bhavya Kailkhura
Atul Prakash 0001

High-quality human-annotated data is crucial for modern deep learning pipelines, yet the human annotation process is both costly and time-consuming. Given a constrained human labeling budget, selecting an informative and representative data subset for labeling can significantly reduce human annotation effort. Well-performing state-of-the-art (SOTA) coreset selection methods require ground truth labels over the whole dataset, failing to reduce the human labeling burden. Meanwhile, SOTA label-free coreset selection methods deliver inferior performance due to poor geometry-based difficulty scores. In this paper, we introduce ELFS (Effective Label-Free Coreset Selection), a novel label-free coreset selection method. ELFS significantly improves label-free coreset selection by addressing two challenges: 1) ELFS utilizes deep clustering to estimate training dynamics-based data difficulty scores without ground truth labels; 2) Pseudo-labels introduce a distribution shift in the data difficulty scores, and we propose a simple but effective double-end pruning method to mitigate bias on calculated scores. We evaluate ELFS on four vision benchmarks and show that, given the same vision encoder, ELFS consistently outperforms SOTA label-free baselines. For instance, when using SwAV as the encoder, ELFS outperforms D2 by up to 10.2% in accuracy on ImageNet-1K. We make our code publicly available on GitHub.

Details

ICLR Conference 2025 Conference Paper

Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

Ziyang Wu
Tianjiao Ding
Yifu Lu
Druv Pai
Jingyuan Zhang
Weida Wang
Yaodong Yu
Yi Ma 0001

The attention operator is arguably the key distinguishing factor of transformer architectures, which have demonstrated state-of-the-art performance on a variety of tasks. However, transformer attention operators often impose a significant computational burden, with the computational complexity scaling quadratically with the number of tokens. In this work, we propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens. We derive our network architecture by extending prior work which has shown that a transformer style architecture naturally arises by "white-box" architecture design, where each layer of the network is designed to implement an incremental optimization step of a maximal coding rate reduction objective (MCR$^2$). Specifically, we derive a novel variational form of the MCR$^2$ objective and show that the architecture that results from unrolled gradient descent of this variational objective leads to a new attention module called Token Statistics Self-Attention ($\texttt{TSSA}$). $\texttt{TSSA}$ has $\textit{linear computational and memory complexity}$ and radically departs from the typical attention architecture that computes pairwise similarities between tokens. Experiments on vision, language, and long sequence tasks show that simply swapping $\texttt{TSSA}$ for standard self-attention, which we refer to as the Token Statistics Transformer ($\texttt{ToST}$), achieves competitive performance with conventional transformers while being significantly more computationally efficient and interpretable. Our results also somewhat call into question the conventional wisdom that pairwise similarity style attention mechanisms are critical to the success of transformer architectures.

Details

NeurIPS Conference 2024 Conference Paper

Exploring Low-Dimensional Subspace in Diffusion Models for Controllable Image Editing

Siyi Chen
Huijie Zhang
Minzhe Guo
Yifu Lu
Peng Wang
Qing Qu

Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LO w-rank CO ntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The code and the arXiv version can be found on the project website.

PDF Details DOI

ICML Conference 2024 Conference Paper

The Emergence of Reproducibility and Consistency in Diffusion Models

Huijie Zhang
Jinfan Zhou
Yifu Lu
Minzhe Guo
Peng Wang 0098
Liyue Shen
Qing Qu 0001

In this work, we investigate an intriguing and prevalent phenomenon of diffusion models which we term as "consistent model reproducibility”: given the same starting noise input and a deterministic sampler, different diffusion models often yield remarkably similar outputs. We confirm this phenomenon through comprehensive experiments, implying that different diffusion models consistently reach the same data distribution and score function regardless of diffusion model frameworks, model architectures, or training procedures. More strikingly, our further investigation implies that diffusion models are learning distinct distributions influenced by the training data size. This is evident in two distinct training regimes: (I) "memorization regime, ” where the diffusion model overfits to the training data distribution, and (ii) "generalization regime, ” where the model learns the underlying data distribution. Our study also finds that this valuable property generalizes to many variants of diffusion models, including those for conditional generation and solving inverse problems. Lastly, we discuss how our findings connect to existing research and highlight the practical implications of our discoveries.

Details

AAAI Conference 2022 Conference Paper

ZINB-Based Graph Embedding Autoencoder for Single-Cell RNA-Seq Interpretations

Zhuohan Yu
Yifu Lu
Yunhe Wang
Fan Tang
Ka-Chun Wong
Xiangtao Li

Single-cell RNA sequencing (scRNA-seq) provides high-throughput information about the genome-wide gene expression levels at the single-cell resolution, bringing a precise understanding on the transcriptome of individual cells. Unfortunately, the rapidly growing scRNA-seq data and the prevalence of dropout events pose substantial challenges for cell type annotation. Here, we propose a single-cell model-based deep graph embedding clustering (scTAG) method, which simultaneously learns cell–cell topology representations and identifies cell clusters based on deep graph convolutional network. scTAG integrates the zero-inflated negative binomial (ZINB) model into a topology adaptive graph convolutional autoencoder to learn the lowdimensional latent representation and adopts Kullback–Leibler (KL) divergence for the clustering tasks. By simultaneously optimizing the clustering loss, ZINB loss, and the cell graph reconstruction loss, scTAG jointly optimizes cluster label assignment and feature learning with the topological structures preserved in an end-to-end manner. Extensive experiments on 16 single-cell RNA-seq datasets from diverse yet representative single-cell sequencing platforms demonstrate the superiority of scTAG over various state-of-the-art clustering methods.

PDF Details