Author name cluster

Yuhong Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

Jintao Tong
Wenwei Jin
Pengda Qin
Anqi Li
Yixiong Zou
Yuhong Li
Yuhua Li
Ruixuan Li

Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show FlowCut achieves superior results, outperforming SoTA by 1. 6% on LLaVA-1. 5-7B with 88. 9% token reduction, and by 4. 3% on LLaVA-NeXT-7B with 94. 4% reduction, delivering 3. 2$\times$ speed-up in the prefilling stage. Our code is available at https: //github. com/TungChintao/FlowCut.

PDF Details

ICML Conference 2024 Conference Paper

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai
Yuhong Li
Zhengyang Geng
Hongwu Peng
Jason D. Lee
Deming Chen
Tri Dao

Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one’s output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator’s cache. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa reduces the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the model’s capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2. 2$\times$ speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2. 3-2. 8$\times$.

Details

NeurIPS Conference 2024 Conference Paper

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li
Yingbing Huang
Bowen Yang
Bharat Venkitesh
Acyr Locatelli
Hanchen Ye
Tianle Cai
Patrick Lewis

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3. 6x increase in generation speed and an 8. 2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.

PDF Details DOI

JBHI Journal 2023 Journal Article

Geometry-Consistent Adversarial Registration Model for Unsupervised Multi-Modal Medical Image Registration

Yanxia Liu
Wenqi Wang
Yuhong Li
Haoyu Lai
Sijuan Huang
Xin Yang

Deformable multi-modal medical image registration aligns the anatomical structures of different modalities to the same coordinate system through a spatial transformation. Due to the difficulties of collecting ground-truth registration labels, existing methods often adopt the unsupervised multi-modal image registration setting. However, it is hard to design satisfactory metrics to measure the similarity of multi-modal images, which heavily limits the multi-modal registration performance. Moreover, due to the contrast difference of the same organ in multi-modal images, it is difficult to extract and fuse the representations of different modal images. To address the above issues, we propose a novel unsupervised multi-modal adversarial registration framework that takes advantage of image-to-image translation to translate the medical image from one modality to another. In this way, we are able to use the well-defined uni-modal metrics to better train the models. Inside our framework, we propose two improvements to promote accurate registration. First, to avoid the translation network learning spatial deformation, we propose a geometry-consistent training scheme to encourage the translation network to learn the modality mapping solely. Second, we propose a novel semi-shared multi-scale registration network that extracts features of multi-modal images effectively and predicts multi-scale registration fields in an coarse-to-fine manner to accurately register the large deformation area. Extensive experiments on brain and pelvic datasets demonstrate the superiority of the proposed method over existing methods, revealing our framework has great potential in clinical application.

Details DOI

ICLR Conference 2023 Conference Paper

What Makes Convolutional Models Great on Long Sequence Modeling?

Yuhong Li
Tianle Cai
Yi Zhang
Deming Chen
Debadeepta Dey

Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependencies efficiently. Attention overcomes this problem by aggregating global information based on the pair-wise attention score but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. With Fast Fourier Transform, S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes that combine the wisdom from several prior works. As a result, S4 is less intuitive and hard to use for researchers with limited prior knowledge. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses the previous SoTA on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.

Details

YNICL Journal 2022 Journal Article

Aberrant static and dynamic functional connectivity of amygdala subregions in patients with major depressive disorder and childhood maltreatment

Qianyi Luo
Juran Chen
Yuhong Li
Zhiyao Wu
Xinyi Lin
Jiazheng Yao
Huiwen Yu
Huawang Wu

Major depressive disorder (MDD) with childhood maltreatment is a heterogeneous clinical phenotype of depression with prominent features of brain disconnectivity in areas linked to maltreatment-related emotion processing (e.g., the amygdala). However, static and dynamic alterations of functional connectivity in amygdala subregions have not been investigated in MDD with childhood maltreatment. Here, we explored whether amygdala subregions (i.e., medial amygdala [MeA] and lateral amygdala [LA]) exhibited static functional connectivity (sFC) and dynamic functional connectivity (dFC) disruption, and whether these disruptions were related to childhood maltreatment. We compared sFC and dFC patterns in MDD with childhood maltreatment (n = 48), MDD without childhood maltreatment (n = 30), healthy controls with childhood maltreatment (n = 57), and healthy controls without childhood maltreatment (n = 46). The bilateral MeA and LA were selected as the seeds in the FC analysis. The results revealed a functional connectivity disruption pattern in maltreated MDD patients, characterized by sFC and dFC abnormalities involving the MeA, LA, and theory of mind-related brain areas including the middle occipital area, middle frontal gyrus, superior medial frontal gyrus, angular gyrus, supplementary motor areas, middle temporal gyrus, middle cingulate gyrus, and calcarine gyrus. Significant correlations were detected between impaired dFC patterns and childhood maltreatment. Furthermore, the dFC disruption pattern served as a moderator in the relationship between sexual abuse and depression severity. Our findings revealed neurobiological features of childhood maltreatment, providing new evidence regarding vulnerability to psychiatric disorders.

Details DOI

EAAI Journal 2021 Journal Article

A novel design of experiment algorithm using improved evolutionary multi-objective optimization strategy

Yuhong Li
Ni Li
Guanghong Gong
Jin Yan

Details DOI

NeurIPS Conference 2021 Conference Paper

Generic Neural Architecture Search via Regression

Yuhong Li
Cong Hao
Pan Li
Jinjun Xiong
Deming Chen

Most existing neural architecture search (NAS) algorithms are dedicated to and evaluated by the downstream tasks, e. g. , image classification in computer vision. However, extensive experiments have shown that, prominent neural architectures, such as ResNet in computer vision and LSTM in natural language processing, are generally good at extracting patterns from the input data and perform well on different downstream tasks. In this paper, we attempt to answer two fundamental questions related to NAS. (1) Is it necessary to use the performance of specific downstream tasks to evaluate and search for good neural architectures? (2) Can we perform NAS effectively and efficiently while being agnostic to the downstream tasks? To answer these questions, we propose a novel and generic NAS framework, termed Generic NAS (GenNAS). GenNAS does not use task-specific labels but instead adopts regression on a set of manually designed synthetic signal bases for architecture evaluation. Such a self-supervised regression task can effectively evaluate the intrinsic power of an architecture to capture and transform the input signal patterns, and allow more sufficient usage of training samples. Extensive experiments across 13 CNN search spaces and one NLP space demonstrate the remarkable efficiency of GenNAS using regression, in terms of both evaluating the neural architectures (quantified by the ranking correlation Spearman's rho between the approximated performances and the downstream task performances) and the convergence speed for training (within a few seconds). For example, on NAS-Bench-101, GenNAS achieves 0. 85 rho while the existing efficient methods only achieve 0. 38. We then propose an automatic task search to optimize the combination of synthetic signals using limited downstream-task-specific labels, further improving the performance of GenNAS. We also thoroughly evaluate GenNAS's generality and end-to-end NAS performance on all search spaces, which outperforms almost all existing works with significant speedup. For example, on NASBench-201, GenNAS can find near-optimal architectures within 0. 3 GPU hour.

PDF Details

AIIM Journal 2016 Journal Article

Brain tumor segmentation from multimodal magnetic resonance images via sparse representation

Yuhong Li
Fucang Jia
Jing Qin

Objective Accurately segmenting and quantifying brain gliomas from magnetic resonance (MR) images remains a challenging task because of the large spatial and structural variability among brain tumors. To develop a fully automatic and accurate brain tumor segmentation algorithm, we present a probabilistic model of multimodal MR brain tumor segmentation. This model combines sparse representation and the Markov random field (MRF) to solve the spatial and structural variability problem. Methods We formulate the tumor segmentation problem as a multi-classification task by labeling each voxel as the maximum posterior probability. We estimate the maximum a posteriori (MAP) probability by introducing the sparse representation into a likelihood probability and a MRF into the prior probability. Considering the MAP as an NP-hard problem, we convert the maximum posterior probability estimation into a minimum energy optimization problem and employ graph cuts to find the solution to the MAP estimation. Results Our method is evaluated using the Brain Tumor Segmentation Challenge 2013 database (BRATS 2013) and obtained Dice coefficient metric values of 0. 85, 0. 75, and 0. 69 on the high-grade Challenge data set, 0. 73, 0. 56, and 0. 54 on the high-grade Challenge LeaderBoard data set, and 0. 84, 0. 54, and 0. 57 on the low-grade Challenge data set for the complete, core, and enhancing regions. Conclusions The experimental results show that the proposed algorithm is valid and ranks 2nd compared with the state-of-the-art tumor segmentation algorithms in the MICCAI BRATS 2013 challenge.

Details DOI