Arrow Research search

Author name cluster

Zigeng Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
1 author row

Possible papers

5

NeurIPS Conference 2025 Conference Paper

Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

  • Kunjun Li
  • Zigeng Chen
  • Cheng-Yen Yang
  • Jenq-Neng Hwang

Visual Autoregressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction approach, which yields substantial improvements in efficiency, scalability, and zero-shot generalization. Nevertheless, the coarse-to-fine methodology inherent in VAR results in exponential growth of the KV cache during inference, causing considerable memory consumption and computational redundancy. To address these bottlenecks, we introduce ScaleKV, a novel KV cache compression framework tailored for VAR architectures. ScaleKV leverages two critical observations: varying cache demands across transformer layers and distinct attention patterns at different scales. Based on these insights, ScaleKV categorizes transformer layers into two functional groups: drafters and refiners. Drafters exhibit dispersed attention across multiple scales, thereby requiring greater cache capacity. Conversely, refiners focus attention on the current token map to process local details, consequently necessitating substantially reduced cache capacity. ScaleKV optimizes the multi-scale inference pipeline by identifying scale-specific drafters and refiners, facilitating differentiated cache management tailored to each scale. Evaluation on the state-of-the-art text-to-image VAR model family, Infinity, demonstrates that our approach effectively reduces the required KV cache memory to 10% while preserving pixel-level fidelity.

NeurIPS Conference 2025 Conference Paper

VeriThinker: Learning to Verify Makes Reasoning Model Efficient

  • Zigeng Chen
  • Xinyin Ma
  • Gongfan Fang
  • Ruonan Yu
  • Xinchao Wang

Large Reasoning Models (LRMs) have garnered considerable attention for their ability to tackle complex tasks through the Chain-of-Thought (CoT) approach. However, their tendency toward overthinking results in unnecessarily lengthy reasoning chains, dramatically increasing the inference costs. To mitigate this issue, we introduce VeriThinker, a novel approach for CoT compression. Unlike conventional methods that fine-tune LRMs directly on the original reasoning task using synthetic concise CoT data, we innovatively fine-tune the model solely through an auxiliary verification task. By training LRMs to accurately verify the correctness of CoT solutions, the LRMs inherently become more discerning about the necessity of subsequent self-reflection steps, thereby effectively suppressing overthinking. Extensive experiments validate that VeriThinker substantially reduces reasoning chain lengths while maintaining or even slightly improving accuracy. When applied to DeepSeek-R1-Distill-Qwen-7B, our approach reduces reasoning tokens on MATH500 from 3790 to 2125 while improving accuracy by 0. 8% (94. 0% to 94. 8%), and on AIME25, tokens decrease from 14321 to 10287 with a 2. 1% accuracy gain (38. 7% to 40. 8%). Additionally, our experiments demonstrate that VeriThinker can also be zero-shot generalized to speculative reasoning.

NeurIPS Conference 2024 Conference Paper

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

  • Zigeng Chen
  • Xinyin Ma
  • Gongfan Fang
  • Zhenxiong Tan
  • Xinchao Wang

Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2. 1, AsyncDiff achieves a 2. 7x speedup with negligible degradation and a 4. 0x speedup with only a slight reduction of 0. 38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate AsyncDiff can be readily applied to video diffusion models with encouraging performances.

IJCAI Conference 2024 Conference Paper

MetaISP: Efficient RAW-to-sRGB Mappings with Merely 1M Parameters

  • Zigeng Chen
  • Chaowei Liu
  • Yuan Yuan
  • Michael Bi Mi
  • Xinchao Wang

State-of-the-art deep ISP models alleviate the dilemma of limited generalization capabilities across heterogeneous inputs by increasing the size and complexity of the network, which inevitably leads to considerable growth in parameter counts and FLOPs. To address this challenge, this paper presents MetaISP - a streamlined model that achieves superior reconstruction quality by adaptively modulating its parameters and architecture in response to diverse inputs. Our rationale revolves around obtaining corresponding spatial and channel-wise correction matrices for various inputs within distinct feature spaces, which assists in assigning optimal attention. This is achieved by predicting dynamic weights for each input image and combining these weights with multiple learnable basis matrices to construct the correction matrices. The proposed MetaISP makes it possible to obtain best performance while being computationally efficient. SOTA results are achieved on two large-scale datasets, e. g. 23. 80dB PSNR on ZRR, exceeding the previous SOTA 0. 19dB with only 9. 2% of its parameter count and 10. 6% of its FLOPs; 25. 06dB PSNR on MAI21, exceeding the previous SOTA 0. 17dB with only 0. 9% of its parameter count and 2. 7% of its FLOPs.

NeurIPS Conference 2024 Conference Paper

SlimSAM: 0.1% Data Makes Segment Anything Slim

  • Zigeng Chen
  • Gongfan Fang
  • Xinyin Ma
  • Xinchao Wang

Current approaches for compressing the Segment Anything Model (SAM) yield commendable results, yet necessitate extensive data to train a new network from scratch. Employing conventional pruning techniques can remarkably reduce data requirements but would suffer from a degradation in performance. To address this challenging trade-off, we introduce SlimSAM, a novel data-efficient SAM compression method that achieves superior performance with extremely less training data. The essence of SlimSAM is encapsulated in the alternate slimming framework which effectively enhances knowledge inheritance under severely limited training data availability and exceptional pruning ratio. Diverging from prior techniques, our framework progressively compresses the model by alternately pruning and distilling distinct, decoupled sub-structures. Disturbed Taylor pruning is also proposed to address the misalignment between the pruning objective and training target, thereby boosting the post-distillation after pruning. SlimSAM yields significant performance improvements while demanding over 10 times less training data than any other existing compression methods. Even when compared to the original SAM, SlimSAM achieves approaching performance while reducing parameter counts to merely 1. 4% (9. 1M), MACs to 0. 8% (23G), and requiring only 0. 1% (10k) of the SAM training data.