Author name cluster

Zelin Peng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

1 author row

AAAI Conference 2026 Conference Paper

CP-CLIP: Customized Parameter Generation for Open-vocabulary Semantic Segmentation

Zelin Peng
Zhengqin Xu
Feilong Tang
Wei Shen

Open-vocabulary semantic segmentation aims to assign pixel-level labels to images based on textual descriptions, even for categories beyond predefined closed sets. While vision-language foundation models like CLIP are widely used for this task, fine-tuning them for pixel-level predictions often compromises their generalization capabilities. To address this, we propose a novel fine-tuning strategy, CP-CLIP, which generates customized parameters for CLIP without sacrificing its generalization. Our method employs a customized parameter generator that produces newly added parameters based on random noise, using local visual features from CLIP's image encoder as conditions, enabling generalization to new images from unseen scenarios. Additionally, we introduce an orthogonal adaptation technique to ensure the update direction is orthogonal to the pre-trained weights, largely preserving the initial generalization ability. Extensive experiments demonstrate that CP-CLIP achieves state-of-the-art performance across multiple benchmarks in open-vocabulary semantic segmentation.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Efficient Segmentation with Multimodal Large Language Model via Token Routing

Changsong Wen
Zelin Peng
Yu Huang
Wei Shen

Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in addressing open-world segmentation tasks. However, the substantial computational cost of the LLM components presents a significant challenge, especially in segmentation tasks, where efficiency has long been a central concern. Existing efficient MLLM approaches typically reduce computation cost by pruning visual tokens in the early layers, as they account for the majority of the input sequence. Despite their efficiency, this is incompatible with dense prediction tasks such as segmentation, since removing visual tokens leads to the loss of essential object parts and spatial details. To better understand the roles of visual tokens in segmentation, we analyze the attention weights of both image and mask tokens within LLM. We find that image tokens are important throughout all layers, whereas mask tokens only attend to image tokens at deeper layers. Based on the observation, we build an efficient segmentation framework based on MLLMs by introducing a sophisticated token routing strategy. This strategy dynamically determines when and how different tokens participate in computation: For mask tokens, they are only inserted at deeper layers of the LLM to reduce redundant computation, since they rarely attend to image tokens in early layers; For image tokens, only a small number of them, named proxies, are updated via full feedforward network (FFN) computation, while the update of the remaining tokens is guided by these proxies, i.e., efficiently computed through a lightweight projector applied on the difference of the proxies during their update. Our method achieves a 1.5× acceleration over the original LLM process by reducing its FLOPs to 56%, while maintaining the same segmentation performance.

PDF Details DOI

JBHI Journal 2025 Journal Article

DMformer: Difficulty-adapted Masked Transformer for Semi-Supervised Medical Image Segmentation

Zelin Peng
Guanchun Wang
Zhengqin Xu
Xiaokang Yang
Wei Shen

The shared anatomy among different human bodies can serve as a strong prior for effectively leveraging unlabeled data in semi-supervised medical image segmentation. Inspired by the success of masked image modeling, we notice that this prior can be explicitly realized by incorporating an auxiliary unsupervised gross anatomy reconstruction task into a teacher-student semi-supervised segmentation framework. In this auxiliary task, consistency is maintained between the student's predictions on masked images and the teacher's predictions on the original images. Despite its potential, we observe that the reconstruction difficulties of different organs/tissues can vary significantly and therefore reconstructing them requires tailored learning strategies. To address this issue, we introduce a difficulty-adapted mask mechanism based on the teacher-student framework, wherein the reconstruction difficulty is adapted to facilitate training. Specifically, we control the reconstruction difficulty by modulating two important factors: masked region ratio and masked class ratio. Accordingly, we design two corresponding mask strategies. 1) Region-based masking: randomly masks a fraction of each class according to an automatically computed mask ratio. 2) Class-based masking: masks the entire regions of the specific classes according to the class confidence predicted by the teacher model. During training, a conflict-aware gradient computation strategy is introduced to mitigate potential optimization conflicts arising from modulating the two reconstruction factors simultaneously. By building on vision transformers, we develop an D ifficulty-adapted M asked Trans former (DMformer) for semi-supervised medical image segmentation. Extensive experiments demonstrate the superiority of DMformer, which outperforms the previous SOTA by 9. 53% and 4. 63% in terms of DSC on ACDC dataset with 5% labeled images and Synapse dataset with 30% labeled images, respectively. Code is available at: https://github.com/SJTU-DeepVisionLab/DMformer.

Details DOI

AAAI Conference 2025 Conference Paper

FATE: Feature-Adapted Parameter Tuning for Vision-Language Models

Zhengqin Xu
Zelin Peng
Xiaokang Yang
Wei Shen

Following the recent popularity of vision language models, several attempts, e.g., parameter-efficient fine-tuning (PEFT), have been made to extend them to different downstream tasks. Previous PEFT works motivate their methods from the view of introducing new parameters for adaptation but still need to learn this part of weight from scratch, i.e., random initialization. In this paper, we present a novel strategy that incorporates the potential of prompts, e.g., vision features, to facilitate the initial parameter space adapting to new scenarios. We introduce a Feature-Adapted parameTer Efficient tuning paradigm for vision-language models, dubbed as FATE, which injects informative features from the vision encoder into language encoder's parameters space. Specifically, we extract vision features from the last layer of CLIP's vision encoder and, after projection, treat them as parameters for fine-tuning each layer of CLIP's language encoder. By adjusting these feature-adapted parameters, we can directly enable communication between the vision and language branches, facilitating CLIP's adaptation to different scenarios. Experimental results show that FATE exhibits superior generalization performance on 11 datasets with a very small amount of extra parameters and computation.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models

Zelin Peng
Zhengqin Xu
Qingyang Liu
Xiaokang Yang
Wei Shen

Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e. g. , thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e. g. , CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as \blg, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. \alg employs learnable matrices with M\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that \alg consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters. Code is available at \url{https: //github. com/godlin-sjtu/HyperET}.

PDF Details

AAAI Conference 2024 Conference Paper

LERE: Learning-Based Low-Rank Matrix Recovery with Rank Estimation

Zhengqin Xu
Yulun Zhang
Chao Ma
Yichao Yan
Zelin Peng
Shoulie Xie
Shiqian Wu
Xiaokang Yang

A fundamental task in the realms of computer vision, Low-Rank Matrix Recovery (LRMR) focuses on the inherent low-rank structure precise recovery from incomplete data and/or corrupted measurements given that the rank is a known prior or accurately estimated. However, it remains challenging for existing rank estimation methods to accurately estimate the rank of an ill-conditioned matrix. Also, existing LRMR optimization methods are heavily dependent on the chosen parameters, and are therefore difficult to adapt to different situations. Addressing these issues, A novel LEarning-based low-rank matrix recovery with Rank Estimation (LERE) is proposed. More specifically, considering the characteristics of the Gerschgorin disk's center and radius, a new heuristic decision rule in the Gerschgorin Disk Theorem is significantly enhanced and the low-rank boundary can be exactly located, which leads to a marked improvement in the accuracy of rank estimation. According to the estimated rank, we select row and column sub-matrices from the observation matrix by uniformly random sampling. A 17-iteration feedforward-recurrent-mixed neural network is then adapted to learn the parameters in the sub-matrix recovery processing. Finally, by the correlation of the row sub-matrix and column sub-matrix, LERE successfully recovers the underlying low-rank matrix. Overall, LERE is more efficient and robust than existing LRMR methods. Experimental results demonstrate that LERE surpasses state-of-the-art (SOTA) methods. The code for this work is accessible at https://github.com/zhengqinxu/LERE.

PDF Details DOI

AAAI Conference 2024 Conference Paper

SAM-PARSER: Fine-Tuning SAM Efficiently by Parameter Space Reconstruction

Zelin Peng
Zhengqin Xu
Zhilin Zeng
Xiaokang Yang
Wei Shen

Segment Anything Model (SAM) has received remarkable attention as it offers a powerful and versatile solution for object segmentation in images. However, fine-tuning SAM for downstream segmentation tasks under different scenarios remains a challenge, as the varied characteristics of different scenarios naturally requires diverse model parameter spaces. Most existing fine-tuning methods attempt to bridge the gaps among different scenarios by introducing a set of new parameters to modify SAM's original parameter space. Unlike these works, in this paper, we propose fine-tuning SAM efficiently by parameter space reconstruction (SAM-PARSER), which introduce nearly zero trainable parameters during fine-tuning. In SAM-PARSER, we assume that SAM's original parameter space is relatively complete, so that its bases are able to reconstruct the parameter space of a new scenario. We obtain the bases by matrix decomposition, and fine-tuning the coefficients to reconstruct the parameter space tailored to the new scenario by an optimal linear combination of the bases. Experimental results show that SAM-PARSER exhibits superior segmentation performance across various scenarios, while reducing the number of trainable parameters by approximately 290 times compared with current parameter-efficient fine-tuning methods.

PDF Details DOI

IJCAI Conference 2022 Conference Paper

Absolute Wrong Makes Better: Boosting Weakly Supervised Object Detection via Negative Deterministic Information

Guanchun Wang
Xiangrong Zhang
Zelin Peng
Xu Tang
Huiyu Zhou
Licheng Jiao

Weakly supervised object detection (WSOD) is a challenging task, in which image-level labels (e. g. , categories of the instances in the whole image) are used to train an object detector. Many existing methods follow the standard multiple instance learning (MIL) paradigm and have achieved promising performance. However, the lack of deterministic information leads to part domination and missing instances. To address these issues, this paper focuses on identifying and fully exploiting the deterministic information in WSOD. We discover that negative instances (i. e. absolutely wrong instances), ignored in most of the previous studies, normally contain valuable deterministic information. Based on this observation, we here propose a negative deterministic information (NDI) based method for improving WSOD, namely NDI-WSOD. Specifically, our method consists of two stages: NDI collecting and exploiting. In the collecting stage, we design several processes to identify and distill the NDI from negative instances online. In the exploiting stage, we utilize the extracted NDI to construct a novel negative contrastive learning mechanism and a negative guided instance selection strategy for dealing with the issues of part domination and missing instances, respectively. Experimental results on several public benchmarks including VOC 2007, VOC 2012 and MS COCO show that our method achieves satisfactory performance.

PDF Details DOI