Author name cluster

Yaoming Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

2 author rows

ECAI Conference 2025 Conference Paper

GLEAM: Parameter-Efficient Transfer Learning via Global Share Local Transform Mixture-of-Experts

Jiarui Zhang
Yue Xin
Yaoming Wang
Wenrui Dai
Ziyang Zheng
Chenglin Li
Junni Zou
Hongkai Xiong

Parameter-efficient transfer learning (PETL) has emerged as a promising solution to adapt large-scale pre-trained models to downstream tasks. Nevertheless, these methods have not thoroughly explored the characteristics of PETL methods to optimize the fine-tuning performance with miminal volume of parameters. In this paper, we first reveal that, compared to pre-trained models, PETL tends to generate similar features via homogeneous feature transformations across different layers. Subsequently, we propose a Global Share Local Transform Mixture-of-Experts framework, namely GLEAM, that decomposes the adapter into a shared component and layer-specific local components to simultaneously reduce the redundancy in layer-wise parameter matrices for homogeneous feature transformations and fine-tune the locally specific parameters for minimizing performance loss. Specifically, we develop a shared mixture of convolution that introduces shared multi-scale sparse MoE to enable diverse transformations for suppressing the homogeneity issue of feature transformations in PETL. GLEAM is evaluated on more than 20 datasets for image classification and few-shot learning. Extensive experimental results demonstrate that it achieves comparable performance with existing PETL methods like LoRA with only 3% of its parameters and further yields competitive performance using only 0. 07M parameters.

Details

ICML Conference 2025 Conference Paper

Noise Conditional Variational Score Distillation

Xinyu Peng
Ziyang Zheng
Yaoming Wang
Han Li
Nuowen Kan
Wenrui Dai
Chenglin Li
Junni Zou

We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.

Details

ICLR Conference 2024 Conference Paper

BarLeRIa: An Efficient Tuning Framework for Referring Image Segmentation

Yaoming Wang
Jin Li 0057
Xiaopeng Zhang 0008
Bowen Shi 0003
Chenglin Li
Wenrui Dai
Hongkai Xiong
Qi Tian 0001

Pre-training followed by full fine-tuning has gradually been substituted by Parameter-Efficient Tuning (PET) in the field of computer vision. PET has gained popularity, especially in the context of large-scale models, due to its ability to reduce transfer learning costs and conserve hardware resources. However, existing PET approaches primarily focus on recognition tasks and typically support uni-modal optimization, while neglecting dense prediction tasks and vision language interactions. To address this limitation, we propose a novel PET framework called **B**i-direction**a**l Inte**r**twined Vision **L**anguage Effici**e**nt Tuning for **R**eferring **I**mage Segment**a**tion (**BarLeRIa**), which leverages bi-directional intertwined vision language adapters to fully exploit the frozen pre-trained models' potential in cross-modal dense prediction tasks. In BarLeRIa, two different tuning modules are employed for efficient attention, one for global, and the other for local, along with an intertwined vision language tuning module for efficient modal fusion. Extensive experiments conducted on RIS benchmarks demonstrate the superiority of BarLeRIa over prior PET methods with a significant margin, i.e., achieving an average improvement of 5.6\%. Remarkably, without requiring additional training datasets, BarLeRIa even surpasses SOTA full fine-tuning approaches. The code is available at https://github.com/NastrondAd/BarLeRIa.

Details

ICML Conference 2024 Conference Paper

Bootstrap AutoEncoders With Contrastive Paradigm for Self-supervised Gaze Estimation

Yaoming Wang
Jin Li 0057
Wenrui Dai
Bowen Shi 0003
Xiaopeng Zhang 0008
Chenglin Li
Hongkai Xiong

Existing self-supervised methods for gaze estimation using the dominant streams of contrastive and generative approaches are restricted to eye images and could fail in general full-face settings. In this paper, we reveal that contrastive methods are ineffective in data augmentation for self-supervised full-face gaze estimation, while generative methods are prone to trivial solutions due to the absence of explicit regularization on semantic representations. To address this challenge, we propose a novel approach called B ootstrap auto- e ncoders with C ontrastive p a radigm ( BeCa ), which combines the strengths of both generative and contrastive methods. Specifically, we revisit the Auto-Encoder used in generative approaches and incorporate the contrastive paradigm to introduce explicit regularization on gaze representation. Furthermore, we design the InfoMSE loss as an alternative to the vanilla MSE loss for Auto-Encoder to mitigate the inconsistency between reconstruction and representation learning. Experimental results demonstrate that the proposed approaches outperform state-of-the-art unsupervised gaze approaches on extensive datasets (including wild scenes) under both within-dataset and cross-dataset protocols.

Details

ICLR Conference 2024 Conference Paper

Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners

Bowen Shi 0003
Xiaopeng Zhang 0008
Yaoming Wang
Jin Li 0057
Wenrui Dai
Junni Zou
Hongkai Xiong
Qi Tian 0001

As two prominent strategies for representation learning, Contrastive Learning (CL) and Masked Image Modeling (MIM) have witnessed significant progress. Previous studies have demonstrated the advantages of each approach in specific scenarios. CL, resembling supervised pre-training, excels at capturing longer-range global patterns and enhancing feature discrimination, while MIM is adept at introducing local and diverse attention across transformer layers. Considering the respective strengths, previous studies utilize feature distillation to inherit both discrimination and diversity. In this paper, we thoroughly examine previous feature distillation methods and observe that the increase in diversity mainly stems from asymmetric designs, which may in turn compromise the discrimination ability. To strike a balance between the two properties, we propose a simple yet effective strategy termed Hybrid Distill, which leverages both the CL and MIM teachers to jointly guide the student model. Hybrid Distill emulates the token relations of the MIM teacher at intermediate layers for diversity, while simultaneously distilling the final features of the CL teacher to enhance discrimination. A progressive redundant token masking strategy is employed to reduce the expenses associated with distillation and aid in preventing the model from converging to local optima. Experimental results demonstrate that Hybrid Distill achieves superior performance on various benchmark datasets.

Details

NeurIPS Conference 2023 Conference Paper

AiluRus: A Scalable ViT Framework for Dense Prediction

Jin Li
Yaoming Wang
Xiaopeng Zhang
Bowen Shi
Dongsheng Jiang
Chenglin Li
Wenrui Dai
Hongkai Xiong

Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, their complexity dramatically increases when handling long token sequences, particularly for dense prediction tasks that require high-resolution input. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we select anchors from the token sequence using the proposed spatial-aware density-based clustering algorithm. Tokens that are adjacent to anchors are merged to form low-resolution regions, while others are preserved independently as high-resolution. This strategy could significantly reduce the number of tokens, and the following layers only handle the reduced token sequence for acceleration. At the output end, the resolution of the feature map is recovered by unfolding merged tokens for task prediction. Consequently, we can considerably accelerate ViTs for dense prediction tasks. The proposed method is evaluated across three different datasets and demonstrates promising performance. For instance, "Segmenter ViT-L" can be accelerated by 48\% FPS without fine-tuning, while maintaining the performance. Moreover, our method can also be applied to accelerate fine-tuning. Experiments indicate that we can save 52\% training time while accelerating 2. 46$\times$ FPS with only a 0. 09\% performance drop.

PDF Details

ICLR Conference 2023 Conference Paper

Progressively Compressed Auto-Encoder for Self-supervised Representation Learning

Jin Li 0057
Yaoming Wang
Xiaopeng Zhang 0008
Yabo Chen
Dongsheng Jiang
Wenrui Dai
Chenglin Li
Hongkai Xiong

As a typical self-supervised learning strategy, Masked Image Modeling (MIM) is driven by recovering all masked patches from visible ones. However, patches from the same image are highly correlated and it is redundant to reconstruct all the masked patches. We find that this redundancy is neglected by existing MIM based methods and causes non-negligible overheads in computation that do not necessarily benefit self-supervised representation. In this paper, we present a novel approach named PCAE, short for Progressively Compressed AutoEncoder, to address the redundant reconstruction issue by progressively compacting tokens and only retaining necessary information for forward propagation and reconstruction. In particular, we identify those redundant tokens in an image via a simple yet effective similarity metric between each token with the mean of the token sequence. Those redundant tokens that other ones can probably represent are progressively dropped accordingly during the forward propagation, and importantly, we only focus on reconstructing these retained tokens. As a result, we are able to achieve a better trade-off between performance and efficiency for pre-training. Besides, benefitting from the flexible strategy, PCAE can be also directly employed for downstream fine-tuning tasks and enable scalable deployment. Experiments show that PCAE achieves comparable performance to MAE with only 1/8 GPU days. The code is available at https://github.com/caddyless/PCAE/.

Details

IJCAI Conference 2020 Conference Paper

SI-VDNAS: Semi-Implicit Variational Dropout for Hierarchical One-shot Neural Architecture Search

Yaoming Wang
Wenrui Dai
Chenglin Li
Junni Zou
Hongkai Xiong

Bayesian methods have improved the interpretability and stability of neural architecture search (NAS). In this paper, we propose a novel probabilistic approach, namely Semi-Implicit Variational Dropout one-shot Neural Architecture Search (SI-VDNAS), that leverages semi-implicit variational dropout to support architecture search with variable operations and edges. SI-VDNAS achieves stable training that would not be affected by the over-selection of skip-connect operation. Experimental results demonstrate that SI-VDNAS finds a convergent architecture with only 2. 7 MB parameters within 0. 8 GPU-days and can achieve 2. 60% top-1 error rate on CIFAR-10. The convergent architecture can obtain a top-1 error rate of 16. 20% and 25. 6% when transferred to CIFAR-100 and ImageNet (mobile setting).

PDF Details DOI