Arrow Research search

Author name cluster

Shaohui Lin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

22 papers
2 author rows

Possible papers

22

NeurIPS Conference 2025 Conference Paper

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

  • Xiaoyu Zhan
  • Wenxuan Huang
  • Hao Sun
  • Xinyu Fu
  • Changfeng Ma
  • Shaosheng Cao
  • Bohan Jia
  • Shaohui Lin

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.

ICLR Conference 2025 Conference Paper

AugKD: Ingenious Augmentations Empower Knowledge Distillation for Image Super-Resolution

  • Yun Zhang
  • Wei Li 0002
  • Simiao Li
  • Hanting Chen
  • Zhijun Tu
  • Bingyi Jing
  • Shaohui Lin
  • Jie Hu 0021

Knowledge distillation (KD) compresses deep neural networks by transferring task-related knowledge from cumbersome pre-trained teacher models to more compact student models. However, vanilla KD for image super-resolution (SR) networks yields only limited improvements due to the inherent nature of SR tasks, where the outputs of teacher models are noisy approximations of high-quality label images. In this work, we show that the potential of vanilla KD has been underestimated and demonstrate that the ingenious application of data augmentation methods can close the gap between it and more complex, well-designed methods. Unlike conventional training processes typically applying image augmentations simultaneously to both low-quality inputs and high-quality labels, we propose AugKD utilizing unpaired data augmentations to 1) generate auxiliary distillation samples and 2) impose label consistency regularization. Comprehensive experiments show that the AugKD significantly outperforms existing state-of-the-art KD methods across a range of SR tasks.

AAAI Conference 2025 Conference Paper

Dynamic Contrastive Knowledge Distillation for Efficient Image Restoration

  • Yunshuai Zhou
  • Junbo Qiao
  • Jincheng Liao
  • Wei Li
  • Simiao Li
  • Jiao Xie
  • Yunhang Shen
  • Jie Hu

Knowledge distillation (KD) is a valuable yet challenging approach that enhances a compact student network by learning from a high-performance but cumbersome teacher model. However, previous KD methods for image restoration overlook the state of the student during the distillation, adopting a fixed solution space that limits the capability of KD. Additionally, relying solely on L1-type loss struggles to leverage the distribution information of images. In this work, we propose a novel dynamic contrastive knowledge distillation (DCKD) framework for image restoration. Specifically, we introduce dynamic contrastive regularization to perceive the student's learning state and dynamically adjust the distilled solution space using contrastive learning. Additionally, we also propose a distribution mapping module to extract and align the pixel-level category distribution of the teacher and student models. Note that the proposed DCKD is a structure-agnostic distillation framework, which can adapt to different backbones and can be combined with methods that optimize upper-bound constraints to further enhance model performance. Extensive experiments demonstrate that DCKD significantly outperforms the state-of-the-art KD methods across various image restoration tasks and backbones.

ICLR Conference 2025 Conference Paper

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

  • Wenxuan Huang 0001
  • Zijie Zhai
  • Yunhang Shen
  • Shaosheng Cao
  • Fei Zhao 0012
  • Xiangfeng Xu
  • Zheyu Ye
  • Shaohui Lin

Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by $\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption under decoding without KV cache, while saving $\sim$50\% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at https://github.com/Osilly/dynamic_llava.

ICLR Conference 2025 Conference Paper

Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-Resolution

  • Simiao Li
  • Yun Zhang
  • Wei Li 0002
  • Hanting Chen
  • Wenjia Wang
  • Bingyi Jing
  • Shaohui Lin
  • Jie Hu 0021

Knowledge distillation (KD) is a promising yet challenging model compression approach that transmits rich learning representations from robust but resource-demanding teacher models to efficient student models. Previous methods for image super-resolution (SR) are often tailored to specific teacher-student architectures, limiting their potential for improvement and hindering broader applications. This work presents a novel KD framework for SR models, the multi-granularity Mixture of Priors Knowledge Distillation (MiPKD), which can be universally applied to a wide range of architectures at both feature and block levels. The teacher’s knowledge is effectively integrated with the student's feature via the Feature Prior Mixer, and the reconstructed feature propagates dynamically in the training phase with the Block Prior Mixer. Extensive experiments illustrate the significance of the proposed MiPKD technique.

AAAI Conference 2025 Conference Paper

Probability-Density-aware Semi-supervised Learning

  • Shuyang Liu
  • Ruiqiu Zheng
  • Yunhang Shen
  • Zhou Yu
  • Ke Li
  • Xing Sun
  • Shaohui Lin

In Semi-supervised learning(SSL), we always accept cluster assumption, assuming features in different high-density regions belong to other categories. However, it is always ignored by existing algorithms and needs mathematical explanations. This paper first proposes a theorem to statistically explain cluster assumption and prove that the probability density can significantly help to use the prior fully. A Probability-Density-Aware Measure(PM) is proposed based on the theorem to discern the similarity between neighbor points. The PM is deployed to improve Label Propagation and a new pseudo-labeling algorithm, the Probability-Density-Aware Label Propagation(PMLP), is proposed. We also prove that traditional first-order similarity pseudo-labeling could be viewed as a particular case of PMLP, which provides a comprehensive theoretical understanding of PMLP's superior performance. Extensive experiments demonstrate that PMLP achieves outstanding performance compared with other recent methods.

AAAI Conference 2024 Conference Paper

AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries

  • Runqi Wang
  • Huixin Sun
  • Linlin Yang
  • Shaohui Lin
  • Chuanjian Liu
  • Yan Gao
  • Yao Hu
  • Baochang Zhang

DEtection TRansformer (DETR)-based models have achieved remarkable performance. However, they are accompanied by a large computation overhead cost, which significantly prevents their applications on resource-limited devices. Prior arts attempt to reduce the computational burden of DETR using low-bit quantization, while these methods sacrifice a severe significant performance on weight-activation-attention low-bit quantization. We observe that the number of matching queries and positive samples affect much on the representation capacity of queries in DETR, while quantifying queries of DETR further reduces its representational capacity, thus leading to a severe performance drop. We introduce a new quantization strategy based on Auxiliary Queries for DETR (AQ-DETR), aiming to enhance the capacity of quantized queries. In addition, a layer-by-layer distillation is proposed to reduce the quantization error between quantized attention and full-precision counterpart. Through our extensive experiments on large-scale open datasets, the performance of the 4-bit quantization of DETR and Deformable DETR models is comparable to full-precision counterparts.

NeurIPS Conference 2024 Conference Paper

CLIP in Mirror: Disentangling text from visual images through reflection

  • Tiancheng Wang
  • Yuguang Yang
  • Linlin Yang
  • Shaohui Lin
  • Juan Zhang
  • Guodong Guo
  • Baochang Zhang

The CLIP network excels in various tasks, but struggles with text-visual images i. e. , images that contain both text and visual objects; it risks confusing textual and visual representations. To address this issue, we propose MirrorCLIP, a zero-shot framework, which disentangles the image features of CLIP by exploiting the difference in the mirror effect between visual objects and text in the images. Specifically, MirrorCLIP takes both original and flipped images as inputs, comparing their features dimension-wise in the latent space to generate disentangling masks. With disentangling masks, we further design filters to separate textual and visual factors more precisely, and then get disentangled representations. Qualitative experiments using stable diffusion models and class activation mapping (CAM) validate the effectiveness of our disentanglement. Moreover, our proposed MirrorCLIP reduces confusion when encountering text-visual images and achieves a substantial improvement on typographic defense, further demonstrating its superior ability of disentanglement. Our code is available at https: //github. com/tcwangbuaa/MirrorCLIP

AAAI Conference 2024 Conference Paper

Kumaraswamy Wavelet for Heterophilic Scene Graph Generation

  • Lianggangxu Chen
  • Youqi Song
  • Shaohui Lin
  • Changbo Wang
  • Gaoqi He

Graph neural networks (GNNs) has demonstrated its capabilities in the field of scene graph generation (SGG) by updating node representations from neighboring nodes. Actually it can be viewed as a form of low-pass filter in the spatial domain, which smooths node feature representation and retains commonalities among nodes. However, spatial GNNs does not work well in the case of heterophilic SGG in which fine-grained predicates are always connected to a large number of coarse-grained predicates. Blind smoothing undermines the discriminative information of the fine-grained predicates, resulting in failure to predict them accurately. To address the heterophily, our key idea is to design tailored filters by wavelet transform from the spectral domain. First, we prove rigorously that when the heterophily on the scene graph increases, the spectral energy gradually shifts towards the high-frequency part. Inspired by this observation, we subsequently propose the Kumaraswamy Wavelet Graph Neural Network (KWGNN). KWGNN leverages complementary multi-group Kumaraswamy wavelets to cover all frequency bands. Finally, KWGNN adaptively generates band-pass filters and then integrates the filtering results to better accommodate varying levels of smoothness on the graph. Comprehensive experiments on the Visual Genome and Open Images datasets show that our method achieves state-of-the-art performance.

IJCAI Conference 2024 Conference Paper

Rethinking Centered Kernel Alignment in Knowledge Distillation

  • Zikai Zhou
  • Yunhang Shen
  • Shitong Shao
  • Linrui Gong
  • Shaohui Lin

Knowledge distillation has emerged as a highly effective method for bridging the representation discrepancy between large-scale models and lightweight models. Prevalent approaches involve leveraging appropriate metrics to minimize the divergence or distance between the knowledge extracted from the teacher model and the knowledge learned by the student model. Centered Kernel Alignment (CKA) is widely used to measure representation similarity and has been applied in several knowledge distillation methods. However, these methods are complex and fail to uncover the essence of CKA, thus not answering the question of how to use CKA to achieve simple and effective distillation properly. This paper first provides a theoretical perspective to illustrate the effectiveness of CKA, which decouples CKA to the upper bound of Maximum Mean Discrepancy (MMD) and a constant term. Drawing from this, we propose a novel Relation-Centered Kernel Alignment (RCKA) framework, which practically establishes a connection between CKA and MMD. Furthermore, we dynamically customize the application of CKA based on the characteristics of each task, with less computational source yet comparable performance than the previous methods. The extensive experiments on the CIFAR-100, ImageNet-1k, and MS-COCO demonstrate that our method achieves state-of-the-art performance on almost all teacher-student pairs for image classification and object detection, validating the effectiveness of our approaches. Our code is available in https: //github. com/Klayand/PCKA.

AAAI Conference 2024 Conference Paper

SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space

  • Yunchen Li
  • Zhou Yu
  • Gaoqi He
  • Yunhang Shen
  • Ke Li
  • Xing Sun
  • Shaohui Lin

Symmetric positive definite(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on E(X|y), where y is a vector and X is an SPD matrix. However, these methods are challenging to handle for large-scale data. In this paper, inspired by denoising diffusion probabilistic model(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate E(X|y). Moreover, our model can estimate p(X) unconditionally and flexibly without giving y. On the one hand, the model conditionally learns p(X|y) and utilizes the mean of samples to obtain E(X|y) as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data p(X) and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and conditionally.

AAAI Conference 2024 Conference Paper

Weakly Supervised Open-Vocabulary Object Detection

  • Jianghang Lin
  • Yunhang Shen
  • Bingquan Wang
  • Shaohui Lin
  • Ke Li
  • Liujuan Cao

Despite weakly supervised object detection (WSOD) being a promising step toward evading strong instance-level annotations, its capability is confined to closed-set categories within a single training dataset. In this paper, we propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize diverse datasets with only image-level annotations. To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment. First, we perform data-aware feature extraction to produce an input-conditional coefficient, which is leveraged into dataset attribute prototypes to identify dataset bias and help achieve cross-dataset generalization. Second, a customized location-oriented weakly supervised region proposal network is proposed to utilize high-level semantic layouts from the category-agnostic segment anything model to distinguish object boundaries. Lastly, we introduce a proposal-concept synchronized multiple-instance network, i.e., object mining and refinement with visual-semantic alignment, to discover objects matched to the text embeddings of concepts. Extensive experiments on Pascal VOC and MS COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art compared with previous WSOD methods in both close-set object localization and detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary learning to achieve on-par or even better performance than well-established fully-supervised open-vocabulary object detection (FSOVOD).

AAAI Conference 2023 Conference Paper

Adaptive Hierarchy-Branch Fusion for Online Knowledge Distillation

  • Linrui Gong
  • Shaohui Lin
  • Baochang Zhang
  • Yunhang Shen
  • Ke Li
  • Ruizhi Qiao
  • Bo Ren
  • Muqing Li

Online Knowledge Distillation (OKD) is designed to alleviate the dilemma that the high-capacity pre-trained teacher model is not available. However, the existing methods mostly focus on improving the ensemble prediction accuracy from multiple students (a.k.a. branches), which often overlook the homogenization problem that makes student model saturate quickly and hurts the performance. We assume that the intrinsic bottleneck of the homogenization problem comes from the identical branch architecture and coarse ensemble strategy. We propose a novel Adaptive Hierarchy-Branch Fusion framework for Online Knowledge Distillation, termed AHBF-OKD, which designs hierarchical branches and adaptive hierarchy-branch fusion module to boost the model diversity and aggregate complementary knowledge. Specifically, we first introduce hierarchical branch architectures to construct diverse peers by increasing the depth of branches monotonously on the basis of target branch. To effectively transfer knowledge from the most complex branch to the simplest target branch, we propose an adaptive hierarchy-branch fusion module to create hierarchical teacher assistants recursively, which regards the target branch as the smallest teacher assistant. During the training, the teacher assistant from the previous hierarchy is explicitly distilled by the teacher assistant and the branch from the current hierarchy. Thus, the important scores to different branches are effectively and adaptively allocated to reduce the branch homogenization. Extensive experiments demonstrate the effectiveness of AHBF-OKD on different datasets, including CIFAR-10/100 and ImageNet 2012. For example, on ImageNet 2012, the distilled ResNet-18 achieves Top-1 error of 29.28\%, which significantly outperforms the state-of-the-art methods. The source code is available at https://github.com/linruigong965/AHBF.

AAAI Conference 2023 Conference Paper

Explicit Invariant Feature Induced Cross-Domain Crowd Counting

  • Yiqing Cai
  • Lianggangxu Chen
  • Haoyue Guan
  • Shaohui Lin
  • Changhong Lu
  • Changbo Wang
  • Gaoqi He

Cross-domain crowd counting has shown progressively improved performance. However, most methods fail to explicitly consider the transferability of different features between source and target domains. In this paper, we propose an innovative explicit Invariant Feature induced Cross-domain Knowledge Transformation framework to address the inconsistent domain-invariant features of different domains. The main idea is to explicitly extract domain-invariant features from both source and target domains, which builds a bridge to transfer more rich knowledge between two domains. The framework consists of three parts, global feature decoupling (GFD), relation exploration and alignment (REA), and graph-guided knowledge enhancement (GKE). In the GFD module, domain-invariant features are efficiently decoupled from domain-specific ones in two domains, which allows the model to distinguish crowds features from backgrounds in the complex scenes. In the REA module both inter-domain relation graph (Inter-RG) and intra-domain relation graph (Intra-RG) are built. Specifically, Inter-RG aggregates multi-scale domain-invariant features between two domains and further aligns local-level invariant features. Intra-RG preserves taskrelated specific information to assist the domain alignment. Furthermore, GKE strategy models the confidence of pseudolabels to further enhance the adaptability of the target domain. Various experiments show our method achieves state-of-theart performance on the standard benchmarks. Code is available at https://github.com/caiyiqing/IF-CKT.

AAAI Conference 2022 Conference Paper

Comprehensive Regularization in a Bi-directional Predictive Network for Video Anomaly Detection

  • Chengwei Chen
  • Yuan Xie
  • Shaohui Lin
  • Angela Yao
  • Guannan Jiang
  • Wei Zhang
  • Yanyun Qu
  • Ruizhi Qiao

Video anomaly detection aims to automatically identify unusual objects or behaviours by learning from normal videos. Previous methods tend to use simplistic reconstruction or prediction constraints, which leads to the insufficiency of learned representations for normal data. As such, we propose a novel bi-directional architecture with three consistency constraints to comprehensively regularize the prediction task from pixelwise, cross-modal, and temporal-sequence levels. First, predictive consistency is proposed to consider the symmetry property of motion and appearance in forwards and backwards time, which ensures the highly realistic appearance and motion predictions at the pixel-wise level. Second, association consistency considers the relevance between different modalities and uses one modality to regularize the prediction of another one. Finally, temporal consistency utilizes the relationship of the video sequence and ensures that the predictive network generates temporally consistent frames. During inference, the pattern of abnormal frames is unpredictable and will therefore cause higher prediction errors. Experiments show that our method outperforms advanced anomaly detectors and achieves state-of-the-art results on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets.

ICML Conference 2022 Conference Paper

Self-supervised Models are Good Teaching Assistants for Vision Transformers

  • Haiyan Wu
  • Yuting Gao
  • Yinqi Zhang
  • Shaohui Lin
  • Yuan Xie 0006
  • Xing Sun 0001
  • Ke Li 0015

Transformers have shown remarkable progress on computer vision tasks in the past year. Compared to their CNN counterparts, transformers usually need the help of distillation to achieve comparable results on middle or small sized datasets. Meanwhile, recent researches discover that when transformers are trained with supervised and self-supervised manner respectively, the captured patterns are quite different both qualitatively and quantitatively. These findings motivate us to introduce an self-supervised teaching assistant (SSTA) besides the commonly used supervised teacher to improve the performance of transformers. Specifically, we propose a head-level knowledge distillation method that selects the most important head of the supervised teacher and self-supervised teaching assistant, and let the student mimic the attention distribution of these two heads, so as to make the student focus on the relationship between tokens deemed by the teacher and the teacher assistant. Extensive experiments verify the effectiveness of SSTA and demonstrate that the proposed SSTA is a good compensation to the supervised teacher. Meanwhile, some analytical experiments towards multiple perspectives (e. g. prediction, shape bias, robustness, and transferability to downstream tasks) with supervised teachers, self-supervised teaching assistants and students are inductive and may inspire future researches.

IJCAI Conference 2021 Conference Paper

Learn from Concepts: Towards the Purified Memory for Few-shot Learning

  • Xuncheng Liu
  • Xudong Tian
  • Shaohui Lin
  • Yanyun Qu
  • Lizhuang Ma
  • Wang Yuan
  • Zhizhong Zhang
  • Yuan Xie

Human beings have a great generalization ability to recognize a novel category by only seeing a few number of samples. This is because humans possess the ability to learn from the concepts that already exist in our minds. However, many existing few-shot approaches fail in addressing such a fundamental problem, {\it i. e. ,} how to utilize the knowledge learned in the past to improve the prediction for the new task. In this paper, we present a novel purified memory mechanism that simulates the recognition process of human beings. This new memory updating scheme enables the model to purify the information from semantic labels and progressively learn consistent, stable, and expressive concepts when episodes are trained one by one. On its basis, a Graph Augmentation Module (GAM) is introduced to aggregate these concepts and knowledge learned from new tasks via a graph neural network, making the prediction more accurate. Generally, our approach is model-agnostic and computing efficient with negligible memory cost. Extensive experiments performed on several benchmarks demonstrate the proposed method can consistently outperform a vast number of state-of-the-art few-shot learning methods.

IJCAI Conference 2021 Conference Paper

Novelty Detection via Contrastive Learning with Negative Data Augmentation

  • Chengwei Chen
  • Yuan Xie
  • Shaohui Lin
  • Ruizhi Qiao
  • Jian Zhou
  • Xin Tan
  • Yi Zhang
  • Lizhuang Ma

Novelty detection is the process of determining whether a query example differs from the learned training distribution. Previous generative adversarial networks based methods and self-supervised approaches suffer from instability training, mode dropping, and low discriminative ability. We overcome such problems by introducing a novel decoder-encoder framework. Firstly, a generative network (decoder) learns the representation by mapping the initialized latent vector to an image. In particular, this vector is initialized by considering the entire distribution of training data to avoid the problem of mode-dropping. Secondly, a contrastive network (encoder) aims to ``learn to compare'' through mutual information estimation, which directly helps the generative network to obtain a more discriminative representation by using a negative data augmentation strategy. Extensive experiments show that our model has significant superiority over cutting-edge novelty detectors and achieves new state-of-the-art results on various novelty detection benchmarks, e. g. CIFAR10 and DCASE. Moreover, our model is more stable for training in a non-adversarial manner, compared to other adversarial based novelty detection methods.

IJCAI Conference 2021 Conference Paper

Towards Compact Single Image Super-Resolution via Contrastive Self-distillation

  • Yanbo Wang
  • Shaohui Lin
  • Yanyun Qu
  • Haiyan Wu
  • Zhizhong Zhang
  • Yuan Xie
  • Angela Yao

Convolutional neural networks (CNNs) are highly successful for super-resolution (SR) but often require sophisticated architectures with heavy memory cost and computational overhead significantly restricts their practical deployments on resource-limited devices. In this paper, we proposed a novel contrastive self-distillation (CSD) framework to simultaneously compress and accelerate various off-the-shelf SR models. In particular, a channel-splitting super-resolution network can first be constructed from a target teacher network as a compact student network. Then, we propose a novel contrastive loss to improve the quality of SR images and PSNR/SSIM via explicit knowledge transfer. Extensive experiments demonstrate that the proposed CSD scheme effectively compresses and accelerates several standard SR models such as EDSR, RCAN and CARN. Code is available at https: //github. com/Booooooooooo/CSD.

IJCAI Conference 2018 Conference Paper

Accelerating Convolutional Networks via Global & Dynamic Filter Pruning

  • Shaohui Lin
  • Rongrong Ji
  • Yuchao Li
  • Yongjian Wu
  • Feiyue Huang
  • Baochang Zhang

Accelerating convolutional neural networks has recently received ever-increasing research focus. Among various approaches proposed in the literature, filter pruning has been regarded as a promising solution, which is due to its advantage in significant speedup and memory reduction of both network model and intermediate feature maps. To this end, most approaches tend to prune filters in a layer-wise fixed manner, which is incapable to dynamically recover the previously removed filter, as well as jointly optimize the pruned network across layers. In this paper, we propose a novel global & dynamic pruning (GDP) scheme to prune redundant filters for CNN acceleration. In particular, GDP first globally prunes the unsalient filters across all layers by proposing a global discriminative function based on prior knowledge of filters. Second, it dynamically updates the filter saliency all over the pruned sparse network, and then recover the mistakenly pruned filter, followed by a retraining phase to improve the model accuracy. Specially, we effectively solve the corresponding non-convex optimization problem of the proposed GDP via stochastic gradient descent with greedy alternative updating. Extensive experiments show that, comparing to the state-of-the-art filter pruning methods, the proposed approach achieves superior performance to accelerate several cutting-edge CNNs on the ILSVRC 2012 benchmark.

AAAI Conference 2017 Conference Paper

ESPACE: Accelerating Convolutional Neural Networks via Eliminating Spatial and Channel Redundancy

  • Shaohui Lin
  • Rongrong Ji
  • Chao Chen
  • Feiyue Huang

Recent years have witnessed an extensive popularity of convolutional neural networks (CNNs) in various computer vision and artificial intelligence applications. However, the performance gains have come at a cost of substantially intensive computation complexity, which prohibits its usage in resource-limited applications like mobile or embedded devices. While increasing attention has been paid to the acceleration of internal network structure, the redundancy of visual input is rarely considered. In this paper, we make the first attempt of reducing spatial and channel redundancy directly from the visual input for CNNs acceleration. The proposed method, termed ESPACE (Elimination of SPAtial and Channel rEdundancy), works by the following three steps: First, the 3D channel redundancy of convolutional layers is reduced by a set of low-rank approximation of convolutional filters. Second, a novel mask based selective processing scheme is proposed, which further speedups the convolution operations via skipping unsalient spatial locations of the visual input. Third, the accelerated network is fine-tuned using the training data via back-propagation. The proposed method is evaluated on ImageNet 2012 with implementations on two widelyadopted CNNs, i. e. AlexNet and GoogLeNet. In comparison to several recent methods of CNN acceleration, the proposed scheme has demonstrated new state-of-the-art acceleration performance by a factor of 5. 48× and 4. 12× speedup on AlexNet and GoogLeNet, respectively, with a minimal decrease in classification accuracy.

IJCAI Conference 2016 Conference Paper

Towards Convolutional Neural Networks Compression via Global Error Reconstruction

  • Shaohui Lin
  • Rongrong Ji
  • Xiaowei Guo
  • Xuelong Li

In recent years, convolutional neural networks (CNNs) have achieved remarkable success in various applications such as image classification, object detection, object parsing and face alignment. Such CNN models are extremely powerful to deal with massive amounts of training data by using millions and billions of parameters. However, these models are typically deficient due to the heavy cost in model storage, which prohibits their usage on resource-limited applications like mobile or embedded devices. In this paper, we target at compressing CNN models to an extreme without significantly losing their discriminability. Our main idea is to explicitly model the output reconstruction error between the original and compressed CNNs, which error is minimized to pursuit a satisfactory rate-distortion after compression. In particular, a global error reconstruction method termed GER is presented, which firstly leverages an SVD-based low-rank approximation to coarsely compress the parameters in the fully connected layers in a layer-wise manner. Subsequently, such layer-wise initial compressions are jointly optimized in a global perspective via back-propagation. The proposed GER method is evaluated on the ILSVRC2012 image classification benchmark, with implementations on two widely-adopted convolutional neural networks, i. e. the AlexNet and VGGNet-19. Comparing to several state-of-the-art and alternative methods of CNN compression, the proposed scheme has demonstrated the best rate-distortion performance on both networks.