Arrow Research search

Author name cluster

Shan You

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers
2 author rows

Possible papers

26

AAAI Conference 2026 Conference Paper

Injection Without Distortion: Geometrically Constrained Knowledge Enhancement for Vision-Language Models

  • Zhongze Wu
  • Xiu Su
  • Feng Yang
  • Shan You
  • Jun Long
  • Yueyi Luo

Vision-Language Models (VLMs) are widely used in tasks like Open-Vocabulary Object Detection and zero-shot Classification, owing to their powerful generalization. However, recent research reveals that VLMs exhibit significant performance instability when tasked with recognizing concepts at varying granularities (e.g., ``animal'' vs. ``dog''). Prevailing methods inject external knowledge from Large Language Models, but this unconstrained approach distorts the VLM's inherent hierarchical orthogonal geometry, leading to performance collapse on general concepts. To address this, we introduce GeCoin, an innovative Geometrically Constrained framework that safely enhances existing VLMs with external knowledge for improved hierarchical understanding, without additional training. By projecting knowledge into the null-space of a query concept's feature space, GeCoin mathematically guarantees the preservation of general knowledge while integrating specialized information. Extensive experiments across large-scale benchmarks, diverse VLMs, and knowledge from various LLMs (e.g., GPT-3.5, Claude-3, Gemini-Pro) show that GeCoin boosts performance by an average of 3.9% over the strongest baseline—crucially eradicating performance collapse on general concepts.

ICML Conference 2025 Conference Paper

Adversarial Robustness via Deformable Convolution with Stochasticity

  • Yanxiang Ma
  • Zixuan Huang
  • Minjing Dong
  • Shan You
  • Chang Xu 0002

Random defense represents a promising strategy to protect neural networks from adversarial attacks. Most of these methods enhance robustness by injecting randomness into the data, increasing uncertainty for attackers. However, this randomness could reduce the generalization capacity of defense, as defense performance could be sensitive to the hyperparameters of noise added to the data, making it difficult to generalize across different datasets. Additionally, the involvement of randomness always comes with a reduction of natural accuracy, which leads to a delicate trade-off between them, which is seldom studied in random defense. In this work, we propose incorporating randomness into the network structure instead of data input by designing stochastic deformable convolution, where a random mask replaces the convolutional offset. This process promotes data independence, enhancing generalization across datasets. To study the trade-off, we conduct a theoretical analysis of both robust and clean accuracy, from a perspective of gradient cosine similarity and natural inference. Based on the analysis, we reformulate the adversarial training in our random defense framework. Extensive experiments show that our method achieves SOTA adversarial robustness and clean accuracy compared with other random defense methods.

ICLR Conference 2025 Conference Paper

Learning Mask Invariant Mutual Information for Masked Image Modeling

  • Tao Huang 0020
  • Yanxiang Ma
  • Shan You
  • Chang Xu 0002

Masked autoencoders (MAEs) represent a prominent self-supervised learning paradigm in computer vision. Despite their empirical success, the underlying mechanisms of MAEs remain insufficiently understood. Recent studies have attempted to elucidate the functioning of MAEs through contrastive learning and feature representation analysis, yet these approaches often provide only implicit insights. In this paper, we propose a new perspective for understanding MAEs by leveraging the information bottleneck principle in information theory. Our theoretical analyses reveal that optimizing the latent features to balance relevant and irrelevant information is key to improving MAE performance. Building upon our proofs, we introduce MI-MAE, a novel method that optimizes MAEs through mutual information maximization and minimization. By enhancing latent features to retain maximal relevant information between them and the output, and minimizing irrelevant information between them and the input, our approach achieves better performance. Extensive experiments on standard benchmarks show that MI-MAE significantly outperforms MAE models in tasks such as image classification, object detection, and semantic segmentation. Our findings validate the theoretical framework and highlight the practical advantages of applying the information bottleneck principle to MAEs, offering deeper insights for developing more powerful self-supervised learning models.

ICML Conference 2025 Conference Paper

Stable Fair Graph Representation Learning with Lipschitz Constraint

  • Qiang Chen 0016
  • Zhongze Wu
  • Xiu Su
  • Xi Lin 0003
  • Zhe Qu
  • Shan You
  • Shuo Yang 0006
  • Chang Xu 0002

Group fairness based on adversarial training has gained significant attention on graph data, which was implemented by masking sensitive attributes to generate fair feature views. However, existing models suffer from training instability due to uncertainty of the generated masks and the trade-off between fairness and utility. In this work, we propose a stable fair Graph Neural Network (SFG) to maintain training stability while preserving accuracy and fairness performance. Specifically, we first theoretically derive a tight upper Lipschitz bound to control the stability of existing adversarial-based models and employ a stochastic projected subgradient algorithm to constrain the bound, which operates in a block-coordinate manner. Additionally, we construct the uncertainty set to train the model, which can prevent unstable training by dropping some overfitting nodes caused by chasing fairness. Extensive experiments conducted on three real-world datasets demonstrate that SFG is stable and outperforms other state-of-the-art adversarial-based methods in terms of both fairness and utility performance. Codes are available at https: //github. com/sh-qiangchen/SFG.

ICML Conference 2024 Conference Paper

Detecting Any instruction-to-answer interaction relationship: Universal Instruction-to-Answer Navigator for Med-VQA

  • Zhongze Wu
  • Hongyan Xu 0002
  • Yitian Long
  • Shan You
  • Xiu Su
  • Jun Long
  • Yueyi Luo
  • Chang Xu 0002

Medical Visual Question Answering (Med-VQA) interprets complex medical imagery using user instructions for precise diagnostics, yet faces challenges due to diverse, inadequately annotated images. In this paper, we introduce the Universal Instruction-Vision Navigator (Uni-Med) framework for extracting instruction-to-answer relationships, facilitating the understanding of visual evidence behind responses. Specifically, we design the Instruct-to-Answer Clues Interpreter (IAI) to generate visual explanations based on the answers and mark the core part of instructions with "real intent" labels. The IAI-Med VQA dataset, produced using IAI, is now publicly available to advance Med-VQA research. Additionally, our Token-Level Cut-Mix module dynamically aligns visual explanations with image patches, ensuring answers are traceable and learnable. We also implement intention-guided attention to minimize non-core instruction interference, sharpening focus on ’real intent’. Extensive experiments on SLAKE datasets show Uni-Med’s superior accuracies (87. 52% closed, 86. 12% overall), outperforming MedVInT-PMC-VQA by 1. 22% and 0. 92%. Code and dataset are available at: https: //github. com/zhongzee/Uni-Med-master.

AAAI Conference 2023 Conference Paper

Boosting Semi-Supervised Semantic Segmentation with Probabilistic Representations

  • Haoyu Xie
  • Changqi Wang
  • Mingkai Zheng
  • Minjing Dong
  • Shan You
  • Chong Fu
  • Chang Xu

Recent breakthroughs in semi-supervised semantic segmentation have been developed through contrastive learning. In prevalent pixel-wise contrastive learning solutions, the model maps pixels to deterministic representations and regularizes them in the latent space. However, there exist inaccurate pseudo-labels which map the ambiguous representations of pixels to the wrong classes due to the limited cognitive ability of the model. In this paper, we define pixel-wise representations from a new perspective of probability theory and propose a Probabilistic Representation Contrastive Learning (PRCL) framework that improves representation quality by taking its probability into consideration. Through modelling the mapping from pixels to representations as the probability via multivariate Gaussian distributions, we can tune the contribution of the ambiguous representations to tolerate the risk of inaccurate pseudo-labels. Furthermore, we define prototypes in the form of distributions, which indicates the confidence of a class, while the point prototype cannot. More- over, we propose to regularize the distribution variance to enhance the reliability of representations. Taking advantage of these benefits, high-quality feature representations can be derived in the latent space, thereby the performance of se- mantic segmentation can be further improved. We conduct sufficient experiment to evaluate PRCL on Pascal VOC and CityScapes to demonstrate its superiority. The code is available at https://github.com/Haoyu-Xie/PRCL.

NeurIPS Conference 2023 Conference Paper

Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models

  • Yichao Cao
  • Qingfei Tang
  • Xiu Su
  • Song Chen
  • Shan You
  • Xiaobo Lu
  • Chang Xu

Human-object interaction (HOI) detection aims to comprehend the intricate relationships between humans and objects, predicting triplets, and serving as the foundation for numerous computer vision tasks. The complexity and diversity of human-object interactions in the real world, however, pose significant challenges for both annotation and recognition, particularly in recognizing interactions within an open world context. This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs). The proposed method is dubbed as UniHOI. We conduct a deep analysis of the three hierarchical features inherent in visual HOI detectors and propose a method for high-level relation extraction aimed at VL foundation models, which we call HO prompt-based learning. Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. Furthermore, we utilize a LLM (i. e. GPT) for interaction interpretation, generating a richer linguistic understanding for complex HOIs. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence. Our efficient architecture design and learning methods effectively unleash the potential of the VL foundation models and LLMs, allowing UniHOI to surpass all existing methods with a substantial margin, under both supervised and zero-shot settings. The code and pre-trained weights will be made publicly available.

NeurIPS Conference 2023 Conference Paper

Knowledge Diffusion for Distillation

  • Tao Huang
  • Yuan Zhang
  • Mingkai Zheng
  • Shan You
  • Fei Wang
  • Chen Qian
  • Chang Xu

The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific. In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models. Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model. To address this, we propose to denoise student features using a diffusion model trained by teacher features. This allows us to perform better distillation between the refined clean feature and teacher feature. Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adaptive noise matching module to improve the denoising performance. Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code is available at https: //github. com/hunto/DiffKD.

ICLR Conference 2023 Conference Paper

Masked Distillation with Receptive Tokens

  • Tao Huang 0020
  • Yuan Zhang 0020
  • Shan You
  • Fei Wang 0032
  • Chen Qian 0006
  • Jian Cao 0002
  • Chang Xu 0002

Distilling from the feature maps can be fairly effective for dense prediction tasks since both the feature discriminability and localization information can be well transferred. However, not every pixel contributes equally to the performance, and a good student should learn from what really matters to the teacher. In this paper, we introduce a learnable embedding dubbed receptive token to locate the pixels of interests (PoIs) in the feature map, with a distillation mask generated via pixel-wise attention. Then the masked distillation will be performed via the pixel-wise reconstruction. In this way, a distillation mask refers to a pattern of pixel dependencies. We thus adopt multiple receptive tokens to investigate more sophisticated and informative pixel dependencies within feature maps to enhance the distillation. To obtain a group of masks, the receptive tokens are learned via the regular task loss but with teacher fixed, and we also leverage a Dice loss to enrich the diversity of obtained masks. Our method dubbed MasKD is simple and practical, and needs no priors of ground-truth labels, which can apply to various dense prediction tasks. Experiments show that our MasKD can achieve state-of-the-art performance consistently on object detection and semantic segmentation benchmarks.

AAAI Conference 2023 Conference Paper

Neural Architecture Search for Wide Spectrum Adversarial Robustness

  • Zhi Cheng
  • Yanxi Li
  • Minjing Dong
  • Xiu Su
  • Shan You
  • Chang Xu

One major limitation of CNNs is that they are vulnerable to adversarial attacks. Currently, adversarial robustness in neural networks is commonly optimized with respect to a small pre-selected adversarial noise strength, causing them to have potentially limited performance when under attack by larger adversarial noises in real-world scenarios. In this research, we aim to find Neural Architectures that have improved robustness on a wide range of adversarial noise strengths through Neural Architecture Search. In detail, we propose a lightweight Adversarial Noise Estimator to reduce the high cost of generating adversarial noise with respect to different strengths. Besides, we construct an Efficient Wide Spectrum Searcher to reduce the cost of adjusting network architecture with the large adversarial validation set during the search. With the two components proposed, the number of adversarial noise strengths searched can be increased significantly while having a limited increase in search time. Extensive experiments on benchmark datasets such as CIFAR and ImageNet demonstrate that with a significantly richer search signal in robustness, our method can find architectures with improved overall robustness while having a limited impact on natural accuracy and around 40% reduction in search time compared with the naive approach of searching. Codes available at: https://github.com/zhicheng2T0/Wsr-NAS.git

NeurIPS Conference 2022 Conference Paper

Green Hierarchical Vision Transformer for Masked Image Modeling

  • Lang Huang
  • Shan You
  • Mingkai Zheng
  • Fei Wang
  • Chen Qian
  • Toshihiko Yamasaki

We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of three key designs. First, for window attention, we propose a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w. r. t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. Third, as for the convolution layers, we convert them to the Sparse Convolution that works seamlessly with the sparse data, i. e. , the visible patches in MIM. As a result, MIM can now work on most, if not all, hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs, e. g. , Swin Transformer and Twins Transformer, about 2. 7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks.

NeurIPS Conference 2022 Conference Paper

Knowledge Distillation from A Stronger Teacher

  • Tao Huang
  • Shan You
  • Fei Wang
  • Chen Qian
  • Chang Xu

Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher models and training strategies are not that strong and competing as state-of-the-art approaches, this paper presents a method dubbed DIST to distill better from a stronger teacher. We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer. As a result, the exact match of predictions in KL divergence would disturb the training and make existing methods perform poorly. In this paper, we show that simply preserving the relations between the predictions of teacher and student would suffice, and propose a correlation-based loss to capture the intrinsic inter-class relations from the teacher explicitly. Besides, considering that different instances have different semantic similarities to each class, we also extend this relational match to the intra-class level. Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures, model sizes and training strategies, and can achieve state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code is available at: https: //github. com/hunto/DIST_KD.

ICLR Conference 2022 Conference Paper

Relational Surrogate Loss Learning

  • Tao Huang 0020
  • Zekang Li
  • Hua Lu 0017
  • Yong Shan
  • Shusheng Yang
  • Yang Feng
  • Fei Wang 0032
  • Shan You

Evaluation metrics in machine learning are often hardly taken as loss functions, as they could be non-differentiable and non-decomposable, e.g., average precision and F1 score. This paper aims to address this problem by revisiting the surrogate loss learning, where a deep neural network is employed to approximate the evaluation metrics. Instead of pursuing an exact recovery of the evaluation metric through a deep neural network, we are reminded of the purpose of the existence of these evaluation metrics, which is to distinguish whether one model is better or worse than another. In this paper, we show that directly maintaining the relation of models between surrogate losses and metrics suffices, and propose a rank correlation-based optimization method to maximize this relation and learn surrogate losses. Compared to previous works, our method is much easier to optimize and enjoys significant efficiency and performance gains. Extensive experiments show that our method achieves improvements on various tasks including image classification and neural machine translation, and even outperforms state-of-the-art methods on human pose estimation and machine reading comprehension tasks. Code is available at: https://github.com/hunto/ReLoss.

NeurIPS Conference 2022 Conference Paper

Searching for Better Spatio-temporal Alignment in Few-Shot Action Recognition

  • Yichao Cao
  • Xiu Su
  • Qingfei Tang
  • Shan You
  • Xiaobo Lu
  • Chang Xu

Spatio-Temporal feature matching and alignment are essential for few-shot action recognition as they determine the coherence and effectiveness of the temporal patterns. Nevertheless, this process could be not reliable, especially when dealing with complex video scenarios. In this paper, we propose to improve the performance of matching and alignment from the end-to-end design of models. Our solution comes at two-folds. First, we encourage to enhance the extracted Spatio-Temporal representations from few-shot videos in the perspective of architectures. With this aim, we propose a specialized transformer search method for videos, thus the spatial and temporal attention can be well-organized and optimized for stronger feature representations. Second, we also design an efficient non-parametric spatio-temporal prototype alignment strategy to better handle the high variability of motion. In particular, a query-specific class prototype will be generated for each query sample and category, which can better match query sequences against all support sequences. By doing so, our method SST enjoys significant superiority over the benchmark UCF101 and HMDB51 datasets. For example, with no pretraining, our method achieves 17. 1\% Top-1 accuracy improvement than the baseline TRX on UCF101 5-way 1-shot setting but with only 3x fewer FLOPs.

IJCAI Conference 2022 Conference Paper

Taylor, Can You Hear Me Now? A Taylor-Unfolding Framework for Monaural Speech Enhancement

  • Andong Li
  • Shan You
  • Guochen Yu
  • Chengshi Zheng
  • Xiaodong Li

While the deep learning techniques promote the rapid development of the speech enhancement (SE) community, most schemes only pursue the performance in a black-box manner and lack adequate model interpretability. Inspired by Taylor's approximation theory, we propose an interpretable decoupling-style SE framework, which disentangles the complex spectrum recovery into two separate optimization problems i. e. , magnitude and complex residual estimation. Specifically, serving as the 0th-order term in Taylor's series, a filter network is delicately devised to suppress the noise component only in the magnitude domain and obtain a coarse spectrum. To refine the phase distribution, we estimate the sparse complex residual, which is defined as the difference between target and coarse spectra, and measures the phase gap. In this study, we formulate the residual component as the combination of various high-order Taylor terms and propose a lightweight trainable module to replace the complicated derivative operator between adjacent terms. Finally, following Taylor's formula, we can reconstruct the target spectrum by the superimposition between 0th-order and high-order terms. Experimental results on two benchmark datasets show that our framework achieves state-of-the-art performance over previous competing baselines in various evaluation metrics. The source code is available at https: //github. com/Andong-Li-speech/TaylorSENet.

ICML Conference 2021 Conference Paper

K-shot NAS: Learnable Weight-Sharing for NAS with K-shot Supernets

  • Xiu Su
  • Shan You
  • Mingkai Zheng
  • Fei Wang 0032
  • Chen Qian 0006
  • Changshui Zhang
  • Chang Xu 0002

In one-shot weight sharing for NAS, the weights of each operation (at each layer) are supposed to be identical for all architectures (paths) in the supernet. However, this rules out the possibility of adjusting operation weights to cater for different paths, which limits the reliability of the evaluation results. In this paper, instead of counting on a single supernet, we introduce $K$-shot supernets and take their weights for each operation as a dictionary. The operation weight for each path is represented as a convex combination of items in a dictionary with a simplex code. This enables a matrix approximation of the stand-alone weight matrix with a higher rank ($K>1$). A \textit{simplex-net} is introduced to produce architecture-customized code for each path. As a result, all paths can adaptively learn how to share weights in the $K$-shot supernets and acquire corresponding weights for better evaluation. $K$-shot supernets and simplex-net can be iteratively trained, and we further extend the search to the channel dimension. Extensive experiments on benchmark datasets validate that K-shot NAS significantly improves the evaluation accuracy of paths and thus brings in impressive performance improvements.

ICLR Conference 2021 Conference Paper

Locally Free Weight Sharing for Network Width Search

  • Xiu Su
  • Shan You
  • Tao Huang 0020
  • Fei Wang 0032
  • Chen Qian 0006
  • Changshui Zhang
  • Chang Xu 0002

Searching for network width is an effective way to slim deep neural networks with hardware budgets. With this aim, a one-shot supernet is usually leveraged as a performance evaluator to rank the performance \wrt~different width. Nevertheless, current methods mainly follow a manually fixed weight sharing pattern, which is limited to distinguish the performance gap of different width. In this paper, to better evaluate each width, we propose a locally free weight sharing strategy (CafeNet) accordingly. In CafeNet, weights are more freely shared, and each width is jointly indicated by its base channels and free channels, where free channels are supposed to locate freely in a local zone to better represent each width. Besides, we propose to further reduce the search space by leveraging our introduced FLOPs-sensitive bins. As a result, our CafeNet can be trained stochastically and get optimized within a min-min strategy. Extensive experiments on ImageNet, CIFAR-10, CelebA and MS COCO dataset have verified our superiority comparing to other state-of-the-art baselines. For example, our method can further boost the benchmark NAS network EfficientNet-B0 by 0.41\% via searching its width more delicately.

NeurIPS Conference 2021 Conference Paper

ReSSL: Relational Self-Supervised Learning with Weak Augmentation

  • Mingkai Zheng
  • Shan You
  • Fei Wang
  • Chen Qian
  • Changshui Zhang
  • Xiaogang Wang
  • Chang Xu

Self-supervised Learning (SSL) including the mainstream contrastive learning has achieved great success in learning visual representations without data annotations. However, most of methods mainly focus on the instance level information (\ie, the different augmented images of the same instance should have the same feature or cluster into the same class), but there is a lack of attention on the relationships between different instances. In this paper, we introduced a novel SSL paradigm, which we term as relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances. Specifically, our proposed method employs sharpened distribution of pairwise similarities among different instances as \textit{relation} metric, which is thus utilized to match the feature embeddings of different augmentations. Moreover, to boost the performance, we argue that weak augmentations matter to represent a more reliable relation, and leverage momentum strategy for practical efficiency. Experimental results show that our proposed ReSSL significantly outperforms the previous state-of-the-art algorithms in terms of both performance and training efficiency.

NeurIPS Conference 2020 Conference Paper

Agree to Disagree: Adaptive Ensemble Knowledge Distillation in Gradient Space

  • Shangchen Du
  • Shan You
  • Xiaojie Li
  • Jianlong Wu
  • Fei Wang
  • Chen Qian
  • Changshui Zhang

Distilling knowledge from an ensemble of teacher models is expected to have a more promising performance than that from a single one. Current methods mainly adopt a vanilla average rule, i. e. , to simply take the average of all teacher losses for training the student network. However, this approach treats teachers equally and ignores the diversity among them. When conflicts or competitions exist among teachers, which is common, the inner compromise might hurt the distillation performance. In this paper, we examine the diversity of teacher models in the gradient space and regard the ensemble knowledge distillation as a multi-objective optimization problem so that we can determine a better optimization direction for the training of student network. Besides, we also introduce a tolerance parameter to accommodate disagreement among teachers. In this way, our method can be seen as a dynamic weighting method for each teacher in the ensemble. Extensive experiments validate the effectiveness of our method for both logits-based and feature-based cases.

NeurIPS Conference 2020 Conference Paper

ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse Coding

  • Yibo Yang
  • Hongyang Li
  • Shan You
  • Fei Wang
  • Chen Qian
  • Zhouchen Lin

Neural architecture search (NAS) aims to produce the optimal sparse solution from a high-dimensional space spanned by all candidate connections. Current gradient-based NAS methods commonly ignore the constraint of sparsity in the search phase, but project the optimized solution onto a sparse one by post-processing. As a result, the dense super-net for search is inefficient to train and has a gap with the projected architecture for evaluation. In this paper, we formulate neural architecture search as a sparse coding problem. We perform the differentiable search on a compressed lower-dimensional space that has the same validation loss as the original sparse solution space, and recover an architecture by solving the sparse coding problem. The differentiable search and architecture recovery are optimized in an alternate manner. By doing so, our network for search at each update satisfies the sparsity constraint and is efficient to train. In order to also eliminate the depth and width gap between the network in search and the target-net in evaluation, we further propose a method to search and evaluate in one stage under the target-net settings. When training finishes, architecture variables are absorbed into network weights. Thus we get the searched architecture and optimized parameters in a single run. In experiments, our two-stage method on CIFAR-10 requires only 0. 05 GPU-day for search. Our one-stage method produces state-of-the-art performances on both CIFAR-10 and ImageNet at the cost of only evaluation time.

AAAI Conference 2020 Conference Paper

Learning Student Networks with Few Data

  • Shumin Kong
  • Tianyu Guo
  • Shan You
  • Chang Xu

Recently, the teacher-student learning paradigm has drawn much attention in compressing neural networks on low-end edge devices, such as mobile phones and wearable watches. Current algorithms mainly assume the complete dataset for the teacher network is also available for the training of the student network. However, for real-world scenarios, users may only have access to part of training examples due to commercial profits or data privacy, and severe over-fitting issues would happen as a result. In this paper, we tackle the challenge of learning student networks with few data by investigating the ground-truth data-generating distribution underlying these few data. Taking Wasserstein distance as the measurement, we assume this ideal data distribution lies in a neighborhood of the discrete empirical distribution induced by the training examples. Thus we propose to safely optimize the worst-case cost within this neighborhood to boost the generalization. Furthermore, with theoretical analysis, we derive a novel and easy-to-implement loss for training the student network in an end-to-end fashion. Experimental results on benchmark datasets validate the effectiveness of our proposed method.

AAAI Conference 2020 Conference Paper

Reborn Filters: Pruning Convolutional Neural Networks with Limited Data

  • Yehui Tang
  • Shan You
  • Chang Xu
  • Jin Han
  • Chen Qian
  • Boxin Shi
  • Chao Xu
  • Changshui Zhang

Channel pruning is effective in compressing the pretrained CNNs for their deployment on low-end edge devices. Most existing methods independently prune some of the original channels and need the complete original dataset to fix the performance drop after pruning. However, due to commercial protection or data privacy, users may only have access to a tiny portion of training examples, which could be insufficient for the performance recovery. In this paper, for pruning with limited data, we propose to use all original filters to directly develop new compact filters, named reborn filters, so that all useful structure priors in the original filters can be well preserved into the pruned networks, alleviating the performance drop accordingly. During training, reborn filters can be easily implemented via 1 × 1 convolutional layers and then be fused in the inference stage for acceleration. Based on reborn filters, the proposed channel pruning algorithm shows its effectiveness and superiority on extensive experiments.

AAAI Conference 2018 Conference Paper

Learning With Single-Teacher Multi-Student

  • Shan You
  • Chang Xu
  • Chao Xu
  • Dacheng Tao

In this paper we study a new learning problem defined as “Single-Teacher Multi-Student” (STMS) problem, which investigates how to learn a series of student (simple and specific) models from a single teacher (complex and universal) model. Taking the multiclass and binary classification for example, we focus on learning multiple binary classifiers from a single multiclass classifier, where each of binary classifier is responsible for a certain class. This actually derives from some realistic problems, such as identifying the suspect based on a comprehensive face recognition system. By treating the already-trained multiclass classifier as the teacher, and multiple binary classifiers as the students, we propose a gated support vector machine (gSVM) as a solution. A series of gSVMs are learned with the help of single teacher multiclass classifier. The teacher’s help is two-fold; first, the teacher’s score provides the gated values for students’ decision; second, the teacher can guide the students to accommodate training examples with different difficulty degrees. Extensive experiments on real datasets validate its effectiveness.

AAAI Conference 2017 Conference Paper

Fast Compressive Phase Retrieval under Bounded Noise

  • Hongyang Zhang
  • Shan You
  • Zhouchen Lin
  • Chao Xu

We study the problem of recovering a t-sparse vector ±x0 in Rn from m quadratic equations yi = (aT i x)2 with noisy measurements yi’s. This is known as the problem of compressive phase retrieval, and has been widely applied to Xray diffraction imaging, microscopy, quantum mechanics, etc. The challenge is to design a a) fast and b) noise-tolerant algorithms with c) near-optimal sample complexity. Prior work in this direction typically achieved one or two of these goals, but none of them enjoyed the three performance guarantees simultaneously. In this work, with a particular set of sensing vectors ai’s, we give a provable algorithm that is robust to any bounded yet unstructured deterministic noise. Our algorithm requires roughly O(t) measurements and runs in O(tn log(1/ )) time, where is the error. This result advances the state-of-the-art work, and guarantees the applicability of our method to large datasets. Experiments on synthetic and real data verify our theory.

IJCAI Conference 2017 Conference Paper

Privileged Multi-label Learning

  • Shan You
  • Chang Xu
  • Yunhe Wang
  • Chao Xu
  • Dacheng Tao

This paper presents privileged multi-label learning (PrML) to explore and exploit the relationship between labels in multi-label learning problems. We suggest that for each individual label, it cannot only be implicitly connected with other labels via the low-rank constraint over label predictors, but also its performance on examples can receive the explicit comments from other labels together acting as an Oracle teacher. We generate privileged label feature for each example and its individual label, and then integrate it into the framework of low-rank based multi-label learning. The proposed algorithm can therefore comprehensively explore and exploit label relationships by inheriting all the merits of privileged information and low-rank constraints. We show that PrML can be efficiently solved by dual coordinate descent algorithm using iterative optimization strategy with cheap updates. Experiments on benchmark datasets show that through privileged label features, the performance can be significantly improved and PrML is superior to several competing methods in most cases.

NeurIPS Conference 2016 Conference Paper

CNNpack: Packing Convolutional Neural Networks in the Frequency Domain

  • Yunhe Wang
  • Chang Xu
  • Shan You
  • Dacheng Tao
  • Chao Xu

Deep convolutional neural networks (CNNs) are successfully used in a number of applications. However, their storage and computational requirements have largely prevented their widespread use on mobile devices. Here we present an effective CNN compression approach in the frequency domain, which focuses not only on smaller weights but on all the weights and their underlying connections. By treating convolutional filters as images, we decompose their representations in the frequency domain as common parts (i. e. , cluster centers) shared by other similar filters and their individual private parts (i. e. , individual residuals). A large number of low-energy frequency coefficients in both parts can be discarded to produce high compression without significantly compromising accuracy. We relax the computational burden of convolution operations in CNNs by linearly combining the convolution responses of discrete cosine transform (DCT) bases. The compression and speed-up ratios of the proposed algorithm are thoroughly analyzed and evaluated on benchmark image datasets to demonstrate its superiority over state-of-the-art methods.