Arrow Research search

Author name cluster

Shiji Song

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
2 author rows

Possible papers

28

NeurIPS Conference 2025 Conference Paper

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

  • Shenzhi Wang
  • Le Yu
  • Chang Gao
  • Chujie Zheng
  • Shixuan Liu
  • Rui Lu
  • Kai Dang
  • Xiong-Hui Chen

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), yet its underlying mechanisms remain insufficiently understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction (approximately 20\%) of tokens exhibit high entropy, and these tokens semantically act as critical forks that steer the model toward diverse reasoning pathways. We further demonstrate that moderately increasing the entropy of these high-entropy tokens via decoding temperature adjustments leads to improved performance, quantitatively confirming their role as decision points in reasoning. We ultimately refine RLVR by restricting policy gradient updates to these forking tokens. Despite utilizing only 20\% of tokens, our approach achieves comparable performance to full-gradient updates on the Qwen3-8B base model. Moreover, it demonstrates remarkable improvements on the larger Qwen3-32B base model, boosting AIME'25 scores by 11. 04 and AIME'24 scores by 7. 71. In contrast, training exclusively on the 80\% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that dictate key reasoning directions. Collectively, our results suggest promising avenues for optimizing RLVR algorithms by strategically leveraging the potential of these high-entropy minority tokens to further enhance the reasoning abilities of LLMs.

NeurIPS Conference 2025 Conference Paper

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

  • Zhiqi Chen
  • Rui Lu
  • Andrew Zhao
  • Zhaokai Wang
  • Yang Yue
  • Shiji Song
  • Gao Huang

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly in mathematics and programming tasks. It is widely believed that, similar to how traditional RL helps agents to explore and learn new strategies, RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed the capacity of the corresponding base models. In this study, we take a critical look at \textit{the current state of RLVR} by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across diverse model families, RL algorithms, and math/coding/visual reasoning benchmarks, using pass@\textit{k} at large \textit{k} values as the evaluation metric. While RLVR improves sampling efficiency towards the correct path, we surprisingly find that current training does \emph{not} elicit fundamentally new reasoning patterns. We observe that while RLVR-trained models outperform their base models at smaller values of $k$ (\eg, $k$=1), base models achieve higher pass@$k$ score when $k$ is large. Moreover, we observe that the reasoning capability boundary of LLMs often narrows as RLVR training progresses. Further coverage and perplexity analysis shows that the reasoning paths generated by RLVR models are already included in the base models' sampling distribution, suggesting that their reasoning abilities originate from and are \textit{bounded} by the base model. From this perspective, treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in fully leveraging the potential of the base model. In contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model’s reasoning capabilities. Taken together, our findings suggest that current RLVR methods have not fully realized the potential of RL to elicit genuinely novel reasoning abilities in LLMs. This underscores the need for improved RL paradigms—such as continual scaling and multi-turn agent-environment interaction—to unlock this potential.

ICLR Conference 2025 Conference Paper

GridMix: Exploring Spatial Modulation for Neural Fields in PDE Modeling

  • Honghui Wang
  • Shiji Song
  • Gao Huang 0001

Significant advancements have been achieved in PDE modeling using neural fields. Despite their effectiveness, existing methods rely on global modulation, limiting their ability to reconstruct local details. While spatial modulation with vanilla grid-based representations offers a promising alternative, it struggles with inadequate global information modeling and over-fitting to the training spatial domain. To address these challenges, we propose GridMix, a novel approach that models spatial modulation as a mixture of grid-based representations. GridMix effectively explores global structures while preserving locality for fine-grained modulation. Furthermore, we introduce spatial domain augmentation to enhance the robustness of the modulated neural fields against spatial domain variations. With all these innovations, our comprehensive approach culminates in MARBLE, a framework that significantly advancing the capabilities of neural fields in PDE modeling. The effectiveness of MARBLE is extensively validated on diverse benchmarks encompassing dynamics modeling and geometric prediction.

AAAI Conference 2024 Conference Paper

A Reinforcement-Learning-Based Multiple-Column Selection Strategy for Column Generation

  • Haofeng Yuan
  • Lichang Fang
  • Shiji Song

Column generation (CG) is one of the most successful approaches for solving large-scale linear programming (LP) problems. Given an LP with a prohibitively large number of variables (i.e., columns), the idea of CG is to explicitly consider only a subset of columns and iteratively add potential columns to improve the objective value. While adding the column with the most negative reduced cost can guarantee the convergence of CG, it has been shown that adding multiple columns per iteration rather than a single column can lead to faster convergence. However, it remains a challenge to design a multiple-column selection strategy to select the most promising columns from a large number of candidate columns. In this paper, we propose a novel reinforcement-learning-based (RL) multiple-column selection strategy. To the best of our knowledge, it is the first RL-based multiple-column selection strategy for CG. The effectiveness of our approach is evaluated on two sets of problems: the cutting stock problem and the graph coloring problem. Compared to several widely used single-column and multiple-column selection strategies, our RL-based multiple-column selection strategy leads to faster convergence and achieves remarkable reductions in the number of CG iterations and runtime.

IROS Conference 2024 Conference Paper

A Unified Interaction Control Framework for Safe Robotic Ultrasound Scanning with Human-Intention-Aware Compliance

  • Xiangjie Yan
  • Shaqi Luo
  • Yongpeng Jiang
  • Mingrui Yu
  • Chen Chen 0087
  • Senqiang Zhu
  • Gao Huang 0001
  • Shiji Song

The ultrasound scanning robot operates in environments where frequent human-robot interactions occur. Most existing control methods for ultrasound scanning address only one specific interaction situation or implement hard switches between controllers for different situations, which compromises both safety and efficiency. In this paper, we propose a unified interaction control framework for ultrasound scanning robots capable of handling all common interactions, distinguishing both human-intended and unintended types, and adapting with appropriate compliance. Specifically, the robot suspends or modulates its ongoing main task if the interaction is intended, e. g. , when the doctor grasps the robot to lead the end effector actively. Furthermore, it can identify unintended interactions and avoid potential collision in the null space beforehand. Even if that collision has happened, it can become compliant with the collision in the null space and try to reduce its impact on the main task (where the scan is ongoing) kinematically and dynamically. The multiple situations are integrated into a unified controller with a smooth transition to deal with the interactions by exhibiting human-intention-aware compliance. Experimental results validate the framework’s ability to cope with all common interactions including intended intervention and unintended collision in a collaborative carotid artery ultrasound scanning task.

NeurIPS Conference 2024 Conference Paper

Bridging the Divide: Reconsidering Softmax and Linear Attention

  • Dongchen Han
  • Yifan Pu
  • Zhuofan Xia
  • Yizeng Han
  • Xuran Pan
  • Xiu Li
  • Jiwen Lu
  • Shiji Song

Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear attention naturally enjoys linear complexity and has great potential to scale up to higher-resolution images. Nonetheless, the unsatisfactory performance of linear attention greatly limits its practical application in various scenarios. In this paper, we take a step forward to close the gap between the linear and Softmax attention with novel theoretical analyses, which demystify the core factors behind the performance deviations. Specifically, we present two key perspectives to understand and alleviate the limitations of linear attention: the injective property and the local modeling ability. Firstly, we prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors, thus adding to severe semantic confusion since different queries correspond to the same outputs. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short. The aforementioned two fundamental differences significantly contribute to the disparities between these two attention paradigms, which is demonstrated by our substantial empirical validation in the paper. In addition, more experiment results indicate that linear attention, as long as endowed with these two properties, can outperform Softmax attention across various tasks while maintaining lower computation complexity. Code is available at https: //github. com/LeapLabTHU/InLine.

NeurIPS Conference 2024 Conference Paper

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

  • Yang Yue
  • Yulin Wang
  • Bingyi Kang
  • Yizeng Han
  • Shenzhi Wang
  • Shiji Song
  • Jiashi Feng
  • Gao Huang

Multimodal Large Language Models (MLLMs) have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks, whose feasibility has been recently verified~\cite{rt-2, rt-x}. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs usually incorporates storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we seek to address this challenge by leveraging an intriguing observation: relatively easier situations make up the bulk of the procedure of controlling robots to fulfill diverse tasks, and they generally require far smaller models to obtain the correct robotic actions. Motivated by this observation, we propose a \emph{DynamicEarly-Exit for Robotic MLLM} (DeeR) framework that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to cease processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (\emph{i. e. }, power consumption), as well as peak computational consumption (\emph{i. e. }, latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. Moreover, we design a tailored training method for integrating temporal information on top of such multi-exit architectures to predict actions reasonably. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs by 5. 2-6. 5x and GPU memory by 2x without compromising performance. Code and checkpoints are available at https: //github. com/yueyang130/DeeR-VLA.

NeurIPS Conference 2024 Conference Paper

Demystify Mamba in Vision: A Linear Attention Perspective

  • Dongchen Han
  • Ziyi Wang
  • Zhuofan Xia
  • Yizeng Han
  • Yifan Pu
  • Chunjiang Ge
  • Jun Song
  • Shiji Song

Mamba is an effective state space model with linear computation complexity. It has recently shown impressive efficiency in dealing with high-resolution inputs across various vision tasks. In this paper, we reveal that the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically underperform conventional Transformer in practice. By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba’s success. Specifically, we reformulate the selective state space model and linear attention within a unified formulation, rephrasing Mamba as a variant of linear attention Transformer with six major distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. For each design, we meticulously analyze its pros and cons, and empirically evaluate its impact on model performance in vision tasks. Interestingly, the results highlight the forget gate and block design as the core contributors to Mamba’s success, while the other four designs are less crucial. Based on these findings, we propose a Mamba-Inspired Linear Attention (MILA) model by incorporating the merits of these two key designs into linear attention. The resulting model outperforms various vision Mamba models in both image classification and high-resolution dense prediction tasks, while enjoying parallelizable computation and fast inference speed. Code is available at https: //github. com/LeapLabTHU/MLLA.

IROS Conference 2024 Conference Paper

In-Hand Following of Deformable Linear Objects Using Dexterous Fingers with Tactile Sensing

  • Mingrui Yu
  • Boyuan Liang
  • Xiang Zhang 0020
  • Xinghao Zhu
  • Lingfeng Sun
  • Changhao Wang
  • Shiji Song
  • Xiang Li 0009

Most research on deformable linear object (DLO) manipulation assumes rigid grasping. However, beyond rigid grasping and re-grasping, in-hand following is also an essential skill that humans use to dexterously manipulate DLOs, which requires continuously changing the grasp point by in-hand sliding while holding the DLO to prevent it from falling. Achieving such a skill is very challenging for robots without using specially designed but not versatile end-effectors. Previous works have attempted using generic parallel grippers, but their robustness is unsatisfactory owing to the conflict between following and holding, which is hard to balance with a one-degree-of-freedom gripper. In this work, inspired by how humans use fingers to follow DLOs, we explore the usage of a generic dexterous hand with tactile sensing to imitate human skills and achieve robust in-hand DLO following. To enable the hardware system to function in the real world, we develop a framework that includes Cartesian-space arm-hand control, tactile-based in-hand 3-D DLO pose estimation, and task-specific motion design. Experimental results demonstrate the significant superiority of our method over using parallel grippers, as well as its great robustness, generalizability, and efficiency.

IROS Conference 2024 Conference Paper

Visual Attention Based Cognitive Human-Robot Collaboration for Pedicle Screw Placement in Robot-Assisted Orthopedic Surgery

  • Chen Chen 0087
  • Qikai Zou
  • Yuhang Song 0009
  • Mingrui Yu
  • Senqiang Zhu
  • Shiji Song
  • Xiang Li 0009

Current orthopedic robotic systems largely focus on navigation, aiding surgeons in positioning a guiding tube but still requiring manual drilling and screw placement. The automation of this task not only demands high precision and safety due to the intricate physical interactions between the surgical tool and bone but also poses significant risks when executed without adequate human oversight. As it involves continuous physical interaction, the robot should collaborate with the surgeon, understand the human intent, and always include the surgeon in the loop. To achieve this, this paper proposes a new cognitive human–robot collaboration framework, including the intuitive AR-haptic human–robot interface, the visual-attention-based surgeon model, and the shared interaction control scheme for the robot. User studies on a robotic platform for orthopedic surgery are presented to illustrate the performance of the proposed method. The results demonstrate that the proposed human– robot collaboration framework outperforms full robot and full human control in terms of safety and ergonomics.

ICML Conference 2023 Conference Paper

Boosting Offline Reinforcement Learning with Action Preference Query

  • Qisen Yang
  • Shenzhi Wang
  • Matthieu Lin
  • Shiji Song
  • Gao Huang 0001

Training practical agents usually involve offline and online reinforcement learning (RL) to balance the policy’s performance and interaction costs. In particular, online fine-tuning has become a commonly used method to correct the erroneous estimates of out-of-distribution data learned in the offline training phase. However, even limited online interactions can be inaccessible or catastrophic for high-stake scenarios like healthcare and autonomous driving. In this work, we introduce an interaction-free training scheme dubbed Offline-with-Action-Preferences (OAP). The main insight is that, compared to online fine-tuning, querying the preferences between pre-collected and learned actions can be equally or even more helpful to the erroneous estimate problem. By adaptively encouraging or suppressing policy constraint according to action preferences, OAP could distinguish overestimation from beneficial policy improvement and thus attains a more accurate evaluation of unseen data. Theoretically, we prove a lower bound of the behavior policy’s performance improvement brought by OAP. Moreover, comprehensive experiments on the D4RL benchmark and state-of-the-art algorithms demonstrate that OAP yields higher (29% on average) scores, especially on challenging AntMaze tasks (98% higher).

ICLR Conference 2023 Conference Paper

Budgeted Training for Vision Transformer

  • Zhuofan Xia
  • Xuran Pan
  • Xuan Jin
  • Yuan He 0011
  • Hui Xue 0001
  • Shiji Song
  • Gao Huang 0001

The superior performances of Vision Transformers often come with higher training costs. Compared to their CNN counterpart, Transformer models are hungry for large-scale data and their training schedules are usually prolonged. This sets great restrictions on training Transformers with limited resources, where a proper trade-off between training cost and model performance is longed. In this paper, we address the problem by proposing a framework that enables the training process under \textit{any training budget} from the perspective of model structure, while achieving competitive model performances. Specifically, based on the observation that Transformer exhibits different levels of model redundancies at different training stages, we propose to dynamically control the activation rate of the model structure along the training process and meet the demand on the training budget by adjusting the duration on each level of model complexity. Extensive experiments demonstrate that our framework is applicable to various Vision Transformers, and achieves competitive performances on a wide range of training budgets.

AAAI Conference 2023 Conference Paper

Causal Intervention for Human Trajectory Prediction with Cross Attention Mechanism

  • Chunjiang Ge
  • Shiji Song
  • Gao Huang

Human trajectory Prediction (HTP) in complex social environments plays a crucial and fundamental role in artificial intelligence systems. Conventional methods make use of both history behaviors and social interactions to forecast future trajectories. However, we demonstrate that the social environment is a confounder that misleads the model to learn spurious correlations between history and future trajectories. To end this, we first formulate the social environment, history and future trajectory variables into a structural causal model to analyze the causalities among them. Based on causal intervention rather than conventional likelihood, we propose a Social Environment ADjustment (SEAD) method, to remove the confounding effect of the social environment. The core of our method is implemented by a Social Cross Attention (SCA) module, which is universal, simple and effective. Our method has consistent improvements on ETH-UCY datasets with three baseline models and achieves competitive performances with existing methods.

NeurIPS Conference 2023 Conference Paper

Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning

  • Shenzhi Wang
  • Qisen Yang
  • Jiawei Gao
  • Matthieu Lin
  • Hao Chen
  • Liwei Wu
  • Ning Jia
  • Shiji Song

Offline-to-online reinforcement learning (RL) is a training paradigm that combines pre-training on a pre-collected dataset with fine-tuning in an online environment. However, the incorporation of online fine-tuning can intensify the well-known distributional shift problem. Existing solutions tackle this problem by imposing a policy constraint on the policy improvement objective in both offline and online learning. They typically advocate a single balance between policy improvement and constraints across diverse data collections. This one-size-fits-all manner may not optimally leverage each collected sample due to the significant variation in data quality across different states. To this end, we introduce Family Offline-to-Online RL (FamO2O), a simple yet effective framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances. FamO2O utilizes a universal model to train a family of policies with different improvement/constraint intensities, and a balance model to select a suitable policy for each state. Theoretically, we prove that state-adaptive balances are necessary for achieving a higher policy performance upper bound. Empirically, extensive experiments show that FamO2O offers a statistically significant improvement over various existing methods, achieving state-of-the-art performance on the D4RL benchmark. Codes are available at https: //github. com/LeapLabTHU/FamO2O.

NeurIPS Conference 2023 Conference Paper

Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL

  • Yang Yue
  • Rui Lu
  • Bingyi Kang
  • Shiji Song
  • Gao Huang

The divergence of the Q-value estimation has been a prominent issue offline reinforcement learning (offline RL), where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, \emph{self-excitation}, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence. For the first time, our theory can reliably decide whether the training will diverge at an early stage, and even predict the order of the growth for the estimated Q-value, the model's norm, and the crashing step when an SGD optimizer is used. The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely improving the model's architecture for better extrapolating behavior. Through extensive empirical studies, we identify LayerNorm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i. e. using only 1$\%$ transitions of the dataset, where all previous methods fail. Moreover, it can be easily plugged into modern offline RL methods and achieve SOTA results on many challenging tasks. We also give unique insights into its effectiveness.

NeurIPS Conference 2022 Conference Paper

Contrastive Language-Image Pre-Training with Knowledge Graphs

  • Xuran Pan
  • Tianzhu Ye
  • Dongchen Han
  • Shiji Song
  • Gao Huang

Recent years have witnessed the fast development of large-scale pre-training frameworks that can extract multi-modal representations in a unified form and achieve promising performances when transferred to downstream tasks. Nevertheless, existing approaches mainly focus on pre-training with simple image-text pairs, while neglecting the semantic connections between concepts from different modalities. In this paper, we propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model. Through introducing knowledge-based objectives in the pre-training process and utilizing different types of knowledge graphs as training data, our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning ability across scenarios and modalities. Extensive experiments on various vision-language downstream tasks demonstrate the effectiveness of Knowledge-CLIP compared with the original CLIP and competitive baselines.

NeurIPS Conference 2022 Conference Paper

Efficient Knowledge Distillation from Model Checkpoints

  • Chaofei Wang
  • Qisen Yang
  • Rui Huang
  • Shiji Song
  • Gao Huang

Knowledge distillation is an effective approach to learn compact models (students) with the supervision of large and strong models (teachers). As empirically there exists a strong correlation between the performance of teacher and student models, it is commonly believed that a high performing teacher is preferred. Consequently, practitioners tend to use a well trained network or an ensemble of them as the teacher. In this paper, we observe that an intermediate model, i. e. , a checkpoint in the middle of the training procedure, often serves as a better teacher compared to the fully converged model, although the former has much lower accuracy. More surprisingly, a weak snapshot ensemble of several intermediate models from a same training trajectory can outperform a strong ensemble of independently trained and fully converged models, when they are used as teachers. We show that this phenomenon can be partially explained by the information bottleneck principle: the feature representations of intermediate models can have higher mutual information regarding the input, and thus contain more ``dark knowledge'' for effective distillation. We further propose an optimal intermediate teacher selection algorithm based on maximizing the total task-related mutual information. Experiments verify its effectiveness and applicability. Our code is available at https: //github. com/LeapLabTHU/CheckpointKD.

NeurIPS Conference 2022 Conference Paper

Latency-aware Spatial-wise Dynamic Networks

  • Yizeng Han
  • Zhihang Yuan
  • Yifan Pu
  • Chenhao Xue
  • Shiji Song
  • Guangyu Sun
  • Gao Huang

Spatial-wise dynamic convolution has become a promising approach to improving the inference efficiency of deep networks. By allocating more computation to the most informative pixels, such an adaptive inference paradigm reduces the spatial redundancy in image features and saves a considerable amount of unnecessary computation. However, the theoretical efficiency achieved by previous methods can hardly translate into a realistic speedup, especially on the multi-core processors (e. g. GPUs). The key challenge is that the existing literature has only focused on designing algorithms with minimal computation, ignoring the fact that the practical latency can also be influenced by scheduling strategies and hardware properties. To bridge the gap between theoretical computation and practical efficiency, we propose a latency-aware spatial-wise dynamic network (LASNet), which performs coarse-grained spatially adaptive inference under the guidance of a novel latency prediction model. The latency prediction model can efficiently estimate the inference latency of dynamic networks by simultaneously considering algorithms, scheduling strategies, and hardware properties. We use the latency predictor to guide both the algorithm design and the scheduling optimization on various hardware platforms. Experiments on image classification, object detection and instance segmentation demonstrate that the proposed framework significantly improves the practical inference efficiency of deep networks. For example, the average latency of a ResNet-101 on the ImageNet validation set could be reduced by 36% and 46% on a server GPU (Nvidia Tesla-V100) and an edge device (Nvidia Jetson TX2 GPU) respectively without sacrificing the accuracy. Code is available at https: //github. com/LeapLabTHU/LASNet.

NeurIPS Conference 2021 Conference Paper

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

  • Yulin Wang
  • Rui Huang
  • Shiji Song
  • Zeyi Huang
  • Gao Huang

Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16 or 14x14. In this paper, we argue that every image has its own characteristics, and ideally the token number should be conditioned on each individual input. In fact, we have observed that there exist a considerable number of “easy” images which can be accurately predicted with a mere number of 4x4 tokens, while only a small fraction of “hard” ones need a finer representation. Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image. This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i. e. , the inference is terminated once a sufficiently confident prediction is produced. We further design efficient feature reuse and relationship reuse mechanisms across different components of the Dynamic Transformer to reduce redundant computations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100 demonstrate that our method significantly outperforms the competitive baselines in terms of both theoretical computational efficiency and practical inference speed. Code and pre-trained models (based on PyTorch and MindSpore) are available at https: //github. com/blackfeather-wang/Dynamic-Vision-Transformer and https: //github. com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore.

ICLR Conference 2021 Conference Paper

Revisiting Locally Supervised Learning: an Alternative to End-to-end Training

  • Yulin Wang 0002
  • Zanlin Ni
  • Shiji Song
  • Le Yang 0007
  • Gao Huang 0001

Due to the need to store the intermediate activations for back-propagation, end-to-end (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training, while allowing using training data with higher-resolution or larger batch sizes under the same GPU memory constraint. Our method also enables training local modules asynchronously for potential training acceleration.

NeurIPS Conference 2020 Conference Paper

Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification

  • Yulin Wang
  • Kangchen Lv
  • Rui Huang
  • Shiji Song
  • Le Yang
  • Gao Huang

The accuracy of deep convolutional neural networks (CNNs) generally improves when fueled with high resolution images. However, this often comes at a high computational cost and high memory footprint. Inspired by the fact that not all regions in an image are task-relevant, we propose a novel framework that performs efficient image classification by processing a sequence of relatively small inputs, which are strategically selected from the original image with reinforcement learning. Such a dynamic decision process naturally facilitates adaptive inference at test time, i. e. , it can be terminated once the model is sufficiently confident about its prediction and thus avoids further redundant computation. Notably, our framework is general and flexible as it is compatible with most of the state-of-the-art light-weighted CNNs (such as MobileNets, EfficientNets and RegNets), which can be conveniently deployed as the backbone feature extractor. Experiments on ImageNet show that our method consistently improves the computational efficiency of a wide variety of deep models. For example, it further reduces the average latency of the highly efficient MobileNet-V3 on an iPhone XS Max by 20% without sacrificing accuracy. Code and pre-trained models are available at https: //github. com/blackfeather-wang/GFNet-Pytorch.

IROS Conference 2019 Conference Paper

End-to-end sensorimotor control problems of AUVs with deep reinforcement learning

  • Hui Wu 0002
  • Shiji Song
  • Yachu Hsu
  • Keyou You
  • Cheng Wu 0002

This paper studies on sensorimotor control problems of Autonomous Underwater Vehicles (AUVs) using deep reinforcement learning. We design an end-to-end learning architecture mapping original sensor input to continuous control output without referring to the dynamics of vehicles. To avoid difficult and noisy underwater localization, we implement the learning without knowing the positions of AUVs by proposing novel state encoder and reward shaping strategies. Two distinct underwater tasks, obstacle avoidance with sonar sensor and pipeline following with visual sensor, are simulated to validate the effectiveness of proposed architecture and strategies. For the latter, we test the learned policy on realistic images of underwater pipelines to check its generalization ability.

NeurIPS Conference 2019 Conference Paper

Implicit Semantic Data Augmentation for Deep Networks

  • Yulin Wang
  • Xuran Pan
  • Shiji Song
  • Hong Zhang
  • Gao Huang
  • Cheng Wu

In this paper, we propose a novel implicit semantic data augmentation (ISDA) approach to complement traditional augmentation techniques like flipping, translation or rotation. Our work is motivated by the intriguing property that deep networks are surprisingly good at linearizing features, such that certain directions in the deep feature space correspond to meaningful semantic transformations, e. g. , adding sunglasses or changing backgrounds. As a consequence, translating training samples along many semantic directions in the feature space can effectively augment the dataset to improve generalization. To implement this idea effectively and efficiently, we first perform an online estimate of the covariance matrix of deep features for each class, which captures the intra-class semantic variations. Then random vectors are drawn from a zero-mean normal distribution with the estimated covariance to augment the training data in that class. Importantly, instead of augmenting the samples explicitly, we can directly minimize an upper bound of the expected cross-entropy (CE) loss on the augmented training set, leading to a highly efficient algorithm. In fact, we show that the proposed ISDA amounts to minimizing a novel robust CE loss, which adds negligible extra computational cost to a normal training procedure. Although being simple, ISDA consistently improves the generalization performance of popular deep models (ResNets and DenseNets) on a variety of datasets, e. g. , CIFAR-10, CIFAR-100 and ImageNet. Code for reproducing our results are available at https: //github. com/blackfeather-wang/ISDA-for-Deep-Networks.

NeurIPS Conference 2019 Conference Paper

Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning

  • Wenjie Shi
  • Shiji Song
  • Hui Wu
  • Ya-Chu Hsu
  • Cheng Wu
  • Gao Huang

Model-free deep reinforcement learning (RL) algorithms have been widely used for a range of complex control tasks. However, slow convergence and sample inefficiency remain challenging problems in RL, especially when handling continuous and high-dimensional state spaces. To tackle this problem, we propose a general acceleration method for model-free, off-policy deep RL algorithms by drawing the idea underlying regularized Anderson acceleration (RAA), which is an effective approach to accelerating the solving of fixed point problems with perturbations. Specifically, we first explain how policy iteration can be applied directly with Anderson acceleration. Then we extend RAA to the case of deep RL by introducing a regularization term to control the impact of perturbation induced by function approximation errors. We further propose two strategies, i. e. , progressive update and adaptive restart, to enhance the performance. The effectiveness of our method is evaluated on a variety of benchmark tasks, including Atari 2600 and MuJoCo. Experimental results show that our approach substantially improves both the learning speed and final performance of state-of-the-art deep RL algorithms.

IJCAI Conference 2019 Conference Paper

Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

  • Wenjie Shi
  • Shiji Song
  • Cheng Wu

Maximum entropy deep reinforcement learning (RL) methods have been demonstrated on a range of challenging continuous tasks. However, existing methods either suffer from severe instability when training on large off-policy data or cannot scale to tasks with very high state and action dimensionality such as 3D humanoid locomotion. Besides, the optimality of desired Boltzmann policy set for non-optimal soft value function is not persuasive enough. In this paper, we first derive soft policy gradient based on entropy regularized expected reward objective for RL with continuous actions. Then, we present an off-policy actor-critic, model-free maximum entropy deep RL algorithm called deep soft policy gradient (DSPG) by combining soft policy gradient with soft Bellman equation. To ensure stable learning while eliminating the need of two separate critics for soft value functions, we leverage double sampling approach to making the soft Bellman equation tractable. The experimental results demonstrate that our method outperforms in performance over off-policy prior methods.

AAAI Conference 2015 Conference Paper

A Reduction of the Elastic Net to Support Vector Machines with an Application to GPU Computing

  • Quan Zhou
  • Wenlin Chen
  • Shiji Song
  • Jacob Gardner
  • Kilian Weinberger
  • Yixin Chen

Algorithmic reductions are one of the corner stones of theoretical computer science. Surprisingly, to-date, they have only played a limited role in machine learning. In this paper we introduce a formal and practical reduction between two of the most widely used machine learning algorithms: from the Elastic Net (and the Lasso as a special case) to the Support Vector Machine. First, we derive the reduction and summarize it in only 11 lines of MATLABTM. Then, we demonstrate its high impact potential by translating recent advances in parallelizing SVM solvers directly to the Elastic Net. The resulting algorithm is a parallel solver for the Elastic Net (and Lasso) that naturally utilizes GPU and multi-core CPUs. We evaluate it on twelve real world data sets, and show that it yields identical results as the popular (and highly optimized) glmnet implementation but is up-to two orders of magnitude faster.

AAAI Conference 2015 Conference Paper

Maximin Separation Probability Clustering

  • Gao Huang
  • Jianwen Zhang
  • Shiji Song
  • Zheng Chen

This paper proposes a new approach for discriminative clustering. The intuition is, for a good clustering, one should be able to learn a classifier from the clustering labels with high generalization accuracy. Thus we define a novel metric to evaluate the quality of a clustering labeling, named Minimum Separation Probability (MSP), which is a lower bound of the generalization accuracy of a classifier learnt from the clustering labeling. We take MSP as the objective to maximize and propose our approach Maximin Separation Probability Clustering (MSPC), which has several attractive properties, such as invariance to anisotropic feature scaling and intuitive probabilistic explanation for clustering quality. We present three efficient optimization strategies for MSPC, and analyze their interesting connections to existing clustering approaches, such as maximum margin clustering (MMC) and discriminative k-means. Empirical results on real world data sets verify that MSP is a robust and effective clustering quality measure. It is also shown that the proposed algorithms compare favorably to state-of-the-art clustering algorithms in both accuracy and efficiency.