Arrow Research search

Author name cluster

Donglin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

46 papers
2 author rows

Possible papers

46

AAAI Conference 2026 Conference Paper

Activation-wise Propagation: A One-Timestep Strategy for Spiking Neural Networks

  • Jian Song
  • Xiangfei Yang
  • Shangke Lyu
  • Donglin Wang

Spiking neural networks (SNNs) have demonstrated significant potential in real-time multi-sensor perception tasks due to their event-driven and parameter-efficient characteristics. A key challenge is the timestep-wise iterative update of neuronal hidden states (membrane potentials), which complicates the trade-off between accuracy and latency. SNNs tend to achieve better performance with longer timesteps, inevitably resulting in higher computational overhead and latency compared to artificial neural networks (ANNs). Moreover, many recent advances in SNNs rely on architecture-specific optimizations, which, while effective with fewer timesteps, often limit generalizability and scalability across modalities and models. To address these limitations, we propose Activation-wise Membrane Potential Propagation (AMP2), a unified hidden state update mechanism for SNNs. Inspired by the spatial propagation of membrane potentials in biological neurons, AMP2 enables dynamic transmission of membrane potentials among spatially adjacent neurons, facilitating spatiotemporal integration and cooperative dynamics of hidden states, thereby improving efficiency and accuracy while reducing reliance on extended temporal updates. This simple yet effective strategy significantly enhances SNN performance across various architectures, including MLPs and CNNs for point cloud and event-based data. Furthermore, ablation studies integrating AMP2 into Transformer-based SNNs for classification tasks demonstrate its potential as a general-purpose and efficient solution for spiking neural networks.

AAAI Conference 2026 Conference Paper

Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models

  • Hongyin Zhang
  • Shiyuan Zhang
  • Junxi Jin
  • Qixin Zeng
  • Yifan Qiao
  • Hongchao Lu
  • Donglin Wang

Vision-Language-Action (VLA) models based on flow matching have shown excellent performance in general-purpose robotic manipulation tasks. However, the action accuracy of these models on complex downstream tasks is unsatisfactory. One important reason is that these models rely solely on the post-training paradigm of imitation learning, which makes it difficult to have a deeper understanding of the distribution properties of data quality, which is exactly what Reinforcement Learning (RL) excels at. In this paper, we theoretically propose an offline RL post-training objective for VLA flow models and induce an efficient and feasible offline RL fine-tuning algorithm −− Adaptive Reinforced Flow Matching (ARFM). By introducing an adaptively adjusted scaling factor in the VLA flow model loss, we construct a principled bias-variance trade-off objective function to optimally control the impact of RL signal on flow loss. ARFM adaptively balances RL advantage preservation and flow loss gradient variance control, resulting in a more stable and efficient fine-tuning process. Extensive simulation and real-world experimental results show that ARFM exhibits excellent generalization, robustness, few-shot learning, and continuous learning performance.

AAAI Conference 2026 Conference Paper

Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

  • Yuhang Han
  • Xuyang Liu
  • Zihan Zhang
  • Pengxiang Ding
  • Junjie Chen
  • Honggang Chen
  • Donglin Wang
  • Qingsen Yan

The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length poses significant computational and memory challenges, hindering their real-world deployment. In the paper, we devise a ''filter-correlate-compress'' framework to accelerate the MLLM by systematically optimizing multimodal context length during prefilling. The framework first implements FiCoCo-V, a training-free method operating within the vision encoder. It employs a redundancy-based token discard mechanism that uses a novel integrated metric to accurately filter out redundant visual tokens. To mitigate information loss, the framework introduces a correlation-based information recycling mechanism that allows preserved tokens to selectively recycle information from correlated discarded tokens with a self-preserving compression, thereby preventing the dilution of their own core content. The framework's FiCoCo-L variant further leverages task-aware textual priors to perform token reduction directly within the LLM decoder. Extensive experiments demonstrate that the FiCoCo series effectively accelerates a range of MLLMs, achieves up to 14.7× FLOPs reduction with 93.6% performance retention. Our methods consistently outperform state-of-the-art training-free approaches, showcasing effectiveness and generalizability across model architectures, sizes, and tasks without requiring retraining.

AAAI Conference 2026 Conference Paper

ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

  • Wenxuan Song
  • Ziyang Zhou
  • Han Zhao
  • Jiayi Chen
  • Pengxiang Ding
  • Haodong Yan
  • Yuxin Huang
  • Feilong Tang

Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model's visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset comprising over 100k trajectories and 2 million data samples from open-source robotic datasets, further boosting the model’s generalization in visual reconstruction. Extensive experiments in simulation and the real world demonstrate the superiority of our implicit grounding method, showcasing its capabilities of precise manipulation and generalization.

AAAI Conference 2026 Conference Paper

Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach

  • Hangyu Liu
  • Bo Peng
  • Pengxiang Ding
  • Donglin Wang

Compared to single-target adversarial attacks, multi-target attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously. However, existing generative approaches for multi-target attacks primarily encode target labels into one-dimensional tensors, leading to a loss of fine-grained visual information and overfitting to model-specific features during noise generation. To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks: 1) Feature quality refers to the structural and detailed completeness of the implanted target features, as deficiencies may result in the loss of key discriminative information; 2) Feature quantity refers to the spatial sufficiency of the implanted target features, as inadequacy limits the victim model's attention to this feature. Based on these findings, we propose the 2D Tensor-Guided Adversarial Fusion (TGAF) framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors for guiding adversarial noise generation. Additionally, we design a novel masking strategy tailored for the training process, ensuring that parts of the generated noise retain complete semantic information about the target class. Extensive experiments demonstrate that TGAF consistently surpasses state-of-the-art methods across various settings.

AAAI Conference 2026 Conference Paper

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

  • Yihao Wang
  • Pengxiang Ding
  • Lingxiao Li
  • Can Cui
  • Zirui Ge
  • Xinyang Tong
  • Wenxuan Song
  • Han Zhao

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks show that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model on a single consumer-grade GPU, greatly lowering the barrier to deploying VLA model.

NeurIPS Conference 2025 Conference Paper

Boundary-to-Region Supervision for Offline Safe Reinforcement Learning

  • Huikang Su
  • Dengyun Peng
  • Zifeng Zhuang
  • Yuhan Liu
  • Qiguang Chen
  • Donglin Wang
  • Qinghe Liu

Offline safe reinforcement learning aims to learn policies that satisfy predefined safety constraints from static datasets. Existing sequence-model-based methods condition action generation on symmetric input tokens for return-to-go and cost-to-go, neglecting their intrinsic asymmetry: RTG serves as a flexible performance target, while CTG should represent a rigid safety boundary. This symmetric conditioning leads to unreliable constraint satisfaction, especially when encountering out-of-distribution cost trajectories. To address this, we propose Boundary-to-Region (B2R), a framework that enables asymmetric conditioning through cost signal realignment. B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories while preserving reward structures. Combined with rotary positional embeddings, it enhances exploration within the safe region. Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks while achieving superior reward performance over baseline methods. This work highlights the limitations of symmetric token conditioning and establishes a new theoretical and practical approach for applying sequence models to safe RL.

AAAI Conference 2025 Conference Paper

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

  • Han Zhao
  • Min Zhang
  • Wei Zhao
  • Pengxiang Ding
  • Siteng Huang
  • Donglin Wang

In recent years, applying multi-modal large language models (MLLMs) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, MLLMs comprise the well-known Transformer network, which has a less efficient quadratic computation complexity. In this study, we introduce Cobra, a multi-modal large-scale language model built upon a state-space model, which has demonstrated significant potential in efficiently handling long sequences with fast inference and linear scalability concerning sequence length. Specifically, Cobra involves replacing Transformer-based backbone models (e.g., LLaMA or Phi) with pre-trained Mamba language models. We then empirically explore effective strategies for aligning visual and textual modalities and integrating various pre-trained Mamba model variants with visual encoders. Experiments across various multi-modal benchmarks demonstrate that: (i) Cobra performs 3× ∼ 4× faster than the most computationally efficient state-of-the-art methods, e.g., LLaVA-Phi and MobileVLM v2. Additionally, its performance is significantly enhanced thanks to the implementation of linear sequential modeling. (ii) Cobra fine-tunes a small parameter (∼48% of model parameters), leading to a significant improvement in overall performance compared to LLaVA.

PRL Workshop 2025 Workshop Paper

ContextFormer: Stitching via Expert Calibration

  • Ziqi Zhang
  • Jingzehua Xu
  • Jinxin Liu
  • Zifeng Zhuang
  • Donglin Wang
  • Miao Liu
  • Shuai Zhang

Offline reinforcement learning (RL) algorithms can learn better decision-making compared to behavior policies by stitching the suboptimal trajectories to derive more optimal ones. Meanwhile, Decision Transformer (DT) abstracts the RL as sequence modeling, showcasing competitive performance on offline RL benchmarks. However, recent studies demonstrate that DT lacks of stitching capacity, thus exploiting stitching capability for DT is vital to further improve its performance. In order to endow stitching capability to DT, we abstract trajectory stitching as divergent sequential expert matching and introduce our approach, ContextFormer, which integrates contextual information-based imitation learning (IL) and sequence modeling to stitch sub-optimal trajectory fragments by emulating the representations of a limited number of expert trajectories. To validate our approach, we conduct experiments from two perspectives: 1) We conduct extensive experiments on D4RL benchmarks under the settings of IL, and experimental results demonstrate ContextFormer can achieve competitive performance in multiple IL settings. 2) More importantly, we conduct a comparison of ContextFormer with various competitive DT variants using identical training datasets. The experimental results unveiled ContextFormer’s superiority, as it outperformed all other variants, showcasing its remarkable performance.

AAAI Conference 2025 Conference Paper

CSL-L2M: Controllable Song-Level Lyric-to-Melody Generation Based on Conditional Transformer with Fine-Grained Lyric and Musical Controls

  • Li Chai
  • Donglin Wang

Lyric-to-melody generation is a highly challenging task in the field of AI music generation. Due to the difficulty of learning strict yet weak correlations between lyrics and melodies, previous methods have suffered from weak controllability, low-quality and poorly structured generation. To address these challenges, we propose CSL-L2M, a controllable song-level lyric-to-melody generation method based on an in-attention Transformer decoder with fine-grained lyric and musical controls, which is able to generate full-song melodies matched with the given lyrics and user-specified musical attributes. Specifically, we first introduce REMI-Aligned, a novel music representation that incorporates strict syllable- and sentence-level alignments between lyrics and melodies, facilitating precise alignment modeling. Subsequently, sentence-level semantic lyric embeddings independently extracted from a sentence-wise Transformer encoder are combined with word-level part-of-speech embeddings and syllable-level tone embeddings as fine-grained controls to enhance the controllability of lyrics over melody generation. Then we introduce human-labeled musical tags, sentence-level statistical musical attributes, and learned musical features extracted from a pre-trained VQ-VAE as coarse-grained, fine-grained and high-fidelity controls, respectively, to the generation process, thereby enabling user control over melody generation. Finally, an in-attention Transformer decoder technique is leveraged to exert fine-grained control over the full-song melody generation with the aforementioned lyric and musical conditions. Experimental results demonstrate that our proposed CSL-L2M outperforms the state-of-the-art models, generating melodies with higher quality, better controllability and enhanced structure.

ICLR Conference 2025 Conference Paper

GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation

  • Hongyin Zhang 0001
  • Pengxiang Ding
  • Shangke Lyu
  • Ying Peng
  • Donglin Wang

With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment. These perturbations introduce unforeseen state information to the VLA, resulting in inaccurate actions and consequently, a significant decline in generalization performance. The classic internal model control (IMC) principle demonstrates that a closed-loop system with an internal model that includes external input signals can accurately track the reference input and effectively offset the disturbance. We propose a novel closed-loop VLA method GEVRM that integrates the IMC principle to enhance the robustness of robot visual manipulation. The text-guided video generation model in GEVRM can generate highly expressive future visual planning goals. Simultaneously, we evaluate perturbations by simulating responses, which are called internal embeddings and optimized through prototype contrastive learning. This allows the model to implicitly infer and distinguish perturbations from the external environment. The proposed GEVRM achieves state-of-the-art performance on both standard and perturbed CALVIN benchmarks and shows significant improvements in realistic robot tasks.

IROS Conference 2025 Conference Paper

Integrating Trajectory Optimization and Reinforcement Learning for Quadrupedal Jumping with Terrain-Adaptive Landing

  • Renjie Wang
  • Shangke Lyu
  • Xin Lang
  • Wei Xiao
  • Donglin Wang

Jumping constitutes an essential component of quadruped robots’ locomotion capabilities, which includes dynamic take-off and adaptive landing. Existing quadrupedal jumping studies mainly focused on the stance and flight phase by assuming a flat landing ground, which is impractical in many real world cases. This work proposes a safe landing framework that achieves adaptive landing on rough terrains by combining Trajectory Optimization (TO) and Reinforcement Learning (RL) together. The RL agent learns to track the reference motion generated by TO in the environments with rough terrains. To enable the learning of compliant landing skills on challenging terrains, a reward relaxation strategy is synthesized to encourage exploration during landing recovery period. Extensive experiments validate the accurate tracking and safe landing skills benefiting from our proposed method in various scenarios.

AAAI Conference 2025 Conference Paper

MGDA: Model-based Goal Data Augmentation for Offline Goal-conditioned Weighted Supervised Learning

  • Xing Lei
  • Xuetao Zhang
  • Donglin Wang

Recently, a state-of-the-art series of algorithms—Goal-Conditioned Weighted Supervised Learning (GCWSL) methods—has been introduced to address the challenges inherent in offline goal-conditioned reinforcement learning (RL). GCWSL optimizes a lower bound on the goal-conditioned RL objective and has demonstrated exceptional performance across a range of goal-reaching tasks, offering a simple, effective, and stable solution. Nonetheless, researches has revealed a critical limitation in GCWSL: the absence of trajectory stitching capabilities. In response, goal data augmentation strategies have been proposed to enhance these methods. However, existing techniques often fail to effectively sample appropriate augmented goals for GCWSL. In this paper, we establish unified principles for goal data augmentation, emphasizing goal diversity, action optimality, and goal reachability. Building on these principles, we propose a Model-based Goal Data Augmentation (MGDA) approach, which leverages a dynamics model to sample more appropriate augmented goals. MGDA uniquely incorporates the local Lipschitz continuity assumption within the learned model to mitigate the effects of compounding errors. Empirical results demonstrate that MGDA significantly improves the performance of GCWSL methods on both state-based and vision-based maze datasets, outperforming previous goal data augmentation techniques in their ability to enhancing stitching capabilities.

ICRA Conference 2025 Conference Paper

MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models

  • Han Zhao 0008
  • Wenxuan Song
  • Donglin Wang
  • Xinyang Tong
  • Pengxiang Ding
  • Xuelian Cheng
  • Zongyuan Ge

Developing versatile quadruped robots that can smoothly perform various actions and tasks in real-world environments remains a significant challenge. This paper introduces a novel vision-language-action (VLA) model, mixture of robotic experts (MoRE), for quadruped robots that aim to introduce reinforcement learning (RL) for fine-tuning large-scale VLA models with a large amount of mixed-quality data. MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model (MLLM), forming a sparse-activated mixture-of-experts model. This design enables the model to effectively adapt to a wide array of downstream tasks. Moreover, we employ a reinforcement learning-based training objective to train our model as a Q-function after deeply exploring the structural properties of our tasks. Effective learning from automatically collected mixed-quality data enhances data efficiency and model performance. Extensive experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios. We further validate our method in real-world scenarios, confirming the practicality of our approach and laying a solid foundation for future research on multi-task learning in quadruped robots.

IROS Conference 2025 Conference Paper

PD-VLA: Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding

  • Wenxuan Song
  • Jiayi Chen
  • Pengxiang Ding
  • Han Zhao 0008
  • Wei Zhao
  • Zhide Zhong
  • Zongyuan Ge
  • Zhijun Li

Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The performance of VLA models can be improved by integrating with action chunking, a critical technique for effective control. However, action chunking linearly scales up action dimensions in VLA models with increased chunking sizes. This reduces the inference efficiency. Therefore, accelerating VLA integrated with action chunking is an urgent need. To tackle this problem, we propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking. Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations. This approach preserves model performance with mathematical guarantees while significantly improving decoding speed. In addition, it enables training-free acceleration without architectural changes, as well as seamless synergy with existing acceleration techniques. Extensive simulations validate that our PD-VLA maintains competitive success rates while achieving 2. 52× execution frequency on manipulators (with 7 degrees of freedom) compared with the fundamental VLA model. Furthermore, we experimentally identify the most effective settings for acceleration. Finally, real-world experiments validate its high applicability across different tasks.

ICRA Conference 2025 Conference Paper

Quart-Online: Latency-Free Multimodal Large Language Model for Quadruped Robot Learning

  • Xinyang Tong
  • Pengxiang Ding
  • Yiguo Fan
  • Donglin Wang
  • Wenjie Zhang
  • Can Cui 0008
  • Mingyang Sun
  • Han Zhao 0008

This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUARTOnline, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference at 50 Hz in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65 %. Our project page is https://quart-online.github.io.

ICML Conference 2025 Conference Paper

ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

  • Hongyin Zhang 0001
  • Zifeng Zhuang
  • Han Zhao 0008
  • Pengxiang Ding
  • Hongchao Lu
  • Donglin Wang

Vision-Language-Action (VLA) models have shown great potential in general robotic decision-making tasks via imitation learning. However, the variable quality of training data often constrains the performance of these models. On the other hand, offline Reinforcement Learning (RL) excels at learning robust policy models from mixed-quality data. In this paper, we introduce Reinforced robot GPT (ReinboT), a novel end-to-end VLA model that integrates the RL principle of maximizing cumulative reward. ReinboT achieves a deeper understanding of the data quality distribution by predicting dense returns that capture the nuances of manipulation tasks. The dense return prediction capability enables the robot to generate more robust decision-making actions, oriented towards maximizing future benefits. Extensive experiments show that ReinboT achieves state-of-the-art performance on the CALVIN mixed-quality dataset and exhibits superior few-shot learning and out-of-distribution generalization capabilities in real-world tasks.

ICML Conference 2025 Conference Paper

Rethinking Latent Redundancy in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation

  • Shuanghao Bai
  • Wanqi Zhou
  • Pengxiang Ding
  • Wei Zhao
  • Donglin Wang
  • Badong Chen

Behavior Cloning (BC) is a widely adopted visual imitation learning method in robot manipulation. Current BC approaches often enhance generalization by leveraging large datasets and incorporating additional visual and textual modalities to capture more diverse information. However, these methods overlook whether the learned representations contain redundant information and lack a solid theoretical foundation to guide the learning process. To address these limitations, we adopt an information-theoretic perspective and introduce mutual information to quantify and mitigate redundancy in latent representations. Building on this, we incorporate the Information Bottleneck (IB) principle into BC, which extends the idea of reducing redundancy by providing a structured framework for compressing irrelevant information while preserving task-relevant features. This work presents the first comprehensive study on redundancy in latent representations across various methods, backbones, and experimental settings, while extending the generalizability of the IB to BC. Extensive experiments and analyses on the CortexBench and LIBERO benchmarks show consistent performance improvements with IB across various settings, underscoring the importance of reducing input data redundancy and highlighting its practical value for real-world applications.

ICML Conference 2025 Conference Paper

Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport

  • Mingyang Sun
  • Pengxiang Ding
  • Weinan Zhang 0001
  • Donglin Wang

Diffusion policies have shown promise in learning complex behaviors from demonstrations, particularly for tasks requiring precise control and long-term planning. However, they face challenges in robustness when encountering distribution shifts. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement learning fine-tuning), a novel method that integrates diffusion policies with RL using optimal transport theory. OTPR leverages the Q-function as a transport cost and views the policy as an optimal transport map, enabling efficient and stable fine-tuning. Moreover, we introduce masked optimal transport to guide state-action matching using expert keypoints and a compatibility-based resampling strategy to enhance training stability. Experiments on three simulation tasks demonstrate OTPR’s superior performance and robustness compared to existing methods, especially in complex and sparse-reward environments. In sum, OTPR provides an effective framework for combining IL and RL, achieving versatile and reliable policy learning.

NeurIPS Conference 2025 Conference Paper

SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

  • Yang Liu
  • Ming Ma
  • Xiaomin Yu
  • Pengxiang Ding
  • Han Zhao
  • Mingyang Sun
  • Siteng Huang
  • Donglin Wang

Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either require specialized sensors or fail to effectively exploit depth information for higher-order reasoning. To this end, we propose a novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that transforms raw depth data into structured, interpretable textual rationales. These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities. Additionally, we leverage knowledge distillation to compress the generated rationales into compact latent embeddings, which facilitate resource-efficient and plug-and-play integration into existing VLMs without retraining. To enable comprehensive evaluation, we introduce a new dataset named SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBench, a comprehensive multi-task benchmark. Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding. Project page: https: //yliu-cs. github. io/SSR.

ICML Conference 2025 Conference Paper

Stay Hungry, Keep Learning: Sustainable Plasticity for Deep Reinforcement Learning

  • Huaicheng Zhou
  • Zifeng Zhuang
  • Donglin Wang

The integration of Deep Neural Networks in Reinforcement Learning (RL) systems has led to remarkable progress in solving complex tasks but also introduced challenges like primacy bias and dead neurons. Primacy bias skews learning towards early experiences, while dead neurons diminish the network’s capacity to acquire new knowledge. Traditional reset mechanisms aimed at addressing these issues often involve maintaining large replay buffers to train new networks or selectively resetting subsets of neurons. However, these approaches either incur prohibitive computational costs or reset network parameters without ensuring stability through recovery mechanisms, ultimately impairing learning efficiency. In this work, we introduce the novel concept of neuron regeneration, which combines reset mechanisms with knowledge recovery techniques. We also propose a new framework called Sustainable Backup Propagation(SBP) that effectively maintains plasticity in neural networks through this neuron regeneration process. The SBP framework achieves whole network neuron regeneration through two key procedures: cycle reset and inner distillation. Cycle reset involves a scheduled renewal of neurons, while inner distillation functions as a knowledge recovery mechanism at the neuron level. To validate our framework, we integrate SBP with Proximal Policy Optimization (PPO) and propose a novel distillation function for inner distillation. This integration results in Plastic PPO (P3O), a new algorithm that enables efficient cyclic regeneration of all neurons in the actor network. Extensive experiments demonstrate the approach effectively maintains policy plasticity and improves sample efficiency in reinforcement learning.

ICLR Conference 2025 Conference Paper

VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation

  • Wei Zhao
  • Pengxiang Ding
  • Min Zhang 0068
  • Zhefei Gong
  • Shuanghao Bai
  • Han Zhao 0008
  • Donglin Wang

Vision-language-action models (VLAs) have recently become highly prevalent in robot manipulation due to its end-to-end architecture and impressive performance. However, current VLAs are limited to processing human instructions in textual form, neglecting the more natural speech modality for human interaction. A typical approach of incorporating speech modality into VLA necessitates a separate speech recognition system to transcribe spoken instructions into text. Such a cascading pipeline raises two major concerns for robotic systems. First, the entire model grows in size and complexity, potentially resulting in redundant computations and increased memory consumption. Second, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which is crucial for a robot to successfully understand and complete customized tasks. To this end, we propose VLAS, the fisrt end-to-end policy model that seamlessly integrates speech modality for robot manipulation. We present a three-stage speech instruction tuning strategy leveraging multimodal datasets, including our manually curated SQA and CSI datasets. Furthermore, to facilitate personalized operations, we develop a voice retrieval-augmented generation (RAG) approach to enhance the robot's performance in tasks requiring individual-specific knowledge. Experimental results show that the proposed VLAS, following either textual or speech instructions, can achieve performance comparable to traditional VLAs on the CALVIN benchmark. In addition, we created a benchmark consisting of customization tasks, where our VLAS demonstrates absolute superiority by fully leveraging the auxiliary information in speech.

AAAI Conference 2024 Conference Paper

Beyond OOD State Actions: Supported Cross-Domain Offline Reinforcement Learning

  • Jinxin Liu
  • Ziqi Zhang
  • Zhenyu Wei
  • Zifeng Zhuang
  • Yachen Kang
  • Sibo Gai
  • Donglin Wang

Offline reinforcement learning (RL) aims to learn a policy using only pre-collected and fixed data. Although avoiding the time-consuming online interactions in RL, it poses challenges for out-of-distribution (OOD) state actions and often suffers from data inefficiency for training. Despite many efforts being devoted to addressing OOD state actions, the latter (data inefficiency) receives little attention in offline RL. To address this, this paper proposes the cross-domain offline RL, which assumes offline data incorporate additional source-domain data from varying transition dynamics (environments), and expects it to contribute to the offline data efficiency. To do so, we identify a new challenge of OOD transition dynamics, beyond the common OOD state actions issue, when utilizing cross-domain offline data. Then, we propose our method BOSA, which employs two support-constrained objectives to address the above OOD issues. Through extensive experiments in the cross-domain offline RL setting, we demonstrate BOSA can greatly improve offline data efficiency: using only 10% of the target data, BOSA could achieve 74.4% of the SOTA offline RL performance that uses 100% of the target data. Additionally, we also show BOSA can be effortlessly plugged into model-based offline RL and noising data augmentation techniques (used for generating source-domain data), which naturally avoids the potential dynamics mismatch between target-domain data and newly generated source-domain data.

ICML Conference 2024 Conference Paper

DIDI: Diffusion-Guided Diversity for Offline Behavioral Generation

  • Jinxin Liu
  • Xinghong Guo
  • Zifeng Zhuang
  • Donglin Wang

In this paper, we propose a novel approach called DIffusion-guided DIversity (DIDI) for offline behavioral generation. The goal of DIDI is to learn a diverse set of skills from a mixture of label-free offline data. We achieve this by leveraging diffusion probabilistic models as priors to guide the learning process and regularize the policy. By optimizing a joint objective that incorporates diversity and diffusion-guided regularization, we encourage the emergence of diverse behaviors while maintaining the similarity to the offline data. Experimental results in four decision-making domains (Push, Kitchen, Humanoid, and D4RL tasks) show that DIDI is effective in discovering diverse and discriminative skills. We also introduce skill stitching and skill interpolation, which highlight the generalist nature of the learned skill space. Further, by incorporating an extrinsic reward function, DIDI enables reward-guided behavior generation, facilitating the learning of diverse and optimal behaviors from sub-optimal data.

AAAI Conference 2024 Conference Paper

Expressive Forecasting of 3D Whole-Body Human Motions

  • Pengxiang Ding
  • Qiongjie Cui
  • Haofan Wang
  • Min Zhang
  • Mengyuan Liu
  • Donglin Wang

Human motion forecasting, with the goal of estimating future human behavior over a period of time, is a fundamental task in many real-world applications. However, existing works typically concentrate on foretelling the major joints of the human body without considering the delicate movements of the human hands. In practical applications, hand gesture plays an important role in human communication with the real world, and expresses the primary intention of human beings. In this work, we are the first to formulate whole-body human pose forecasting task, which jointly predicts future both body and gesture activities. Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI) framework that aims to predict both coarse (body joints) and fine-grained (gestures) activities collaboratively, enabling expressive and cross-facilitated forecasting of 3D whole-body human motions. Specifically, our model involves two key constituents: cross-context alignment (XCA) and cross-context interaction (XCI). Considering the heterogeneous information within the whole-body, XCA aims to align the latent features of various human components, while XCI focuses on effectively capturing the context interaction among the human components. We conduct extensive experiments on a newly-introduced large-scale benchmark and achieve state-of-the-art performance. The code is public for research purposes at https://github.com/Dingpx/EAI.

IROS Conference 2024 Conference Paper

GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot

  • Wenxuan Song
  • Han Zhao 0008
  • Pengxiang Ding
  • Can Cui 0008
  • Shangke Lyu
  • Yaning Fan
  • Donglin Wang

Multi-task robot learning holds significant importance in tackling diverse and complex scenarios. However, current approaches are hindered by performance issues and difficulties in collecting training datasets. In this paper, we propose GeRM (Generalist Robotic Model). We utilize offline reinforcement learning to optimize data utilization strategies to learn from both demonstrations and sub-optimal data, thus surpassing the limitations of human demonstrations. Thereafter, we employ a transformer-based VLA network to process multi-modal inputs and output actions. By introducing the Mixture-of-Experts structure, GeRM allows faster inference speed with higher whole model capacity, and thus resolves the issue of limited RL parameters, enhancing model performance in multi-task learning while controlling computational costs. Through a series of experiments, we demonstrate that GeRM outperforms other methods across all tasks, while also validating its efficiency in both training and inference processes. Additionally, we uncover its potential to acquire emergent skills. Additionally, we contribute the QUARD-Auto dataset, collected automatically to support our training approach and foster advancements in multi-task quadruped robot learning. This work presents a new paradigm for reducing the cost of collecting robot data and driving progress in the multi-task learning community. You can reach our project and video through the link: https://songwxuan.github.io/GeRM/.

AAAI Conference 2024 Conference Paper

Prompt-Based Distribution Alignment for Unsupervised Domain Adaptation

  • Shuanghao Bai
  • Min Zhang
  • Wanqi Zhou
  • Siteng Huang
  • Zhirong Luan
  • Donglin Wang
  • Badong Chen

Recently, despite the unprecedented success of large pre-trained visual-language models (VLMs) on a wide range of downstream tasks, the real-world unsupervised domain adaptation (UDA) problem is still not well explored. Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target domains, thereby improving the performance of UDA. However, a major challenge for directly deploying such models on downstream UDA tasks is prompt engineering, which requires aligning the domain knowledge of source and target domains, since the performance of UDA is severely influenced by a good domain-invariant representation. We further propose a Prompt-based Distribution Alignment (PDA) method to incorporate the domain knowledge into prompt learning. Specifically, PDA employs a two-branch prompt-tuning paradigm, namely base branch and alignment branch. The base branch focuses on integrating class-related representation into prompts, ensuring discrimination among different classes. To further minimize domain discrepancy, for the alignment branch, we construct feature banks for both the source and target domains and propose image-guided feature tuning (IFT) to make the input attend to feature banks, which effectively integrates self-enhanced and cross-domain features into the model. In this way, these two branches can be mutually promoted to enhance the adaptation of VLMs for UDA. We conduct extensive experiments on three benchmarks to demonstrate that our proposed PDA achieves state-of-the-art performance. The code is available at https://github.com/BaiShuanghao/Prompt-based-Distribution-Alignment.

ICML Conference 2024 Conference Paper

Reinformer: Max-Return Sequence Modeling for Offline RL

  • Zifeng Zhuang
  • Dengyun Peng
  • Jinxin Liu
  • Ziqi Zhang
  • Donglin Wang

As a data-driven paradigm, offline reinforcement learning (RL) has been formulated as sequence modeling that conditions on the hindsight information including returns, goal or future trajectory. Although promising, this supervised paradigm overlooks the core objective of RL that maximizes the return. This overlook directly leads to the lack of trajectory stitching capability that affects the sequence model learning from sub-optimal data. In this work, we introduce the concept of max-return sequence modeling which integrates the goal of maximizing returns into existing sequence models. We propose Rein for ced Trans for mer ( Rein for mer ), indicating the sequence model is reinforced by the RL objective. Rein for mer additionally incorporates the objective of maximizing returns in the training phase, aiming to predict the maximum future return within the distribution. During inference, this in-distribution maximum return will guide the selection of optimal actions. Empirically, Rein for mer is competitive with classical RL methods on the D4RL benchmark and outperforms state-of-the-art sequence model particularly in trajectory stitching ability. Code is public at https: //github. com/Dragon-Zhuang/Reinformer.

IROS Conference 2023 Conference Paper

A Composite Control Strategy for Quadruped Robot by Integrating Reinforcement Learning and Model-Based Control

  • Shangke Lyu
  • Han Zhao 0008
  • Donglin Wang

Locomotion in the wild requires the quadruped robot to have strong capabilities in adaptation and robustness. The deep reinforcement learning (DRL) exhibits the huge potential in environmental adaptability, while its stability issues remain open. On the other hand, the quadruped robot dynamic model contains a lot of useful information that is beneficial to the robust control. The combination of DRL with model-based control may take both strengths and hold promises in better robustness. In this paper, the DRL and the proposed model-based controller are firmly integrated in a novel manner such that the proposed model-based controller is able to rectify the gait commands generated by DRL based on the system dynamic model so as to enhance the robustness of the quadruped robot against the external disturbances. Besides, a potential energy function is introduced to achieve the compliant contact. The stability of the proposed method is ensured in terms of passivity analysis. Several physical experiments are carried out to verify the performance of the proposed method.

ICLR Conference 2023 Conference Paper

Behavior Proximal Policy Optimization

  • Zifeng Zhuang
  • Kun Lei
  • Jinxin Liu
  • Donglin Wang
  • Yilang Guo

Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to overestimating of out-of-distribution state-action pairs. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or the behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we reach a surprising conclusion that online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to overcome the overestimation. Based on this, we propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization introduced compared to PPO. Extensive experiments on the D4RL benchmark empirically show this extremely succinct method outperforms state-of-the-art offline RL algorithms. Our implementation is available at https://github.com/Dragon-Zhuang/BPPO.

ICML Conference 2023 Conference Paper

Beyond Reward: Offline Preference-guided Policy Optimization

  • Yachen Kang
  • Diyuan Shi
  • Jinxin Liu
  • Li He
  • Donglin Wang

This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with fixed offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck of the learning process. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https: //sites. google. com/view/oppo-icml-2023.

NeurIPS Conference 2023 Conference Paper

CEIL: Generalized Contextual Imitation Learning

  • Jinxin Liu
  • Li He
  • Yachen Kang
  • Zifeng Zhuang
  • Donglin Wang
  • Huazhe Xu

In this paper, we present ContExtual Imitation Learning (CEIL), a general and broadly applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight information matching, we derive CEIL by explicitly learning a hindsight embedding function together with a contextual policy using the hindsight embeddings. To achieve the expert matching objective for IL, we advocate for optimizing a contextual variable such that it biases the contextual policy towards mimicking expert behaviors. Beyond the typical learning from demonstrations (LfD) setting, CEIL is a generalist that can be effectively applied to multiple settings including: 1) learning from observations (LfO), 2) offline IL, 3) cross-domain IL (mismatched experts), and 4) one-shot IL settings. Empirically, we evaluate CEIL on the popular MuJoCo tasks (online) and the D4RL dataset (offline). Compared to prior state-of-the-art baselines, we show that CEIL is more sample-efficient in most online IL tasks and achieves better or competitive performances in offline tasks.

NeurIPS Conference 2023 Conference Paper

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

  • Jinxin Liu
  • Hongyin Zhang
  • Zifeng Zhuang
  • Yachen Kang
  • Donglin Wang
  • Bin Wang

In this work, we decouple the iterative bi-level offline RL (value estimation and policy extraction) from the offline training phase, forming a non-iterative bi-level paradigm and avoiding the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization (value estimation) in training, while performing outer-level optimization (policy extraction) in testing. Naturally, such a paradigm raises three core questions that are not fully answered by prior non-iterative offline RL counterparts like reward-conditioned policy: (q1) What information should we transfer from the inner-level to the outer-level? (q2) What should we pay attention to when exploiting the transferred information for safe/confident outer-level optimization? (q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization (MBO), we propose DROP (design from policies), which fully answers the above questions. Specifically, in the inner-level, DROP decomposes offline data into multiple subsets, and learns an MBO score model (a1). To keep safe exploitation to the score model in the outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (a2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (a3). Empirically, we evaluate DROP on various tasks, showing that DROP gains comparable or better performance compared to prior methods.

ECAI Conference 2023 Conference Paper

OER: Offline Experience Replay for Continual Offline Reinforcement Learning

  • Sibo Gai
  • Donglin Wang
  • Li He

The capability of continuously learning new skills via a sequence of pre-collected offline datasets is desired for an agent. However, consecutively learning a sequence of offline tasks likely leads to the catastrophic forgetting issue under resource-limited scenarios. In this paper, we formulate a new setting, continual offline reinforcement learning (CORL), where an agent learns a sequence of offline reinforcement learning tasks and pursues good performance on all learned tasks with a small replay buffer without exploring any of the environments of all the sequential tasks. For consistently learning on all sequential tasks, an agent requires acquiring new knowledge and meanwhile preserving old knowledge in an offline manner. To this end, we introduced continual learning algorithms and experimentally found experience replay (ER) to be the most suitable algorithm for the CORL problem. However, we observe that introducing ER into CORL encounters a new distribution shift problem: the mismatch between the experiences in the replay buffer and trajectories from the learned policy. To address such an issue, we propose a new model-based experience selection (MBES) scheme to build the replay buffer, where a transition model is learned to approximate the state distribution. This model is used to bridge the distribution bias between the replay buffer and the learned model by filtering the data from offline data that most closely resembles the learned model for storage. Moreover, in order to enhance the ability on learning new tasks, we retrofit the experience replay method with a new dual behavior cloning (DBC) architecture to avoid the disturbance of behavior-cloning loss on the Q-learning process. In general, we call our algorithm offline experience replay (OER). Extensive experiments demonstrate that our OER method outperforms SOTA baselines in widely-used Mujoco environments.

ICLR Conference 2022 Conference Paper

DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning

  • Jinxin Liu
  • Hongyin Zhang 0001
  • Donglin Wang

Offline reinforcement learning algorithms promise to be applicable in settings where a fixed dataset is available and no new experience can be acquired. However, such formulation is inevitably offline-data-hungry and, in practice, collecting a large offline dataset for one specific task over one specific environment is also costly and laborious. In this paper, we thus 1) formulate the offline dynamics adaptation by using (source) offline data collected from another dynamics to relax the requirement for the extensive (target) offline data, 2) characterize the dynamics shift problem in which prior offline methods do not scale well, and 3) derive a simple dynamics-aware reward augmentation (DARA) framework from both model-free and model-based offline settings. Specifically, DARA emphasizes learning from those source transition pairs that are adaptive for the target environment and mitigates the offline dynamics shift by characterizing state-action-next-state pairs instead of the typical state-action distribution sketched by prior offline RL methods. The experimental evaluation demonstrates that DARA, by augmenting rewards in the source offline dataset, can acquire an adaptive policy for the target environment and yet significantly reduce the requirement of target offline data. With only modest amounts of target offline data, our performance consistently outperforms the prior offline RL methods in both simulated and real-world tasks.

IJCAI Conference 2022 Conference Paper

End-to-End Open-Set Semi-Supervised Node Classification with Out-of-Distribution Detection

  • Tiancheng Huang
  • Donglin Wang
  • Yuan Fang
  • Zhengyu Chen

Out-Of-Distribution (OOD) samples are prevalent in real-world applications. The OOD issue becomes even more severe on graph data, as the effect of OOD nodes can be potentially amplified by propagation through the graph topology. Recent works have considered the OOD detection problem, which is critical for reducing the uncertainty in learning and improving the robustness. However, no prior work considers simultaneously OOD detection and node classification on graphs in an end-to-end manner. In this paper, we study a novel problem of end-to-end open-set semi-supervised node classification (OSSNC) on graphs, which deals with node classification in the presence of OOD nodes. Given the lack of supervision on OOD nodes, we introduce a latent variable to indicate in-distribution or OOD nodes in a variational inference framework, and further propose a novel algorithm named as Learning to Mix Neighbors (LMN) which learns to dampen the influence of OOD nodes through the messaging-passing in typical graph neural networks. Extensive experiments on various datasets show that the proposed method outperforms state-of-the-art baselines in terms of both node classification and OOD detection.

AAAI Conference 2022 Conference Paper

Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep Reinforcement Learning

  • Jinxin Liu
  • Donglin Wang
  • Qiangxing Tian
  • Zhengyu Chen

It is of significance for an agent to autonomously explore the environment and learn a widely applicable and generalpurpose goal-conditioned policy that can achieve diverse goals including images and text descriptions. Considering such perceptually-specific goals, one natural approach is to reward the agent with a prior non-parametric distance over the embedding spaces of states and goals. However, this may be infeasible in some situations, either because it is unclear how to choose suitable measurement, or because embedding (heterogeneous) goals and states is non-trivial. The key insight of this work is that we introduce a latent-conditioned policy to provide goals and intrinsic rewards for learning the goal-conditioned policy. As opposed to directly scoring current states with regards to goals, we obtain rewards by scoring current states with associated latent variables. We theoretically characterize the connection between our unsupervised objective and the multi-goal setting, and empirically demonstrate the effectiveness of our proposed method which substantially outperforms prior techniques in a variety of tasks.

AAAI Conference 2022 Short Paper

Learning to Evolve on Dynamic Graphs (Student Abstract)

  • Xintao Xiang
  • Tiancheng Huang
  • Donglin Wang

Representation learning in dynamic graphs is a challenging problem because the topology of graph and node features vary at different time. This requires the model to be able to effectively capture both graph topology information and temporal information. Most existing works are built on recurrent neural networks (RNNs), which are used to exact temporal information of dynamic graphs, and thus they inherit the same drawbacks of RNNs. In this paper, we propose Learning to Evolve on Dynamic Graphs (LEDG) - a novel algorithm that jointly learns graph information and time information. Specifically, our approach utilizes gradient-based metalearning to learn updating strategies that have better generalization ability than RNN on snapshots. It is model-agnostic and thus can train any message passing based graph neural network (GNN) on dynamic graphs. To enhance the representation power, we disentangle the embeddings into time embeddings and graph intrinsic embeddings. We conduct experiments on various datasets and down-stream tasks, and the experimental results validate the effectiveness of our method.

AAAI Conference 2021 Conference Paper

A General Offline Reinforcement Learning Framework for Interactive Recommendation

  • Teng Xiao
  • Donglin Wang

This paper studies the problem of learning interactive recommender systems from logged feedbacks without any exploration in online environments. We address the problem by proposing a general offline reinforcement learning framework for recommendation, which enables maximizing cumulative user rewards without online exploration. Specifically, we first introduce a probabilistic generative model for interactive recommendation, and then propose an effective inference algorithm for discrete and stochastic policy learning based on logged feedbacks. In order to perform offline learning more effectively, we propose five approaches to minimize the distribution mismatch between the logging policy and recommendation policy: support constraints, supervised regularization, policy constraints, dual constraints and reward extrapolation. We conduct extensive experiments on two public realworld datasets, demonstrating that the proposed methods can achieve superior performance over existing supervised learning and reinforcement learning methods for recommendation.

AAAI Conference 2021 Conference Paper

Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition

  • Siteng Huang
  • Min Zhang
  • Yachen Kang
  • Donglin Wang

The purpose of few-shot recognition is to recognize novel categories with a limited number of labeled examples in each class. To encourage learning from a supplementary view, recent approaches have introduced auxiliary semantic modalities into effective metric-learning frameworks that aim to learn a feature similarity between training samples (support set) and test samples (query set). However, these approaches only augment the representations of samples with available semantics while ignoring the query set, which loses the potential for the improvement and may lead to a shift between the modalities combination and the pure-visual representation. In this paper, we devise an attributes-guided attention module (AGAM) to utilize human-annotated attributes and learn more discriminative features. This plug-and-play module enables visual contents and corresponding attributes to collectively focus on important channels and regions for the support set. And the feature selection is also achieved for query set with only visual information while the attributes are not available. Therefore, representations from both sets are improved in a fine-grained manner. Moreover, an attention alignment mechanism is proposed to distill knowledge from the guidance of attributes to the pure-visual branch for samples without attributes. Extensive experiments and analysis show that our proposed module can significantly improve simple metric-based approaches to achieve state-of-the-art performance on different datasets and settings.

AAAI Conference 2021 Conference Paper

Deep Transfer Tensor Decomposition with Orthogonal Constraint for Recommender Systems

  • Zhengyu Chen
  • Ziqing Xu
  • Donglin Wang

Tensor decomposition is one of the most effective techniques for multi-criteria recommendations. However, it suffers from data sparsity when dealing with three-dimensional (3D) useritem-criterion ratings. To mitigate this issue, we consider effectively incorporating the side information and cross-domain knowledge in tensor decomposition. A deep transfer tensor decomposition (DTTD) method is proposed by integrating deep structure and Tucker decomposition, where an orthogonal constrained stacked denoising autoencoder (OC-SDAE) is proposed for alleviating the scale variation in learning effective latent representation, and the side information is incorporated as a compensation for tensor sparsity. Tucker decomposition generates users and items’ latent factors to connect with OC-SDAEs and creates a common core tensor to bridge different domains. A cross-domain alignment algorithm (CDAA) is proposed to solve the rotation issue between two core tensors in source and target domain. Experiments show that DTTD outperforms state-of-the-art related works.

IROS Conference 2021 Conference Paper

Hierarchical Terrain-Aware Control for Quadrupedal Locomotion by Combining Deep Reinforcement Learning and Optimal Control

  • Qingfeng Yao
  • Jilong Wang 0008
  • Donglin Wang
  • Shuyu Yang
  • Hongyin Zhang 0001
  • Yinuo Wang 0001
  • Zhengqing Wu

Quadruped robots possess advantages on different terrains over other types of mobile robots by virtue of their flexible choices of foothold points. It is crucial to integrate terrain perception with motion planning to exploit the potential of quadruped robots. We propose a novel hierarchical terrain-aware control (HTC) framework, which leverages deep reinforcement learning (DRL) for the high-level planner and optimal control for the low-level controller. In general, traditional control methods yield better stability by using an optimization algorithm. In addition, DRL is able to offer more adaptive behavior. Our approach makes full use of the advantages of these two methods and possesses better adaptability and stability in challenging natural environments. Furthermore, the global height map of the terrain serves as visual information for the DRL, which determines the desired footholds for the robot’s leg swings and body postures. Optimal control calculates the torque of the joints on the standing legs to maintain body balance. Our method is tested on various terrains both simulated and real environments. The experimental results show that HTC can effectively enhance the adaptability of the quadruped robot by coordinating body posture.

IROS Conference 2021 Conference Paper

Terrain-Aware Risk-Assessment-Network-Aided Deep Reinforcement Learning for Quadrupedal Locomotion in Tough Terrain

  • Hongyin Zhang 0001
  • Jilong Wang 0008
  • Zhengqing Wu
  • Yinuo Wang 0001
  • Donglin Wang

When it comes to the control system of quadruped robots, deep reinforcement learning (DRL) is considered to be a promising solution. Despite years of development in this field, difficulties remain in guaranteeing the action stability of DRL-based quadruped robots’ locomotion, especially in tough terrain. In this paper, a terrain-aware teacher-student controller integrating a risk assessment network (RAN) is proposed to alleviate this problem. During the training phase, the RAN can evaluate the risk level of historical observation or current state and further guide the update of the policy, thereby assisting the policy in selecting better actions and avoid risky ones. Furthermore, the real-time elevation map is transmitted to the controller as visual information, so that it can perceive the terrain to produce higher performance locomotion. With the aforementioned configuration, we enable a robot to traverse various challenging terrain in simulation and bound or trot stably in the real environment.

NeurIPS Conference 2021 Conference Paper

Unsupervised Domain Adaptation with Dynamics-Aware Rewards in Reinforcement Learning

  • Jinxin Liu
  • Hao Shen
  • Donglin Wang
  • Yachen Kang
  • Qiangxing Tian

Unsupervised reinforcement learning aims to acquire skills without prior goal representations, where an agent automatically explores an open-ended environment to represent goals and learn the goal-conditioned policy. However, this procedure is often time-consuming, limiting the rollout in some potentially expensive target environments. The intuitive approach of training in another interaction-rich environment disrupts the reproducibility of trained skills in the target environment due to the dynamics shifts and thus inhibits direct transferring. Assuming free access to a source environment, we propose an unsupervised domain adaptation method to identify and acquire skills across dynamics. Particularly, we introduce a KL regularized objective to encourage emergence of skills, rewarding the agent for both discovering skills and aligning its behaviors respecting dynamics shifts. This suggests that both dynamics (source and target) shape the reward to facilitate the learning of adaptive skills. We also conduct empirical experiments to demonstrate that our method can effectively learn skills that can be smoothly deployed in target.

IJCAI Conference 2020 Conference Paper

Independent Skill Transfer for Deep Reinforcement Learning

  • Qiangxing Tian
  • Guanchu Wang
  • Jinxin Liu
  • Donglin Wang
  • Yachen Kang

Recently, diverse primitive skills have been learned by adopting the entropy as intrinsic reward, which further shows that new practical skills can be produced by combining a variety of primitive skills. This is essentially skill transfer, very useful for learning high-level skills but quite challenging due to the low efficiency of transferring primitive skills. In this paper, we propose a novel efficient skill transfer method, where we learn independent skills and only independent components of skills are transferred instead of the whole set of skills. More concretely, independent components of skills are obtained through independent component analysis (ICA), which always have a smaller amount (or lower dimension) compared with their mixtures. With a lower dimension, independent skill transfer (IST) exhibits a higher efficiency on learning a given task. Extensive experiments including three robotic tasks demonstrate the effectiveness and high efficiency of our proposed IST method in comparison to direct primitive-skill transfer and conventional reinforcement learning.

ECAI Conference 2020 Conference Paper

Knowledge Distillation for Model-Agnostic Meta-Learning

  • Min Zhang 0068
  • Donglin Wang
  • Sibo Gai

Recently, model-agnostic meta-learning (MAML) and its variants have drawn much attention in few-shot learning. In this paper, we investigate how to improve the performance of a portable MAML network so that it can be used in handheld devices, such as small robots, mobile phones, and laptops. We propose a novel approach named portable model-agnostic meta-learning (P-MAML), where valuable knowledge is distilled from a teacher MAML network to a portable student MAML. Moreover, data augmentation and ResNet architecture are employed in the teacher MAML network so as to avoid overfitting and enhance efficiency. To the best of our knowledge, this is the first work to consider a portable meta-learning model through knowledge distillation (KD) to learn a good initialization. Extensive experimental results on three real datasets show that our P-MAML algorithm greatly enhances the accuracy through KD from the teacher network. As shown, P-MAML with KD improves the performance of one-shot learning as high as 10% in comparison to that without KD.