Arrow Research search

Author name cluster

Mengdi Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

53 papers
1 author row

Possible papers

53

TMLR Journal 2026 Journal Article

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

  • Huan-ang Gao
  • Jiayi Geng
  • Wenyue Hua
  • Mengkang Hu
  • Xinzhe Juan
  • Hongzhang Liu
  • Shilong Liu
  • Jiahao Qiu

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift ---from scaling static models to developing self-evolving agents --- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organizing the field around three foundational dimensions --- what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing more adaptive, capable, robust, and versatile agentic systems in both research and real-world deployments, and ultimately sheds light on the realization of Artificial Super Intelligence (ASI) where agents evolve autonomously and perform beyond human-level intelligence across a wide array of tasks.

NeurIPS Conference 2025 Conference Paper

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

  • Yinjie Wang
  • Ling Yang
  • Ye Tian
  • Ke Shen
  • Mengdi Wang

Mathematical reasoning in large language models has been successfully incentivized through reinforcement learning with verifiable rewards, leading to improved one-shot precision. In this work, we turn our focus to the coding domain. Beyond one-shot precision, we highlight unit test generation as another key factor for enhancing coding ability, since accurate unit tests are essential for enabling self-checking and self-correction during inference. Traditional approaches for fine-tuning LLMs on unit test generation rely heavily on ground-truth code solutions in the training data. We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes—without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder’s mistakes. Through extensive evaluations, we demonstrate that our CURE models, derived from base models of varying sizes, excel in both code generation and unit test generation. They naturally extend to downstream tasks such as test-time scaling—achieving a 6. 2\% improvement over the base model—and agentic unit test generation, with a 25. 1\% improvement. Our 4B model consistently outperforms Qwen3-4B while achieving 64. 8\% inference efficiency in unit test generation. Notably, we also find that the CURE model can serve as an effective reward model for reinforcement learning on base models, even in the absence of any labeled supervision.

IJCAI Conference 2025 Conference Paper

Deep Reinforcement Learning for Efficient and Fair Allocation of Healthcare Resources

  • Yikuan Li
  • Chengsheng Mao
  • Kaixuan Huang
  • Hanyin Wang
  • Zheng Yu
  • Mengdi Wang
  • Yuan Luo

The scarcity of health care resources, such as ventilators, often leads to the unavoidable consequence of rationing, particularly during public health emergencies or in resource-constrained settings like pandemics. The absence of a universally accepted standard for resource allocation protocols results in governments relying on varying criteria and heuristic-based approaches, often yielding suboptimal and inequitable outcomes. This study addresses the societal challenge of fair and effective critical care resource allocation by leveraging deep reinforcement learning to optimize policy decisions. We propose a transformer-based deep Q-network that integrates individual patient disease progression and interaction effects among patients to enhance allocation decisions. Our method aims to improve both fairness and overall patient outcomes. Experiments using metrics such as normalized survival rates and interracial allocation rate differences demonstrate that our approach significantly reduces excess deaths and achieves more equitable resource allocation compared to severity- and comorbidity-based protocols currently in use. Our findings highlight the potential of deep reinforcement learning to address critical health care challenges.

NeurIPS Conference 2025 Conference Paper

DISC: Dynamic Decomposition Improves LLM Inference Scaling

  • Jonathan Li
  • Wei Cheng
  • Benjamin Riviere
  • Yue Wu
  • Masafumi Oyamada
  • Mengdi Wang
  • Yisong Yue
  • Santiago Paternain

Inference scaling methods for LLMs often rely on decomposing problems into steps (or groups of tokens), followed by sampling and selecting the best next steps. However, these steps and their sizes are often predetermined or manually designed based on domain knowledge. We propose dynamic decomposition, a method that adaptively and automatically partitions solution and reasoning traces into manageable steps during inference. By more effectively allocating compute -- particularly through subdividing challenging steps and prioritizing their sampling -- dynamic decomposition significantly improves inference efficiency. Experiments on benchmarks such as APPS, MATH, and LiveCodeBench demonstrate that dynamic decomposition outperforms static approaches, including token-level, sentence-level, and single-step decompositions, reducing the pass@10 error rate by 5. 0%, 6. 7%, and 10. 5% respectively. These findings highlight the potential of dynamic decomposition to improve a wide range of inference scaling techniques.

NeurIPS Conference 2025 Conference Paper

Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models

  • Soumya Suvra Ghosal
  • Souradip Chakraborty
  • Avinash Reddy
  • Yifu Lu
  • Mengdi Wang
  • Dinesh Manocha
  • Furong Huang
  • Mohammad Ghavamzadeh

Recent trends in test-time scaling for reasoning models (e. g. , OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like “Wait” or “Let me rethink” can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking". To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance—creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from "more thinking" are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking. This provides a simple yet effective mechanism for test-time scaling of reasoning models.

JBHI Journal 2025 Journal Article

GIAE-DTI: Predicting Drug-Target Interactions Based on Heterogeneous Network and GIN-Based Graph Autoencoder

  • Mengdi Wang
  • Xiujuan Lei
  • Lian Liu
  • Jianrui Chen
  • Fang-Xiang Wu

Accurate prediction of drug-target interactions (DTIs) is essential for advancing drug discovery and repurposing. However, the sparsity of DTI data limits the effectiveness of existing computational methods, which primarily focus on sparse DTI networks and have poor performance in aggregating information from neighboring nodes and representing isolated nodes within the network. In this study, we propose a novel deep learning framework, named GIAE-DTI, which considers cross-modal similarity of drugs and targets and constructs a heterogeneous network for DTI prediction. Firstly, the model calculates the cross-modal similarity of drugs and proteins from the relationships among drugs, proteins, diseases, and side effects, and performs similarity integration by taking the average. Then, a drug-target heterogeneous network is constructed, including drug-drug interactions, protein-protein interactions, and drug-target interactions processed by weighted K nearest known neighbors. In the heterogeneous network, a graph autoencoder based on a graph isomorphism network is employed for feature extraction, while a dual decoder is utilized to achieve better self-supervised learning, resulting in latent feature representations for drugs and targets. Finally, a deep neural network is employed to predict DTIs. The experimental results indicate that on the benchmark dataset, GIAE-DTI achieves AUC and AUPR scores of 0. 9533 and 0. 9619, respectively, in DTI prediction, outperforming the current state-of-the-art methods. Additionally, case studies on four 5-hydroxytryptamine receptor-related targets and five drugs related to mental diseases show the great potential of the proposed method in practical applications.

NeurIPS Conference 2025 Conference Paper

MMaDA: Multimodal Large Diffusion Language Models

  • Ling Yang
  • Ye Tian
  • Bowen Li
  • Xinchen Zhang
  • Ke Shen
  • Yunhai Tong
  • Mengdi Wang

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https: //github. com/Gen-Verse/MMaDA

NeurIPS Conference 2025 Conference Paper

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

  • Jiaru Zou
  • Ling Yang
  • Jingwen Gu
  • Jiahao Qiu
  • Ke Shen
  • Jingrui He
  • Mengdi Wang

Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory–response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e. g. , Qwen2. 5-Math-PRM-72B) and human-curated baselines. Furthermore, ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12. 1\% in supervised fine-tuning, 4. 5\% in reinforcement learning, and 6. 3\% in test-time scaling. We also release an efficient ReasonFlux-PRM-1. 5B for resource-constrained applications and edge deployment. Our code and models are released at https: //github. com/Gen-Verse/ReasonFlux.

NeurIPS Conference 2025 Conference Paper

Securing the Language of Life: Inheritable Watermarks from DNA Language Models to Proteins

  • Zaixi Zhang
  • Ruofan Jin
  • Le Cong
  • Mengdi Wang

DNA language models have revolutionized our ability to design and manipulate DNA sequences—the fundamental language of life—with unprecedented precision, enabling transformative applications in therapeutics, synthetic biology, and gene editing. However, this capability also poses significant dual-use risks, including the potential creation of harmful biological agents. To address these biosecurity challenges, we introduce two innovative watermarking techniques: DNAMark and CentralMark. DNAMark employs synonymous codon substitutions to embed robust watermarks in DNA sequences while preserving the function of encoded proteins. CentralMark advances this by creating inheritable watermarks that transfer from DNA to translated proteins, leveraging protein embeddings to ensure detection across the central dogma. Both methods utilize state-of-the-art embeddings to generate watermark logits, enhancing resilience against natural mutations, synthesis errors, and adversarial attacks. Evaluated on a therapeutic DNA benchmark, DNAMark and CentralMark achieve F1 detection scores above 0. 85 under diverse conditions, while maintaining over 60\% sequence similarity to ground truth and degeneracy scores below 15\%. A case study on a CRISPR-Cas9 system underscores CentralMark’s utility in real-world synthetic biology. This work establishes a vital framework for securing DNA language models, balancing innovation with accountability to mitigate biosecurity risks.

NeurIPS Conference 2025 Conference Paper

Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models

  • Yingqing Guo
  • Yukang Yang
  • Hui Yuan
  • Mengdi Wang

Training-free guidance enables controlled generation in diffusion and flow models, but most methods rely on gradients and assume differentiable objectives. This work focuses on training-free guidance addressing challenges from non-differentiable objectives and discrete data distributions. We propose TreeG: Tree Search-Based Path Steering Guidance, applicable to both continuous and discrete settings in diffusion and flow models. TreeG offers a unified framework for training-free guidance by proposing, evaluating, and selecting candidates at each step, enhanced with tree search over active paths and parallel exploration. We comprehensively investigate the design space of TreeG over the candidate proposal module and the evaluation function, instantiating TreeG into three novel algorithms. Our experiments show that TreeG consistently outperforms top guidance baselines in symbolic music generation, small molecule design, and enhancer DNA design with improvements of 29. 01%, 26. 38%, and 18. 43%. Additionally, we identify an inference-time scaling law showing TreeG's scalability in inference-time computation.

IJCAI Conference 2025 Conference Paper

WenyanGPT: A Large Language Model for Classical Chinese Tasks

  • Xinyu Yao
  • Mengdi Wang
  • Bo Chen
  • Xiaobing Zhao

Classical Chinese, as the core carrier of Chinese culture, plays a crucial role in the inheritance and study of ancient literature. However, existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese. This paper presents a comprehensive solution for Classical Chinese language processing. By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks. Additionally, we develop an evaluation benchmark dataset, WenyanBENCH. Experimental results on WenyanBENCH demonstrate that WenyanGPT significantly outperforms current advanced LLMs in various Classical Chinese tasks. We make the model's training data, instruction fine-tuning data, and evaluation benchmark dataset publicly available to promote further research and development in the field of Classical Chinese processing.

NeurIPS Conference 2024 Conference Paper

A Theoretical Perspective for Speculative Decoding Algorithm

  • Ming Yin
  • Minshuo Chen
  • Kaixuan Huang
  • Mengdi Wang

Transformer-based autoregressive sampling has been the major bottleneck for slowing down large language model inferences. One effective way to accelerate inference is Speculative Decoding, which employs a small model to sample a sequence of draft tokens and a large model to validate. Given its empirical effectiveness, the theoretical understanding of Speculative Decoding is falling behind. This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, output quality and inference acceleration, from a theoretical perspective. Our analysis covers the theoretical limits of speculative decoding, batch algorithms, and output quality-inference acceleration tradeoffs. Our results reveal the fundamental connections between different components of LLMs via total variation distances and show how they jointly affect the efficiency of decoding algorithms.

TMLR Journal 2024 Journal Article

Adversarial Attacks on Online Learning to Rank with Stochastic Click Models

  • Zichen Wang
  • Rishab Balasubramanian
  • Hui Yuan
  • Chenyu Song
  • Mengdi Wang
  • Huazheng Wang

We propose the first study of adversarial attacks on online learning to rank. The goal of the attacker it to misguide the online learning to rank algorithm to place the target item on top of the ranking list linear times to time horizon $T$ with a sublinear attack cost. We propose generalized list poisoning attacks that perturb the ranking list presented to the user. This strategy can efficiently attack any no-regret ranker in general stochastic click models. Furthermore, we propose a click poisoning-based strategy named attack-then-quit that can efficiently attack two representative OLTR algorithms for stochastic click models. We theoretically analyze the success and cost upper bound of the two proposed methods. Experimental results based on synthetic and real-world data further validate the effectiveness and cost-efficiency of the proposed attack strategies.

NeurIPS Conference 2024 Conference Paper

Fast Best-of-N Decoding via Speculative Rejection

  • Hanshi Sun
  • Momin Haider
  • Ruiqi Zhang
  • Huitao Yang
  • Jiahao Qiu
  • Ming Yin
  • Mengdi Wang
  • Peter Bartlett

The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model's responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predominant, these post-training methods add substantial complexity before LLMs can be deployed. Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences. The best-known inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures. Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable. In this work, we introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm. It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.

NeurIPS Conference 2024 Conference Paper

FlexSBDD: Structure-Based Drug Design with Flexible Protein Modeling

  • Zaixi Zhang
  • Mengdi Wang
  • Qi Liu

Structure-based drug design (SBDD), which aims to generate 3D ligand molecules binding to target proteins, is a fundamental task in drug discovery. Existing SBDD methods typically treat protein as rigid and neglect protein structural change when binding with ligand molecules, leading to a big gap with real-world scenarios and inferior generation qualities (e. g. , many steric clashes). To bridge the gap, we propose FlexSBDD, a deep generative model capable of accurately modeling the flexible protein-ligand complex structure for ligand molecule generation. FlexSBDD adopts an efficient flow matching framework and leverages E(3)-equivariant network with scalar-vector dual representation to model dynamic structural changes. Moreover, novel data augmentation schemes based on structure relaxation/sidechain repacking are adopted to boost performance. Extensive experiments demonstrate that FlexSBDD achieves state-of-the-art performance in generating high-affinity molecules and effectively modeling the protein's conformation change to increase favorable protein-ligand interactions (e. g. , Hydrogen bonds) and decrease steric clashes.

NeurIPS Conference 2024 Conference Paper

Global Convergence in Training Large-Scale Transformers

  • Cheng Gao
  • Yuan Cao
  • Zihao Li
  • Yihan He
  • Mengdi Wang
  • Han Liu
  • Jason M. Klusowski
  • Jianqing Fan

Despite the widespread success of Transformers across various domains, their optimization guarantees in large-scale model settings are not well-understood. This paper rigorously analyzes the convergence properties of gradient flow in training Transformers with weight decay regularization. First, we construct the mean-field limit of large-scale Transformers, showing that as the model width and depth go to infinity, gradient flow converges to the Wasserstein gradient flow, which is represented by a partial differential equation. Then, we demonstrate that the gradient flow reaches a global minimum consistent with the PDE solution when the weight decay regularization parameter is sufficiently small. Our analysis is based on a series of novel mean-field techniques that adapt to Transformers. Compared with existing tools for deep networks (Lu et al. , 2020) that demand homogeneity and global Lipschitz smoothness, we utilize a refined analysis assuming only $\textit{partial homogeneity}$ and $\textit{local Lipschitz smoothness}$. These new techniques may be of independent interest.

NeurIPS Conference 2024 Conference Paper

Gradient Guidance for Diffusion Models: An Optimization Perspective

  • Yingqing Guo
  • Hui Yuan
  • Yukang Yang
  • Minshuo Chen
  • Mengdi Wang

Diffusion models have demonstrated empirical successes in various applications and can be adapted to task-specific needs via guidance. This paper studies a form of gradient guidance for adapting a pre-trained diffusion model towards optimizing user-specified objectives. We establish a mathematical framework for guided diffusion to systematically study its optimization theory and algorithmic design. Our theoretical analysis spots a strong link between guided diffusion models and optimization: gradient-guided diffusion models are essentially sampling solutions to a regularized optimization problem, where the regularization is imposed by the pre-training data. As for guidance design, directly bringing in the gradient of an external objective function as guidance would jeopardize the structure in generated samples. We investigate a modified form of gradient guidance based on a forward prediction loss, which leverages the information in pre-trained score functions and provably preserves the latent structure. We further consider an iteratively fine-tuned version of gradient-guided diffusion where guidance and score network are both updated with newly generated samples. This process mimics a first-order optimization iteration in expectation, for which we proved $\tilde{\mathcal{O}}(1/K)$ convergence rate to the global optimum when the objective function is concave. Our code is released at https: //github. com/yukang123/GGDMOptim. git.

NeurIPS Conference 2024 Conference Paper

Nonparametric Classification on Low Dimensional Manifolds using Overparameterized Convolutional Residual Networks

  • Zixuan Zhang
  • Kaiqi Zhang
  • Minshuo Chen
  • Yuma Takeda
  • Mengdi Wang
  • Tuo Zhao
  • Yu-Xiang Wang

Convolutional residual neural networks (ConvResNets), though overparametersized, can achieve remarkable prediction performance in practice, which cannot be well explained by conventional wisdom. To bridge this gap, we study the performance of ConvResNeXts trained with weight decay, which cover ConvResNets as a special case, from the perspective of nonparametric classification. Our analysis allows for infinitely many building blocks in ConvResNeXts, and shows that weight decay implicitly enforces sparsity on these blocks. Specifically, we consider a smooth target function supported on a low-dimensional manifold, then prove that ConvResNeXts can adapt to the function smoothness and low-dimensional structures and efficiently learn the function without suffering from the curse of dimensionality. Our findings partially justify the advantage of overparameterized ConvResNeXts over conventional machine learning models.

NeurIPS Conference 2024 Conference Paper

Offline Multitask Representation Learning for Reinforcement Learning

  • Haque Ishfaq
  • Thanh Nguyen-Tang
  • Songtao Feng
  • Raman Arora
  • Mengdi Wang
  • Ming Yin
  • Doina Precup

We study offline multitask representation learning in reinforcement learning (RL), where a learner is provided with an offline dataset from different tasks that share a common representation and is asked to learn the shared representation. We theoretically investigate offline multitask low-rank RL, and propose a new algorithm called MORL for offline multitask representation learning. Furthermore, we examine downstream RL in reward-free, offline and online scenarios, where a new task is introduced to the agent that shares the same representation as the upstream offline tasks. Our theoretical results demonstrate the benefits of using the learned representation from the upstream offline task instead of directly learning the representation of the low-rank model.

JMLR Journal 2024 Journal Article

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control

  • Amrit Singh Bedi
  • Anjaly Parayil
  • Junyu Zhang
  • Mengdi Wang
  • Alec Koppel

Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter $\alpha$, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index $\alpha$, a Hölder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Lévy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

NeurIPS Conference 2024 Conference Paper

One-Layer Transformer Provably Learns One-Nearest Neighbor In Context

  • Zihao Li
  • Yuan Cao
  • Cheng Gao
  • Yihan He
  • Han Liu
  • Jason M. Klusowski
  • Jianqing Fan
  • Mengdi Wang

Transformers have achieved great success in recent years. Interestingly, transformers have shown particularly strong in-context learning capability -- even without fine-tuning, they are still able to solve unseen tasks well purely based on task-specific prompts. In this paper, we study the capability of one-layer transformers in learning the one-nearest neighbor prediction rule. Under a theoretical framework where the prompt contains a sequence of labeled training data and unlabeled test data, we show that, although the loss function is nonconvex, when trained with gradient descent, a single softmax attention layer can successfully learn to behave like a one-nearest neighbor classifier. Our result gives a concrete example on how transformers can be trained to implement nonparametric machine learning algorithms, and sheds light on the role of softmax attention in transformer models.

JBHI Journal 2024 Journal Article

Redefining the Game: MVAE-DFDPnet's Low-Dimensional Embeddings for Superior Drug-Protein Interaction Predictions

  • Liang-Yong Xia
  • Yu Wu
  • Longfei Zhao
  • Leying Chen
  • Shiyi Zhang
  • Mengdi Wang
  • Jie Luo

Precisely predicting drug-protein interactions (DPIs) is pivotal for drug discovery and advancing precision medicine. A significant challenge in this domain is the high-dimensional and heterogeneous data characterizing drug and protein attributes, along with their intricate interactions. In our study, we introduce a novel deep learning architecture: the M ulti-view V ariational A uto- E ncoder followed by a cascade D eep F orest (MVAE-DFDPnet). This framework adeptly learns ultra-low-dimensional embedding for drugs and proteins. Notably, our t-SNE analysis reveals that two-dimensional embedding can clearly define clusters corresponding to diverse drug classes and protein families. These ultra-low-dimensional embedding likely contribute to the enhanced robustness and generalizability of our MVAE-DFDPnet. Our model surpasses current leading methods on benchmark datasets, functioning in significantly reduced dimensional spaces. The model's resilience is further evidenced by its sustained accuracy in predicting interactions involving novel drugs, proteins, and drug classes. Additionally, we have corroborated several newly identified DPIs with experimental evidence from the scientific literature.

JMLR Journal 2024 Journal Article

Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds

  • Zhenghao Xu
  • Xiang Ji
  • Minshuo Chen
  • Mengdi Wang
  • Tuo Zhao

Policy gradient methods equipped with deep neural networks have achieved great success in solving high-dimensional reinforcement learning (RL) problems. However, current analyses cannot explain why they are resistant to the curse of dimensionality. In this work, we study the sample complexity of the neural policy mirror descent (NPMD) algorithm with deep convolutional neural networks (CNN). Motivated by the empirical observation that many high-dimensional environments have state spaces possessing low-dimensional structures, such as those taking images as states, we consider the state space to be a $d$-dimensional manifold embedded in the $D$-dimensional Euclidean space with intrinsic dimension $d\ll D$. We show that in each iteration of NPMD, both the value function and the policy can be well approximated by CNNs. The approximation errors are controlled by the size of the networks, and the smoothness of the previous networks can be inherited. As a result, by properly choosing the network size and hyperparameters, NPMD can find an $\epsilon$-optimal policy with $\tilde{O}(\epsilon^{-\frac{d}{\alpha}-2})$ samples in expectation, where $\alpha\in(0,1]$ indicates the smoothness of environment. Compared to previous work, our result exhibits that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality, explaining the efficacy of deep policy gradient algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

NeurIPS Conference 2024 Conference Paper

Transfer Q-star : Principled Decoding for LLM Alignment

  • Souradip Chakraborty
  • Soumya Suvra Ghosal
  • Ming Yin
  • Dinesh Manocha
  • Mengdi Wang
  • Amrit Singh Bedi
  • Furong Huang

Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward $r$, thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function ($Q^*$), which is often unavailable in practice. Hence, prior SoTA methods either approximate this $Q^*$ using $Q^{\pi_{\text{sft}}}$ (derived from the reference $\texttt{SFT}$ model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose $\texttt{Transfer Q}^*$, which implicitly estimates the optimal value function for a target reward $r$ through a baseline model $\rho_{\texttt{BL}}$ aligned with a baseline reward $r_{\texttt{BL}}$ (which can be different from the target reward $r$). Theoretical analyses of $\texttt{Transfer Q}^*$ provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference $\texttt{SFT}$ model based on user needs. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods and demonstrates superior empirical performance across key metrics such as coherence, diversity, and quality in extensive tests on several synthetic and real datasets.

AAAI Conference 2024 Conference Paper

Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization

  • Jiahao Qiu
  • Hui Yuan
  • Jinghong Zhang
  • Wentao Chen
  • Huazheng Wang
  • Mengdi Wang

While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein. Protein engineering is typically conducted through an iterative process of adding mutations to the wild-type or lead sequences, recombination of mutations, and running new rounds of screening. To enhance the efficiency of such a process, we propose a tree search-based bandit learning method, which expands a tree starting from the initial sequence with the guidance of a bandit machine learning model. Under simplified assumptions and a Gaussian Process prior, we provide theoretical analysis and a Bayesian regret bound, demonstrating that the method can efficiently discover a near-optimal design. The full algorithm is compatible with a suite of randomized tree search heuristics, machine learning models, pre-trained embeddings, and bandit techniques. We test various instances of the algorithm across benchmark protein datasets using simulated screens. Experiment results demonstrate that the algorithm is both sample-efficient, diversity-promoting, and able to find top designs using reasonably small mutation counts.

AAAI Conference 2024 Conference Paper

TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for Lazy Clients

  • Mengdi Wang
  • Anna Bodonhelyi
  • Efe Bozkir
  • Enkelejda Kasneci

Federated learning is a distributed collaborative machine learning paradigm that has gained strong momentum in recent years. In federated learning, a central server periodically coordinates models with clients and aggregates the models trained locally by clients without necessitating access to local data. Despite its potential, the implementation of federated learning continues to encounter several challenges, predominantly the slow convergence that is largely due to data heterogeneity. The slow convergence becomes particularly problematic in cross-device federated learning scenarios where clients may be strongly limited by computing power and storage space, and hence counteracting methods that induce additional computation or memory cost on the client side such as auxiliary objective terms and larger training iterations can be impractical. In this paper, we propose a novel federated aggregation strategy, TurboSVM-FL, that poses no additional computation burden on the client side and can significantly accelerate convergence for federated classification task, especially when clients are "lazy" and train their models solely for few epochs for next global aggregation. TurboSVM-FL extensively utilizes support vector machine to conduct selective aggregation and max-margin spread-out regularization on class embeddings. We evaluate TurboSVM-FL on multiple datasets including FEMNIST, CelebA, and Shakespeare using user-independent validation with non-iid data distribution. Our results show that TurboSVM-FL can significantly outperform existing popular algorithms on convergence rate and reduce communication rounds while delivering better test metrics including accuracy, F1 score, and MCC.

AAAI Conference 2024 Conference Paper

Visual Adversarial Examples Jailbreak Aligned Large Language Models

  • Xiangyu Qi
  • Kaixuan Huang
  • Ashwinee Panda
  • Peter Henderson
  • Mengdi Wang
  • Prateek Mittal

Warning: this paper contains data, prompts, and model outputs that are offensive in nature. Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions (that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.

JMLR Journal 2023 Journal Article

Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning

  • Zihao Li
  • Boyi Liu
  • Zhuoran Yang
  • Zhaoran Wang
  • Mengdi Wang

We study the Constrained Convex Markov Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure, subject to a convex constraint. Designing algorithms for a constrained convex MDP faces several challenges, including (1) handling the large state space, (2) managing the exploration/exploitation tradeoff, and (3) solving the constrained optimization where the objective and the constraint are both nonlinear functions of the visitation measure. In this work, we present a model-based algorithm, Variational Primal-Dual Policy Optimization (VPDPO), in which Lagrangian and Fenchel duality are implemented to reformulate the original constrained problem into an unconstrained primal-dual optimization. The primal variables are updated by model-based value iteration following the principle of Optimism in the Face of Uncertainty (OFU), while the dual variables are updated by gradient ascent. Moreover, by embedding the visitation measure into a finite-dimensional space, we can handle large state spaces by incorporating function approximation. Two notable examples are (1) Kernelized Nonlinear Regulators and (2) Low-rank MDPs. We prove that with an optimistic planning oracle, our algorithm achieves sublinear regret and constraint violation in both cases and can attain the globally optimal policy of the original constrained problem. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations

  • Minshuo Chen
  • Yu Bai
  • H. Vincent Poor
  • Mengdi Wang

In real-world reinforcement learning (RL) systems, various forms of {\it impaired observability} can complicate matters. These situations arise when an agent is unable to observe the most recent state of the system due to latency or lossy channels, yet the agent must still make real-time decisions. This paper introduces a theoretical investigation into efficient RL in control systems where agents must act with delayed and missing state observations. We establish near-optimal regret bounds, of the form $\tilde{\mathcal{O}}(\sqrt{{\rm poly}(H) SAK})$, for RL in both the delayed and missing observation settings. Despite impaired observability posing significant challenges to the policy class and planning, our results demonstrate that learning remains efficient, with the regret bound optimally depending on the state-action size of the original system. Additionally, we provide a characterization of the performance of the optimal policy under impaired observability, comparing it to the optimal value obtained with full observability.

JMLR Journal 2023 Journal Article

Learning Good State and Action Representations for Markov Decision Process via Tensor Decomposition

  • Chengzhuo Ni
  • Yaqi Duan
  • Munther Dahleh
  • Mengdi Wang
  • Anru R. Zhang

The transition kernel of a continuous-state-action Markov decision process (MDP) admits a natural tensor structure. This paper proposes a tensor-inspired unsupervised learning method to identify meaningful low-dimensional state and action representations from empirical trajectories. The method exploits the MDP's tensor structure by kernelization, importance sampling and low-Tucker-rank approximation. This method can be further used to cluster states and actions respectively and find the best discrete MDP abstraction. We provide sharp statistical error bounds for tensor concentration and the preservation of diffusion distance after embedding. We further prove that the learned state/action abstractions provide accurate approximations to latent block structures if they exist, enabling function approximation in downstream tasks such as policy evaluation. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation

  • Nikki Lijing Kuang
  • Ming Yin
  • Mengdi Wang
  • Yu-Xiang Wang
  • Yian Ma

Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce \textit{Delayed-PSVI}, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 \mathbb{E}[\tau])$ worst-case regret in the presence of unknown stochastic delays. Here $\mathbb{E}[\tau]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for \textit{Delayed-LPSVI}, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms.

NeurIPS Conference 2023 Conference Paper

Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement

  • Hui Yuan
  • Kaixuan Huang
  • Chengzhuo Ni
  • Minshuo Chen
  • Mengdi Wang

We explore the methodology and theory of reward-directed generation via conditional diffusion models. Directed generation aims to generate samples with desired properties as measured by a reward function, which has broad applications in generative AI, reinforcement learning, and computational biology. We consider the common learning scenario where the dataset consists of majorly unlabeled data and a small set of data with noisy reward labels. Our approach leverages a learned reward function on the smaller data set as a pseudolabeler to label the unlabelled data. After pseudo-labelling, a conditional diffusion model (CDM) is trained on the data and samples are generated by setting a target value $a$ as the condition in CDM. From a theoretical standpoint, we show that this directed generator can effectively learn and sample from the reward-conditioned data distribution: 1. our model is capable of recovering the data's latent subspace representation. 2. the model generates samples moving closer to the user-specified target. The improvement in rewards of samples is influenced by a interplay between the strength of the reward signal, the distribution shift, and the cost of off-support extrapolation. We provide empirical results to validate our theory and highlight the relationship between the strength of extrapolation and the quality of generated samples.

NeurIPS Conference 2023 Conference Paper

Unified Off-Policy Learning to Rank: a Reinforcement Learning Perspective

  • Zeyu Zhang
  • Yi Su
  • Hui Yuan
  • Yiran Wu
  • Rishab Balasubramanian
  • Qingyun Wu
  • Huazheng Wang
  • Mengdi Wang

Off-policy Learning to Rank (LTR) aims to optimize a ranker from data collected by a deployed logging policy. However, existing off-policy learning to rank methods often make strong assumptions about how users generate the click data, i. e. , the click model, and hence need to tailor their methods specifically under different click models. In this paper, we unified the ranking process under general stochastic click models as a Markov Decision Process (MDP), and the optimal ranking could be learned with offline reinforcement learning (RL) directly. Building upon this, we leverage offline RL techniques for off-policy LTR and propose the Click Model-Agnostic Unified Off-policy Learning to Rank (CUOLR) method, which could be easily applied to a wide range of click models. Through a dedicated formulation of the MDP, we show that offline RL algorithms can adapt to various click models without complex debiasing techniques and prior knowledge of the model. Results on various large-scale datasets demonstrate that CUOLR consistently outperforms the state-of-the-art off-policy learning to rank algorithms while maintaining consistency and robustness under different click models.

NeurIPS Conference 2022 Conference Paper

Bandit Theory and Thompson Sampling-Guided Directed Evolution for Sequence Optimization

  • Hui Yuan
  • Chengzhuo Ni
  • Huazheng Wang
  • Xuezhou Zhang
  • Le Cong
  • Csaba Szepesvari
  • Mengdi Wang

Directed Evolution (DE), a landmark wet-lab method originated in 1960s, enables discovery of novel protein designs via evolving a population of candidate sequences. Recent advances in biotechnology has made it possible to collect high-throughput data, allowing the use of machine learning to map out a protein's sequence-to-function relation. There is a growing interest in machine learning-assisted DE for accelerating protein optimization. Yet the theoretical understanding of DE, as well as the use of machine learning in DE, remains limited. In this paper, we connect DE with the bandit learning theory and make a first attempt to study regret minimization in DE. We propose a Thompson Sampling-guided Directed Evolution (TS-DE) framework for sequence optimization, where the sequence-to-function mapping is unknown and querying a single value is subject to costly and noisy measurements. TS-DE updates a posterior of the function based on collected measurements. It uses a posterior-sampled function estimate to guide the crossover recombination and mutation steps in DE. In the case of a linear model, we show that TS-DE enjoys a Bayesian regret of order $\tilde O(d^{2}\sqrt{MT})$, where $d$ is feature dimension, $M$ is population size and $T$ is number of rounds. This regret bound is nearly optimal, confirming that bandit learning can provably accelerate DE. It may have implications for more general sequence optimization and evolutionary algorithms.

NeurIPS Conference 2022 Conference Paper

Communication Efficient Distributed Learning for Kernelized Contextual Bandits

  • Chuanhao Li
  • Huazheng Wang
  • Mengdi Wang
  • Hongning Wang

We tackle the communication efficiency challenge of learning kernelized contextual bandits in a distributed setting. Despite the recent advances in communication-efficient distributed bandit learning, existing solutions are restricted to simple models like multi-armed bandits and linear bandits, which hamper their practical utility. In this paper, instead of assuming the existence of a linear reward mapping from the features to the expected rewards, we consider non-linear reward mappings, by letting agents collaboratively search in a reproducing kernel Hilbert space (RKHS). This introduces significant challenges in communication efficiency as distributed kernel learning requires the transfer of raw data, leading to a communication cost that grows linearly w. r. t. time horizon $T$. We addresses this issue by equipping all agents to communicate via a common Nystr\"{o}m embedding that gets updated adaptively as more data points are collected. We rigorously proved that our algorithm can attain sub-linear rate in both regret and communication cost.

NeurIPS Conference 2022 Conference Paper

Decentralized Gossip-Based Stochastic Bilevel Optimization over Communication Networks

  • Shuoguang Yang
  • Xuezhou Zhang
  • Mengdi Wang

Bilevel optimization have gained growing interests, with numerous applications found in meta learning, minimax games, reinforcement learning, and nested composition optimization. This paper studies the problem of decentralized distributed bilevel optimization over a network where agents can only communicate with neighbors, and gives examples from multi-task, multi-agent learning and federated learning. In this paper, we propose a gossip-based distributed bilevel learning algorithm that allows networked agents to solve both the inner and outer optimization problems in a single timescale and share information through network propagation. We show that our algorithm enjoys the $\mathcal{O}(\frac{1}{K \epsilon^2})$ per-agent sample complexity for general nonconvex bilevel optimization and $\mathcal{O}(\frac{1}{K \epsilon})$ for Polyak-Łojasiewicz objective, achieving a speedup that scales linearly with the network size $K$. The sample complexities are optimal in both $\epsilon$ and $K$. We test our algorithm on the examples of hyperparameter tuning and decentralized reinforcement learning. Simulated experiments confirmed that our algorithm achieves the state-of-the-art training efficiency and test accuracy.

AAAI Conference 2022 Conference Paper

Multi-Agent Reinforcement Learning with General Utilities via Decentralized Shadow Reward Actor-Critic

  • Junyu Zhang
  • Amrit Singh Bedi
  • Mengdi Wang
  • Alec Koppel

We posit a new mechanism for cooperation in multi-agent reinforcement learning (MARL) based upon any nonlinear function of the team’s long-term state-action occupancy measure, i. e. , a general utility. This subsumes the cumulative return but also allows one to incorporate risk-sensitivity, exploration, and priors. We derive the Decentralized Shadow Reward Actor-Critic (DSAC) in which agents alternate between policy evaluation (critic), weighted averaging with neighbors (information mixing), and local gradient updates for their policy parameters (actor). DSAC augments the classic critic step by requiring agents to (i) estimate their local occupancy measure in order to (ii) estimate the derivative of the local utility with respect to their occupancy measure, i. e. , the “shadow reward”. DSAC converges to a stationary point in sublinear rate with high probability, depending on the amount of communications. Under proper conditions, we further establish the non-existence of spurious stationary points for this problem, that is, DSAC finds the globally optimal policy. Experiments demonstrate the merits of goals beyond the cumulative return in cooperative MARL.

IJCAI Conference 2022 Conference Paper

Parameter-Efficient Sparsity for Large Language Models Fine-Tuning

  • Yuchao Li
  • Fuli Luo
  • Chuanqi Tan
  • Mengdi Wang
  • Songfang Huang
  • Shen Li
  • Junjie Bai

With the dramatically increased number of parameters in language models, sparsity methods have received ever-increasing research focus to compress and accelerate the models. While most research focuses on how to accurately retain appropriate weights while maintaining the performance of the compressed model, there are challenges in the computational overhead and memory footprint of sparse training when compressing large-scale language models. To address this problem, we propose a Parameter-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training in downstream tasks. Specifically, we first combine the data-free and data-driven criteria to efficiently and accurately measure the importance of weights. Then we investigate the intrinsic redundancy of data-driven weight importance and derive two obvious characteristics i. e. low-rankness and structuredness. Based on that, two groups of small matrices are introduced to compute the data-driven importance of weights, instead of using the original large importance score matrix, which therefore makes the sparse training resource-efficient and parameter-efficient. Experiments with diverse networks (i. e. BERT, RoBERTa and GPT-2) on dozens of datasets demonstrate PST performs on par or better than previous sparsity methods, despite only training a small number of parameters. For instance, compared with previous sparsity methods, our PST only requires 1. 5% trainable parameters to achieve comparable performance on BERT.

NeurIPS Conference 2021 Conference Paper

On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method

  • Junyu Zhang
  • Chengzhuo Ni
  • Zheng Yu
  • Csaba Szepesvari
  • Mengdi Wang

Policy gradient (PG) gives rise to a rich class of reinforcement learning (RL) methods. Recently, there has been an emerging trend to augment the existing PG methods such as REINFORCE by the \emph{variance reduction} techniques. However, all existing variance-reduced PG methods heavily rely on an uncheckable importance weight assumption made for every single iteration of the algorithms. In this paper, a simple gradient truncation mechanism is proposed to address this issue. Moreover, we design a Truncated Stochastic Incremental Variance-Reduced Policy Gradient (TSIVR-PG) method, which is able to maximize not only a cumulative sum of rewards but also a general utility function over a policy's long-term visiting distribution. We show an $\tilde{\mathcal{O}}(\epsilon^{-3})$ sample complexity for TSIVR-PG to find an $\epsilon$-stationary policy. By assuming the \emph{overparameterization} of policy and exploiting the \emph{hidden convexity} of the problem, we further show that TSIVR-PG converges to global $\epsilon$-optimal policy with $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples.

NeurIPS Conference 2020 Conference Paper

Generalized Leverage Score Sampling for Neural Networks

  • Jason D. Lee
  • Ruoqi Shen
  • Zhao Song
  • Mengdi Wang
  • Zheng Yu

Leverage score sampling is a powerful technique that originates from theoretical computer science, which can be used to speed up a large number of fundamental questions, e. g. linear regression, linear programming, semi-definite programming, cutting plane method, graph sparsification, maximum matching and max-flow. Recently, it has been shown that leverage score sampling helps to accelerate kernel methods [Avron, Kapralov, Musco, Musco, Velingker and Zandieh 17]. In this work, we generalize the results in [Avron, Kapralov, Musco, Musco, Velingker and Zandieh 17] to a broader class of kernels. We further bring the leverage score sampling into the field of deep learning theory. 1. We show the connection between the initialization for neural network training and approximating the neural tangent kernel with random features. 2. We prove the equivalence between regularized neural network and neural tangent kernel ridge regression under the initialization of both classical random Gaussian and leverage score sampling.

NeurIPS Conference 2020 Conference Paper

High-Dimensional Sparse Linear Bandits

  • Botao Hao
  • Tor Lattimore
  • Mengdi Wang

Stochastic linear bandits with high-dimensional sparse features are a practical model for a variety of domains, such as personalized medicine and online advertising. We derive a novel O(n^{2/3}) dimension-free minimax regret lower bound for sparse linear bandits in the data-poor regime where the horizon is larger than the ambient dimension and where the feature vectors admit a well-conditioned exploration distribution. This is complemented by a nearly matching upper bound for an explore-then-commit algorithm showing that that O(n^{2/3}) is the optimal rate in the data-poor regime. The results complement existing bounds for the data-rich regime and also provide another example where carefully balancing the trade-off between information and regret is necessary. Finally, we prove a dimension-free O(\sqrt{n}) regret upper bound under an additional assumption on the magnitude of the signal for relevant features.

NeurIPS Conference 2020 Conference Paper

Provably Efficient Reinforcement Learning with Kernel and Neural Function Approximations

  • Zhuoran Yang
  • Chi Jin
  • Zhaoran Wang
  • Mengdi Wang
  • Michael Jordan

Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved significant empirical successes in large-scale application problems with a massive number of states. From a theoretical perspective, however, RL with functional approximation poses a fundamental challenge to developing algorithms with provable computational and statistical efficiency, due to the need to take into consideration both the exploration-exploitation tradeoff that is inherent in RL and the bias-variance tradeoff that is innate in statistical estimation. To address such a challenge, focusing on the episodic setting where the action-value functions are represented by a kernel function or over-parametrized neural network, we propose the first provable RL algorithm with both polynomial runtime and sample complexity, without additional assumptions on the data-generating model. In particular, for both the kernel and neural settings, we prove that an optimistic modification of the least-squares value iteration algorithm incurs an $\tilde{\mathcal{O}}(\delta_{\cF} H^2 \sqrt{T})$ regret, where $\delta_{\cF}$ characterizes the intrinsic complexity of the function class $\cF$, $H$ is the length of each episode, and $T$ is the total number of episodes. Our regret bounds are independent of the number of states and therefore even allows it to diverge, which exhibits the benefit of function approximation.

NeurIPS Conference 2020 Conference Paper

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

  • Junyu Zhang
  • Alec Koppel
  • Amrit Singh Bedi
  • Csaba Szepesvari
  • Mengdi Wang

In recent years, reinforcement learning systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. Further, we prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, and we also establish its rate of convergence that matches or improves the convergence rate available in the case of RL with cumulative rewards.

JMLR Journal 2019 Journal Article

Approximation Hardness for A Class of Sparse Optimization Problems

  • Yichen Chen
  • Yinyu Ye
  • Mengdi Wang

In this paper, we consider three typical optimization problems with a convex loss function and a nonconvex sparse penalty or constraint. For the sparse penalized problem, we prove that finding an $\mathcal{O}(n^{c_1}d^{c_2})$-optimal solution to an $n\times d$ problem is strongly NP-hard for any $c_1, c_2\in [0,1)$ such that $c_1+c_2 [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

NeurIPS Conference 2019 Conference Paper

Learning low-dimensional state embeddings and metastable clusters from time series data

  • Yifan Sun
  • Yaqi Duan
  • Hao Gong
  • Mengdi Wang

This paper studies how to find compact state embeddings from high-dimensional Markov state trajectories, where the transition kernel has a small intrinsic rank. In the spirit of diffusion map, we propose an efficient method for learning a low-dimensional state embedding and capturing the process's dynamics. This idea also leads to a kernel reshaping method for more accurate nonparametric estimation of the transition function. State embedding can be used to cluster states into metastable sets, thereby identifying the slow dynamics. Sharp statistical error bounds and misclassification rate are proved. Experiment on a simulated dynamical system shows that the state clustering method indeed reveals metastable structures. We also experiment with time series generated by layers of a Deep-Q-Network when playing an Atari game. The embedding method identifies game states to be similar if they share similar future events, even though their raw data are far different.

JMLR Journal 2019 Journal Article

Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python

  • Jason Ge
  • Xingguo Li
  • Haoming Jiang
  • Han Liu
  • Tong Zhang
  • Mengdi Wang
  • Tuo Zhao

We describe a new library named picasso, which implements a unified framework of pathwise coordinate optimization for a variety of sparse learning problems (e.g., sparse linear regression, sparse logistic regression, sparse Poisson regression and scaled sparse linear regression) combined with efficient active set selection strategies. Besides, the library allows users to choose different sparsity-inducing regularizers, including the convex $\ell_1$, nonvoncex MCP and SCAD regularizers. The library is coded in \texttt{C++} and has user-friendly R and Python wrappers. Numerical experiments demonstrate that picasso can scale up to large problems efficiently. [abs] [ pdf ][ bib ] [ code ] [ webpage ] &copy JMLR 2019. ( edit, beta )

NeurIPS Conference 2019 Conference Paper

State Aggregation Learning from Markov Transition Data

  • Yaqi Duan
  • Tracy Ke
  • Mengdi Wang

State aggregation is a popular model reduction method rooted in optimal control. It reduces the complexity of engineering systems by mapping the system’s states into a small number of meta-states. The choice of aggregation map often depends on the data analysts’ knowledge and is largely ad hoc. In this paper, we propose a tractable algorithm that estimates the probabilistic aggregation map from the system’s trajectory. We adopt a soft-aggregation model, where each meta-state has a signature raw state, called an anchor state. This model includes several common state aggregation models as special cases. Our proposed method is a simple two- step algorithm: The first step is spectral decomposition of empirical transition matrix, and the second step conducts a linear transformation of singular vectors to find their approximate convex hull. It outputs the aggregation distributions and disaggregation distributions for each meta-state in explicit forms, which are not obtainable by classical spectral methods. On the theoretical side, we prove sharp error bounds for estimating the aggregation and disaggregation distributions and for identifying anchor states. The analysis relies on a new entry-wise deviation bound for singular vectors of the empirical transition matrix of a Markov process, which is of independent interest and cannot be deduced from existing literature. The application of our method to Manhattan traffic data successfully generates a data-driven state aggregation map with nice interpretations.

NeurIPS Conference 2018 Conference Paper

Dimensionality Reduction for Stationary Time Series via Stochastic Nonconvex Optimization

  • Minshuo Chen
  • Lin Yang
  • Mengdi Wang
  • Tuo Zhao

Stochastic optimization naturally arises in machine learning. Efficient algorithms with provable guarantees, however, are still largely missing, when the objective function is nonconvex and the data points are dependent. This paper studies this fundamental challenge through a streaming PCA problem for stationary time series data. Specifically, our goal is to estimate the principle component of time series data with respect to the covariance matrix of the stationary distribution. Computationally, we propose a variant of Oja's algorithm combined with downsampling to control the bias of the stochastic gradient caused by the data dependency. Theoretically, we quantify the uncertainty of our proposed stochastic algorithm based on diffusion approximations. This allows us to prove the asymptotic rate of convergence and further implies near optimal asymptotic sample complexity. Numerical experiments are provided to support our analysis.

NeurIPS Conference 2018 Conference Paper

Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model

  • Aaron Sidford
  • Mengdi Wang
  • Xian Wu
  • Lin Yang
  • Yinyu Ye

In this paper we consider the problem of computing an $\epsilon$-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in $O(1)$ time. Given such a DMDP with states $\states$, actions $\actions$, discount factor $\gamma\in(0, 1)$, and rewards in range $[0, 1]$ we provide an algorithm which computes an $\epsilon$-optimal policy with probability $1 - \delta$ where {\it both} the run time spent and number of sample taken is upper bounded by \[ O\left[\frac{|\cS||\cA|}{(1-\gamma)^3 \epsilon^2} \log \left(\frac{|\cS||\cA|}{(1-\gamma)\delta \epsilon} \right) \log\left(\frac{1}{(1-\gamma)\epsilon}\right)\right] ~. \] For fixed values of $\epsilon$, this improves upon the previous best known bounds by a factor of $(1 - \gamma)^{-1}$ and matches the sample complexity lower bounds proved in \cite{azar2013minimax} up to logarithmic factors. We also extend our method to computing $\epsilon$-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.

JMLR Journal 2017 Journal Article

Accelerating Stochastic Composition Optimization

  • Mengdi Wang
  • Ji Liu
  • Ethan X. Fang

We consider the stochastic nested composition optimization problem where the objective is a composition of two expected- value functions. We propose a new stochastic first-order method, namely the accelerated stochastic compositional proximal gradient (ASC-PG) method. This algorithm updates the solution based on noisy gradient queries using a two-timescale iteration. The ASC-PG is the first proximal gradient method for the stochastic composition problem that can deal with nonsmooth regularization penalty. We show that the ASC-PG exhibits faster convergence than the best known algorithms, and that it achieves the optimal sample-error complexity in several important special cases. We demonstrate the application of ASC-PG to reinforcement learning and conduct numerical experiments. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

NeurIPS Conference 2017 Conference Paper

Diffusion Approximations for Online Principal Component Estimation and Global Convergence

  • Chris Junchi Li
  • Mengdi Wang
  • Han Liu
  • Tong Zhang

In this paper, we propose to adopt the diffusion approximation tools to study the dynamics of Oja's iteration which is an online stochastic gradient method for the principal component analysis. Oja's iteration maintains a running estimate of the true principal component from streaming data and enjoys less temporal and spatial complexities. We show that the Oja's iteration for the top eigenvector generates a continuous-state discrete-time Markov chain over the unit sphere. We characterize the Oja's iteration in three phases using diffusion approximation and weak convergence tools. Our three-phase analysis further provides a finite-sample error bound for the running estimate, which matches the minimax information lower bound for PCA under the additional assumption of bounded samples.

RLDM Conference 2017 Conference Abstract

Stochastic Primal-Dual Methods and Sample Complexity of Markov Decision Processes

  • Yichen Chen
  • Mengdi Wang

We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use O|S||A| space and has low computational complexity per iteration. We first consider a basic version of SPD that uses Euclidean projection for both the pri- mal and dual updates.  We show that it find an -optimal policy regardless of the initial state, with high 4 |A|2 σ 2 probability, using O |S| (1−γ)6 2 iterations/samples for the infinite-horizon discounted-reward MDP and 4 2 6 2   O |S| |A|2 H σ iterations/samples for the finite-horizon MDP. We then propose an accelerated version of SDP that uses relative entropy projection in the  dual udate. We show that the improved SPD method |S|3 |A| log(|S||A|)σ 2  achieves the sample/running-time complexity O (1−γ)4 2 for the general discounted-reward MDPs. For MDPs that are “sufficiently” ergodic, the improved SPD has sample/running-time complex- |S||A| log(|S||A|)σ 2   ity O (1−γ)2 2.

NeurIPS Conference 2016 Conference Paper

Accelerating Stochastic Composition Optimization

  • Mengdi Wang
  • Ji Liu
  • Ethan Fang

Consider the stochastic composition optimization problem where the objective is a composition of two expected-value functions. We propose a new stochastic first-order method, namely the accelerated stochastic compositional proximal gradient (ASC-PG) method, which updates based on queries to the sampling oracle using two different timescales. The ASC-PG is the first proximal gradient method for the stochastic composition problem that can deal with nonsmooth regularization penalty. We show that the ASC-PG exhibits faster convergence than the best known algorithms, and that it achieves the optimal sample-error complexity in several important special cases. We further demonstrate the application of ASC-PG to reinforcement learning and conduct numerical experiments.