Arrow Research search

Author name cluster

Wen Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

43 papers
2 author rows

Possible papers

43

EAAI Journal 2026 Journal Article

Opposition-based learning memetic algorithm for the maximum intersection of k -subsets problem

  • Wen Sun
  • Xu Li
  • Jin-Kao Hao
  • Wenlong Li
  • Zhipeng Lü

Given m elements and n subsets of elements, the maximum intersection of k -subsets (kMIS) problem is to select k subsets of elements to maximize the number of elements simultaneously covered by all of the selected subsets. As a general model, kMIS can be used to formulate some practical problems including data privacy control, community detection, and deoxyribonucleic acid microarray technology. This paper presents an opposition-based learning memetic algorithm that integrates opposition-based learning initialization, adaptive crossover, and solution-based tabu search. Experimental results on 608 instances show that the algorithm competes favorably with the state-of-the-art methods. The importance of the algorithmic components is experimentally validated.

NeurIPS Conference 2025 Conference Paper

$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training

  • Jin Zhou
  • Kaiwen Wang
  • Jonathan Chang
  • Zhaolin Gao
  • Nathan Kallus
  • Kilian Weinberger
  • Kianté Brantley
  • Wen Sun

Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at \url{https: //github. com/jinpz/q_sharp}.

NeurIPS Conference 2025 Conference Paper

Accelerating RL for LLM Reasoning with Optimal Advantage Regression

  • Kianté Brantley
  • Mingyu Chen
  • Zhaolin Gao
  • Jason Lee
  • Wen Sun
  • Wenhao Zhan
  • Xuezhou Zhang

Reinforcement learning (RL) has emerged as a powerful tool for fine-tuning large language models (LLMs) to improve complex reasoning abilities. However, state-of-the-art policy optimization methods often suffer from high computational overhead and memory consumption, primarily due to the need for multiple generations per prompt and the reliance on critic networks or advantage estimates of the current policy. In this paper, we propose $A^\star$-PO, a novel two-stage policy optimization framework that directly approximates the optimal advantage function and enables efficient training of LLMs for reasoning tasks. In the first stage, we leverage offline sampling from a reference policy to estimate the optimal value function $V^\star$, eliminating the need for costly online value estimation. In the second stage, we perform on-policy updates using a simple least-squares regression loss with only a single generation per prompt. Theoretically, we establish performance guarantees and prove that the KL-regularized RL objective can be optimized without requiring complex exploration strategies. Empirically, $A^\star$-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks, while reducing training time by up to 2$\times$ and peak memory usage by over 30\% compared to PPO, GRPO, and REBEL. Implementation of $A^\star$-PO can be found at https: //github. com/ZhaolinGao/A-PO.

NeurIPS Conference 2025 Conference Paper

Avoiding exp(R) scaling in RLHF through Preference-based Exploration

  • Mingyu Chen
  • Yiding Chen
  • Wen Sun
  • Xuezhou Zhang

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focuses on improving its sample efficiency. All existing algorithms for online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the range of the reward function. This statistical inefficiency hinders their effectiveness in scenarios with heavily skewed preferences, e. g. questions with objectively correct answers. To address this, we introduce Self-Exploring Preference-Incentive Online Preference Optimization (SE-POPO), an online RLHF algorithm that for the first time achieves a sample complexity that scales polynomially with the reward range, answering an open problem raised by Xie et al. [2024]. Theoretically, we demonstrate that the sample complexity of SE-POPO dominates that of existing exploration algorithms. Empirically, our systematic evaluation confirms that SE-POPO is more sample-efficient than both exploratory and non-exploratory baselines, in two primary application scenarios of RLHF as well as on public benchmarks, marking a significant step forward in RLHF algorithm design.

YNICL Journal 2025 Journal Article

Mechanisms underlying the spontaneous reorganization of depression network after stroke

  • Yirong Fang
  • Xian Chao
  • Zeyu Lu
  • Hongmei Huang
  • Ran Shi
  • Dawei Yin
  • Hao Chen
  • Yanan Lu

Exploring the causal relationship between focal brain lesions and post-stroke depression (PSD) can provide therapeutic insights. However, a gap exists between causal and therapeutic information. Exploring post-stroke brain repair processes post-stroke could bridge this gap. We defined a depression network using the normative connectome and investigated the predictive capacity of lesion-induced network damage on depressive symptoms in discovery cohort of 96 patients, at baseline and six months post-stroke. Stepwise functional connectivity (SFC) was used to examine topological changes in the depression network over time to identify patterns of network reorganization. The predictive value of reorganization information was evaluated for follow-up symptoms in discovery and validation cohort 1 (22 worsening PSD patients) as well as for treatment responsiveness in validation cohort 2 (23 antidepressant-treated patients). We evaluated the consistency of significant reorganization areas with neuromodulation targets. Spatial correlations of network reorganization patterns with gene expression and neurotransmitter maps were analyzed. The predictive power of network damage for symptoms diminished at follow-up compared to baseline (Δadjusted R2 = -0. 070, p < 0. 001). Reorganization information effectively predicted symptoms at follow-up in the discovery cohort (adjust R2 = 0. 217, 95 %CI: 0. 010 to 0. 431), as well as symptom exacerbation (r = 0. 421, p = 0. 033) and treatment responsiveness (r = 0. 587, p = 0. 012) in the validation cohorts. Regions undergoing significant reorganization overlapped with neuromodulatory targets known to be effective in treating depression. The reorganization of the depression network was associated with immune-inflammatory responses gene expressions and gamma-aminobutyric acid. Our findings may yield important insights into the repair mechanisms of PSD and provide a critical context for developing post-stroke treatment strategies.

RLJ Journal 2025 Journal Article

MixUCB: Enhancing Safe Exploration in Contextual Bandits with Human Oversight

  • Jinyan Su
  • Rohan Banerjee
  • Jiankai Sun
  • Wen Sun
  • Sarah Dean

The integration of AI into high-stakes decision-making domains demands safety and accountability. Traditional contextual bandit algorithms for online and adaptive decision-making must balance exploration and exploitation, posing significant risks when applied to critical environments where exploratory actions can lead to severe consequences. To address these challenges, we propose MixUCB, a flexible human-in-the-loop contextual bandit framework that enhances safe exploration by incorporating human expertise and oversight with machine automation. Based on the model's confidence and the associated risks, MixUCB intelligently determines when to seek human intervention. The reliance on human input gradually reduces as the system learns and gains confidence. Theoretically, we analyze the regret and query complexity in order to rigorously answer the question of when to query. Empirically, we validate the effectiveness through extensive experiments on both synthetic and real-world datasets. Our findings underscore the importance of designing decision-making frameworks that are not only theoretically and technically sound, but also align with societal expectations of accountability and safety. Our experimental code is available at: https://github.com/sdean-group/MixUCB

RLC Conference 2025 Conference Paper

MixUCB: Enhancing Safe Exploration in Contextual Bandits with Human Oversight

  • Jinyan Su
  • Rohan Banerjee
  • Jiankai Sun
  • Wen Sun
  • Sarah Dean

The integration of AI into high-stakes decision-making domains demands safety and accountability. Traditional contextual bandit algorithms for online and adaptive decision-making must balance exploration and exploitation, posing significant risks when applied to critical environments where exploratory actions can lead to severe consequences. To address these challenges, we propose MixUCB, a flexible human-in-the-loop contextual bandit framework that enhances safe exploration by incorporating human expertise and oversight with machine automation. Based on the model's confidence and the associated risks, MixUCB intelligently determines when to seek human intervention. The reliance on human input gradually reduces as the system learns and gains confidence. Theoretically, we analyze the regret and query complexity in order to rigorously answer the question of when to query. Empirically, we validate the effectiveness through extensive experiments on both synthetic and real-world datasets. Our findings underscore the importance of designing decision-making frameworks that are not only theoretically and technically sound, but also align with societal expectations of accountability and safety. Our experimental code is available at: https: //github. com/sdean-group/MixUCB

NeurIPS Conference 2025 Conference Paper

Scaling Offline RL via Efficient and Expressive Shortcut Models

  • Nicolas Espinosa-Dice
  • Yiyi Zhang
  • Yiding Chen
  • Bradley Guo
  • Owen Oertell
  • Gokul Swamy
  • Kianté Brantley
  • Wen Sun

Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline RL remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models – a novel class of generative models – to scale both training and inference. SORL's policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL supports both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute.

NeurIPS Conference 2025 Conference Paper

Value-Guided Search for Efficient Chain-of-Thought Reasoning

  • Kaiwen Wang
  • Jin Zhou
  • Jonathan Chang
  • Zhaolin Gao
  • Nathan Kallus
  • Kianté Brantley
  • Wen Sun

In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of ``step, '' which is difficult to define for long-context reasoning models. By collecting a dataset of 2. 5 million reasoning traces, we train a 1. 5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (\texttt{VGS}) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-$n$. Moreover, \texttt{VGS} significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced at \codeurl.

NeurIPS Conference 2024 Conference Paper

Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes

  • Andrew Bennett
  • Nathan Kallus
  • Miruna Oprescu
  • Wen Sun
  • Kaiwen Wang

We study the evaluation of a policy under best- and worst-case perturbations to a Markov decision process (MDP), using transition observations from the original MDP, whether they are generated under the same or a different policy. This is an important problem when there is the possibility of a shift between historical and future environments, \emph{e. g. } due to unmeasured confounding, distributional shift, or an adversarial environment. We propose a perturbation model that allows changes in the transition kernel densities up to a given multiplicative factor or its reciprocal, extending the classic marginal sensitivity model (MSM) for single time-step decision-making to infinite-horizon RL. We characterize the sharp bounds on policy value under this model -- \emph{i. e. }, the tightest possible bounds based on transition observations from the original MDP -- and we study the estimation of these bounds from such transition observations. We develop an estimator with several important guarantees: it is semiparametrically efficient, and remains so even when certain necessary nuisance functions, such as worst-case Q-functions, are estimated at slow, nonparametric rates. Our estimator is also asymptotically normal, enabling straightforward statistical inference using Wald confidence intervals. Moreover, when certain nuisances are estimated inconsistently, the estimator still provides valid, albeit possibly not sharp, bounds on the policy value. We validate these properties in numerical simulations. The combination of accounting for environment shifts from train to test (robustness), being insensitive to nuisance-function estimation (orthogonality), and addressing the challenge of learning from finite samples (inference) together leads to credible and reliable policy evaluation.

RLC Conference 2024 Conference Paper

JoinGym: An Efficient Join Order Selection Environment

  • Junxiong Wang
  • Kaiwen Wang
  • Yueying Li
  • Nathan Kallus
  • Immanuel Trummer
  • Wen Sun

Join order selection (JOS), the ordering of join operations to minimize query execution cost, is a core NP-hard combinatorial optimization problem in database query optimization. We present \textsc{JoinGym}, a lightweight and easy-to-use reinforcement learning (RL) environment that captures both left-deep and bushy variants of the JOS problem. Compared to prior works that execute queries online, \textsc{JoinGym} has much higher throughput and efficiently simulates the cost of joins offline by looking up the intermediate table's cardinality from a pre-computed dataset. We provide such a cardinality dataset for $3300$ queries based on real IMDb workloads, which is the largest suite its kind and may be of independent interest. We extensively benchmark several RL algorithms and find that the best policies are competitive with or better than Postgres, a strong non-learning baseline. However, the learned policies can still catastrophically fail on a small fraction of queries which motivates future research using \textsc{JoinGym} to improve generalization and safety in long-tailed, partially observed, combinatorial optimization problems.

RLJ Journal 2024 Journal Article

JoinGym: An Efficient Join Order Selection Environment

  • Junxiong Wang
  • Kaiwen Wang
  • Yueying Li
  • Nathan Kallus
  • Immanuel Trummer
  • Wen Sun

Join order selection (JOS), the ordering of join operations to minimize query execution cost, is a core NP-hard combinatorial optimization problem in database query optimization. We present \textsc{JoinGym}, a lightweight and easy-to-use reinforcement learning (RL) environment that captures both left-deep and bushy variants of the JOS problem. Compared to prior works that execute queries online, \textsc{JoinGym} has much higher throughput and efficiently simulates the cost of joins offline by looking up the intermediate table's cardinality from a pre-computed dataset. We provide such a cardinality dataset for $3300$ queries based on real IMDb workloads, which is the largest suite its kind and may be of independent interest. We extensively benchmark several RL algorithms and find that the best policies are competitive with or better than Postgres, a strong non-learning baseline. However, the learned policies can still catastrophically fail on a small fraction of queries which motivates future research using \textsc{JoinGym} to improve generalization and safety in long-tailed, partially observed, combinatorial optimization problems.

NeurIPS Conference 2024 Conference Paper

REBEL: Reinforcement Learning via Regressing Relative Rewards

  • Zhaolin Gao
  • Jonathan D. Chang
  • Wenhao Zhan
  • Owen Oertell
  • Gokul Swamy
  • Kianté Brantley
  • Thorsten Joachims
  • J. A. Bagnell

While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e. g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and be extended to handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in AlpacaEval 2. 0, MT-Bench, and Open LLM Leaderboard. Implementation of REBEL can be found at https: //github. com/ZhaolinGao/REBEL, and models trained by REBEL can be found at https: //huggingface. co/Cornell-AGI.

RLJ Journal 2024 Journal Article

RL for Consistency Models: Reward Guided Text-to-Image Generation with Fast Inference

  • Owen Oertell
  • Jonathan Daniel Chang
  • Yiyi Zhang
  • Kianté Brantley
  • Wen Sun

Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Our code is available at https://rlcm.owenoertell.com.

RLC Conference 2024 Conference Paper

RL for Consistency Models: Reward Guided Text-to-Image Generation with Fast Inference

  • Owen Oertell
  • Jonathan Daniel Chang
  • Yiyi Zhang
  • Kianté Brantley
  • Wen Sun

Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Our code is available at https: //rlcm. owenoertell. com.

NeurIPS Conference 2024 Conference Paper

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

  • Yuda Song
  • Gokul Swamy
  • Aarti Singh
  • J. A. Bagnell
  • Wen Sun

Learning from human preference data has emerged as the dominant paradigm for fine-tuning large language models (LLMs). The two most common families of techniques -- online reinforcement learning (RL) such as Proximal Policy Optimization (PPO) and offline contrastive methods such as Direct Preference Optimization (DPO) -- were positioned as equivalent in prior work due to the fact that both have to start from the same offline preference dataset. To further expand our theoretical understanding of the similarities and differences between online and offline techniques for preference fine-tuning, we conduct a rigorous analysis through the lens of dataset coverage, a concept that captures how the training data covers the test distribution and is widely used in RL. We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the optimal policy, but a weaker partial coverage condition suffices for online RL methods. This separation provides one explanation of why online RL methods can perform better than offline methods, especially when the offline preference data is not diverse enough. Finally, motivated by our preceding theoretical observations, we derive a hybrid preference optimization (HyPO) algorithm that uses offline data for contrastive-based preference optimization and online unlabeled data for KL regularization. Theoretically and empirically, we demonstrate that HyPO is more performant than its pure offline counterpart DPO, while still preserving its computation and memory efficiency.

YNIMG Journal 2024 Journal Article

Visual selective attention in individuals with age-related hearing loss

  • Min Zhu
  • Yufei Qiao
  • Wen Sun
  • Yang Sun
  • Yuanshun Long
  • Hua Guo
  • Chang Cai
  • Hang Shen

Evidence from epidemiological studies suggests that hearing loss is associated with an accelerated decline in cognitive function, but the underlying pathophysiological mechanism remains poorly understood. Studies using auditory tasks have suggested that degraded auditory input increases the cognitive load for auditory perceptual processing and thereby reduces the resources available for other cognitive tasks. Attention-related networks are among the systems overrecruited to support degraded auditory perception, but it is unclear how they function when no excessive recruitment of cognitive resources for auditory processing is needed. Here, we implemented an EEG study using a nonauditory visual attentional selection task in 30 individuals with age-related hearing loss (ARHLs, 60-73 years) and compared them with aged (N = 30, 60-70 years) and young (N = 35, 22-29 years) normal-hearing controls. Compared with their normal-hearing peers, ARHLs demonstrated a significant amplitude reduction for the posterior contralateral N2 component, which is a well-validated index of the allocation of selective visual attention, despite the comparable behavioral performance. Furthermore, the amplitudes were observed to correlate significantly with hearing acuities (pure tone audiometry thresholds) and higher-order hearing abilities (speech-in-noise thresholds) in aged individuals. The target-elicited alpha lateralization, another mechanism of visuospatial attention, demonstrated in control groups was not observed in ARHLs. Although behavioral performance is comparable, the significant decrease in N2pc amplitude in ARHLs provides neurophysiologic evidence that may suggest a visual attentional deficit in ARHLs even without extra-recruitment of cognitive resources by auditory processing. It supports the hypothesis that constant degraded auditory input in ARHLs has an adverse impact on the function of cognitive control systems, which is a possible mechanism mediating the relationship between hearing loss and cognitive decline.

NeurIPS Conference 2023 Conference Paper

Contextual Bandits and Imitation Learning with Preference-Based Active Queries

  • Ayush Sekhari
  • Karthik Sridharan
  • Wen Sun
  • Runzhe Wu

We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Instead, the learner can actively request the expert at each round to compare two actions and receive noisy preference feedback. The learner's objective is two-fold: to minimize regret associated with the executed actions, while simultaneously, minimizing the number of comparison queries made to the expert. In this paper, we assume that the learner has access to a function class that can represent the expert's preference model under appropriate link functions and present an algorithm that leverages an online regression oracle with respect to this function class. For the contextual bandit setting, our algorithm achieves a regret bound that combines the best of both worlds, scaling as $O(\min\\{\sqrt{T}, d/\Delta\\})$, where $T$ represents the number of interactions, $d$ represents the eluder dimension of the function class, and $\Delta$ represents the minimum preference of the optimal action over any suboptimal action under all contexts. Our algorithm does not require the knowledge of $\Delta$, and the obtained regret bound is comparable to what can be achieved in the standard contextual bandits setting where the learner observes reward signals at each round. Additionally, our algorithm makes only $O(\min\\{T, d^2/\Delta^2\\})$ queries to the expert. We then extend our algorithm to the imitation learning setting, where the agent engages with an unknown environment in episodes of length $H$, and provide similar guarantees regarding regret and query complexity. Interestingly, with preference-based feedback, our imitation learning algorithm can learn a policy outperforming a sub-optimal expert, matching the result from interactive imitation learning algorithms [Ross and Bagnell, 2014] that require access to the expert's actions and also reward signals.

NeurIPS Conference 2023 Conference Paper

Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

  • Masatoshi Uehara
  • Haruka Kiyohara
  • Andrew Bennett
  • Victor Chernozhukov
  • Nan Jiang
  • Nathan Kallus
  • Chengchun Shi
  • Wen Sun

We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Our code is available at https: //github. com/aiueola/neurips2023-future-dependent-ope

ICML Conference 2023 Conference Paper

Multi-task Representation Learning for Pure Exploration in Linear Bandits

  • Yihan Du
  • Longbo Huang
  • Wen Sun

Despite the recent success of representation learning in sequential decision making, the study of the pure exploration scenario (i. e. , identify the best option and minimize the sample complexity) is still limited. In this paper, we study multi-task representation learning for best arm identification in linear bandit (RepBAI-LB) and best policy identification in contextual linear bandit (RepBPI-CLB), two popular pure exploration settings with wide applications, e. g. , clinical trials and web content optimization. In these two problems, all tasks share a common low-dimensional linear representation, and our goal is to leverage this feature to accelerate the best arm (policy) identification process for all tasks. For these problems, we design computationally and sample efficient algorithms DouExpDes and C-DouExpDes, which perform double experimental designs to plan optimal sample allocations for learning the global representation. We show that by learning the common representation among tasks, our sample complexity is significantly better than that of the native approach which solves tasks independently. To the best of our knowledge, this is the first work to demonstrate the benefits of representation learning for multi-task pure exploration.

NeurIPS Conference 2023 Conference Paper

Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage

  • Masatoshi Uehara
  • Nathan Kallus
  • Jason D. Lee
  • Wen Sun

We consider offline reinforcement learning (RL) where we only have only access to offline data. In contrast to numerous offline RL algorithms that necessitate the uniform coverage of the offline data over state and action space, we propose value-based algorithms with PAC guarantees under partial coverage, specifically, coverage of offline data against a single policy, and realizability of soft Q-function (a. k. a. , entropy-regularized Q-function) and another function, which is defined as a solution to a saddle point of certain minimax optimization problem). Furthermore, we show the analogous result for Q-functions instead of soft Q-functions. To attain these guarantees, we use novel algorithms with minimax loss functions to accurately estimate soft Q-functions and Q-functions with -convergence guarantees measured on the offline data. We introduce these loss functions by casting the estimation problems into nonlinear convex optimization problems and taking the Lagrange functions.

NeurIPS Conference 2023 Conference Paper

Reward Finetuning for Faster and More Accurate Unsupervised Object Discovery

  • Katie Luo
  • Zhenzhen Liu
  • Xiangyu Chen
  • Yurong You
  • Sagie Benaim
  • Cheng Perng Phoo
  • Mark Campbell
  • Wen Sun

Recent advances in machine learning have shown that Reinforcement Learning from Human Feedback (RLHF) can improve machine learning models and align them with human preferences. Although very successful for Large Language Models (LLMs), these advancements have not had a comparable impact in research for autonomous vehicles—where alignment with human expectations can be imperative. In this paper, we propose to adapt similar RL-based methods to unsupervised object discovery, i. e. learning to detect objects from LiDAR points without any training labels. Instead of labels, we use simple heuristics to mimic human feedback. More explicitly, we combine multiple heuristics into a simple reward function that positively correlates its score with bounding box accuracy, i. e. , boxes containing objects are scored higher than those without. We start from the detector’s own predictions to explore the space and reinforce boxes with high rewards through gradient updates. Empirically, we demonstrate that our approach is not only more accurate, but also orders of magnitudes faster to train compared to prior works on object discovery. Code is available at https: //github. com/katieluo88/DRIFT.

NeurIPS Conference 2023 Conference Paper

Selective Sampling and Imitation Learning via Online Regression

  • Ayush Sekhari
  • Karthik Sridharan
  • Wen Sun
  • Runzhe Wu

We consider the problem of Imitation Learning (IL) by actively querying noisy expert for feedback. While imitation learning has been empirically successful, much of prior work assumes access to noiseless expert feedback which is not practical in many applications. In fact, when one only has access to noisy expert feedback, algorithms that rely on purely offline data (non-interactive IL) can be shown to need a prohibitively large number of samples to be successful. In contrast, in this work, we provide an interactive algorithm for IL that uses selective sampling to actively query the noisy expert for feedback. Our contributions are twofold: First, we provide a new selective sampling algorithm that works with general function classes and multiple actions, and obtains the best-known bounds for the regret and the number of queries. Next, we extend this analysis to the problem of IL with noisy expert feedback and provide a new IL algorithm that makes limited queries. Our algorithm for selective sampling leverages function approximation, and relies on an online regression oracle w. r. t. ~the given model class to predict actions, and to decide whether to query the expert for its label. On the theoretical side, the regret bound of our algorithm is upper bounded by the regret of the online regression oracle, while the query complexity additionally depends on the eluder dimension of the model class. We complement this with a lower bound that demonstrates that our results are tight. We extend our selective sampling algorithm for IL with general function approximation and provide bounds on both the regret and the number of queries made to the noisy expert. A key novelty here is that our regret and query complexity bounds only depend on the number of times the optimal policy (and not the noisy expert, or the learner) go to states that have a small margin.

NeurIPS Conference 2023 Conference Paper

The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning

  • Kaiwen Wang
  • Kevin Zhou
  • Runzhe Wu
  • Nathan Kallus
  • Wen Sun

While distributional reinforcement learning (DistRL) has been empirically effective, the question of when and why it is better than vanilla, non-distributional RL has remained unanswered. This paper explains the benefits of DistRL through the lens of small-loss bounds, which are instance-dependent bounds that scale with optimal achievable cost. Particularly, our bounds converge much faster than those from non-distributional approaches if the optimal cost is small. As warmup, we propose a distributional contextual bandit (DistCB) algorithm, which we show enjoys small-loss regret bounds and empirically outperforms the state-of-the-art on three real-world tasks. In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation. We prove that our algorithm enjoys novel small-loss PAC bounds in low-rank MDPs. As part of our analysis, we introduce the $\ell_1$ distributional eluder dimension which may be of independent interest. Then, in offline RL, we show that pessimistic DistRL enjoys small-loss PAC bounds that are novel to the offline setting and are more robust to bad single-policy coverage.

NeurIPS Conference 2022 Conference Paper

Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems

  • Masatoshi Uehara
  • Ayush Sekhari
  • Jason D. Lee
  • Nathan Kallus
  • Wen Sun

We study Reinforcement Learning for partially observable systems using function approximation. We propose a new PO-bilinear framework, that is general enough to include models such as undercomplete tabular Partially Observable Markov Decision Processes (POMDPs), Linear Quadratic Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs. Under this framework, we propose an actor-critic style algorithm that is capable to performing agnostic policy learning. Given a policy class that consists of memory based policies (i. e. , policy that looks at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy among the policy class. For certain examples such as undercomplete POMDPs and LQGs, by leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon.

NeurIPS Conference 2021 Conference Paper

Mitigating Covariate Shift in Imitation Learning via Offline Data With Partial Coverage

  • Jonathan Chang
  • Masatoshi Uehara
  • Dhruv Sreenivas
  • Rahul Kidambi
  • Wen Sun

This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework that utilizes the static dataset to solve the offline IL problem efficiently both in theory and in practice. In theory, even if the behavior policy is highly sub-optimal compared to the expert, we show that as long as the data from the behavior policy provides sufficient coverage on the expert state-action traces (and with no necessity for a global coverage over the entire state-action space), MILO can provably combat the covariate shift issue in IL. Complementing our theory results, we also demonstrate that a practical implementation of our approach mitigates covariate shift on benchmark MuJoCo continuous control tasks. We demonstrate that with behavior policies whose performances are less than half of that of the expert, MILO still successfully imitates with an extremely low number of expert state-action pairs while traditional offline IL methods such as behavior cloning (BC) fail completely. Source code is provided at https: //github. com/jdchang1/milo.

NeurIPS Conference 2021 Conference Paper

MobILE: Model-Based Imitation Learning From Observation Alone

  • Rahul Kidambi
  • Jonathan Chang
  • Wen Sun

This paper studies Imitation Learning from Observations alone (ILFO) where the learner is presented with expert demonstrations that consist only of states visited by an expert (without access to actions taken by the expert). We present a provably efficient model-based framework MobILE to solve the ILFO problem. MobILE involves carefully trading off exploration against imitation - this is achieved by integrating the idea of optimism in the face of uncertainty into the distribution matching imitation learning (IL) framework. We provide a unified analysis for MobILE, and demonstrate that MobILE enjoys strong performance guarantees for classes of MDP dynamics that satisfy certain well studied notions of complexity. We also show that the ILFO problem is strictly harder than the standard IL problem by reducing ILFO to a multi-armed bandit problem indicating that exploration is necessary for solving ILFO efficiently. We complement these theoretical results with experimental simulations on benchmark OpenAI Gym tasks that indicate the efficacy of MobILE. Code for implementing the MobILE framework is available at https: //github. com/rahulkidambi/MobILE-NeurIPS2021.

NeurIPS Conference 2020 Conference Paper

Constrained episodic reinforcement learning in concave-convex and knapsack settings

  • Kianté Brantley
  • Miro Dudik
  • Thodoris Lykouris
  • Sobhan Miryoosefi
  • Max Simchowitz
  • Aleksandrs Slivkins
  • Wen Sun

We propose an algorithm for tabular episodic reinforcement learning with constraints. We provide a modular analysis with strong theoretical guarantees for settings with concave rewards and convex constraints, and for settings with hard constraints (knapsacks). Most of the previous work in constrained reinforcement learning is limited to linear constraints, and the remaining work focuses on either the feasibility question or settings with a single episode. Our experiments demonstrate that the proposed algorithm significantly outperforms these approaches in existing constrained episodic environments.

NeurIPS Conference 2020 Conference Paper

FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs

  • Alekh Agarwal
  • Sham Kakade
  • Akshay Krishnamurthy
  • Wen Sun

In order to deal with the curse of dimensionality in reinforcement learning (RL), it is common practice to make parametric assumptions where values or policies are functions of some low dimensional feature space. This work focuses on the representation learning question: how can we learn such features? Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem. Structurally, we make precise connections between these low rank MDPs and latent variable models, showing how they significantly generalize prior formulations, such as block MDPs, for representation learning in RL. Algorithmically, we develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models. On a technical level, our analysis eliminates reachability assumptions that appear in prior results on the simpler block MDP model and may be of independent interest.

NeurIPS Conference 2020 Conference Paper

Information Theoretic Regret Bounds for Online Nonlinear Control

  • Sham Kakade
  • Akshay Krishnamurthy
  • Kendall Lowrey
  • Motoya Ohnishi
  • Wen Sun

This work studies the problem of sequential control in an unknown, nonlinear dynamical system, where we model the underlying system dynamics as an unknown function in a known Reproducing Kernel Hilbert Space. This framework yields a general setting that permits discrete and continuous control inputs as well as non-smooth, non-differentiable dynamics. Our main result, the Lower Confidence-based Continuous Control (LC3) algorithm, enjoys a near-optimal $O(\sqrt{T})$ regret bound against the optimal controller in episodic settings, where $T$ is the number of episodes. The bound has no explicit dependence on dimension of the system dynamics, which could be infinite, but instead only depends on information theoretic quantities. We empirically show its application to a number of nonlinear control tasks and demonstrate the benefit of exploration for learning model dynamics.

NeurIPS Conference 2020 Conference Paper

Learning the Linear Quadratic Regulator from Nonlinear Observations

  • Zakaria Mhammedi
  • Dylan J. Foster
  • Max Simchowitz
  • Dipendra Misra
  • Wen Sun
  • Akshay Krishnamurthy
  • Alexander Rakhlin
  • John Langford

We introduce a new problem setting for continuous control called the LQR with Rich Observations, or RichLQR. In our setting, the environment is summarized by a low-dimensional continuous latent state with linear dynamics and quadratic costs, but the agent operates on high-dimensional, nonlinear observations such as images from a camera. To enable sample-efficient learning, we assume that the learner has access to a class of decoder functions (e. g. , neural networks) that is flexible enough to capture the mapping from observations to latent states. We introduce a new algorithm, RichID, which learns a near-optimal policy for the RichLQR with sample complexity scaling only with the dimension of the latent state space and the capacity of the decoder function class. RichID is oracle-efficient and accesses the decoder class only through calls to a least-squares regression oracle. To our knowledge, our results constitute the first provable sample complexity guarantee for continuous control with an unknown nonlinearity in the system model.

NeurIPS Conference 2020 Conference Paper

Multi-Robot Collision Avoidance under Uncertainty with Probabilistic Safety Barrier Certificates

  • Wenhao Luo
  • Wen Sun
  • Ashish Kapoor

Safety in terms of collision avoidance for multi-robot systems is a difficult challenge under uncertainty, non-determinism, and lack of complete information. This paper aims to propose a collision avoidance method that accounts for both measurement uncertainty and motion uncertainty. In particular, we propose Probabilistic Safety Barrier Certificates (PrSBC) using Control Barrier Functions to define the space of admissible control actions that are probabilistically safe with formally provable theoretical guarantee. By formulating the chance constrained safety set into deterministic control constraints with PrSBC, the method entails minimally modifying an existing controller to determine an alternative safe controller via quadratic programming constrained to PrSBC constraints. The key advantage of the approach is that no assumptions about the form of uncertainty are required other than finite support, also enabling worst-case guarantees. We demonstrate effectiveness of the approach through experiments on realistic simulation environments.

NeurIPS Conference 2020 Conference Paper

PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning

  • Alekh Agarwal
  • Mikael Henaff
  • Sham Kakade
  • Wen Sun

Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning can, at least in theory, directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the POLICY COVER GUIDED POLICY GRADIENT (PC- PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case L infinity assumptions; these include approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.

NeurIPS Conference 2019 Conference Paper

Optimal Sketching for Kronecker Product Regression and Low Rank Approximation

  • Huaian Diao
  • Rajesh Jayaram
  • Zhao Song
  • Wen Sun
  • David Woodruff

We study the Kronecker product regression problem, in which the design matrix is a Kronecker product of two or more matrices. Formally, given $A_i \in \R^{n_i \times d_i}$ for $i=1, 2, \dots, q$ where $n_i \gg d_i$ for each $i$, and $b \in \R^{n_1 n_2 \cdots n_q}$, let $\mathcal{A} = A_i \otimes A_2 \otimes \cdots \otimes A_q$. Then for $p \in [1, 2]$, the goal is to find $x \in \R^{d_1 \cdots d_q}$ that approximately minimizes $\|\mathcal{A}x - b\|_p$. Recently, Diao, Song, Sun, and Woodruff (AISTATS, 2018) gave an algorithm which is faster than forming the Kronecker product $\mathcal{A} \in \R^{n_1 \cdots n_q \times d_1 \cdots d_q}$. Specifically, for $p=2$ they achieve a running time of $O(\sum_{i=1}^q \texttt{nnz}(A_i) + \texttt{nnz}(b))$, where $ \texttt{nnz}(A_i)$ is the number of non-zero entries in $A_i$. Note that $\texttt{nnz}(b)$ can be as large as $\Theta(n_1 \cdots n_q)$. For $p=1, $ $q=2$ and $n_1 = n_2$, they achieve a worse bound of $O(n_1^{3/2} \text{poly}(d_1d_2) + \texttt{nnz}(b))$. In this work, we provide significantly faster algorithms. For $p=2$, our running time is $O(\sum_{i=1}^q \texttt{nnz}(A_i) )$, which has no dependence on $\texttt{nnz}(b)$. For $p<2$, our running time is $O(\sum_{i=1}^q \texttt{nnz}(A_i) + \texttt{nnz}(b))$, which matches the prior best running time for $p=2$. We also consider the related all-pairs regression problem, where given $A \in \R^{n \times d}, b \in \R^n$, we want to solve $\min_{x \in \R^d} \|\bar{A}x - \bar{b}\|_p$, where $\bar{A} \in \R^{n^2 \times d}, \bar{b} \in \R^{n^2}$ consist of all pairwise differences of the rows of $A, b$. We give an $O(\texttt{nnz}(A))$ time algorithm for $p \in[1, 2]$, improving the $\Omega(n^2)$ time required to form $\bar{A}$. Finally, we initiate the study of Kronecker product low rank and and low-trank approximation. For input $\mathcal{A}$ as above, we give $O(\sum_{i=1}^q \texttt{nnz}(A_i))$ time algorithms, which is much faster than computing $\mathcal{A}$.

NeurIPS Conference 2019 Conference Paper

Policy Poisoning in Batch Reinforcement Learning and Control

  • Yuzhe Ma
  • Xuezhou Zhang
  • Wen Sun
  • Jerry Zhu

We study a security threat to batch reinforcement learning and control where the attacker aims to poison the learned policy. The victim is a reinforcement learner / controller which first estimates the dynamics and the rewards from a batch data set, and then solves for the optimal policy with respect to the estimates. The attacker can modify the data set slightly before learning happens, and wants to force the learner into learning a target policy chosen by the attacker. We present a unified framework for solving batch policy poisoning attacks, and instantiate the attack on two standard victims: tabular certainty equivalence learner in reinforcement learning and linear quadratic regulator in control. We show that both instantiation result in a convex optimization problem on which global optimality is guaranteed, and provide analysis on attack feasibility and attack cost. Experiments show the effectiveness of policy poisoning attacks.

NeurIPS Conference 2018 Conference Paper

Dual Policy Iteration

  • Wen Sun
  • Geoffrey Gordon
  • Byron Boots
  • J. Bagnell

Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e. g. , ExIt from [1], AlphaGo-Zero from [2]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e. g. , a deep neural network) deployed at test time, and a slow, non-reactive policy (e. g. , Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.

RLDM Conference 2017 Conference Abstract

Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction*

  • Wen Sun
  • Arun Venkatraman
  • Geoff Gordon
  • Byron Boots
  • J. Bagnell

Researchers have demonstrated state-of-the-art performance in sequential decision making prob- lems (e. g. , robotics control, sequential prediction) with deep neural network models. One often has access to near-optimal oracles that achieve good performance on the task during training. We demonstrate that AggreVaTeD — a policy gradient extension of the Imitation Learning (IL) approach of (Ross & Bagnell, 2014) — can leverage such an oracle to achieve faster and better solutions with less training data than a less-informed Reinforcement Learning (RL) technique. Using both feedforward and recurrent neural pre- dictors, we present stochastic gradient procedures on a sequential prediction task, dependency-parsing from raw image data, as well as on various high dimensional robotics control problems. We also provide a comprehensive theoretical study of IL that demonstrates we can expect up to exponentially lower sample complexity for learning with AggreVaTeD than with RL algorithms, which backs our empirical findings. Our results and theory indicate that the proposed approach can achieve superior performance with respect to the oracle when the demonstrator is sub-optimal.

ICRA Conference 2017 Conference Paper

No-regret replanning under uncertainty

  • Wen Sun
  • Niteesh Sood
  • Debadeepta Dey
  • Gireeja Ranade
  • Siddharth Prakash
  • Ashish Kapoor

This paper explores the problem of path planning under uncertainty. Specifically, we consider online receding horizon based planners that need to operate in a latent environment where the latent information can be modelled via Gaussian Processes. Online path planning in latent environments is challenging since the robot needs to explore the environment to get a more accurate model of latent information for better planning later and also achieves the task as quick as possible. We propose UCB style algorithms that are popular in the bandit settings and show how those analyses can be adapted to the online robotic path planning problems. The proposed algorithm trades-off exploration and exploitation in near-optimal manner and has appealing no-regret properties. We demonstrate the efficacy of the framework on the application of aircraft flight path planning when the winds are partially observed.

NeurIPS Conference 2017 Conference Paper

Predictive-State Decoders: Encoding the Future into Recurrent Networks

  • Arun Venkatraman
  • Nicholas Rhinehart
  • Wen Sun
  • Lerrel Pinto
  • Martial Hebert
  • Byron Boots
  • Kris Kitani
  • J. Bagnell

Recurrent neural networks (RNNs) are a vital modeling technique that rely on internal states learned indirectly by optimization of a supervised, unsupervised, or reinforcement training loss. RNNs are used to model dynamic processes that are characterized by underlying latent states whose form is often unknown, precluding its analytic representation inside an RNN. In the Predictive-State Representation (PSR) literature, latent state processes are modeled by an internal state representation that directly models the distribution of future observations, and most recent work in this area has relied on explicitly representing and targeting sufficient statistics of this probability distribution. We seek to combine the advantages of RNNs and PSRs by augmenting existing state-of-the-art recurrent neural networks with Predictive-State Decoders (PSDs), which add supervision to the network's internal state representation to target predicting future observations. PSDs are simple to implement and easily incorporated into existing training pipelines via additional loss regularization. We demonstrate the effectiveness of PSDs with experimental results in three different domains: probabilistic filtering, Imitation Learning, and Reinforcement Learning. In each, our method improves statistical performance of state-of-the-art recurrent baselines and does so with fewer iterations and less data.

ICML Conference 2017 Conference Paper

Safety-Aware Algorithms for Adversarial Contextual Bandit

  • Wen Sun
  • Debadeepta Dey
  • Ashish Kapoor

In this work we study the safe sequential decision making problem under the setting of adversarial contextual bandits with sequential risk constraints. At each round, nature prepares a context, a cost for each arm, and additionally a risk for each arm. The learner leverages the context to pull an arm and receives the corresponding cost and risk associated with the pulled arm. In addition to minimizing the cumulative cost, for safety purposes, the learner needs to make safe decisions such that the average of the cumulative risk from all pulled arms should not be larger than a pre-defined threshold. To address this problem, we first study online convex programming in the full information setting where in each round the learner receives an adversarial convex loss and a convex constraint. We develop a meta algorithm leveraging online mirror descent for the full information setting and then extend it to contextual bandit with sequential risk constraints setting using expert advice. Our algorithms can achieve near-optimal regret in terms of minimizing the total cost, while successfully maintaining a sub-linear growth of accumulative risk constraint violation. We support our theoretical results by demonstrating our algorithm on a simple simulated robotics reactive control task.

IJCAI Conference 2016 Conference Paper

Inference Machines for Nonparametric Filter Learning

  • Arun Venkatraman
  • Wen Sun
  • Martial Hebert
  • Byron Boots
  • J. Andrew Bagnell

Data-driven approaches for learning dynamic models for Bayesian filtering often try to maximize the data likelihood given parametric forms for the transition and observation models. However, this objective is usually nonconvex in the parametrization and can only be locally optimized. Furthermore, learning algorithms typically do not provide performance guarantees on the desired Bayesian filtering task. In this work, we propose using inference machines to directly optimize the filtering performance. Our procedure is capable of learning partially-observable systems when the state space is either unknown or known in advance. To accomplish this, we adapt PREDICTIVE STATE INFERENCE MACHINES (PSIMs) by introducing the concept of hints, which incorporate prior knowledge of the state space to accompany the predictive state representation. This allows PSIM to be applied to the larger class of filtering problems which require prediction of a specific parameter or partial component of state. Our PSIM+HINTS adaptation enjoys theoretical advantages similar to the original PSIM algorithm, and we showcase its performance on a variety of robotics filtering problems.

IJCAI Conference 2016 Conference Paper

Online Bellman Residual and Temporal Difference Algorithms with Predictive Error Guarantees

  • Wen Sun
  • J. Andrew Bagnell

We establish connections from optimizing Bellman Residual and Temporal Difference Loss to worst-case long-term predictive error. In the online learning framework, learning takes place over a sequence of trials with the goal of predicting a future discounted sum of rewards. Our first analysis shows that, together with a stability assumption, any no-regret online learning algorithm that minimizes Bellman error ensures small prediction error. Our second analysis shows that applying the family of online mirror descent algorithms on temporal difference loss also ensures small prediction error. No statistical assumptions are made on the sequence of observations, which could be non-Markovian or even adversarial. Our approach thus establishes a broad new family of provably sound algorithms and provides a generalization of previous worst-case results for minimizing predictive error. We investigate the potential advantages of some of this family both theoretically and empirically on benchmark problems.

AAAI Conference 2016 Conference Paper

Online Instrumental Variable Regression with Applications to Online Linear System Identification

  • Arun Venkatraman
  • Wen Sun
  • Martial Hebert
  • J. Bagnell
  • Byron Boots

Instrumental variable regression (IVR) is a statistical technique utilized to recover unbiased estimators when there are errors in the independent variables. Estimator bias in learned time series models can yield poor performance in applications such as long-term prediction and filtering where the recursive use of the model results in the accumulation of propagated error. However, prior work addressed the IVR objective in the batch setting, where it is necessary to store the entire dataset in memory - an infeasible requirement in large dataset scenarios. In this work, we develop Online Instrumental Variable Regression (OIVR), an algorithm that is capable of updating the learned estimator with streaming data. We show that the online adaptation of IVR enjoys a no-regret performance guarantee with respect to the original batch setting by taking advantage of any no-regret online learning algorithm inside OIVR for the underlying update steps. We experimentally demonstrate the efficacy of our algorithm in combination with popular no-regret online algorithms for the task of learning predictive dynamical system models and on a prototypical econometrics instrumental variable regression problem.