Author name cluster

Nan Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

61 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning

Yuzheng Hu
Fan Wu
Haotian Ye
David Forsyth
James Zou
Nan Jiang
Jiaqi Ma
Han Zhao

Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability. Data attribution provides a principled way to trace model behavior back to training samples, yet existing methods assume fixed datasets, which is violated in online RL where each experience both updates the policy and shapes future data collection. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a local attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record's contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Together, these results open a new direction for making online RL more interpretable, efficient, and effective.

AAAI Conference 2025 Conference Paper

Active Symbolic Discovery of Ordinary Differential Equations via Phase Portrait Sketching

Nan Jiang
Md Nasim
Yexiang Xue

The symbolic discovery of Ordinary Differential Equations (ODEs) from trajectory data plays a pivotal role in AI-driven scientific discovery. Existing symbolic methods predominantly rely on fixed, pre-collected training datasets, which often result in suboptimal performance, as demonstrated in our case study in Figure 1. Drawing inspiration from active learning, we investigate strategies to query informative trajectory data that can enhance the evaluation of predicted ODEs. However, the butterfly effect in dynamical systems reveals that small variations in initial conditions can lead to drastically different trajectories, necessitating the storage of vast quantities of trajectory data using conventional active learning. To address this, we introduce Active Symbolic Discovery of Ordinary Differential Equations via Phase Portrait Sketching (APPS). Instead of directly selecting individual initial conditions, our APPS first identifies an informative region within the phase space and then samples a batch of initial conditions from this region. Compared to traditional active learning methods, APPS mitigates the gap of maintaining a large amount of data. Extensive experiments demonstrate that APPS consistently discovers more accurate ODE expressions than baseline methods using passively collected datasets.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Commit0: Library Generation from Scratch

Wenting Zhao
Nan Jiang
Celine Lee
Justin T. Chiu
Claire Cardie
Matthias Gallé
Alexander M. Rush

With the goal of benchmarking generative systems beyond expert software development ability, we introduce Commit0, a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library’s API as well as a suite of interactive unit tests, with the goal of producing an implementation of this API accordingly. The implementation is validated through running these unit tests. As a benchmark, Commit0 is designed to move beyond static one-shot code generation towards agents that must process long-form natural language specifications, adapt to multi-stage feedback, and generate code with complex dependencies. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate. Our experiments demonstrate that while current agents can pass some unit tests, none can yet fully reproduce full libraries. Results also show that interactive feedback is quite useful for models to generate code that passes more unit tests, validating the benchmarks that facilitate its use. We publicly release the benchmark, the interactive environment, and the leaderboard.

ICLR Conference 2025 Conference Paper

GameArena: Evaluating LLM Reasoning through Live Computer Games

Lanxiang Hu
Qiyu Li 0001
Anze Xie
Nan Jiang
Ion Stoica
Haojian Jin
Hao Zhang 0025

Evaluating the reasoning abilities of large language models (LLMs) is challenging. Existing benchmarks often depend on static datasets, which are vulnerable to data contamination and may get saturated over time, or on binary live human feedback that conflates reasoning with other abilities. As the most prominent dynamic benchmark, Chatbot Arena evaluates open-ended questions in real-world settings, but lacks the granularity in assessing specific reasoning capabilities. We introduce GameArena, a dynamic benchmark designed to evaluate LLM reasoning capabilities through interactive gameplay with humans. GameArena consists of three games designed to test specific reasoning capabilities (e.g., deductive and inductive reasoning), while keeping participants entertained and engaged. We analyze the gaming data retrospectively to uncover the underlying reasoning processes of LLMs and measure their fine-grained reasoning capabilities. We collect over 2000 game sessions and provide detailed assessments of various reasoning capabilities for five state-of-the-art LLMs. Our user study with 100 participants suggests that GameArena improves user engagement compared to Chatbot Arena. For the first time, GameArena enables the collection of step-by-step LLM reasoning data in the wild.

NeurIPS Conference 2025 Conference Paper

Improving LLM General Preference Alignment via Optimistic Online Mirror Descent

Yuheng Zhang
Dian Yu
Tao Ge
Linfeng Song
Zhichen Zeng
Haitao Mi
Nan Jiang
Dong Yu

Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences. Many existing alignment approaches rely on the Bradley-Terry (BT) model assumption, which assumes the existence of a ground-truth reward for each prompt-response pair. However, this assumption can be overly restrictive when modeling complex human preferences. In this paper, we drop the BT model assumption and study LLM alignment under general preferences, formulated as a two-player game. Drawing on theoretical insights from learning in games, we integrate optimistic online mirror descent into our alignment framework to approximate the Nash policy. Theoretically, we demonstrate that our approach achieves an $\mathcal{O}(T^{-1})$ bound on the duality gap, improving upon the previous $\mathcal{O}(T^{-1/2})$ result. Meanwhile, it enjoys a linear convergence rate in the last iterate, a property not achieved by previous methods. More importantly, we implement our method and show through experiments that it outperforms state-of-the-art RLHF algorithms across multiple representative benchmarks.

AAAI Conference 2025 Conference Paper

LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement

Nan Jiang
Shanchao Liang
Chengxiao Wang
Jiannan Wang
Lin Tan

Portable Document Format (PDF) files are dominantly used for storing and disseminating scientific research, legal documents, and tax information. LaTeX is a popular application for creating PDF documents. Despite its advantages, LaTeX is not WYSWYG---what you see is what you get, i.e., the LaTeX source and rendered PDF images look drastically different, especially for formulae and tables. This gap makes it hard to modify or export LaTeX sources for formulae and tables from PDF images, and existing work is still limited. First, prior work generates LaTeX sources in a single iteration and struggles with complex LaTeX formulae. Second, existing work mainly recognizes and extracts LaTeX sources for formulae; and is incapable or ineffective for tables. This paper proposes LATTE, the first iterative refinement framework for LaTeX recognition. Specifically, we propose delta-view as feedback, which compares and pinpoints the differences between a pair of rendered images of the extracted LaTeX source and the expected correct image. Such delta-view feedback enables our fault localization model to localize the faulty parts of the incorrect recognition more accurately and enables our LaTeX refinement model to repair the incorrect extraction more accurately. LATTE improves the LaTeX source extraction accuracy of both LaTeX formulae and tables, outperforming existing techniques as well as GPT-4V by at least 7.07% of exact match, with a success refinement rate of 46.08% (formula) and 25.51% (table).

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

Pai Liu
Lingfeng Zhao
Shivangi Agarwal
Jinghan Liu
Audrey Huang
Philip Amortila
Nan Jiang

Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e. g. , importance sampling) or has hyperparameters on their own (e. g. , FQE and model-based). We focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics models ("model-based") to best assess the performance of a target policy. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation and better control of candidate value functions in an optimization-free manner, and evaluation of model-free and model-based methods alike. We exemplify the protocol on Gym-Hopper, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.

NeurIPS Conference 2025 Conference Paper

Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

Jiarui Yao
Yifan Hao
Hanning Zhang
Hanze Dong
Wei Xiong
Nan Jiang
Tong Zhang

Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy.

ICLR Conference 2025 Conference Paper

Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs

Yuheng Zhang
Nan Jiang

We investigate off-policy evaluation (OPE), a central and fundamental problem in reinforcement learning (RL), in the challenging setting of Partially Observable Markov Decision Processes (POMDPs) with large observation spaces. Recent works of Uehara et al. (2023a); Zhang & Jiang (2024) developed a model-free framework and identified important coverage assumptions (called belief and outcome coverage) that enable accurate OPE of memoryless policies with polynomial sample complexities, but handling more general target policies that depend on the entire observable history remained an open problem. In this work, we prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings, characterized by additional assumptions imposed on the behavior policy (memoryless vs. history-dependent) and/or the state-revealing property of the POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm—whose analysis has surprisingly eluded the literature despite the algorithm’s simplicity—demonstrating provable separation between model-free and model-based OPE in POMDPs.

NeurIPS Conference 2025 Conference Paper

Thinking vs. Doing: Improving Agent Reasoning by Scaling Test-Time Interaction

Junhong Shen
Hao Bai
Lunjun Zhang
Yifei Zhou
Amrith Setlur
Peter Tong
Diego Caples
Nan Jiang

Test-time scaling in agentic tasks often relies on generating long reasoning traces ("think" more) before acting, but this does not allow agents to acquire new information from the environment or adapt behavior over time. In this work, we propose scaling test-time interaction, an untapped dimension for test-time scaling that increases the agent's interaction horizon to enable rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we situate our study in the domain of web agents. We first show that even prompting-based interaction scaling can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI, a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their interaction lengths during rollout. Using a Gemma 3 12B model, TTI sets a new state-of-the-art among open-source agents trained on public data on WebVoyager and WebArena. Case studies further reveal that TTI enables agents to balance exploration and exploitation adaptively. Our results establish interaction scaling as a powerful, complementary axis to scaling per-action compute, offering new avenues for training robust and adaptive agents.

RLJ Journal 2024 Journal Article

A Tighter Convergence Proof of Reverse Experience Replay

Nan Jiang
Jinzhao Li
Yexiang Xue

In reinforcement learning, Reverse Experience Replay (RER) is a recently proposed algorithm that attains better sample complexity than the classic experience replay method. RER requires the learning algorithm to update the parameters through consecutive state-action-reward tuples in reverse order. However, the most recent theoretical analysis only holds for a minimal learning rate and short consecutive steps, which converge slower than those large learning rate algorithms without RER. In view of this theoretical and empirical gap, we provide a tighter analysis that mitigate the limitation on the learning rate and the length of consecutive steps. Furthermore, we show theoretically that RER converges with a larger learning rate and a longer sequence.

RLC Conference 2024 Conference Paper

A Tighter Convergence Proof of Reverse Experience Replay

Nan Jiang
Jinzhao Li
Yexiang Xue

In reinforcement learning, Reverse Experience Replay (RER) is a recently proposed algorithm that attains better sample complexity than the classic experience replay method. RER requires the learning algorithm to update the parameters through consecutive state-action-reward tuples in reverse order. However, the most recent theoretical analysis only holds for a minimal learning rate and short consecutive steps, which converge slower than those large learning rate algorithms without RER. In view of this theoretical and empirical gap, we provide a tighter analysis that mitigate the limitation on the learning rate and the length of consecutive steps. Furthermore, we show theoretically that RER converges with a larger learning rate and a longer sequence.

JBHI Journal 2024 Journal Article

Effective Motion Self-Learning Genre Using 360° Virtual Reality Content on Mobile Device: A Study Based on Taichi Training Platform

Lutong Wang
Wei Gai
Nan Jiang
Gongxiang Chen
Yulong Bian
Hongqiu Luan
Li Huang
Chenglei Yang

Online fitness training, with its affordability and flexibility, offers a convenient way for individuals to engage in regular workouts that promote physical and mental health. Yet, learning fitness motions in this way presents various challenges and may not always be as effective as in-person training. To address the practical demands of motion learning, we conducted a systematic survey and accordingly proposed a four-stage self-learning genre that integrates immersive virtual reality (VR) environments with motion skill learning theories, strategies, and expert experience. Herein, we merged progressive structures and multi-level visual cues to enhance instruction, and proposed a fine-grained motion analysis method to provide adaptive correction feedback during training. Utilizing a Taichi training platform with the genre embedded, we systematically validated the effectiveness of the genre, and examined the potential impact of VR content presentation form on motion learning among different age groups, as well as their preferences and focus on VR fitness training genre design. Results from the quantitative analysis, qualitative evaluation, and case study showed that the 360° video-based VR content brought more positive motion learning performance and user experiences than the fully-simulated VR used in many previous studies. The proposed genre demonstrated outstandingly performance, experience, and usability, with each stage and design playing an effective role. Moreover, we offer several design considerations for VR fitness systems targeting diverse age groups, providing beneficial insights for VR development in the sports and health-related fields.

AIIM Journal 2024 Journal Article

FIT-graph: A multi-grained evolutionary graph based framework for disease diagnosis

Zizhu Liu
Qing Cao
Nan Du
Huizhen Shu
Erheng Zhong
Nan Jiang
Qiaoran Chen
Ying Shen

ICLR Conference 2024 Conference Paper

Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability

Ivan Lee
Nan Jiang
Taylor Berg-Kirkpatrick

What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps toward answering this question. We evaluate thirteen model architectures capable of causal language modeling across a suite of synthetic in-context learning tasks. These selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, state space model inspired, and other emerging attention alternatives. We discover that all the considered architectures can perform in-context learning under a wider range of conditions than previously documented. Additionally, we observe stark differences in statistical efficiency and consistency by varying the number of in-context examples and task difficulty. We also measure each architecture's predisposition towards in-context learning when presented with the option to memorize rather than leverage in-context examples. Finally, and somewhat surprisingly, we find that several attention alternatives are sometimes competitive with or better in-context learners than transformers. However, no single architecture demonstrates consistency across all tasks, with performance either plateauing or declining when confronted with a significantly larger number of in-context examples than those encountered during gradient-based training.

NeurIPS Conference 2024 Conference Paper

LeDex: Training LLMs to Better Self-Debug and Explain Code

Nan Jiang
Xiaopeng Li
Shiqi Wang
Qiang Zhou
Soneya B. Hossain
Baishakhi Ray
Varun Kumar
Xiaofei Ma

In the domain of code generation, self-debugging is crucial. It allows LLMs to refine their generated code based on execution feedback. This is particularly important because generating correct solutions in one attempt proves challenging for complex tasks. Prior works on self-debugging mostly focus on prompting methods by providing LLMs with few-shot examples, which work poorly on small open-sourced LLMs. In this work, we propose LeDex, a training framework that significantly improves the self-debugging capability of LLMs. Intuitively, we observe that a chain of explanations on the wrong code followed by code refinement helps LLMs better analyze the wrong code and do refinement. We thus propose an automated pipeline to collect a high-quality dataset for code explanation and refinement by generating a number of explanations and refinement trajectories from the LLM itself or a larger teacher model and filtering via execution verification. We perform supervised fine-tuning (SFT) and further reinforcement learning (RL) on both success and failure trajectories with a novel reward design considering code explanation and refinement quality. SFT improves the pass@1 by up to 15. 92\% and pass@10 by 9. 30\% over four benchmarks. RL training brings additional up to 3. 54\% improvement on pass@1 and 2. 55\% improvement on pass@10. The trained LLMs show iterative refinement ability and can keep refining code continuously. Lastly, our human evaluation shows that the LLMs trained with our framework generate more useful code explanations and help developers better understand bugs in source code.

PDF Details DOI

JMLR Journal 2024 Journal Article

Model-Free Representation Learning and Exploration in Low-Rank MDPs

Aditya Modi
Jinglin Chen
Akshay Krishnamurthy
Nan Jiang
Alekh Agarwal

The low-rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low-rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

RLC Conference 2024 Conference Paper

Non-adaptive Online Finetuning for Offline Reinforcement Learning

Audrey Huang
Mohammad Ghavamzadeh
Nan Jiang
Marek Petrik

Offline reinforcement learning (RL) has emerged as an important framework for applying RL to real-life applications. However, the complete lack of online interactions causes technical difficulties. The online finetuning setting which incorporates a limited form of online interactions, often available in practice, has been developed to address these challenges. Unfortunately, existing theoretical frameworks for online finetuning either assume high online sample complexity or require deploying fully adaptive algorithms (i. e. , unlimited policy changes), which restrict their application to real-world settings where online interactions and policy updates are expensive and limited. In this paper, we develop a new theoretical framework for online finetuning. Instead of competing with the optimal policy (which inherits the high sample complexity and adaptivity requirements of online RL), we aim to learn a policy that improves as much as possible over an existing reference policy using a pre-specified number of online samples and a non-adaptive data-collection strategy. Our formulation reveals surprising nuances and suggests novel principles that distinguish finetuning from purely online and offline RL.

RLJ Journal 2024 Journal Article

Non-adaptive Online Finetuning for Offline Reinforcement Learning

Audrey Huang
Mohammad Ghavamzadeh
Nan Jiang
Marek Petrik

Offline reinforcement learning (RL) has emerged as an important framework for applying RL to real-life applications. However, the complete lack of online interactions causes technical difficulties. The online finetuning setting which incorporates a limited form of online interactions, often available in practice, has been developed to address these challenges. Unfortunately, existing theoretical frameworks for online finetuning either assume high online sample complexity or require deploying fully adaptive algorithms (i.e., unlimited policy changes), which restrict their application to real-world settings where online interactions and policy updates are expensive and limited. In this paper, we develop a new theoretical framework for online finetuning. Instead of competing with the optimal policy (which inherits the high sample complexity and adaptivity requirements of online RL), we aim to learn a policy that improves as much as possible over an existing reference policy using a pre-specified number of online samples and a non-adaptive data-collection strategy. Our formulation reveals surprising nuances and suggests novel principles that distinguish finetuning from purely online and offline RL.

NeurIPS Conference 2024 Conference Paper

Occupancy-based Policy Gradient: Estimation, Convergence, and Optimality

Audrey Huang
Nan Jiang

Occupancy functions play an instrumental role in reinforcement learning (RL) for guiding exploration, handling distribution shift, and optimizing general objectives beyond the expected return. Yet, computationally efficient policy optimization methods that use (only) occupancy functions are virtually non-existent. In this paper, we establish the theoretical foundations of model-free policy gradient (PG) methods that compute the gradient through the occupancy for both online and offline RL, without modeling value functions. Our algorithms reduce gradient estimation to squared-loss regression and are computationally oracle-efficient. We characterize the sample complexities of both local and global convergence, accounting for both finite-sample estimation error and the roles of exploration (online) and data coverage (offline). Occupancy-based PG naturally handles arbitrary offline data distributions, and, with one-line algorithmic changes, can be adapted to optimize any differentiable objective functional.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation

Yuheng Zhang
Nan Jiang

We study off-policy evaluation (OPE) in partially observable environments with complex observations, with the goal of developing estimators whose guarantee avoids exponential dependence on the horizon. While such estimators exist for MDPs and POMDPs can be converted to history-based MDPs, their estimation errors depend on the state-density ratio for MDPs which becomes history ratios after conversion, an exponential object. Recently, Uehara et al. [2022a] proposed future-dependent value functions as a promising framework to address this issue, where the guarantee for memoryless policies depends on the density ratio over the latent state space. However, it also depends on the boundedness of the future-dependent value function and other related quantities, which we show could be exponential-in-length and thus erasing the advantage of the method. In this paper, we discover novel coverage assumptions tailored to the structure of POMDPs, such as outcome coverage and belief coverage, which enable polynomial bounds on the aforementioned quantities. As a side product, our analyses also lead to the discovery of new algorithms with complementary properties.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Chenlu Ye
Wei Xiong
Yuheng Zhang
Hanze Dong
Nan Jiang
Tong Zhang

We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle. In particular, we do not assume the existence of a reward function and an oracle preference signal drawn from the Bradley-Terry model as most of the prior works do. We consider a standard mathematical formulation, the reverse-KL regularized minimax game between two LLMs for RLHF under general preference oracle. The learning objective of this formulation is to find a policy so that it is consistently preferred by the KL-regularized preference oracle over any competing LLMs. We show that this framework is strictly more general than the reward-based one, and propose sample-efficient algorithms for both the offline learning from a pre-collected preference dataset and online learning where we can query the preference oracle along the way of training. Empirical studies verify the effectiveness of the proposed framework.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

PhyRecon: Physically Plausible Neural Scene Reconstruction

Junfeng Ni
Yixin Chen
Bohan Jing
Nan Jiang
Bin Wang
Bo Dai
Puhao Li
Yixin Zhu

We address the issue of physical implausibility in multi-view neural reconstruction. While implicit representations have gained popularity in multi-view 3D reconstruction, previous work struggles to yield physically plausible results, limiting their utility in domains requiring rigorous physical accuracy. This lack of plausibility stems from the absence of physics modeling in existing methods and their inability to recover intricate geometrical structures. In this paper, we introduce PHYRECON, the first approach to leverage both differentiable rendering and differentiable physics simulation to learn implicit surface representations. PHYRECON features a novel differentiable particle-based physical simulator built on neural implicit representations. Central to this design is an efficient transformation between SDF-based implicit representations and explicit surface points via our proposed Surface Points Marching Cubes (SP-MC), enabling differentiable learning with both rendering and physical losses. Additionally, PHYRECON models both rendering and physical uncertainty to identify and compensate for inconsistent and inaccurate monocular geometric priors. The physical uncertainty further facilitates physics-guided pixel sampling to enhance the learning of slender structures. By integrating these techniques, our model supports differentiable joint modeling of appearance, geometry, and physics. Extensive experiments demonstrate that PHYRECON significantly improves the reconstruction quality. Our results also exhibit superior physical stability in physical simulators, with at least a 40% improvement across all datasets, paving the way for future physics-based applications.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Racing Control Variable Genetic Programming for Symbolic Regression

Nan Jiang
Yexiang Xue

Symbolic regression, as one of the most crucial tasks in AI for science, discovers governing equations from experimental data. Popular approaches based on genetic programming, Monte Carlo tree search, or deep reinforcement learning learn symbolic regression from a fixed dataset. These methods require massive datasets and long training time especially when learning complex equations involving many variables. Recently, Control Variable Genetic Programming (CVGP) has been introduced which accelerates the regression process by discovering equations from designed control variable experiments. However, the set of experiments is fixed a-priori in CVGP and we observe that sub-optimal selection of experiment schedules delay the discovery process significantly. To overcome this limitation, we propose Racing Control Variable Genetic Programming (Racing-CVGP), which carries out multiple experiment schedules simultaneously. A selection scheme similar to that used in selecting good symbolic equations in the genetic programming process is implemented to ensure that promising experiment schedules eventually win over the average ones. The unfavorable schedules are terminated early to save time for the promising ones. We evaluate Racing-CVGP on several synthetic and real-world datasets corresponding to true physics laws. We demonstrate that Racing-CVGP outperforms CVGP and a series of symbolic regressors which discover equations from fixed datasets.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity

Philip Amortila
Dylan J. Foster
Nan Jiang
Akshay Krishnamurthy
Zakaria Mhammedi

Real-world applications of reinforcement learning often involve environments where agents operate on complex, high-dimensional observations, but the underlying (``latent'') dynamics are comparatively simple. However, beyond restrictive settings such as tabular latent dynamics, the fundamental statistical requirements and algorithmic principles for reinforcement learning under latent dynamics are poorly understood. This paper addresses the question of reinforcement learning under general latent dynamics from a statistical and algorithmic perspective. On the statistical side, our main negativeresult shows that most well-studied settings for reinforcement learning with function approximation become intractable when composed with rich observations; we complement this with a positive result, identifying latent pushforward coverability as ageneral condition that enables statistical tractability. Algorithmically, we develop provably efficient observable-to-latent reductions ---that is, reductions that transform an arbitrary algorithm for the latent MDP into an algorithm that can operate on rich observations--- in two settings: one where the agent has access to hindsightobservations of the latent dynamics (Lee et al. , 2023) and onewhere the agent can estimate self-predictive latent models (Schwarzer et al. , 2020). Together, our results serve as a first step toward a unified statistical and algorithmic theory forreinforcement learning under latent dynamics.

PDF Details DOI

TMLR Journal 2024 Journal Article

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong
Wei Xiong
Bo Pang
Haoxiang Wang
Han Zhao
Yingbo Zhou
Nan Jiang
Doyen Sahoo

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available.

AAAI Conference 2024 Conference Paper

Solving Satisfiability Modulo Counting for Symbolic and Statistical AI Integration with Provable Guarantees

Jinzhao Li
Nan Jiang
Yexiang Xue

Satisfiability Modulo Counting (SMC) encompasses problems that require both symbolic decision-making and statistical reasoning. Its general formulation captures many real-world problems at the intersection of symbolic and statistical AI. SMC searches for policy interventions to control probabilistic outcomes. Solving SMC is challenging because of its highly intractable nature (NP^PP-complete), incorporating statistical inference and symbolic reasoning. Previous research on SMC solving lacks provable guarantees and/or suffers from suboptimal empirical performance, especially when combinatorial constraints are present. We propose XOR-SMC, a polynomial algorithm with access to NP-oracles, to solve highly intractable SMC problems with constant approximation guarantees. XOR-SMC transforms the highly intractable SMC into satisfiability problems by replacing the model counting in SMC with SAT formulae subject to randomized XOR constraints. Experiments on solving important SMC problems in AI for social good demonstrate that XOR-SMC outperforms several baselines both in solution quality and running time.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Vertical Symbolic Regression via Deep Policy Gradient

Nan Jiang
Md Nasim
Yexiang Xue

Vertical Symbolic Regression (VSR) has recently been proposed to expedite the discovery of symbolic equations with many independent variables from experimental data. VSR reduces the search spaces following the vertical discovery path by building from reduced-form equations involving a subset of variables to all variables. While deep neural networks have shown promise in enhancing symbolic regression, directly integrating VSR with deep networks faces challenges such as gradient propagation and engineering complexities due to the tree representation of expressions. We propose Vertical Symbolic Regression using Deep Policy Gradient (VSR-DPG) and demonstrate that VSR-DPG can recover ground-truth equations involving multiple input variables, significantly beyond both deep reinforcement learning-based approaches and previous VSR variants. Our VSR-DPG models symbolic regression as a sequential decision-making process, in which equations are built from repeated applications of grammar rules. The integrated deep model is trained to maximize a policy gradient objective. Experimental results demonstrate that our VSR-DPG significantly outperforms popular baselines in identifying both algebraic equations and ordinary differential equations on a series of benchmarks.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Adversarial Model for Offline Reinforcement Learning

Mohak Bhardwaj
Tengyang Xie
Byron Boots
Nan Jiang
Ching-An Cheng

We propose a novel model-based offline Reinforcement Learning (RL) framework, called Adversarial Model for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary reference policy regardless of data coverage. ARMOR is designed to optimize policies for the worst-case performance relative to the reference policy through adversarially training a Markov decision process model. In theory, we prove that ARMOR, with a well-tuned hyperparameter, can compete with the best policy within data coverage when the reference policy is supported by the data. At the same time, ARMOR is robust to hyperparameter choices: the policy learned by ARMOR, with any admissible hyperparameter, would never degrade the performance of the reference policy, even when the reference policy is not covered by the dataset. To validate these properties in practice, we design a scalable implementation of ARMOR, which by adversarial training, can optimize policies without using model ensembles in contrast to typical model-based methods. We show that ARMOR achieves competent performance with both state-of-the-art offline model-free and model-based RL algorithms and can robustly improve the reference policy over various hyperparameter choices.

ICLR Conference 2023 Conference Paper

Explaining RL Decisions with Trajectories

Shripad Vilasrao Deshmukh
Arpan Dasgupta
Balaji Krishnamurthy
Nan Jiang
Chirag Agarwal
Georgios Theocharous
Jayakumar Subramanian

Explanation is a key component for the adoption of reinforcement learning (RL) in many real-world decision-making problems. In the literature, the explanation is often provided by saliency attribution to the features of the RL agent's state. In this work, we propose a complementary approach to these explanations, particularly for offline RL, where we attribute the policy decisions of a trained RL agent to the trajectories encountered by it during training. To do so, we encode trajectories in offline training data individually as well as collectively (encoding a set of trajectories). We then attribute policy decisions to a set of trajectories in this encoded space by estimating the sensitivity of the decision with respect to that set. Further, we demonstrate the effectiveness of the proposed approach in terms of quality of attributions as well as practical scalability in diverse environments that involve both discrete and continuous state and action spaces such as grid-worlds, video games (Atari) and continuous control (MuJoCo). We also conduct a human study on a simple navigation task to observe how their understanding of the task compares with data attributed for a trained RL policy.

NeurIPS Conference 2023 Conference Paper

Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

Masatoshi Uehara
Haruka Kiyohara
Andrew Bennett
Victor Chernozhukov
Nan Jiang
Nathan Kallus
Chengchun Shi
Wen Sun

We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Our code is available at https: //github. com/aiueola/neurips2023-future-dependent-ope

AAAI Conference 2023 Conference Paper

Learning Markov Random Fields for Combinatorial Structures via Sampling through Lovász Local Lemma

Nan Jiang
Yi Gu
Yexiang Xue

Learning to generate complex combinatorial structures satisfying constraints will have transformative impacts in many application domains. However, it is beyond the capabilities of existing approaches due to the highly intractable nature of the embedded probabilistic inference. Prior works spend most of the training time learning to separate valid from invalid structures but do not learn the inductive biases of valid structures. We develop NEural Lovasz Sampler (NELSON), which embeds the sampler through Lovasz Local Lemma (LLL) as a fully differentiable neural network layer. Our NELSON-CD embeds this sampler into the contrastive divergence learning process of Markov random fields. NELSON allows us to obtain valid samples from the current model distribution. Contrastive divergence is then applied to separate these samples from those in the training set. NELSON is implemented as a fully differentiable neural net, taking advantage of the parallelism of GPUs. Experimental results on several real-world domains reveal that NELSON learns to generate 100% valid structures, while baselines either time out or cannot ensure validity. NELSON also outperforms other approaches in running time, log-likelihood, and MAP scores.

PDF Details DOI

ICML Conference 2023 Conference Paper

Offline Learning in Markov Games with General Function Approximation

Yuheng Zhang
Yu Bai
Nan Jiang

We study offline multi-agent reinforcement learning (RL) in Markov games, where the goal is to learn an approximate equilibrium—such as Nash equilibrium and (Coarse) Correlated Equilibrium—from an offline dataset pre-collected from the game. Existing works consider relatively restricted tabular or linear models and handle each equilibria separately. In this work, we provide the first framework for sample-efficient offline learning in Markov games under general function approximation, handling all 3 equilibria in a unified manner. By using Bellman-consistent pessimism, we obtain interval estimation for policies’ returns, and use both the upper and the lower bounds to obtain a relaxation on the gap of a candidate policy, which becomes our optimization objective. Our results generalize prior works and provide several additional insights. Importantly, we require a data coverage condition that improves over the recently proposed “unilateral concentrability”. Our condition allows selective coverage of deviation policies that optimally trade-off between their greediness (as approximate best responses) and coverage, and we show scenarios where this leads to significantly better guarantees. As a new connection, we also show how our algorithmic framework can subsume seemingly different solution concepts designed for the special case of two-player zero-sum games.

NeurIPS Conference 2022 Conference Paper

A Few Expert Queries Suffices for Sample-Efficient RL with Resets and Linear Value Approximation

Philip Amortila
Nan Jiang
Dhruv Madeka
Dean P. Foster

The current paper studies sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearly-realizable. It has recently been understood that, even under this seemingly strong assumption and access to a generative model, worst-case sample complexities can be prohibitively (i. e. , exponentially) large. We investigate the setting where the learner additionally has access to interactive demonstrations from an expert policy, and we present a statistically and computationally efficient algorithm (Delphi) for blending exploration with expert queries. In particular, Delphi requires $\tilde O(d)$ expert queries and a $\texttt{poly}(d, H, |A|, 1/\varepsilon)$ amount of exploratory samples to provably recover an $\varepsilon$-suboptimal policy. Compared to pure RL approaches, this corresponds to an exponential improvement in sample complexity with surprisingly-little expert input. Compared to prior imitation learning (IL) approaches, our required number of expert demonstrations is independent of $H$ and logarithmic in $1/\varepsilon$, whereas all prior work required at least linear factors of both in addition to the same dependence on $d$. Towards establishing the minimal amount of expert queries needed, we show that, in the same setting, any learner whose exploration budget is \textit{polynomially-bounded} (in terms of $d, H, $ and $|A|$) will require \textit{at least} $\tilde\Omega(\sqrt{d})$ oracle calls to recover a policy competing with the expert's value function. Under the weaker assumption that the expert's policy is linear, we show that the lower bound increases to $\tilde\Omega(d)$.

NeurIPS Conference 2022 Conference Paper

Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions

Audrey Huang
Nan Jiang

Off-policy evaluation often refers to two related tasks: estimating the expected return of a policy and estimating its value function (or other functions of interest, such as density ratios). While recent works on marginalized importance sampling (MIS) show that the former can enjoy provable guarantees under realizable function approximation, the latter is only known to be feasible under much stronger assumptions such as prohibitively expressive discriminators. In this work, we provide guarantees for off-policy function estimation under only realizability, by imposing proper regularization on the MIS objectives. Compared to commonly used regularization in MIS, our regularizer is much more flexible and can account for an arbitrary user-specified distribution, under which the learned function will be close to the groundtruth. We provide exact characterization of the optimal dual solution that needs to be realized by the discriminator class, which determines the data-coverage assumption in the case of value-function learning. As another surprising observation, the regularizer can be altered to relax the data-coverage requirement, and completely eliminate it in the ideal case with strong side information.

JMLR Journal 2022 Journal Article

Constraint Reasoning Embedded Structured Prediction

Nan Jiang
Maosen Zhang
Willem-Jan van Hoeve
Yexiang Xue

Many real-world structured prediction problems need machine learning to capture data distribution and constraint reasoning to ensure structure validity. Nevertheless, constrained structured prediction is still limited in real-world applications because of the lack of tools to bridge constraint satisfaction and machine learning. In this paper, we propose COnstraint REasoning embedded Structured Prediction (Core-Sp), a scalable constraint reasoning and machine learning integrated approach for learning over structured domains. We propose to embed decision diagrams, a popular constraint reasoning tool, as a fully-differentiable module into deep neural networks for structured prediction. We also propose an iterative search algorithm to automate the searching process of the best Core-Sp structure. We evaluate Core-Sp on three applications: vehicle dispatching service planning, if-then program synthesis, and text2SQL generation. The proposed Core-Sp module demonstrates superior performance over state-of-the-art approaches in all three applications. The structures generated with Core-Sp satisfy 100% of the constraints when using exact decision diagrams. In addition, Core-Sp boosts learning performance by reducing the modeling space via constraint satisfaction. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Interaction-Grounded Learning with Action-Inclusive Feedback

Tengyang Xie
Akanksha Saran
Dylan J Foster
Lekan Molu
Ida Momennejad
Nan Jiang
Paul Mineiro
John Langford

Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL’s success in many potential scenarios such as Brain-computer interface (BCI) or Human-computer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and large-scale experiments based on supervised datasets to demonstrate the effectiveness of the new approach.

NeurIPS Conference 2022 Conference Paper

On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL

Jinglin Chen
Aditya Modi
Akshay Krishnamurthy
Nan Jiang
Alekh Agarwal

We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings of linear MDPs (Jin et al. , 2020b), linear completeness (Zanette et al. , 2020b) and low-rank MDPs with unknown representation (Modi et al. , 2021). Our analyses indicate that the explorability or reachability assumptions, previously made for the latter two settings, are not necessary statistically for reward-free exploration. On the negative side, we provide a statistical hardness result for both reward-free and reward-aware exploration under linear completeness assumptions when the underlying features are unknown, showing an exponential separation between low-rank and linear completeness settings.

NeurIPS Conference 2022 Conference Paper

Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

Jiawei Huang
Li Zhao
Tao Qin
Wei Chen
Nan Jiang
Tie-Yan Liu

We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ (``O'' for ``online'') interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ (``E'' for ``exploit'') exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i. e. , $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs. ~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.

NeurIPS Conference 2021 Conference Paper

Bellman-consistent Pessimism for Offline Reinforcement Learning

Tengyang Xie
Ching-An Cheng
Nan Jiang
Paul Mineiro
Alekh Agarwal

The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal O(d)$ in its sample complexity (when the action space is finite). Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.

NeurIPS Conference 2021 Conference Paper

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Cameron Voloshin
Hoang Le
Nan Jiang
Yisong Yue

We offer an experimental benchmark and empirical study for off-policy policy evaluation (OPE) in reinforcement learning, which is a key problem in many safety critical applications. Given the increasing interest in deploying learning-based methods, there has been a flurry of recent proposals for OPE method, leading to a need for standardized empirical analyses. Our work takes a strong focus on diversity of experimental design to enable stress testing of OPE methods. We provide a comprehensive benchmarking suite to study the interplay of different attributes on method performance. We distill the results into a summarized set of guidelines for OPE in practice. Our software package, the Caltech OPE Benchmarking Suite (COBS), is open-sourced and we invite interested researchers to further contribute to the benchmark.

AAAI Conference 2021 Conference Paper

Improved Worst-Case Regret Bounds for Randomized Least-Squares Value Iteration

Priyank Agrawal
Jinglin Chen
Nan Jiang

This paper studies regret minimization with randomized value functions in reinforcement learning. In tabular finite-horizon Markov Decision Processes, we introduce a clipping variant of one classical Thompson Sampling (TS)-like algorithm, randomized least-squares value iteration (RLSVI). Our Õ(H2 S √ AT) high-probability worst-case regret bound improves the previous sharpest worst-case regret bounds for RLSVI and matches the existing state-of-the-art worst-case TS-based regret bounds.

NeurIPS Conference 2021 Conference Paper

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Tengyang Xie
Nan Jiang
Huan Wang
Caiming Xiong
Yu Bai

Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of *policy finetuning*, that is, online RL where the learner has additional access to a "reference policy" $\mu$ close to the optimal policy $\pi_\star$ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and horizon length $H$. We first design a sharp *offline reduction* algorithm---which simply executes $\mu$ and runs offline policy optimization on the collected dataset---that finds an $\varepsilon$ near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the single-policy concentrability coefficient between $\mu$ and $\pi_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $\Omega(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for *any* policy finetuning algorithm, including those that can adaptively explore the environment. This implies that---perhaps surprisingly---the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $\mu$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $\mu$ only satisfies concentrability partially up to a certain time step. Overall, our results offer a quantitative understanding on the benefit of a good reference policy, and make a step towards bridging offline and online RL.

NeurIPS Conference 2021 Conference Paper

Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning

Siyuan Zhang
Nan Jiang

How to select between policies and value functions produced by different training algorithms in offline reinforcement learning (RL)---which is crucial for hyperparameter tuning---is an important open question. Existing approaches based on off-policy evaluation (OPE) often require additional function approximation and hence hyperparameters, creating a chicken-and-egg situation. In this paper, we design hyperparameter-free algorithms for policy selection based on BVFT [XJ21], a recent theoretical advance in value-function selection, and demonstrate their effectiveness in discrete-action benchmarks such as Atari. To address performance degradation due to poor critics in continuous-action domains, we further combine BVFT with OPE to get the best of both worlds, and obtain a hyperparameter-tuning method for $Q$-function based OPE with theoretical guarantees as a side product.

NeurIPS Conference 2020 Conference Paper

Minimax Value Interval for Off-Policy Evaluation and Policy Optimization

Nan Jiang
Jiawei Huang

We study minimax methods for off-policy evaluation (OPE) using value functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain: (1) They require function approximation and are generally biased. For the sake of trustworthy OPE, is there anyway to quantify the biases? (2) They are split into two styles (“weight-learning” vs “value-learning”). Can we unify them? In this paper we answer both questions positively. By slightly altering the derivation of previous methods (one from each style), we unify them into a single value interval that comes with a special type of double robustness: when either the value-function or the importance-weight class is well specified, the interval is valid and its length quantifies the misspecification of the other class. Our interval also provides a unified view of and new insights to some recent methods, and we further explore the implications of our results on exploration and exploitation in off-policy policy optimization with insufficient data coverage.

AAAI Conference 2020 Conference Paper

RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning

Nan Jiang
Sheng Jin
Zhiyao Duan
Changshui Zhang

This paper presents a deep reinforcement learning algorithm for online accompaniment generation, with potential for realtime interactive human-machine duet improvisation. Different from ofﬂine music generation and harmonization, online music accompaniment requires the algorithm to respond to human input and generate the machine counterpart in a sequential order. We cast this as a reinforcement learning problem, where the generation agent learns a policy to generate a musical note (action) based on previously generated context (state). The key of this algorithm is the well-functioning reward model. Instead of deﬁning it using music composition rules, we learn this model from monophonic and polyphonic training data. This model considers the compatibility of the machine-generated note with both the machinegenerated context and the human-generated context. Experiments show that this algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part. Subjective evaluations on preferences show that the proposed algorithm generates music pieces of higher quality than the baseline method.

NeurIPS Conference 2020 Conference Paper

When Counterpoint Meets Chinese Folk Melodies

Nan Jiang
Sheng Jin
Zhiyao Duan
Changshui Zhang

Counterpoint is an important concept in Western music theory. In the past century, there have been significant interests in incorporating counterpoint into Chinese folk music composition. In this paper, we propose a reinforcement learning-based system, named FolkDuet, towards the online countermelody generation for Chinese folk melodies. With no existing data of Chinese folk duets, FolkDuet employs two reward models based on out-of-domain data, i. e. Bach chorales, and monophonic Chinese folk melodies. An interaction reward model is trained on the duets formed from outer parts of Bach chorales to model counterpoint interaction, while a style reward model is trained on monophonic melodies of Chinese folk songs to model melodic patterns. With both rewards, the generator of FolkDuet is trained to generate countermelodies while maintaining the Chinese folk style. The entire generation process is performed in an online fashion, allowing real-time interactive human-machine duet improvisation. Experiments show that the proposed algorithm achieves better subjective and objective results than the baselines.

NeurIPS Conference 2019 Conference Paper

Provably Efficient Q-Learning with Low Switching Cost

Yu Bai
Tengyang Xie
Nan Jiang
Yu-Xiang Wang

We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of \emph{local switching cost}. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for $H$-step episodic MDP that achieves sublinear regret whose local switching cost in $K$ episodes is $O(H^3SA\log K)$, and we provide a lower bound of $\Omega(HSA)$ on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting \citep{guo2015concurrent}, which yields nontrivial results that improve upon prior work in certain aspects.

NeurIPS Conference 2018 Conference Paper

Completing State Representations using Spectral Learning

Nan Jiang
Alex Kulesza
Satinder Singh

A central problem in dynamical system modeling is state discovery—that is, finding a compact summary of the past that captures the information needed to predict the future. Predictive State Representations (PSRs) enable clever spectral methods for state discovery; however, while consistent in the limit of infinite data, these methods often suffer from poor performance in the low data regime. In this paper we develop a novel algorithm for incorporating domain knowledge, in the form of an imperfect state representation, as side information to speed spectral learning for PSRs. We prove theoretical results characterizing the relevance of a user-provided state representation, and design spectral algorithms that can take advantage of a relevant representation. Our algorithm utilizes principal angles to extract the relevant components of the representation, and is robust to misspecification. Empirical evaluation on synthetic HMMs, an aircraft identification domain, and a gene splice dataset shows that, even with weak domain knowledge, the algorithm can significantly outperform standard PSR learning.

NeurIPS Conference 2018 Conference Paper

On Oracle-Efficient PAC RL with Rich Observations

Christoph Dann
Nan Jiang
Akshay Krishnamurthy
Alekh Agarwal
John Langford
Robert Schapire

We study the computational tractability of PAC reinforcement learning with rich observations. We present new provably sample-efficient algorithms for environments with deterministic hidden state dynamics and stochastic rich observations. These methods operate in an oracle model of computation -- accessing policy and value function classes exclusively through standard optimization primitives -- and therefore represent computationally efficient alternatives to prior algorithms that require enumeration. With stochastic hidden state dynamics, we prove that the only known sample-efficient algorithm, OLIVE, cannot be implemented in the oracle model. We also present several examples that illustrate fundamental challenges of tractable PAC reinforcement learning in such general settings.

AAAI Conference 2018 Conference Paper

PAC Reinforcement Learning With an Imperfect Model

Nan Jiang

Reinforcement learning (RL) methods have proved to be successful in many simulated environments. The common approaches, however, are often too sample intensive to be applied directly in the real world. A promising approach to addressing this issue is to train an RL agent in a simulator and transfer the solution to the real environment. When a high- ﬁdelity simulator is available we would expect signiﬁcant reduction in the amount of real trajectories needed for learning. In this work we aim at better understanding the theoretical nature of this approach. We start with a perhaps surprising result that, even if the approximate model (e. g. , a simulator) only differs from the real environment in a single state-action pair (but which one is unknown), such a model could be information-theoretically useless and the sample complexity (in terms of real trajectories) still scales with the total number of states in the worst case. We investigate the hard instances and come up with natural conditions that avoid the pathological situations. We then propose two conceptually simple algorithms that enjoy polynomial sample complexity guarantees with no dependence on the size of the state-action space, and prove some foundational results to provide insights into this important problem.

RLDM Conference 2017 Conference Abstract

Contextual Decision Processes with low Bellman rank are PAC-Learnable

Nan Jiang
Akshay Krishnamurthy
Alekh Agarwal
John Langford
Robert Schapire

This paper studies systematic exploration for reinforcement learning (RL) with rich observations and function approximation. We introduce contextual decision processes (CDPs), that unify and gener- alize most prior RL settings. Our first contribution is a complexity measure, the Bellman Rank, that we show enables tractable learning of near-optimal behavior in these processes and is naturally small for many well-studied RL settings. Our second contribution is a new RL algorithm that engages in systematic explo- ration to learn near-optimal behavior in CDPs with low Bellman Rank. The algorithm requires a number of samples that is polynomial in all relevant parameters but independent of the number of unique contexts. Our approach uses Bellman error minimization with optimistic exploration and provides new insights into efficient exploration for RL with function approximation.

IJCAI Conference 2017 Conference Paper

Exploration of Tree-based Hierarchical Softmax for Recurrent Language Models

Nan Jiang
Wenge Rong
Min Gao
Yikang Shen
Zhang Xiong

Recently, variants of neural networks for computational linguistics have been proposed and successfully applied to neural language modeling and neural machine translation. These neural models can leverage knowledge from massive corpora but they are extremely slow as they predict candidate words from a large vocabulary during training and inference. As an alternative to gradient approximation and softmax with class decomposition, we explore the tree-based hierarchical softmax method and reform its architecture, making it compatible with modern GPUs and introducing a compact tree-based loss function. When combined with several word hierarchical clustering algorithms, improved performance is achieved in language modelling task with intrinsic evaluation criterions on PTB, WikiText-2 and WikiText-103 datasets.

NeurIPS Conference 2017 Conference Paper

Repeated Inverse Reinforcement Learning

Kareem Amin
Nan Jiang
Satinder Singh

We introduce a novel repeated Inverse Reinforcement Learning problem: the agent has to act on behalf of a human in a sequence of tasks and wishes to minimize the number of tasks that it surprises the human by acting suboptimally with respect to how the human would have acted. Each time the human is surprised, the agent is provided a demonstration of the desired behavior by the human. We formalize this problem, including how the sequence of tasks is chosen, in a few different ways and provide some foundational results.

RLDM Conference 2017 Conference Abstract

Repeated Inverse Reinforcement Learning for AI Safety

Kareem Amin
Nan Jiang
Satinder Singh

How detailed should we make the goals we prescribe to AI agents acting on our behalf in com- plex environments? Detailed & low-level specification of goals can be tedious and expensive to create, and abstract & high-level goals could lead to negative surprises as the agent may find behaviors that we would not want it to do, i. e. , lead to unsafe AI. One approach to addressing this dilemma is for the agent to infer human goals by observing human behavior. This is the Inverse Reinforcement Learning (IRL) problem. However, IRL is generally ill-posed for there are typically many reward functions for which the observed behavior is optimal. While the use of heuristics to select from among the set of feasible reward functions has led to successful applications of IRL to learning from demonstration, such heuristics do not address AI safety. In this paper we introduce a novel repeated IRL problem that captures an aspect of AI safety as fol- lows. The agent has to act on behalf of a human in a sequence of tasks and wishes to minimize the number of tasks that it surprises the human. Each time the human is surprised the agent is provided a demonstration of the desired behavior by the human. We formalize this problem, including how the sequence of tasks is chosen, in a few different ways and provide some foundational results.

AAAI Conference 2017 Conference Paper

Word Embedding Based Correlation Model for Question/Answer Matching

Yikang Shen
Wenge Rong
Nan Jiang
Baolin Peng
Jie Tang
Zhang Xiong

The large scale of Q&A archives accumulated in community based question answering (CQA) servivces are important information and knowledge resource on the web. Question and answer matching task has been attached much importance to for its ability to reuse knowledge stored in these systems: it can be useful in enhancing user experience with recurrent questions. In this paper, a Word Embedding based Correlation (WEC) model is proposed by integrating advantages of both the translation model and word embedding. Given a random pair of words, WEC can score their co-occurrence probability in Q&A pairs, while it can also leverage the continuity and smoothness of continuous space word representation to deal with new pairs of words that are rare in the training parallel text. An experimental study on Yahoo! Answers dataset and Baidu Zhidao dataset shows this new method’s promising potential.

ICML Conference 2016 Conference Paper

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

Nan Jiang
Lihong Li 0001

We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. This problem is often a critical step when applying RL to real-world problems. Despite its importance, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators. We demonstrate the estimator’s accuracy in several benchmark problems, and illustrate its use as a subroutine in safe policy improvement. We also provide theoretical results on the inherent hardness of the problem, and show that our estimator can match the lower bound in certain scenarios.

AAAI Conference 2016 Conference Paper

Improving Predictive State Representations via Gradient Descent

Nan Jiang
Alex Kulesza
Satinder Singh

Predictive state representations (PSRs) model dynamical systems using appropriately chosen predictions about future observations as a representation of the current state. In contrast to the hidden states posited by HMMs or RNNs, PSR states are directly observable in the training data; this gives rise to a moment-matching spectral algorithm for learning PSRs that is computationally efﬁcient and statistically consistent when the model complexity matches that of the true system generating the data. In practice, however, model mismatch is inevitable and while spectral learning remains appealingly fast and simple it may fail to ﬁnd optimal models. To address this problem, we investigate the use of gradient methods for improving spectrally-learned PSRs. We show that only a small amount of additional gradient optimization can lead to significant performance gains, and moreover that initializing gradient methods with the spectral learning solution yields better models in signiﬁcantly less time than starting from scratch.

IJCAI Conference 2016 Conference Paper

On Structural Properties of MDPs that Bound Loss Due to Shallow Planning

Nan Jiang
Satinder Singh
Ambuj Tewari

Planning in MDPs often uses a smaller planning horizon than specified in the problem to save computational expense at the risk of a loss due to suboptimal plans. Jiang et al. [2015b] recently showed that smaller than specified planning horizons can in fact be beneficial in cases where the MDP model is learned from data and therefore not accurate. In this paper, we consider planning with accurate models and investigate structural properties of MDPs that bound the loss incurred by using smaller than specified planning horizons. We identify a number of structural parameters some of which depend on the reward function alone, some on the transition dynamics alone, and some that depend on the interaction between rewards and transition dynamics. We provide planning loss bounds in terms of these structural parameters and, in some cases, also show tightness of the upper bounds. Empirical results with randomly generated MDPs are used to validate qualitative properties of our theoretical bounds for shallow planning.

IJCAI Conference 2016 Conference Paper

The Dependence of Effective Planning Horizon on Model Accuracy

Nan Jiang
Alex Kulesza
Satinder Singh
Richard Lewis

Because planning with a long horizon (i. e. , looking far into the future) is computationally expensive, it is common in practice to save time by using reduced horizons. This is usually understood to come at the expense of computing suboptimal plans, which is the case when the planning model is exact. However, when the planning model is estimated from data, as is frequently true in the real world, the policy found using a shorter planning horizon can actually be better than a policy learned with the true horizon. In this paper we provide a precise explanation for this phenomenon based on principles of learning theory. We show formally that the planning horizon is a complexity control parameter for the class of policies available to the planning algorithm, having an intuitive, monotonic relationship with a simple measure of complexity. We prove a planning loss bound predicting that shorter planning horizons can reduce overfitting and improve test performance, and we confirm these predictions empirically.

AAAI Conference 2015 Conference Paper

Spectral Learning of Predictive State Representations with Insufficient Statistics

Alex Kulesza
Nan Jiang
Satinder Singh

Predictive state representations (PSRs) are models of dynamical systems that represent state as a vector of predictions about future observable events (tests) conditioned on past observed events (histories). If a practitioner selects finite sets of tests and histories that are known to be sufficient to completely capture the system, an exact PSR can be learned in polynomial time using spectral methods. However, most real-world systems are complex, and in practice computational constraints limit us to small sets of tests and histories which are therefore never truly sufficient. How, then, should we choose these sets? Existing theory offers little guidance here, and yet we show that the choice is highly consequential— tests and histories selected at random or by a naı̈ve rule significantly underperform the best sets. In this paper we approach the problem both theoretically and empirically. While any fixed system can be represented by an infinite number of equivalent but distinct PSRs, we show that in the computationally unconstrained setting, where existing theory guarantees accurate predictions, the PSRs learned by spectral methods always satisfy a particular spectral bound. Adapting this idea, we propose a simple algorithmic technique to search for sets of tests and histories that approximately satisfy the bound while respecting computational limits. Empirically, our method significantly reduces prediction errors compared to standard spectral learning approaches.