Arrow Research search

Author name cluster

Wee Sun Lee

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

53 papers
2 author rows

Possible papers

53

NeurIPS Conference 2025 Conference Paper

Approximation and Generalization Abilities of Score-based Neural Network Generative Models for Sub-Gaussian Distributions

  • Guoji Fu
  • Wee Sun Lee

This paper studies the approximation and generalization abilities of score-based neural network generative models (SGMs) in estimating an unknown distribution $P_0$ from $n$ i. i. d. observations in $d$ dimensions. Assuming merely that $P_0$ is $\alpha$-sub-Gaussian, we prove that for any time step $t \in [t_0, n^{\mathcal{O}(1)}]$, where $t_0 > \mathcal{O}(\alpha^2n^{-2/d}\log n)$, there exists a deep ReLU neural network with width $\leq \mathcal{O}(n^{\frac{3}{d}}\log_2n)$ and depth $\leq \mathcal{O}(\log^2n)$ that can approximate the scores with $\tilde{\mathcal{O}}(n^{-1})$ mean square error and achieve a nearly optimal rate of $\tilde{\mathcal{O}}(n^{-1}t_0^{-d/2})$ for score estimation, as measured by the score matching loss. Our framework is universal and can be used to establish convergence rates for SGMs under milder assumptions than previous work. For example, assuming further that the target density function $p_0$ lies in Sobolev or Besov classes, with an appropriately early stopping strategy, we demonstrate that neural network-based SGMs can attain nearly minimax convergence rates up to logarithmic factors. Our analysis removes several crucial assumptions, such as Lipschitz continuity of the score function or a strictly positive lower bound on the target density.

ICML Conference 2025 Conference Paper

Continual Reinforcement Learning by Planning with Online World Models

  • Zichen Liu
  • Guoji Fu
  • Chao Du
  • Wee Sun Lee
  • Min Lin

Continual reinforcement learning (CRL) refers to a naturalistic setting where an agent needs to endlessly evolve, by trial and error, to solve multiple tasks that are presented sequentially. One of the largest obstacles to CRL is that the agent may forget how to solve previous tasks when learning a new task, known as catastrophic forgetting. In this paper, we propose to address this challenge by planning with online world models. Specifically, we learn a Follow-The-Leader shallow model online to capture the world dynamics, in which we plan using model predictive control to solve a set of tasks specified by any reward functions. The online world model is immune to forgetting by construction with a proven regret bound of $\mathcal{O}(\sqrt{K^2D\log(T)})$ under mild assumptions. The planner searches actions solely based on the latest online model, thus forming a FTL Online Agent (OA) that updates incrementally. To assess OA, we further design Continual Bench, a dedicated environment for CRL, and compare with several strong baselines under the same model-planning algorithmic framework. The empirical results show that OA learns continuously to solve new tasks while not forgetting old skills, outperforming agents built on deep world models with various continual learning techniques.

ICLR Conference 2025 Conference Paper

Learning to Search from Demonstration Sequences

  • Dixant Mittal
  • Liwei Kang
  • Wee Sun Lee

Search and planning are essential for solving many real-world problems. However, in numerous learning scenarios, only action-observation sequences, such as demonstrations or instruction sequences, are available for learning. Relying solely on supervised learning with these sequences can lead to sub-optimal performance due to the vast, unseen search space encountered during training. In this paper, we introduce Differentiable Tree Search Network (D-TSN), a novel neural network architecture that learns to construct search trees from just sequences of demonstrations by performing gradient descent on a best-first search tree construction algorithm. D-TSN enables the joint learning of submodules, including an encoder, value function, and world model, which are essential for planning. To construct the search tree, we employ a stochastic tree expansion policy and formulate it as another decision-making task. Then, we optimize the tree expansion policy via REINFORCE with an effective variance reduction technique for the gradient computation. D-TSN can be applied to problems with a known world model or to scenarios where it needs to jointly learn a world model with a latent state space. We study problems from these two scenarios, including Game of 24, 2D grid navigation, and Procgen games, to understand when D-TSN is more helpful. Through our experiments, we show that D-TSN is effective, especially when the world model with a latent state space is jointly learned. The code is available at https://github.com/dixantmittal/differentiable-tree-search-network.

NeurIPS Conference 2025 Conference Paper

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

  • Penghui Qi
  • Zichen Liu
  • Tianyu Pang
  • Chao Du
  • Wee Sun Lee
  • Min Lin

Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present AnytimeReasoner, a novel framework for optimizing reasoning performance under varying thinking budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

ICML Conference 2025 Conference Paper

SHIELD: Multi-task Multi-distribution Vehicle Routing Solver with Sparsity and Hierarchy

  • Yong Liang Goh
  • Zhiguang Cao
  • Yining Ma 0001
  • Jianan Zhou 0002
  • Mohammed Haroon Dupty
  • Wee Sun Lee

Recent advances toward foundation models for routing problems have shown great potential of a unified deep model for various VRP variants. However, they overlook the complex real-world customer distributions. In this work, we advance the Multi-Task VRP (MTVRP) setting to the more realistic yet challenging Multi-Task Multi-Distribution VRP (MTMDVRP) setting, and introduce SHIELD, a novel model that leverages both sparsity and hierarchy principles. Building on a deeper decoder architecture, we first incorporate the Mixture-of-Depths (MoD) technique to enforce sparsity. This improves both efficiency and generalization by allowing the model to dynamically select nodes to use or skip each decoder layer, providing the needed capacity to adaptively allocate computation for learning the task/distribution specific and shared representations. We also develop a context-based clustering layer that exploits the presence of hierarchical structures in the problems to produce better local representations. These two designs inductively bias the network to identify key features that are common across tasks and distributions, leading to significantly improved generalization on unseen ones. Our empirical results demonstrate the superiority of our approach over existing methods on 9 real-world maps with 16 VRP variants each.

NeurIPS Conference 2025 Conference Paper

Solving the Asymmetric Traveling Salesman Problem via Trace-Guided Cost Augmentation

  • Zhen Zhang
  • Prof Javen Qinfeng Shi
  • Wee Sun Lee

The Asymmetric Traveling Salesman Problem (ATSP) ranks among the most fundamental and notoriously difficult problems in combinatorial optimization. We propose a novel continuous relaxation framework for the Asymmetric Traveling Salesman Problem (ATSP) by leveraging differentiable constraints that encourage acyclic structures and valid permutations. Our approach integrates a differentiable trace-based Directed Acyclic Graph (DAG) constraint with a doubly stochastic matrix relaxation of the assignment problem, enabling gradient-based optimization over soft permutations. We develop a projected exponentiated gradient method with adaptive step size to minimize tour cost while satisfying the relaxed constraints. To recover high-quality discrete tours, we introduce a greedy post-processing procedure that iteratively corrects subtours using cost-aware cycle merging. Our method achieves state-of-the-art performance on standard asymmetric TSP benchmarks and demonstrates competitive scalability and accuracy, particularly on large or asymmetric instances where heuristic solvers such as LKH-3 struggle.

ICLR Conference 2024 Conference Paper

Locality Sensitive Sparse Encoding for Learning World Models Online

  • Zichen Liu
  • Chao Du
  • Wee Sun Lee
  • Min Lin

Acquiring an accurate world model $\textit{online}$ for model-based reinforcement learning (MBRL) is challenging due to data nonstationarity, which typically causes catastrophic forgetting for neural networks (NNs). From the online learning perspective, a Follow-The-Leader (FTL) world model is desirable, which optimally fits all previous experiences at each round. Unfortunately, NN-based models need re-training on all accumulated data at every interaction step to achieve FTL, which is computationally expensive for lifelong agents. In this paper, we revisit models that can achieve FTL with incremental updates. Specifically, our world model is a linear regression model supported by nonlinear random features. The linear part ensures efficient FTL update while the nonlinear random feature empowers the fitting of complex environments. To best trade off model capacity and computation efficiency, we introduce a locality sensitive sparse encoding, which allows us to conduct efficient sparse updates even with very high dimensional nonlinear features. We validate the representation power of our encoding and verify that it allows efficient online learning under data covariate shift. We also show, in the Dyna MBRL setting, that our world models learned online using a $\textit{single pass}$ of trajectory data either surpass or match the performance of deep world models trained with replay and other continual learning methods.

ICRA Conference 2023 Conference Paper

Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement

  • Zirui Zhao
  • Wee Sun Lee
  • David Hsu

We present a new method, PARsing And visual GrOuNding (PARAGON), for grounding natural language in object placement tasks. Natural language generally describes objects and spatial relations with compositionality and ambiguity, two major obstacles to effective language grounding. For compositionality, Paragon parses a language instruction into an object-centric graph representation to ground objects individually. For ambiguity, Paragon uses a novel particle-based graph neural network to reason about object placements with uncertainty. Essentially, Paragon integrates a parsing algorithm into a probabilistic, data-driven learning framework. It is fully differentiable and trained end-to-end from data for robustness against complex, ambiguous language input.

ICLR Conference 2023 Conference Paper

Efficient Offline Policy Optimization with a Learned Model

  • Zichen Liu
  • Siyi Li
  • Wee Sun Lee
  • Shuicheng Yan
  • Zhongwen Xu

MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model. On the large-scale Atari benchmark, the proposed method outperforms MuZero Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e., 1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM normalized score with the same hardware and software stacks. Our implementation is open-sourced at https://github.com/sail-sg/rosmo.

AAMAS Conference 2023 Conference Paper

ExPoSe: Combining State-Based Exploration with Gradient-Based Online Search

  • Dixant Mittal
  • Siddharth Aravindan
  • Wee Sun Lee

Online tree-based search algorithms iteratively simulate trajectories and update action-values for a set of states stored in a tree structure. It works reasonably well in practice but fails to e�ectively utilise the information gathered from similar states. Depending upon the smoothness of the action-value function, one approach to overcoming this issue is through online learning, where information is interpolated among similar states; Policy Gradient Search provides a practical algorithm to achieve this. However, Policy Gradient Search lacks an explicit exploration mechanism, which is a key feature of tree-based online search algorithms. In this paper, we propose an e�cient and e�ective online search algorithm called Exploratory Policy Gradient Search (ExPoSe), which leverages information sharing among states by updating the search policy parameters directly, while incorporating a well-de�ned exploration mechanism during the online search process. We evaluate ExPoSe on a range of decision-making problems, including Atari games, Sokoban, and Hamiltonian cycle search in sparse graphs. The results demonstrate that ExPoSe consistently outperforms other popular online search algorithms across all domains. The ExPoSe source code is available at https: //github. com/dixantmittal/ExPoSe.

JMLR Journal 2023 Journal Article

Factor Graph Neural Networks

  • Zhen Zhang
  • Mohammed Haroon Dupty
  • Fan Wu
  • Javen Qinfeng Shi
  • Wee Sun Lee

In recent years, we have witnessed a surge of Graph Neural Networks (GNNs), most of which can learn powerful representations in an end-to-end fashion with great success in many real-world applications. They have resemblance to Probabilistic Graphical Models (PGMs), but break free from some limitations of PGMs. By aiming to provide expressive methods for representation learning instead of computing marginals or most likely configurations, GNNs provide flexibility in the choice of information flowing rules while maintaining good performance. Despite their success and inspirations, they lack efficient ways to represent and learn higher-order relations among variables/nodes. More expressive higher-order GNNs which operate on k-tuples of nodes need increased computational resources in order to process higher-order tensors. We propose Factor Graph Neural Networks (FGNNs) to effectively capture higher-order relations for inference and learning. To do so, we first derive an efficient approximate Sum-Product loopy belief propagation inference algorithm for discrete higher-order PGMs. We then neuralize the novel message passing scheme into a Factor Graph Neural Network (FGNN) module by allowing richer representations of the message update rules; this facilitates both efficient inference and powerful end-to-end learning. We further show that with a suitable choice of message aggregation operators, our FGNN is also able to represent Max-Product belief propagation, providing a single family of architecture that can represent both Max and Sum-Product loopy belief propagation. Our extensive experimental evaluation on synthetic as well as real datasets demonstrates the potential of the proposed model. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

Large Language Models as Commonsense Knowledge for Large-Scale Task Planning

  • Zirui Zhao
  • Wee Sun Lee
  • David Hsu

Large-scale task planning is a major challenge. Recent work exploits large language models (LLMs) directly as a policy and shows surprisingly interesting results. This paper shows that LLMs provide a commonsense model of the world in addition to a policy that acts on it. The world model and the policy can be combined in a search algorithm, such as Monte Carlo Tree Search (MCTS), to scale up task planning. In our new LLM-MCTS algorithm, the LLM-induced world model provides a commonsense prior belief for MCTS to achieve effective reasoning; the LLM-induced policy acts as a heuristic to guide the search, vastly improving search efficiency. Experiments show that LLM-MCTS outperforms both MCTS alone and policies induced by LLMs (GPT2 and GPT3. 5) by a wide margin, for complex, novel tasks. Further experiments and analyses on multiple tasks -- multiplication, travel planning, object rearrangement -- suggest minimum description length (MDL) as a general guiding principle: if the description length of the world model is substantially smaller than that of the policy, using LLM as a world model for model-based planning is likely better than using LLM solely as a policy.

ICRA Conference 2022 Conference Paper

Learning Latent Graph Dynamics for Visual Manipulation of Deformable Objects

  • Xiao Ma 0006
  • David Hsu
  • Wee Sun Lee

Manipulating deformable objects, such as ropes and clothing, is a long-standing challenge in robotics, because of their large degrees of freedom, complex non-linear dynamics, and self-occlusion in visual perception. The key difficulty is a suitable representation, rich enough to capture the object shape, dynamics for manipulation and yet simple enough to be estimated reliably from visual observations. This work aims to learn latent Graph dynamics for DefOrmable Object Manipulation (G-DOOM). G-DOOM approximates a deformable object as a sparse set of interacting keypoints, which are extracted automatically from images via unsupervised learning. It learns a graph neural network that captures abstractly the geometry and the interaction dynamics of the keypoints. To handle object self-occlusion, G-DOOM uses a recurrent neural network to track the keypoints over time and condition their interactions on the history. We then train the resulting recurrent graph dynamics model through contrastive learning in a high-fidelity simulator. For manipulation planning, G-DOOM reasons explicitly about the learned dynamics model through model-predictive control applied at each keypoint. Preliminary experiments of G-DOOM on a set of challenging rope and cloth manipulation tasks indicate strong performance, compared with state-of-the-art methods. Although trained in a simulator, G-DOOM transfers directly to a real robot for both rope and cloth manipulation 1 1 Demo video available online at https://youtu.be/oCfbNMx2sQI.

IJCAI Conference 2022 Conference Paper

None Class Ranking Loss for Document-Level Relation Extraction

  • Yang Zhou
  • Wee Sun Lee

Document-level relation extraction (RE) aims at extracting relations among entities expressed across multiple sentences, which can be viewed as a multi-label classification problem. In a typical document, most entity pairs do not express any pre-defined relation and are labeled as "none" or "no relation". For good document-level RE performance, it is crucial to distinguish such none class instances (entity pairs) from those of pre-defined classes (relations). However, most existing methods only estimate the probability of pre-defined relations independently without considering the probability of "no relation". This ignores the context of entity pairs and the label correlations between the none class and pre-defined classes, leading to sub-optimal predictions. To address this problem, we propose a new multi-label loss that encourages large margins of label confidence scores between each pre-defined class and the none class, which enables captured label correlations and context-dependent thresholding for label prediction. To gain further robustness against positive-negative imbalance and mislabeled data that could appear in real-world RE datasets, we propose a margin regularization and a margin shifting technique. Experimental results demonstrate that our method significantly outperforms existing multi-label losses for document-level RE and works well in other multi-label tasks such as emotion classification when none class instances are available for training.

ICLR Conference 2022 Conference Paper

PF-GNN: Differentiable particle filtering based approximation of universal graph representations

  • Mohammed Haroon Dupty
  • Yanfei Dong
  • Wee Sun Lee

Message passing Graph Neural Networks (GNNs) are known to be limited in expressive power by the 1-WL color-refinement test for graph isomorphism. Other more expressive models either are computationally expensive or need preprocessing to extract structural features from the graph. In this work, we propose to make GNNs universal by guiding the learning process with exact isomorphism solver techniques which operate on the paradigm of $\textit{Individualization and refinement}$ (IR), a method to artificially introduce asymmetry and further refine the coloring when 1-WL stops. Isomorphism solvers generate a search-tree of colorings whose leaves uniquely identify the graph. However, the tree grows exponentially large and needs hand-crafted pruning techniques which are not desirable from a learning perspective. We take a probabilistic view and approximate the search tree of colorings ( i.e. embeddings) by sampling multiple paths from root to leaves of the search-tree. To learn more discriminative representations, we guide the sampling process with $\textit{particle filter}$ updates, a principled approach for sequential state estimation. Our algorithm is end-to-end differentiable, can be applied with any GNN as backbone and learns richer graph representations with only linear increase in runtime. Experimental evaluation shows that our approach consistently outperforms leading GNN models on both synthetic benchmarks for isomorphism detection as well as real-world datasets.

AAMAS Conference 2021 Conference Paper

State-Aware Variational Thompson Sampling for Deep Q-Networks

  • Siddharth Aravindan
  • Wee Sun Lee

Thompson sampling is a well-known approach for balancing exploration and exploitation in reinforcement learning. It requires the posterior distribution of value-action functions to be maintained; this is generally intractable for tasks that have a high dimensional state-action space. We derive a variational Thompson sampling approximation for DQNs which uses a deep network whose parameters are perturbed by a learned variational noise distribution. We interpret the successful NoisyNets method [11] as an approximation to the variational Thompson sampling method that we derive. Further, we propose State Aware Noisy Exploration (SANE) which seeks to improve on NoisyNets by allowing a non-uniform perturbation, where the amount of parameter perturbation is conditioned on the state of the agent. This is done with the help of an auxiliary perturbation module, whose output is state dependent and is learnt end to end with gradient descent. We hypothesize that such state-aware noisy exploration is particularly useful in problems where exploration in certain high risk states may result in the agent failing badly. We demonstrate the effectiveness of the state-aware exploration method in the off-policy setting by augmenting DQNs with the auxiliary perturbation module.

ICLR Conference 2020 Conference Paper

Discriminative Particle Filter Reinforcement Learning for Complex Partial observations

  • Xiao Ma 0006
  • Péter Karkus
  • David Hsu
  • Wee Sun Lee
  • Nan Ye

Deep reinforcement learning is successful in decision making for sophisticated games, such as Atari, Go, etc. However, real-world decision making often requires reasoning with partial information extracted from complex visual observations. This paper presents Discriminative Particle Filter Reinforcement Learning (DPFRL), a new reinforcement learning framework for complex partial observations. DPFRL encodes a differentiable particle filter in the neural network policy for explicit reasoning with partial observations over time. The particle filter maintains a belief using learned discriminative update, which is trained end-to-end for decision making. We show that using the discriminative update instead of standard generative models results in significantly improved performance, especially for tasks with complex visual observations, because they circumvent the difficulty of modeling complex observations that are irrelevant to decision making. In addition, to extract features from the particle belief, we propose a new type of belief feature based on the moment generating function. DPFRL outperforms state-of-the-art POMDP RL models in Flickering Atari Games, an existing POMDP RL benchmark, and in Natural Flickering Atari Games, a new, more challenging POMDP RL benchmark introduced in this paper. Further, DPFRL performs well for visual navigation with real-world data in the Habitat environment.

NeurIPS Conference 2020 Conference Paper

Factor Graph Neural Networks

  • Zhen Zhang
  • Fan Wu
  • Wee Sun Lee

Most of the successful deep neural network architectures are structured, often consisting of elements like convolutional neural networks and gated recurrent neural networks. Recently, graph neural networks (GNNs) have been successfully applied to graph-structured data such as point cloud and molecular data. These networks often only consider pairwise dependencies, as they operate on a graph structure. We generalize the GNN into a factor graph neural network (FGNN) providing a simple way to incorporate dependencies among multiple variables. We show that FGNN is able to represent Max-Product belief propagation, an approximate inference method on probabilistic graphical models, providing a theoretical understanding on the capabilities of FGNN and related GNNs. Experiments on synthetic and real datasets demonstrate the potential of the proposed architecture.

AAAI Conference 2020 Conference Paper

Particle Filter Recurrent Neural Networks

  • Xiao Ma
  • Peter Karkus
  • David Hsu
  • Wee Sun Lee

Recurrent neural networks (RNNs) have been extraordinarily successful for prediction with sequential data. To tackle highly variable and multi-modal real-world data, we introduce Particle Filter Recurrent Neural Networks (PF-RNNs), a new RNN family that explicitly models uncertainty in its internal structure: while an RNN relies on a long, deterministic latent state vector, a PF-RNN maintains a latent state distribution, approximated as a set of particles. For effective learning, we provide a fully differentiable particle filter algorithm that updates the PF-RNN latent state distribution according to the Bayes rule. Experiments demonstrate that the proposed PF- RNNs outperform the corresponding standard gated RNNs on a synthetic robot localization dataset and 10 real-world sequence prediction datasets for text classification, stock price prediction, etc.

AAAI Conference 2020 Conference Paper

Visual Relationship Detection with Low Rank Non-Negative Tensor Decomposition

  • Mohammed Haroon Dupty
  • Zhen Zhang
  • Wee Sun Lee

We address the problem of Visual Relationship Detection (VRD) which aims to describe the relationships between pairs of objects in the form of triplets of (subject, predicate, object). We observe that given a pair of bounding box proposals, objects often participate in multiple relations implying the distribution of triplets is multimodal. We leverage the strong correlations within triplets to learn the joint distribution of triplet variables conditioned on the image and the bounding box proposals, doing away with the hitherto used independent distribution of triplets. To make learning the triplet joint distribution feasible, we introduce a novel technique of learning conditional triplet distributions in the form of their normalized low rank non-negative tensor decompositions. Normalized tensor decompositions take form of mixture distributions of discrete variables and thus are able to capture multimodality. This allows us to efficiently learn higher order discrete multimodal distributions and at the same time keep the parameter size manageable. We further model the probability of selecting an object proposal pair and include a relation triplet prior in our model. We show that each part of the model improves performance and the combination outperforms stateof-the-art score on the Visual Genome (VG) and Visual Relationship Detection (VRD) datasets.

ICRA Conference 2019 Conference Paper

Factored Contextual Policy Search with Bayesian optimization

  • Robert Pinsler
  • Péter Karkus
  • Andras G. Kupcsik
  • David Hsu
  • Wee Sun Lee

Scarce data is a major challenge to scaling robot learning to truly complex tasks, as we need to generalize locally learned policies over different task contexts. Contextual policy search offers data-efficient learning and generalization by explicitly conditioning the policy on a parametric context space. In this paper, we further structure the contextual policy representation. We propose to factor contexts into two components: target contexts that describe the task objectives, e. g. target position for throwing a ball; and environment contexts that characterize the environment, e. g. initial position or mass of the ball. Our key observation is that experience can be directly generalized over target contexts. We show that this can be easily exploited in contextual policy search algorithms. In particular, we apply factorization to a Bayesian optimization approach to contextual policy search both in sampling-based and active learning settings. Our simulation results show faster learning and better generalization in various robotic domains. See our supplementary video: https://youtu.be/IIJTbBAOufDY.

ICRA Conference 2019 Conference Paper

Learning To Grasp Under Uncertainty Using POMDPs

  • Neha P. Garg
  • David Hsu
  • Wee Sun Lee

Robust object grasping under uncertainty is an essential capability of service robots. Many existing approaches rely on far-field sensors, such as cameras, to compute a grasp pose and perform open-loop grasp after placing gripper under the pose. This often fails as a result of sensing or environment uncertainty. This paper presents a principled, general and efficient approach to adaptive grasping, using both tactile and visual sensing as feedback. We first model adaptive grasping as a partially observable Markov decision process (POMDP), which handles uncertainty naturally. We solve the POMDP for sampled objects from a set, in order to generate data for learning. Finally, we train a grasp policy, represented as a deep recurrent neural network (RNN), in simulation through imitation learning. By combining model-based POMDP planning and imitation learning, the proposed approach achieves robustness under uncertainty, generalization over many objects, and fast execution. In particular, we show that modeling only a small sample of objects enables us to learn a robust strategy to grasp previously unseen objects of varying shapes and recover from failure over multiple steps. Experiments on the G3DB object dataset in simulation and a smaller object set with a real robot indicate promising results.

JAIR Journal 2017 Journal Article

DESPOT: Online POMDP Planning with Regularization

  • Nan Ye
  • Adhiraj Somani
  • David Hsu
  • Wee Sun Lee

The partially observable Markov decision process (POMDP) provides a principled general framework for planning under uncertainty, but solving POMDPs optimally is computationally intractable, due to the "curse of dimensionality" and the "curse of history". To overcome these challenges, we introduce the Determinized Sparse Partially Observable Tree (DESPOT), a sparse approximation of the standard belief tree, for online planning under uncertainty. A DESPOT focuses online planning on a set of randomly sampled scenarios and compactly captures the "execution" of all policies under these scenarios. We show that the best policy obtained from a DESPOT is near-optimal, with a regret bound that depends on the representation size of the optimal policy. Leveraging this result, we give an anytime online planning algorithm, which searches a DESPOT for a policy that optimizes a regularized objective function. Regularization balances the estimated value of a policy under the sampled scenarios and the policy size, thus avoiding overfitting. The algorithm demonstrates strong experimental results, compared with some of the best online POMDP algorithms available. It has also been incorporated into an autonomous driving system for real-time vehicle control. The source code for the algorithm is available online.

NeurIPS Conference 2017 Conference Paper

QMDP-Net: Deep Learning for Planning under Partial Observability

  • Peter Karkus
  • David Hsu
  • Wee Sun Lee

This paper introduces the QMDP-net, a neural network architecture for planning under partial observability. The QMDP-net combines the strengths of model-free learning and model-based planning. It is a recurrent policy network, but it represents a policy for a parameterized set of tasks by connecting a model with a planning algorithm that solves the model, thus embedding the solution structure of planning in a network learning architecture. The QMDP-net is fully differentiable and allows for end-to-end training. We train a QMDP-net on different tasks so that it can generalize to new ones in the parameterized task set and “transfer” to other similar tasks beyond the set. In preliminary experiments, QMDP-net showed strong performance on several robotic tasks in simulation. Interestingly, while QMDP-net encodes the QMDP algorithm, it sometimes outperforms the QMDP algorithm in the experiments, as a result of end-to-end learning.

UAI Conference 2017 Conference Paper

Shortest Path under Uncertainty: Exploration versus Exploitation

  • Zhan Wei Lim
  • David Hsu
  • Wee Sun Lee

In the Canadian Traveler Problem (CTP), a traveler seeks a shortest path to a destination through a road network, but unknown to the traveler, some roads may be blocked. This paper studies the Bayesian CTP (BCTP), in which road states are correlated with known prior probabilities and the traveler can infer the states of an unseen road from past observations of other correlated roads. As generalized shortest-path problems, CTP and BCTP have important practical applications. We show that BCTP is NP-complete and give a polynomial-time approximation algorithm, Hedged Shortest Path under Determinization (HSPD), which approximates an optimal solution with a polylogarithmic factor. Preliminary experiments show promising results. HSPD outperforms a widely used greedy algorithm and a state-of-the-art UCT-based search algorithm for CTP, especially when significant exploration is required.

ICML Conference 2017 Conference Paper

Tensor Belief Propagation

  • Andrew Wrigley
  • Wee Sun Lee
  • Nan Ye

We propose a new approximate inference algorithm for graphical models, tensor belief propagation, based on approximating the messages passed in the junction tree algorithm. Our algorithm represents the potential functions of the graphical model and all messages on the junction tree compactly as mixtures of rank-1 tensors. Using this representation, we show how to perform the operations required for inference on the junction tree efficiently: marginalisation can be computed quickly due to the factored form of rank-1 tensors while multiplication can be approximated using sampling. Our analysis gives sufficient conditions for the algorithm to perform well, including for the case of high-treewidth graphs, for which exact inference is intractable. We compare our algorithm experimentally with several approximate inference algorithms and show that it performs well.

IROS Conference 2016 Conference Paper

Act to See and See to Act: POMDP planning for objects search in clutter

  • Jue Kun Li
  • David Hsu
  • Wee Sun Lee

We study the problem of objects search in clutter. In cluttered environments, partial occlusion among objects prevents vision systems from correctly recognizing objects. Hence, the agent needs to move objects around to gather information, which helps reduce uncertainty in perception. At the same time, the agent needs to minimize the efforts of moving objects to reduce the time required to complete the task. We model the problem as a Partially Observable Markov Decision Process (POMDP), formulating it as a problem of optimal decision making under uncertainty. By exploiting spatial constraints, we are able to adapt online POMDP planners to handle objects search problems with large state space and action space. Experiments show that the POMDP solution outperforms greedy approaches, especially in cases where multi-step manipulation is required.

ICRA Conference 2016 Conference Paper

POMDP-lite for robust robot planning under uncertainty

  • Min Chen 0018
  • Emilio Frazzoli
  • David Hsu
  • Wee Sun Lee

The partially observable Markov decision process (POMDP) provides a principled general model for planning under uncertainty. However, solving a general POMDP is computationally intractable in the worst case. This paper introduces POMDP-lite, a subclass of POMDPs in which the hidden state variables are constant or only change deterministically. We show that a POMDP-lite is equivalent to a set of fully observable Markov decision processes indexed by a hidden parameter and is useful for modeling a variety of interesting robotic tasks. We develop a simple model-based Bayesian reinforcement learning algorithm to solve POMDP-lite models. The algorithm performs well on large-scale POMDP-lite models with up to 1020 states and outperforms the state-of-the-art general-purpose POMDP algorithms. We further show that the algorithm is near-Bayesian-optimal under suitable conditions.

NeurIPS Conference 2015 Conference Paper

Adaptive Stochastic Optimization: From Sets to Paths

  • Zhan Wei Lim
  • David Hsu
  • Wee Sun Lee

Adaptive stochastic optimization optimizes an objective function adaptively under uncertainty. Adaptive stochastic optimization plays a crucial role in planning and learning under uncertainty, but is, unfortunately, computationally intractable in general. This paper introduces two conditions on the objective function, the marginal likelihood rate bound and the marginal likelihood bound, which enable efficient approximate solution of adaptive stochastic optimization. Several interesting classes of functions satisfy these conditions naturally, e. g. , the version space reduction function for hypothesis learning. We describe Recursive Adaptive Coverage (RAC), a new adaptive stochastic optimization algorithm that exploits these conditions, and apply it to two planning tasks under uncertainty. In constrast to the earlier submodular optimization approach, our algorithm applies to adaptive stochastic optimization algorithm over both sets and paths.

ICRA Conference 2015 Conference Paper

Intention-aware online POMDP planning for autonomous driving in a crowd

  • Haoyu Bai
  • Shaojun Cai
  • Nan Ye
  • David Hsu
  • Wee Sun Lee

This paper presents an intention-aware online planning approach for autonomous driving amid many pedestrians. To drive near pedestrians safely, efficiently, and smoothly, autonomous vehicles must estimate unknown pedestrian intentions and hedge against the uncertainty in intention estimates in order to choose actions that are effective and robust. A key feature of our approach is to use the partially observable Markov decision process (POMDP) for systematic, robust decision making under uncertainty. Although there are concerns about the potentially high computational complexity of POMDP planning, experiments show that our POMDP-based planner runs in near real time, at 3 Hz, on a robot golf cart in a complex, dynamic environment. This indicates that POMDP planning is improving fast in computational efficiency and becoming increasingly practical as a tool for robot planning under uncertainty.

ICAPS Conference 2015 Conference Paper

PLEASE: Palm Leaf Search for POMDPs with Large Observation Spaces

  • Zongzhang Zhang
  • David Hsu
  • Wee Sun Lee
  • Zhan Wei Lim
  • Aijun Bai

Trial-based asynchronous value iteration algorithms for large Partially Observable Markov Decision Processes (POMDPs), such as HSVI2, FSVI and SARSOP, have made impressive progress in the past decade. In the forward exploration phase of these algorithms, only the outcome that has the highest potential impact is searched. This paper provides a novel approach, called Palm LEAf SEarch (PLEASE), which allows the selection of more than one outcome when their potential impacts are close to the highest one. Compared with existing trial-based algorithms, PLEASE can save considerable time to propagate the bound improvements of beliefs in deep levels of the search tree to the root belief because of fewer point-based value backups. Experiments show that PLEASE scales up SARSOP, one of the fastest algorithms, by orders of magnitude on some POMDP tasks with large observation spaces.

SoCS Conference 2015 Conference Paper

PLEASE: Palm Leaf Search for POMDPs with Large Observation Spaces

  • Zongzhang Zhang
  • David Hsu
  • Wee Sun Lee
  • Zhan Wei Lim
  • Aijun Bai

This paper provides a novel POMDP planning method, called Palm LEAf SEarch (PLEASE), which allows the selection of more than one outcome when their potential impacts are close to the highest one during its forward exploration. Compared with existing trial-based algorithms, PLEASE can save considerable time to propagate the bound improvements of beliefs in deep levels of the search tree to the root belief because of fewer backup operations. Experiments showed that PLEASE scales up SARSOP, one of the fastest algorithms, by orders of magnitude on some POMDP tasks with large observation spaces.

IROS Conference 2015 Conference Paper

POMDP to the Rescue: Boosting Performance for Robocup Rescue

  • Kegui Wu
  • Wee Sun Lee
  • David Hsu

Disaster response is one of the most critical social issues and introduces quite a few research themes for the AI planning area. Robocup Rescue provides a platform to simulate the rescue process in a city when an earthquake happens. Existing methods consist of multi-agent methods that use greedy heuristics. These methods scale to large maps but suffer from volatile performance under different scenarios. In this work, we propose a planning framework to boost the performance on Robocup Rescue given several policies from the competition to be used as components. More specifically, we use an online POMDP algorithm with macro-actions and restrict it to plan within the space of tasks performed by the agents in the component policies at each time instance. Since the action space contains macro-actions of the component policies, the method is guaranteed to perform at least as well as the best component policy, and possibly better, if sufficient computation is provided. On the other hand, the restriction of the tasks to those suggested by component policies reduces the computational complexity of planning and allows the planning method to be practically applied. Experiment results show that our planner generates better performance than the best component policy for some scenarios and gives performance comparable to the best component policy for the rest.

ICAPS Conference 2014 Conference Paper

Bootstrapping Simulation-Based Algorithms with a Suboptimal Policy

  • Truong-Huy Dinh Nguyen
  • Tomi Silander
  • Wee Sun Lee
  • Tze-Yun Leong

Finding optimal policies for Markov Decision Processes with large state spaces is in general intractable. Nonetheless, simulation-based algorithms inspired by Sparse Sampling (SS) such as Upper Confidence Bound applied in Trees (UCT) and Forward Search Sparse Sampling (FSSS) have been shown to perform reasonably well in both theory and practice, despite the high computational demand. To improve the efficiency of these algorithms, we adopt a simple enhancement technique with a heuristic policy to speed up the selection of optimal actions. The general method, called Aux, augments the look-ahead tree with auxiliary arms that are evaluated by the heuristic policy. In this paper, we provide theoretical justification for the method and demonstrate its effectiveness in two experimental benchmarks that showcase the faster convergence to a near optimal policy for both SS and FSSS. Moreover, to further speed up the convergence of these algorithms at the early stage, we present a novel mechanism to combine them with UCT so that the resulting hybrid algorithm is superior to both of its components.

JMLR Journal 2014 Journal Article

Conditional Random Field with High-order Dependencies for Sequence Labeling and Segmentation

  • Nguyen Viet Cuong
  • Nan Ye
  • Wee Sun Lee
  • Hai Leong Chieu

Dependencies among neighboring labels in a sequence are important sources of information for sequence labeling and segmentation. However, only first-order dependencies, which are dependencies between adjacent labels or segments, are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we give efficient inference algorithms to handle high-order dependencies between labels or segments in conditional random fields, under the assumption that the number of distinct label patterns used in the features is small. This leads to efficient learning algorithms for these conditional random fields. We show experimentally that exploiting high-order dependencies can lead to substantial performance improvements for some problems, and we discuss conditions under which high-order features can be effective. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2014. ( edit, beta )

ICML Conference 2014 Conference Paper

Covering Number for Efficient Heuristic-based POMDP Planning

  • Zongzhang Zhang
  • David Hsu
  • Wee Sun Lee

The difficulty of POMDP planning depends on the size of the search space involved. Heuristics are often used to reduce the search space size and improve computational efficiency; however, there are few theoretical bounds on their effectiveness. In this paper, we use the covering number to characterize the size of the search space reachable under heuristics and connect the complexity of POMDP planning to the effectiveness of heuristics. With insights from the theoretical analysis, we have developed a practical POMDP algorithm, Packing-Guided Value Iteration (PGVI). Empirically, PGVI is competitive with the state-of-the-art point-based POMDP algorithms on 65 small benchmark problems and outperforms them on 4 larger problems.

UAI Conference 2014 Conference Paper

Near-optimal Adaptive Pool-based Active Learning with General Loss

  • Cuong V. Nguyen
  • Wee Sun Lee
  • Nan Ye

We consider adaptive pool-based active learning in a Bayesian setting. We first analyze two commonly used greedy active learning criteria: the maximum entropy criterion, which selects the example with the highest entropy, and the least confidence criterion, which selects the example whose most probable label has the least probability value. We show that unlike the non-adaptive case, the maximum entropy criterion is not able to achieve an approximation that is within a constant factor of optimal policy entropy. For the least confidence criterion, we show that it is able to achieve a constant factor approximation to the optimal version space reduction in a worst-case setting, where the probability of labelings that have not been eliminated is considered as the version space. We consider a third greedy active learning criterion, the Gibbs error criterion, and generalize it to handle arbitrary loss functions between labelings. We analyze the properties of the generalization and its variants, and show that they perform well in practice.

NeurIPS Conference 2013 Conference Paper

Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

  • Nguyen Viet Cuong
  • Wee Sun Lee
  • Nan Ye
  • Kian Ming Chai
  • Hai Leong Chieu

We introduce a new objective function for pool-based Bayesian active learning with probabilistic hypotheses. This objective function, called the policy Gibbs error, is the expected error rate of a random classifier drawn from the prior distribution on the examples adaptively selected by the active learning policy. Exact maximization of the policy Gibbs error is hard, so we propose a greedy strategy that maximizes the Gibbs error at each iteration, where the Gibbs error on an instance is the expected error of a random classifier selected from the posterior label distribution on that instance. We apply this maximum Gibbs error criterion to three active learning scenarios: non-adaptive, adaptive, and batch active learning. In each scenario, we prove that the criterion achieves near-maximal policy Gibbs error when constrained to a fixed budget. For practical implementations, we provide approximations to the maximum Gibbs error criterion for Bayesian conditional random fields and transductive Naive Bayes. Our experimental results on a named entity recognition task and a text classification task show that the maximum Gibbs error criterion is an effective active learning criterion for noisy models.

NeurIPS Conference 2013 Conference Paper

DESPOT: Online POMDP Planning with Regularization

  • Adhiraj Somani
  • Nan Ye
  • David Hsu
  • Wee Sun Lee

POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online lookahead search algorithm that alleviates these difficulties by limiting the search to a set of sampled scenarios. The execution of all policies on the sampled scenarios is summarized using a Determinized Sparse Partially Observable Tree (DESPOT), which is a sparsely sampled belief tree. Our algorithm, named Regularized DESPOT (R-DESPOT), searches the DESPOT for a policy that optimally balances the size of the policy and the accuracy on its value estimate obtained through sampling. We give an output-sensitive performance bound for all policies derived from the DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime approximation to R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms.

NeurIPS Conference 2013 Conference Paper

Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

  • Xinhua Zhang
  • Wee Sun Lee
  • Yee Whye Teh

Incorporating invariance information is important for many learning problems. To exploit invariances, most existing methods resort to approximations that either lead to expensive optimization problems such as semi-definite programming, or rely on separation oracles to retain tractability. Some methods further limit the space of functions and settle for non-convex models. In this paper, we propose a framework for learning in reproducing kernel Hilbert spaces (RKHS) using local invariances that explicitly characterize the behavior of the target function around data instances. These invariances are \emph{compactly} encoded as linear functionals whose value are penalized by some loss function. Based on a representer theorem that we establish, our formulation can be efficiently optimized via a convex program. For the representer theorem to hold, the linear functionals are required to be bounded in the RKHS, and we show that this is true for a variety of commonly used RKHS and invariances. Experiments on learning with unlabeled data and transform invariances show that the proposed method yields better or similar results compared with the state of the art.

ICRA Conference 2013 Conference Paper

Planning how to learn

  • Haoyu Bai
  • David Hsu
  • Wee Sun Lee

When a robot uses an imperfect system model to plan its actions, a key challenge is the exploration-exploitation trade-off between two sometimes conflicting objectives: (i) learning and improving the model, and (ii) immediate progress towards the goal, according to the current model. To address model uncertainty systematically, we propose to use Bayesian reinforcement learning and cast it as a partially observable Markov decision process (POMDP). We present a simple algorithm for offline POMDP planning in the continuous state space. Offline planning produces a POMDP policy, which can be executed efficiently online as a finite-state controller. This approach seamlessly integrates planning and learning: it incorporates learning objectives in the computed plan, which then enables the robot to learn nearly optimally online and reach the goal. We evaluated the approach in simulations on two distinct tasks, acrobot swing-up and autonomous vehicle navigation amidst pedestrians, and obtained interesting preliminary results.

AAAI Conference 2010 Conference Paper

Structured Parameter Elicitation

  • Li Ling Ko
  • David Hsu
  • Wee Sun Lee
  • Sylvie Ong

The behavior of a complex system often depends on parameters whose values are unknown in advance. To operate effectively, an autonomous agent must actively gather information on the parameter values while progressing towards its goal. We call this problem parameter elicitation. Partially observable Markov decision processes (POMDPs) provide a principled framework for such uncertainty planning tasks, but they suffer from high computational complexity. However, POMDPs for parameter elicitation often possess special structural properties, specifically, factorization and symmetry. This work identifies these properties and exploits them for efficient solution through a factored belief representation. The experimental results show that our new POMDP solvers outperform SARSOP and MOMDP, two of the fastest general-purpose POMDP solvers available, and can handle significantly larger problems.

ICRA Conference 2008 Conference Paper

A point-based POMDP planner for target tracking

  • David Hsu
  • Wee Sun Lee
  • Nan Rong

Target tracking has two variants that are often studied independently with different approaches: target searching requires a robot to find a target initially not visible, and target following requires a robot to maintain visibility on a target initially visible. In this work, we use a partially observable Markov decision process (POMDP) to build a single model that unifies target searching and target following. The POMDP solution exhibits interesting tracking behaviors, such as anticipatory moves that exploit target dynamics, informationgathering moves that reduce target position uncertainty, and energy-conserving actions that allow the target to get out of sight, but do not compromise long-term tracking performance. To overcome the high computational complexity of solving POMDPs, we have developed SARSOP, a new point-based POMDP algorithm based on successively approximating the space reachable under optimal policies. Experimental results show that SARSOP is competitive with the fastest existing pointbased algorithm on many standard test problems and faster by many times on some.

IJCAI Conference 2007 Conference Paper

  • Upali S Kohomban
  • Wee Sun Lee

Learning word sense classes has been shown to be useful in fine-grained word sense disambiguation. However, the common choice for sense classes, WordNet lexicographer files, are not designed for machine learning based word sense disambiguation. In this work, we explore the use of clustering techniques in an effort to construct sense classes that are more suitable for word sense disambiguation end-task. Our results show that these classes can significantly improve classifier performance over the state of the art results of unrestricted word sense disambiguation.

AAAI Conference 2004 Conference Paper

Text Classification by Labeling Words

  • Bing Liu
  • Wee Sun Lee

Traditionally, text classifiers are built from labeled training examples. Labeling is usually done manually by human experts (or the users), which is a labor intensive and time consuming process. In the past few years, researchers investigated various forms of semi-supervised learning to reduce the burden of manual labeling. In this paper, we propose a different approach. Instead of labeling a set of documents, the proposed method labels a set of representative words for each class. It then uses these words to extract a set of documents for each class from a set of unlabeled documents to form the initial training set. The EM algorithm is then applied to build the classifier. The key issue of the approach is how to obtain a set of representative words for each class. One way is to ask the user to provide them, which is difficult because the user usually can only give a few words (which are insufficient for accurate learning). We propose a method to solve the problem. It combines clustering and feature selection. The technique can effectively rank the words in the unlabeled set according to their importance. The user then selects/labels some words from the ranked list for each class. This process requires less effort than providing words with no help or manual labeling of documents. Our results show that the new method is highly effective and promising.

NeurIPS Conference 1997 Conference Paper

Generalization in Decision Trees and DNF: Does Size Matter?

  • Mostefa Golea
  • Peter Bartlett
  • Wee Sun Lee
  • Llew Mason

Recent theoretical results for pattern classification with thresh(cid: 173) olded real-valued functions (such as support vector machines, sig(cid: 173) moid networks, and boosting) give bounds on misclassification probability that do not depend on the size of the classifier, and hence can be considerably smaller than the bounds that follow from the VC theory. In this paper, we show that these techniques can be more widely applied, by representing other boolean functions as two-layer neural networks (thresholded convex combinations of boolean functions). For example, we show that with high probabil(cid: 173) ity any decision tree of depth no more than d that is consistent with m training examples has misclassification probability no more than o ( (~ (Neff VCdim(U) log2 m log d)) 1/2), where U is the class of node decision functions, and Neff: :; N can be thought of as the effective number of leaves (it becomes small as the distribution on the leaves induced by the training data gets far from uniform). This bound is qualitatively different from the VC bound and can be considerably smaller. We use the same technique to give similar results for DNF formulae. • Author to whom correspondence should be addressed 260 M. Golea, P Bartlett, W. S. Lee and L Mason