Arrow Research search

Author name cluster

Yuu Jinnai

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

TMLR Journal 2025 Journal Article

Evaluation of Best-of-N Sampling Strategies for Language Model Alignment

  • Yuki Ichihara
  • Yuu Jinnai
  • Tetsuro Morimura
  • Kenshi Abe
  • Kaito Ariu
  • Mitsuki Sakamoto
  • Eiji Uchibe

Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Since the reward model is an imperfect proxy for the true objective, an excessive focus on optimizing its value can lead to a compromise of its performance on the true objective. Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling so that it mitigates reward hacking and empirically (Jinnai et al., 2024). However, Jinnai et al. (2024) introduce RBoN based on a heuristic and they lack the analysis of why such regularization strategy improves the performance of BoN sampling. The aim of this study is to analyze the effect of BoN sampling on regularization strategies. Using the regularization strategies corresponds to robust optimization, which maximizes the worst case over a set of possible perturbations in the proxy reward. Although the theoretical guarantees are not directly applicable to RBoN, RBoN corresponds to a practical implementation. This paper proposes an extension of the RBoN framework, called Stochastic RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN in proxy reward. We then perform an empirical evaluation using the AlpacaFarm and Anthropic’s hh-rlhf datasets to evaluate which factors of the regularization strategies contribute to the improvement of the true proxy reward. In addition, we also propose another simple RBoN method, the Sentence Length Regularized BoN, which has a better performance in the experiment as compared to the previous methods.

ICML Conference 2024 Conference Paper

Model-Based Minimum Bayes Risk Decoding for Text Generation

  • Yuu Jinnai
  • Tetsuro Morimura
  • Ukyo Honda
  • Kaito Ariu
  • Kenshi Abe

Minimum Bayes Risk (MBR) decoding has been shown to be a powerful alternative to beam search decoding in a variety of text generation tasks. MBR decoding selects a hypothesis from a pool of hypotheses that has the least expected risk under a probability model according to a given utility function. Since it is impractical to compute the expected risk exactly over all possible hypotheses, two approximations are commonly used in MBR. First, it integrates over a sampled set of hypotheses rather than over all possible hypotheses. Second, it estimates the probability of each hypothesis using a Monte Carlo estimator. While the first approximation is necessary to make it computationally feasible, the second is not essential since we typically have access to the model probability at inference time. We propose model-based MBR (MBMBR), a variant of MBR that uses the model probability itself as the estimate of the probability distribution instead of the Monte Carlo estimate. We show analytically and empirically that the model-based estimate is more promising than the Monte Carlo estimate in text generation tasks. Our experiments show that MBMBR outperforms MBR in several text generation tasks, both with encoder-decoder models and with language models.

AAAI Conference 2021 Conference Paper

Lipschitz Lifelong Reinforcement Learning

  • Erwan Lecarpentier
  • David Abel
  • Kavosh Asadi
  • Yuu Jinnai
  • Emmanuel Rachelson
  • Michael L. Littman

We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These theoretical results lead us to a value-transfer method for Lifelong RL, which we use to build a PAC-MDP algorithm with improved convergence rate. Further, we show the method to experience no negative transfer with high probability. We illustrate the benefits of the method in Lifelong RL experiments.

ICLR Conference 2020 Conference Paper

Exploration in Reinforcement Learning with Deep Covering Options

  • Yuu Jinnai
  • Jee Won Park
  • Marlos C. Machado
  • George Konidaris 0001

While many option discovery methods have been proposed to accelerate exploration in reinforcement learning, they are often heuristic. Recently, covering options was proposed to discover a set of options that provably reduce the upper bound of the environment's cover time, a measure of the difficulty of exploration. Covering options are computed using the eigenvectors of the graph Laplacian, but they are constrained to tabular tasks and are not applicable to tasks with large or continuous state-spaces. We introduce deep covering options, an online method that extends covering options to large state spaces, automatically discovering task-agnostic options that encourage exploration. We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward.

AAAI Conference 2020 Conference Paper

Neural Architecture Search Using Deep Neural Networks and Monte Carlo Tree Search

  • Linnan Wang
  • Yiyang Zhao
  • Yuu Jinnai
  • Yuandong Tian
  • Rodrigo Fonseca

Neural Architecture Search (NAS) has shown great success in automating the design of neural networks, but the prohibitive amount of computations behind current NAS methods requires further investigations in improving the sample efficiency and the network evaluation cost to get better results in a shorter time. In this paper, we present a novel scalable Monte Carlo Tree Search (MCTS) based NAS agent, named AlphaX, to tackle these two aspects. AlphaX improves the search efficiency by adaptively balancing the exploration and exploitation at the state level, and by a Meta-Deep Neural Network (DNN) to predict network accuracies for biasing the search toward a promising region. To amortize the network evaluation cost, AlphaX accelerates MCTS rollouts with a distributed design and reduces the number of epochs in evaluating a network by transfer learning, which is guided with the tree structure in MCTS. In 12 GPU days and 1000 samples, AlphaX found an architecture that reaches 97. 84% top-1 accuracy on CIFAR-10, and 75. 5% top-1 accuracy on ImageNet, exceeding SOTA NAS methods in both the accuracy and sampling efficiency. Particularly, we also evaluate AlphaX on NASBench-101, a large scale NAS dataset; AlphaX is 3x and 2. 8x more sample efficient than Random Search and Regularized Evolution in finding the global optimum. Finally, we show the searched architecture improves a variety of vision applications from Neural Style Transfer, to Image Captioning and Object Detection.

ICML Conference 2019 Conference Paper

Discovering Options for Exploration by Minimizing Cover Time

  • Yuu Jinnai
  • Jee Won Park
  • David Abel
  • George Konidaris 0001

One of the main challenges in reinforcement learning is solving tasks with sparse reward. We show that the difficulty of discovering a distant rewarding state in an MDP is bounded by the expected cover time of a random walk over the graph induced by the MDP’s transition dynamics. We therefore propose to accelerate exploration by constructing options that minimize cover time. We introduce a new option discovery algorithm that diminishes the expected cover time by connecting the most distant states in the state-space graph with options. We show empirically that the proposed algorithm improves learning in several domains with sparse rewards.

ICML Conference 2019 Conference Paper

Finding Options that Minimize Planning Time

  • Yuu Jinnai
  • David Abel
  • D. Ellis Hershkowitz
  • Michael L. Littman
  • George Konidaris 0001

We formalize the problem of selecting the optimal set of options for planning as that of computing the smallest set of options so that planning converges in less than a given maximum of value-iteration passes. We first show that the problem is $\NP$-hard, even if the task is constrained to be deterministic—the first such complexity result for option discovery. We then present the first polynomial-time boundedly suboptimal approximation algorithm for this setting, and empirically evaluate it against both the optimal options and a representative collection of heuristic approaches in simple grid-based domains.

AAAI Conference 2019 Conference Paper

State Abstraction as Compression in Apprenticeship Learning

  • David Abel
  • Dilip Arumugam
  • Kavosh Asadi
  • Yuu Jinnai
  • Michael L. Littman
  • Lawson L.S. Wong

State abstraction can give rise to models of environments that are both compressed and useful, thereby enabling efficient sequential decision making. In this work, we offer the first formalism and analysis of the trade-off between compression and performance made in the context of state abstraction for Apprenticeship Learning. We build on Rate-Distortion theory, the classic Blahut-Arimoto algorithm, and the Information Bottleneck method to develop an algorithm for computing state abstractions that approximate the optimal tradeoff between compression and performance. We illustrate the power of this algorithmic structure to offer insights into effective abstraction, compression, and reinforcement learning through a mixture of analysis, visuals, and experimentation.

ICML Conference 2018 Conference Paper

Policy and Value Transfer in Lifelong Reinforcement Learning

  • David Abel
  • Yuu Jinnai
  • Yue Guo 0003
  • George Konidaris 0001
  • Michael L. Littman

We consider the problem of how best to use prior experience to bootstrap lifelong learning, where an agent faces a series of task instances drawn from some task distribution. First, we identify the initial policy that optimizes expected performance over the distribution of tasks for increasingly complex classes of policy and task distributions. We empirically demonstrate the relative performance of each policy class’ optimal element in a variety of simple task distributions. We then consider value-function initialization methods that preserve PAC guarantees while simultaneously minimizing the learning required in two learning algorithms, yielding MaxQInit, a practical new method for value-function-based transfer. We show that MaxQInit performs well in simple lifelong RL experiments.

AAAI Conference 2017 Short Paper

Learning to Avoid Dominated Action Sequences in Planning for Black-Box Domains

  • Yuu Jinnai
  • Alex Fukunaga

Black-box domains where the successor states generated by applying an action are generated by a completely opaque simulator pose a challenge for domain-independent planning. The main computational bottleneck in search-based planning for such domains is the number of calls to the black-box simulation. We propose a method for significantly reducing the number of calls to the simulator by the search algorithm by detecting and pruning sequences of actions which are dominated by others. We apply our pruning method to Iterated Width and breadth-first search in domain-independent black-box planning for Atari 2600 games, adding our pruning method significantly improves upon the baseline algorithms.

AAAI Conference 2017 Conference Paper

Learning to Prune Dominated Action Sequences in Online Black-Box Planning

  • Yuu Jinnai
  • Alex Fukunaga

Black-box domains where the successor states generated by applying an action are generated by a completely opaque simulator pose a challenge for domain-independent planning. The main computational bottleneck in search-based planning for such domains is the number of calls to the black-box simulation. We propose a method for significantly reducing the number of calls to the simulator by the search algorithm by detecting and pruning sequences of actions which are dominated by others. We apply our pruning method to Iterated Width and breadth-first search in domain-independent blackbox planning for Atari 2600 games in the Arcade Learning Environment (ALE), adding our pruning method significantly improves upon the baseline algorithms.

JAIR Journal 2017 Journal Article

On Hash-Based Work Distribution Methods for Parallel Best-First Search

  • Yuu Jinnai
  • Alex Fukunaga

Parallel best-first search algorithms such as Hash Distributed A* (HDA*) distribute work among the processes using a global hash function. We analyze the search and communication overheads of state-of-the-art hash-based parallel best-first search algorithms, and show that although Zobrist hashing, the standard hash function used by HDA*, achieves good load balance for many domains, it incurs significant communication overhead since almost all generated nodes are transferred to a different processor than their parents. We propose Abstract Zobrist hashing, a new work distribution method for parallel search which, instead of computing a hash value based on the raw features of a state, uses a feature projection function to generate a set of abstract features which results in a higher locality, resulting in reduced communications overhead. We show that Abstract Zobrist hashing outperforms previous methods on search domains using hand-coded, domain specific feature projection functions. We then propose GRAZHDA*, a graph-partitioning based approach to automatically generating feature projection functions. GRAZHDA* seeks to approximate the partitioning of the actual search space graph by partitioning the domain transition graph, an abstraction of the state space graph. We show that GRAZHDA* outperforms previous methods on domain-independent planning.

AAAI Conference 2016 Conference Paper

Abstract Zobrist Hashing: An Efficient Work Distribution Method for Parallel Best-First Search

  • Yuu Jinnai
  • Alex Fukunaga

Hash Distributed A* (HDA*) is an efficient parallel best first algorithm that asynchronously distributes work among the processes using a global hash function. Although Zobrist hashing, the standard hash function used by HDA*, achieves good load balance for many domains, it incurs significant communication overhead since it requires many node transfers among threads. We propose Abstract Zobrist hashing, a new work distribution method for parallel search which reduces node transfers and mitigates communication overhead by using feature projection functions. We evaluate Abstract Zobrist hashing for multicore HDA*, and show that it significantly outperforms previous work distribution methods.

ICAPS Conference 2016 Conference Paper

Automated Creation of Efficient Work Distribution Functions for Parallel Best-First Search

  • Yuu Jinnai
  • Alex S. Fukunaga

Hash Distributed A* (HDA*) is an efficient parallel best first algorithm that asynchronously distributes work among the processes using a global hash function. We investigate domain-independent methods for automatically creating effective work distribution functions for HDA*. First, we propose a new method for generating abstract features for the recently proposed abstract Zobrist hashing method. Second, we propose a method for modifying Zobrist hashing such that selected operators are guaranteed to generate children that are mapped to the same process as the parent, ensuring that no communications overhead is incurred for such operators. Finally, we present an improvement to state-abstraction based HDA* which dynamically selects the abstraction graph size based on problem features. We evaluate these new work distribution methods for a domain-independent planner on a cluster with 48 cores and show that these methods result in significantly higher speedups than previous methods.