Author name cluster

Shinji Ito

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

Shinji Ito
Kevin Jamieson
Haipeng Luo
Arnab Maiti
Taira Tsuchiya

We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging \textit{aggregate bandit feedback} model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of \textit{best-of-both-worlds} (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve $O(\log T)$ regret in stochastic settings and ${O}(\sqrt{T})$ regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.

TMLR Journal 2025 Journal Article

Influential Bandits: Pulling an Arm May Change the Environment

Ryoma Sato
Shinji Ito

While classical formulations of multi-armed bandit problems assume that each arm's reward is independent and stationary, real-world applications often involve non-stationary environments and interdependencies between arms. In particular, selecting one arm may influence the future rewards of other arms, a scenario not adequately captured by existing models such as rotting bandits or restless bandits. To address this limitation, we propose the influential bandit problem, which models inter-arm interactions through an unknown, symmetric, positive semi-definite interaction matrix that governs the dynamics of arm losses. We formally define this problem and establish two regret lower bounds, including a superlinear $\Omega(T^2 / \log^2 T)$ bound for the standard LCB algorithm (loss minimization version of UCB) and an algorithm-independent $\Omega(T)$ bound, which highlight the inherent difficulty of the setting. We then introduce a new algorithm based on a lower confidence bound (LCB) estimator tailored to the structure of the loss dynamics. Under mild assumptions, our algorithm achieves a regret of $O(KT \log T)$, which is nearly optimal in terms of its dependence on the time horizon. The algorithm is simple to implement and computationally efficient. Empirical evaluations on both synthetic and real-world datasets demonstrate the presence of inter-arm influence and confirm the superior performance of our method compared to conventional bandit algorithms.

NeurIPS Conference 2025 Conference Paper

Optimal Dynamic Regret by Transformers for Non-Stationary Reinforcement Learning

Baiyuan Chen
Shinji Ito
Masaaki Imaizumi

Transformers have demonstrated exceptional performance across a wide range of domains. While their ability to perform reinforcement learning in-context has been established both theoretically and empirically, their behavior in non-stationary environments remains less understood. In this study, we address this gap by showing that transformers can achieve nearly optimal dynamic regret bounds in non-stationary settings. We prove that transformers are capable of approximating strategies used to handle non-stationary environment, and can learn the approximator in the in-context learning setup. Our experiments further show that transformers can match or even outperform existing expert algorithms in such environments.

NeurIPS Conference 2025 Conference Paper

Optimal Regret of Bandits under Differential Privacy

Achraf Azize
Yulian Wu
Junya Honda
Francesco Orabona
Shinji Ito
Debabrota Basu

As sequential learning algorithms are increasingly applied to real life, ensuring data privacy while maintaining their utilities emerges as a timely question. In this context, regret minimisation in stochastic bandits under $\epsilon$-global Differential Privacy (DP) has been widely studied. The present literature poses a significant gap between the best-known regret lower and upper bound in this setting, though they ``match in order''. Thus, we revisit the regret lower and upper bounds of $\epsilon$-global DP bandits and improve both. First, we prove a tighter regret lower bound involving a novel information-theoretic quantity characterising the hardness of $\epsilon$-global DP in stochastic bandits. This quantity smoothly interpolates between Kullback–Leibler divergence and Total Variation distance, depending on the privacy budget $\epsilon$. Then, we choose two asymptotically optimal bandit algorithms, i. e. , KL-UCB and IMED, and propose their DP versions using a unified blueprint, i. e. , (a) running in arm-dependent phases, and (b) adding Laplace noise to achieve privacy. For Bernoulli bandits, we analyse the regrets of these algorithms and show that their regrets asymptotically match our lower bound up to a constant arbitrary close to 1. At the core of our algorithms lies a new concentration inequality for sums of Bernoulli variables under Laplace mechanism, which is a new DP version of the Chernoff bound. Finally, our numerical experiments validate that DP-KLUCB and DP-IMED achieve lower regret than the existing $\epsilon$-global DP bandit algorithms.

NeurIPS Conference 2025 Conference Paper

Revisiting Follow-the-Perturbed-Leader with Unbounded Perturbations in Bandit Problems

Jongyeong Lee
Junya Honda
Shinji Ito
Min-hwan Oh

Follow-the-Regularized-Leader (FTRL) policies have achieved Best-of-Both-Worlds (BOBW) results in various settings through hybrid regularizers, whereas analogous results for Follow-the-Perturbed-Leader (FTPL) remain limited due to inherent analytical challenges. To advance the analytical foundations of FTPL, we revisit classical FTRL-FTPL duality for unbounded perturbations and establish BOBW results for FTPL under a broad family of asymmetric unbounded Fréchet-type perturbations, including hybrid perturbations combining Gumbel-type and Fréchet-type tails. These results not only extend the BOBW results of FTPL but also offer new insights into designing alternative FTPL policies competitive with hybrid regularization approaches. Motivated by earlier observations in two-armed bandits, we further investigate the connection between the $1/2$-Tsallis entropy and a Fréchet-type perturbation. Our numerical observations suggest that it corresponds to a symmetric Fréchet-type perturbation, and based on this, we establish the first BOBW guarantee for symmetric unbounded perturbations in the two-armed setting. In contrast, in general multi-armed bandits, we find an instance in which symmetric Fréchet-type perturbations violate the standard condition for BOBW analysis, which is a problem not observed with asymmetric or nonnegative Fréchet-type perturbations. Although this example does not rule out alternative analyses achieving BOBW results, it suggests the limitations of directly applying the relationship observed in two-armed cases to the general case and thus emphasizes the need for further investigation to fully understand the behavior of FTPL in broader settings.

NeurIPS Conference 2024 Conference Paper

A Simple and Adaptive Learning Rate for FTRL in Online Learning with Minimax Regret of $\Theta(T^{2/3})$ and its Application to Best-of-Both-Worlds

Taira Tsuchiya
Shinji Ito

Follow-the-Regularized-Leader (FTRL) is a powerful framework for various online learning problems. By designing its regularizer and learning rate to be adaptive to past observations, FTRL is known to work adaptively to various properties of an underlying environment. However, most existing adaptive learning rates are for online learning problems with a minimax regret of $\Theta(\sqrt{T})$ for the number of rounds $T$, and there are only a few studies on adaptive learning rates for problems with a minimax regret of $\Theta(T^{2/3})$, which include several important problems dealing with indirect feedback. To address this limitation, we establish a new adaptive learning rate framework for problems with a minimax regret of $\Theta(T^{2/3})$. Our learning rate is designed by matching the stability, penalty, and bias terms that naturally appear in regret upper bounds for problems with a minimax regret of $\Theta(T^{2/3})$. As applications of this framework, we consider three major problems with a minimax regret of $\Theta(T^{2/3})$: partial monitoring, graph bandits, and multi-armed bandits with paid observations. We show that FTRL with our learning rate and the Tsallis entropy regularizer improves existing Best-of-Both-Worlds (BOBW) regret upper bounds, which achieve simultaneous optimality in the stochastic and adversarial regimes. The resulting learning rate is surprisingly simple compared to the existing learning rates for BOBW algorithms for problems with a minimax regret of $\Theta(T^{2/3})$.

PDF Details DOI

TMLR Journal 2024 Journal Article

Best-of-Both-Worlds Linear Contextual Bandits

Masahiro Kato
Shinji Ito

This study investigates the problem of $K$-armed linear contextual bandits, an instance of the multi-armed bandit problem, under an adversarial corruption. At each round, a decision-maker observes an independent and identically distributed context and then selects an arm based on the context and past observations. After selecting an arm, the decision-maker incurs a loss corresponding to the selected arm. The decision-maker aims to minimize the cumulative loss over the trial. The goal of this study is to develop a strategy that is effective in both stochastic and adversarial environments, with theoretical guarantees. We first formulate the problem by introducing a novel setting of bandits with adversarial corruption, referred to as the contextual adversarial regime with a self-bounding constraint. We assume linear models for the relationship between the loss and the context. Then, we propose a strategy that extends the {\tt RealLinExp3} by \citet{Neu2020} and the Follow-The-Regularized-Leader (FTRL). The regret of our proposed algorithm is shown to be upper-bounded by $O\left(\min\left\{\frac{(\log(T))^3}{\Delta_{*}} + \sqrt{\frac{C(\log(T))^3}{\Delta_{*}}},\ \ \sqrt{T}(\log(T))^2\right\}\right)$, where $T \in\mathbb{N}$ is the number of rounds, $\Delta_{*} > 0$ is the constant minimum gap between the best and suboptimal arms for any context, and $C\in[0, T] $ is an adversarial corruption parameter. This regret upper bound implies $O\left(\frac{(\log(T))^3}{\Delta_{*}}\right)$ in a stochastic environment and by $O\left( \sqrt{T}(\log(T))^2\right)$ in an adversarial environment. We refer to our strategy as the {\tt Best-of-Both-Worlds (BoBW) RealFTRL}, due to its theoretical guarantees in both stochastic and adversarial regimes.

TMLR Journal 2024 Journal Article

Contaminated Online Convex Optimization

Tomoya Kamijima
Shinji Ito

In online convex optimization, some efficient algorithms have been designed for each of the individual classes of objective functions, e.g., convex, strongly convex, and exp-concave. However, existing regret analyses, including those of universal algorithms, are limited to cases in which the objective functions in all rounds belong to the same class and cannot be applied to cases in which the property of objective functions may change in each time step. This paper introduces a novel approach to address such cases, proposing a new regime we term as \textit{contaminated} online convex optimization. For the contaminated case, we demonstrate that the regret is lower bounded by $\Omega(\log T + \sqrt{k})$. Here, $k$ signifies the level of contamination in the objective functions. We also demonstrate that the regret is bounded by $O(\log T+\sqrt{k\log T})$ when universal algorithms are used. When our proposed algorithms with additional information are employed, the regret is bounded by $O(\log T+\sqrt{k})$, which matches the lower bound. These are intermediate bounds between a convex case and a strongly convex or exp-concave case.

ICML Conference 2024 Conference Paper

Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring

Taira Tsuchiya
Shinji Ito
Junya Honda

Partial monitoring is a generic framework of online decision-making problems with limited feedback. To make decisions from such limited feedback, it is necessary to find an appropriate distribution for exploration. Recently, a powerful approach for this purpose, exploration by optimization (ExO), was proposed, which achieves optimal bounds in adversarial environments with follow-the-regularized-leader for a wide range of online decision-making problems. However, a naive application of ExO in stochastic environments significantly degrades regret bounds. To resolve this issue in locally observable games, we first establish a new framework and analysis for ExO with a hybrid regularizer. This development allows us to significantly improve existing regret bounds of best-of-both-worlds (BOBW) algorithms, which achieves nearly optimal bounds both in stochastic and adversarial environments. In particular, we derive a stochastic regret bound of $O(\sum_{a \neq a^*} k^2 m^2 \log T / \Delta_a)$, where $k$, $m$, and $T$ are the numbers of actions, observations and rounds, $a^*$ is an optimal action, and $\Delta_a$ is the suboptimality gap for action $a$. This bound is roughly $\Theta(k^2 \log T)$ times smaller than existing BOBW bounds. In addition, for globally observable games, we provide a new BOBW algorithm with the first $O(\log T)$ stochastic bound.

NeurIPS Conference 2024 Conference Paper

Fast Rates in Stochastic Online Convex Optimization by Exploiting the Curvature of Feasible Sets

Taira Tsuchiya
Shinji Ito

In this work, we explore online convex optimization (OCO) and introduce a new condition and analysis that provides fast rates by exploiting the curvature of feasible sets. In online linear optimization, it is known that if the average gradient of loss functions exceeds a certain threshold, the curvature of feasible sets can be exploited by the follow-the-leader (FTL) algorithm to achieve a logarithmic regret. This study reveals that algorithms adaptive to the curvature of loss functions can also leverage the curvature of feasible sets. In particular, we first prove that if an optimal decision is on the boundary of a feasible set and the gradient of an underlying loss function is non-zero, then the algorithm achieves a regret bound of $O(\rho \log T)$ in stochastic environments. Here, $\rho > 0$ is the radius of the smallest sphere that includes the optimal decision and encloses the feasible set. Our approach, unlike existing ones, can work directly with convex loss functions, exploiting the curvature of loss functions simultaneously, and can achieve the logarithmic regret only with a local property of feasible sets. Additionally, the algorithm achieves an $O(\sqrt{T})$ regret even in adversarial environments, in which FTL suffers an $\Omega(T)$ regret, and achieves an $O(\rho \log T + \sqrt{C \rho \log T})$ regret in corrupted stochastic environments with corruption level $C$. Furthermore, by extending our analysis, we establish a matching regret upper bound of $O\Big(T^{\frac{q-2}{2(q-1)}} (\log T)^{\frac{q}{2(q-1)}}\Big)$ for $q$-uniformly convex feasible sets, where uniformly convex sets include strongly convex sets and $\ell_p$-balls for $p \in [2, \infty)$. This bound bridges the gap between the $O(\log T)$ bound for strongly convex sets~($q=2$) and the $O(\sqrt{T})$ bound for non-curved sets~($q\to\infty$).

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Learning with Posterior Sampling for Revenue Management under Time-varying Demand

Kazuma Shimizu
Junya Honda
Shinji Ito
Shinji Nakadai

This paper discusses the revenue management (RM) problem to maximize revenue by pricing items or services. One challenge in this problem is that the demand distribution is unknown and varies over time in real applications such as airline and retail industries. In particular, the time-varying demand has not been well studied under scenarios of unknown demand due to the difficulty of jointly managing the remaining inventory and estimating the demand. To tackle this challenge, we first introduce an episodic generalization of the RM problem motivated by typical application scenarios. We then propose a computationally efficient algorithm based on posterior sampling, which effectively optimizes prices by solving linear programming. We derive a Bayesian regret upper bound of this algorithm for general models where demand parameters can be correlated between time periods, while also deriving a regret lower bound for generic algorithms. Our empirical study shows that the proposed algorithm performs better than other benchmark algorithms and comparably to the optimal policy in hindsight. We also propose a heuristic modification of the proposed algorithm, which further efficiently learns the pricing policy in the experiments. An extended version of this paper with appendixes is available at: http: //arxiv. org/abs/2405. 04910.

PDF Details DOI

AAAI Conference 2024 Conference Paper

New Classes of the Greedy-Applicable Arm Feature Distributions in the Sparse Linear Bandit Problem

Koji Ichikawa
Shinji Ito
Daisuke Hatano
Hanna Sumita
Takuro Fukunaga
Naonori Kakimura
Ken-ichi Kawarabayashi

We consider the sparse contextual bandit problem where arm feature affects reward through the inner product of sparse parameters. Recent studies have developed sparsity-agnostic algorithms based on the greedy arm selection policy. However, the analysis of these algorithms requires strong assumptions on the arm feature distribution to ensure that the greedily selected samples are sufficiently diverse; One of the most common assumptions, relaxed symmetry, imposes approximate origin-symmetry on the distribution, which cannot allow distributions that has origin-asymmetric support. In this paper, we show that the greedy algorithm is applicable to a wider range of the arm feature distributions from two aspects. Firstly, we show that a mixture distribution that has a greedy-applicable component is also greedy-applicable. Second, we propose new distribution classes, related to Gaussian mixture, discrete, and radial distribution, for which the sample diversity is guaranteed. The proposed classes can describe distributions with origin-asymmetric support and, in conjunction with the first claim, provide theoretical guarantees of the greedy policy for a very wide range of the arm feature distributions.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

On the Minimax Regret for Contextual Linear Bandits and Multi-Armed Bandits with Expert Advice

Shinji Ito

This paper examines two extensions of multi-armed bandit problems: multi-armed bandits with expert advice and contextual linear bandits. For the former problem, multi-armed bandits with expert advice, the previously known best upper and lower bounds have been $O(\sqrt{KT \log \frac{N}{K} })$ and $\Omega( \sqrt{KT \frac{ \log N }{\log K }} )$, respectively. Here, $K$, $N$, and $T$ represent the numbers of arms, experts, and rounds, respectively. We provide a lower bound of $\Omega( \sqrt{KT \log \frac{N}{K}} )$ for the setup in which the player chooses an expert before observing the advices in each round. For the latter problem, contextual linear bandits, we provide an algorithm that achieves $O ( \sqrt{d T \log ( K \min\{ 1, \frac{S}{d} \} )} )$ together with a matching lower bound, where $d$ and $S$ represent the dimensionality of feature vectors and the size of the context space, respectively.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

An Exploration-by-Optimization Approach to Best of Both Worlds in Linear Bandits

Shinji Ito
Kei Takemura

In this paper, we consider how to construct best-of-both-worlds linear bandit algorithms that achieve nearly optimal performance for both stochastic and adversarial environments. For this purpose, we show that a natural approach referred to as exploration by optimization [Lattimore and Szepesvári, 2020] works well. Specifically, an algorithm constructed using this approach achieves $O(d \sqrt{ T \log{T}})$-regret in adversarial environments and $O(\frac{d^2 \log T}{\Delta_{\min}} )$-regret in stochastic environments. Symbols $d$, $T$ and $\Delta_{\min}$ here represent the dimensionality of the action set, the time horizon, and the minimum sub-optimality gap, respectively. We also show that this algorithm has even better theoretical guarantees for important special cases including the multi-armed bandit problem and multitask bandits.

NeurIPS Conference 2023 Conference Paper

Bandit Task Assignment with Unknown Processing Time

Shinji Ito
Daisuke Hatano
Hanna Sumita
Kei Takemura
Takuro Fukunaga
Naonori Kakimura
Ken-ichi Kawarabayashi

This study considers a novel problem setting, referred to as \textit{bandit task assignment}, that incorporates the processing time of each task in the bandit setting. In this problem setting, a player sequentially chooses a set of tasks to start so that the set of processing tasks satisfies a given combinatorial constraint. The reward and processing time for each task follow unknown distributions, values of which are revealed only after the task has been completed. The problem generalizes the stochastic combinatorial semi-bandit problem and the budget-constrained bandit problem. For this problem setting, we propose an algorithm based on upper confidence bounds~(UCB) combined with a phased-update approach. The proposed algorithm admits a gap-dependent regret upper bound of $O(MN(1/\Delta){\log T})$ and a gap-free regret upper bound of $\tilde{O}( \sqrt{MNT} )$, where $N$ is the number of the tasks, $M$ is the maximum number of tasks run at the same time, $T$ is the time horizon, and $\Delta$ is the gap between expected per-round rewards of the optimal and best suboptimal sets of tasks. These regret bounds nearly match lower bounds.

NeurIPS Conference 2023 Conference Paper

Stability-penalty-adaptive follow-the-regularized-leader: Sparsity, game-dependency, and best-of-both-worlds

Taira Tsuchiya
Shinji Ito
Junya Honda

Adaptivity to the difficulties of a problem is a key property in sequential decision-making problems to broaden the applicability of algorithms. Follow-the-regularized-leader (FTRL) has recently emerged as one of the most promising approaches for obtaining various types of adaptivity in bandit problems. Aiming to further generalize this adaptivity, we develop a generic adaptive learning rate, called stability-penalty-adaptive (SPA) learning rate for FTRL. This learning rate yields a regret bound jointly depending on stability and penalty of the algorithm, into which the regret of FTRL is typically decomposed. With this result, we establish several algorithms with three types of adaptivity: sparsity, game-dependency, and best-of-both-worlds (BOBW). Despite the fact that sparsity appears frequently in real problems, existing sparse multi-armed bandit algorithms with $k$-arms assume that the sparsity level $s \leq k$ is known in advance, which is often not the case in real-world scenarios. To address this issue, we first establish $s$-agnostic algorithms with regret bounds of $\tilde{O}(\sqrt{sT})$ in the adversarial regime for $T$ rounds, which matches the existing lower bound up to a logarithmic factor. Meanwhile, BOBW algorithms aim to achieve a near-optimal regret in both the stochastic and adversarial regimes. Leveraging the SPA learning rate and the technique for $s$-agnostic algorithms combined with a new analysis to bound the variation in FTRL output in response to changes in a regularizer, we establish the first BOBW algorithm with a sparsity-dependent bound. Additionally, we explore partial monitoring and demonstrate that the proposed SPA learning rate framework allows us to achieve a game-dependent bound and the BOBW simultaneously.

NeurIPS Conference 2022 Conference Paper

Average Sensitivity of Euclidean k-Clustering

Yuichi Yoshida
Shinji Ito

Given a set of $n$ points in $\mathbb{R}^d$, the goal of Euclidean $(k, \ell)$-clustering is to find $k$ centers that minimize the sum of the $\ell$-th powers of the Euclidean distance of each point to the closest center. In practical situations, the clustering result must be stable against points missing in the input data so that we can make trustworthy and consistent decisions. To address this issue, we consider the average sensitivity of Euclidean $(k, \ell)$-clustering, which measures the stability of the output in total variation distance against deleting a random point from the input data. We first show that a popular algorithm \textsc{$k$-means++} and its variant called \textsc{$D^\ell$-sampling} have low average sensitivity. Next, we show that any approximation algorithm for Euclidean $(k, \ell)$-clustering can be transformed to an algorithm with low average sensitivity while almost preserving the approximation guarantee. As byproducts of our results, we provide several algorithms for consistent $(k, \ell)$-clustering and dynamic $(k, \ell)$-clustering in the random-order model, where the input points are randomly permuted and given in an online manner. The goal of the consistent setting is to maintain a good solution while minimizing the number of changes to the solution during the process, and that of the dynamic setting is to maintain a good solution while minimizing the (amortized) update time.

NeurIPS Conference 2022 Conference Paper

Nearly Optimal Best-of-Both-Worlds Algorithms for Online Learning with Feedback Graphs

Shinji Ito
Taira Tsuchiya
Junya Honda

This study considers online learning with general directed feedback graphs. For this problem, we present best-of-both-worlds algorithms that achieve nearly tight regret bounds for adversarial environments as well as poly-logarithmic regret bounds for stochastic environments. As Alon et al. [2015] have shown, tight regret bounds depend on the structure of the feedback graph: strongly observable graphs yield minimax regret of $\tilde{\Theta}( \alpha^{1/2} T^{1/2} )$, while weakly observable graphs induce minimax regret of $\tilde{\Theta}( \delta^{1/3} T^{2/3} )$, where $\alpha$ and $\delta$, respectively, represent the independence number of the graph and the domination number of a certain portion of the graph. Our proposed algorithm for strongly observable graphs has a regret bound of $\tilde{O}( \alpha^{1/2} T^{1/2} )$ for adversarial environments, as well as of $ {O} ( \frac{\alpha (\ln T)^3 }{\Delta_{\min}} ) $ for stochastic environments, where $\Delta_{\min}$ expresses the minimum suboptimality gap. This result resolves an open question raised by Erez and Koren [2021]. We also provide an algorithm for weakly observable graphs that achieves a regret bound of $\tilde{O}( \delta^{1/3}T^{2/3} )$ for adversarial environments and poly-logarithmic regret for stochastic environments. The proposed algorithms are based on the follow-the-regularized-leader approach combined with newly designed update rules for learning rates.

AAAI Conference 2022 Conference Paper

Online Task Assignment Problems with Reusable Resources

Hanna Sumita
Shinji Ito
Kei Takemura
Daisuke Hatano
Takuro Fukunaga
Naonori Kakimura
Ken-ichi Kawarabayashi

We study online task assignment problem with reusable resources, motivated by practical applications such as ridesharing, crowdsourcing and job hiring. In the problem, we are given a set of offline vertices (agents), and, at each time, an online vertex (task) arrives randomly according to a known time-dependent distribution. Upon arrival, we assign the task to agents immediately and irrevocably. The goal of the problem is to maximize the expected total profit produced by completed tasks. The key features of our problem are (1) an agent is reusable, i. e. , an agent comes back to the market after completing the assigned task, (2) an agent may reject the assigned task to stay the market, and (3) a task may accommodate multiple agents. The setting generalizes that of existing work in which an online task is assigned to one agent under (1). In this paper, we propose an online algorithm that is 1/2competitive for the above setting, which is tight. Moreover, when each agent can reject assigned tasks at most ∆ times, the algorithm is shown to have the competitive ratio ∆/(3∆ − 1) ≥ 1/3. We also evaluate our proposed algorithm with numerical experiments.

ICML Conference 2022 Conference Paper

Revisiting Online Submodular Minimization: Gap-Dependent Regret Bounds, Best of Both Worlds and Adversarial Robustness

Shinji Ito

In this paper, we consider online decision problems with submodular loss functions. For such problems, existing studies have only dealt with worst-case analysis. This study goes beyond worst-case analysis to show instance-dependent regret bounds. More precisely, for each of the full-information and bandit-feedback settings, we propose an algorithm that achieves a gap-dependent O(log T)-regret bound in the stochastic environment and is comparable to the best existing algorithm in the adversarial environment. The proposed algorithms also work well in the stochastic environment with adversarial corruptions, which is an intermediate setting between the stochastic and adversarial environments.

NeurIPS Conference 2022 Conference Paper

Single Loop Gaussian Homotopy Method for Non-convex Optimization

Hidenori Iwakiri
Yuhang Wang
Shinji Ito
Akiko Takeda

The Gaussian homotopy (GH) method is a popular approach to finding better stationary points for non-convex optimization problems by gradually reducing a parameter value $t$, which changes the problem to be solved from an almost convex one to the original target one. Existing GH-based methods repeatedly call an iterative optimization solver to find a stationary point every time $t$ is updated, which incurs high computational costs. We propose a novel single loop framework for GH methods (SLGH) that updates the parameter $t$ and the optimization decision variables at the same. Computational complexity analysis is performed on the SLGH algorithm under various situations: either a gradient or gradient-free oracle of a GH function can be obtained for both deterministic and stochastic settings. The convergence rate of SLGH with a tuned hyperparameter becomes consistent with the convergence rate of gradient descent, even though the problem to be solved is gradually changed due to $t$. In numerical experiments, our SLGH algorithms show faster convergence than an existing double loop GH method while outperforming gradient descent-based methods in terms of finding a better solution.

NeurIPS Conference 2021 Conference Paper

Hybrid Regret Bounds for Combinatorial Semi-Bandits and Adversarial Linear Bandits

Shinji Ito

This study aims to develop bandit algorithms that automatically exploit tendencies of certain environments to improve performance, without any prior knowledge regarding the environments. We first propose an algorithm for combinatorial semi-bandits with a hybrid regret bound that includes two main features: a best-of-three-worlds guarantee and multiple data-dependent regret bounds. The former means that the algorithm will work nearly optimally in all environments in an adversarial setting, a stochastic setting, or a stochastic setting with adversarial corruptions. The latter implies that, even if the environment is far from exhibiting stochastic behavior, the algorithm will perform better as long as the environment is "easy" in terms of certain metrics. The metrics w. r. t. the easiness referred to in this paper include cumulative loss for optimal actions, total quadratic variation of losses, and path-length of a loss sequence. We also show hybrid data-dependent regret bounds for adversarial linear bandits, which include a first path-length regret bound that is tight up to logarithmic factors.

AAAI Conference 2021 Conference Paper

Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions

Kei Takemura
Shinji Ito
Daisuke Hatano
Hanna Sumita
Takuro Fukunaga
Naonori Kakimura
Ken-ichi Kawarabayashi

The contextual combinatorial semi-bandit problem with linear payoff functions is a decision-making problem in which a learner chooses a set of arms with the feature vectors in each round under given constraints so as to maximize the sum of rewards of arms. Several existing algorithms have regret bounds that are optimal with respect to the number of rounds T. However, there is a gap of Õ(max( √ d, √ k)) between the current best upper and lower bounds, where d is the dimension of the feature vectors, k is the number of the chosen arms in a round, and Õ(·) ignores the logarithmic factors. The dependence of k and d is of practical importance because k may be larger than T in real-world applications such as recommender systems. In this paper, we fill the gap by improving the upper and lower bounds. More precisely, we show that the C2 UCB algorithm proposed by Qin, Chen, and Zhu (2014) has the optimal regret bound Õ(d √ kT + dk) for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C2 UCB algorithm and demonstrate that it enjoys the optimal regret bound for a more general problem that can take into account other objectives simultaneously. We also show that our technique would be applicable to related problems. Numerical experiments support our theoretical results and considerations.

NeurIPS Conference 2021 Conference Paper

On Optimal Robustness to Adversarial Corruption in Online Decision Problems

Shinji Ito

This paper considers two fundamental sequential decision-making problems: the problem of prediction with expert advice and the multi-armed bandit problem. We focus on stochastic regimes in which an adversary may corrupt losses, and we investigate what level of robustness can be achieved against adversarial corruption. The main contribution of this paper is to show that optimal robustness can be expressed by a square-root dependency on the amount of corruption. More precisely, we show that two classes of algorithms, anytime Hedge with decreasing learning rate and algorithms with second-order regret bounds, achieve $O( \frac{\log N}{\Delta} + \sqrt{ \frac{C \log N }{\Delta} } )$-regret, where $N, \Delta$, and $C$ represent the number of experts, the gap parameter, and the corruption level, respectively. We further provide a matching lower bound, which means that this regret bound is tight up to a constant factor. For the multi-armed bandit problem, we also provide a nearly-tight lower bound up to a logarithmic factor.

NeurIPS Conference 2020 Conference Paper

A Tight Lower Bound and Efficient Reduction for Swap Regret

Shinji Ito

Swap regret, a generic performance measure of online decision-making algorithms, plays an important role in the theory of repeated games, along with a close connection to correlated equilibria in strategic games. This paper shows an $\Omega( \sqrt{T N\log{N}} )$-lower bound for swap regret, where $T$ and $N$ denote the numbers of time steps and available actions, respectively. Our lower bound is tight up to a constant, and resolves an open problem mentioned, e. g. , in the book by Nisan et al. (2007). Besides, we present a computationally efficient reduction method that converts no-external-regret algorithms to no-swap-regret algorithms. This method can be applied not only to the full-information setting but also to the bandit setting and provides a better regret bound than previous results.

NeurIPS Conference 2020 Conference Paper

Delay and Cooperation in Nonstochastic Linear Bandits

Shinji Ito
Daisuke Hatano
Hanna Sumita
Kei Takemura
Takuro Fukunaga
Naonori Kakimura
Ken-ichi Kawarabayashi

This paper offers a nearly optimal algorithm for online linear optimization with delayed bandit feedback. Online linear optimization with bandit feedback, or nonstochastic linear bandits, provides a generic framework for sequential decision-making problems with limited information. This framework, however, assumes that feedback can be observed just after choosing the action, and, hence, does not apply directly to many practical applications, in which the feedback can often only be obtained after a while. To cope with such situations, we consider problem settings in which the feedback can be observed $d$ rounds after the choice of an action, and propose an algorithm for which the expected regret is $\tilde{O}( \sqrt{m (m + d) T} )$, ignoring logarithmic factors in $m$ and $T$, where $m$ and $T$ denote the dimensionality of the action set and the number of rounds, respectively. This algorithm achieves nearly optimal performance, as we are able to show that arbitrary algorithms suffer the regret of $\Omega(\sqrt{m (m+d) T})$ in the worst case. To develop the algorithm, we introduce a technique we refer to as \textit{distribution truncation}, which plays an essential role in bounding the regret. We also apply our approach to cooperative bandits, as studied by Cesa-Bianchi et al. [17] and Bar-On and Mansour [12], and extend their results to the linear bandits setting.

NeurIPS Conference 2020 Conference Paper

Tight First- and Second-Order Regret Bounds for Adversarial Linear Bandits

Shinji Ito
Shuichi Hirahara
Tasuku Soma
Yuichi Yoshida

We propose novel algorithms with first- and second-order regret bounds for adversarial linear bandits. These regret bounds imply that our algorithms perform well when there is an action achieving a small cumulative loss or the loss has a small variance. In addition, we need only assumptions weaker than those of existing algorithms; our algorithms work on discrete action sets as well as continuous ones without a priori knowledge about losses, and they run efficiently if a linear optimization oracle for the action set is available. These results are obtained by combining optimistic online optimization, continuous multiplicative weight update methods, and a novel technique that we refer to as distribution truncation. We also show that the regret bounds of our algorithms are tight up to polylogarithmic factors.

NeurIPS Conference 2019 Conference Paper

Improved Regret Bounds for Bandit Combinatorial Optimization

Shinji Ito
Daisuke Hatano
Hanna Sumita
Kei Takemura
Takuro Fukunaga
Naonori Kakimura
Ken-ichi Kawarabayashi

\textit{Bandit combinatorial optimization} is a bandit framework in which a player chooses an action within a given finite set $\mathcal{A} \subseteq \{ 0, 1 \}^d$ and incurs a loss that is the inner product of the chosen action and an unobservable loss vector in $\mathbb{R} ^ d$ in each round. In this paper, we aim to reveal the property, which makes the bandit combinatorial optimization hard. Recently, Cohen et al. ~\citep{cohen2017tight} obtained a lower bound $\Omega(\sqrt{d k^3 T / \log T})$ of the regret, where $k$ is the maximum $\ell_1$-norm of action vectors, and $T$ is the number of rounds. This lower bound was achieved by considering a continuous strongly-correlated distribution of losses. Our main contribution is that we managed to improve this bound by $\Omega( \sqrt{d k ^3 T} )$ through applying a factor of $\sqrt{\log T}$, which can be done by means of strongly-correlated losses with \textit{binary} values. The bound derives better regret bounds for three specific examples of the bandit combinatorial optimization: the multitask bandit, the bandit ranking and the multiple-play bandit. In particular, the bound obtained for the bandit ranking in the present study addresses an open problem raised in \citep{cohen2017tight}. In addition, we demonstrate that the problem becomes easier without considering correlations among entries of loss vectors. In fact, if each entry of loss vectors is an independent random variable, then, one can achieve a regret of $\tilde{O}(\sqrt{d k^2 T})$, which is $\sqrt{k}$ times smaller than the lower bound shown above. The observed results indicated that correlation among losses is the reason for observing a large regret.

NeurIPS Conference 2019 Conference Paper

Oracle-Efficient Algorithms for Online Linear Optimization with Bandit Feedback

Shinji Ito
Daisuke Hatano
Hanna Sumita
Kei Takemura
Takuro Fukunaga
Naonori Kakimura
Ken-ichi Kawarabayashi

We propose computationally efficient algorithms for \textit{online linear optimization with bandit feedback}, in which a player chooses an \textit{action vector} from a given (possibly infinite) set $\mathcal{A} \subseteq \mathbb{R}^d$, and then suffers a loss that can be expressed as a linear function in action vectors. Although existing algorithms achieve an optimal regret bound of $\tilde{O}(\sqrt{T})$ for $T$ rounds (ignoring factors of $\mathrm{poly} (d, \log T)$), computationally efficient ways of implementing them have not yet been specified, in particular when $|\mathcal{A}|$ is not bounded by a polynomial size in $d$. A standard way to pursue computational efficiency is to assume that we have an efficient algorithm referred to as \textit{oracle} that solves (offline) linear optimization problems over $\mathcal{A}$. Under this assumption, the computational efficiency of a bandit algorithm can then be measured in terms of \textit{oracle complexity}, i. e. , the number of oracle calls. Our contribution is to propose algorithms that offer optimal regret bounds of $\tilde{O}(\sqrt{T})$ as well as low oracle complexity for both \textit{non-stochastic settings} and \textit{stochastic settings}. Our algorithm for non-stochastic settings has an oracle complexity of $\tilde{O}( T )$ and is the first algorithm that achieves both a regret bound of $\tilde{O}( \sqrt{T} )$ and an oracle complexity of $\tilde{O} ( \mathrm{poly} ( T ) )$, given only linear optimization oracles. Our algorithm for stochastic settings calls the oracle only $O( \mathrm{poly} (d, \log T))$ times, which is smaller than the current best oracle complexity of $O( T )$ if $T$ is sufficiently large.

NeurIPS Conference 2019 Conference Paper

Submodular Function Minimization with Noisy Evaluation Oracle

Shinji Ito

This paper considers submodular function minimization with \textit{noisy evaluation oracles} that return the function value of a submodular objective with zero-mean additive noise. For this problem, we provide an algorithm that returns an $O(n^{3/2}/\sqrt{T})$-additive approximate solution in expectation, where $n$ and $T$ stand for the size of the problem and the number of oracle calls, respectively. There is no room for reducing this error bound by a factor smaller than $O(1/\sqrt{n})$. Indeed, we show that any algorithm will suffer additive errors of $\Omega(n/\sqrt{T})$ in the worst case. Further, we consider an extended problem setting with \textit{multiple-point feedback} in which we can get the feedback of $k$ function values with each oracle call. Under the additional assumption that each noisy oracle is submodular and that $2 \leq k = O(1)$, we provide an algorithm with an $O(n/\sqrt{T})$-additive error bound as well as a worst-case analysis including a lower bound of $\Omega(n/\sqrt{T})$, which together imply that the algorithm achieves an optimal error bound up to a constant.

ICML Conference 2018 Conference Paper

Causal Bandits with Propagating Inference

Akihiro Yabe
Daisuke Hatano
Hanna Sumita
Shinji Ito
Naonori Kakimura
Takuro Fukunaga
Ken-ichi Kawarabayashi

Bandit is a framework for designing sequential experiments, where a learner selects an arm $A \in \mathcal{A}$ and obtains an observation corresponding to $A$ in each experiment. Theoretically, the tight regret lower-bound for the general bandit is polynomial with respect to the number of arms $|\mathcal{A}|$, and thus, to overcome this bound, the bandit problem with side-information is often considered. Recently, a bandit framework over a causal graph was introduced, where the structure of the causal graph is available as side-information and the arms are identified with interventions on the causal graph. Existing algorithms for causal bandit overcame the $\Omega(\sqrt{|\mathcal{A}|/T})$ simple-regret lower-bound; however, their algorithms work only when the interventions $\mathcal{A}$ are localized around a single node (i. e. , an intervention propagates only to its neighbors). We then propose a novel causal bandit algorithm for an arbitrary set of interventions, which can propagate throughout the causal graph. We also show that it achieves $O(\sqrt{ \gamma^*\log(|\mathcal{A}|T) / T})$ regret bound, where $\gamma^*$ is determined by using a causal graph structure. In particular, if the maximum in-degree of the causal graph is a constant, then $\gamma^* = O(N^2)$, where $N$ is the number of nodes.

NeurIPS Conference 2018 Conference Paper

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint

Shinji Ito
Daisuke Hatano
Hanna Sumita
Akihiro Yabe
Takuro Fukunaga
Naonori Kakimura
Ken-ichi Kawarabayashi

Online portfolio selection is a sequential decision-making problem in which a learner repetitively selects a portfolio over a set of assets, aiming to maximize long-term return. In this paper, we study the problem with the cardinality constraint that the number of assets in a portfolio is restricted to be at most k, and consider two scenarios: (i) in the full-feedback setting, the learner can observe price relatives (rates of return to cost) for all assets, and (ii) in the bandit-feedback setting, the learner can observe price relatives only for invested assets. We propose efficient algorithms for these scenarios that achieve sublinear regrets. We also provide regret (statistical) lower bounds for both scenarios which nearly match the upper bounds when k is a constant. In addition, we give a computational lower bound which implies that no algorithm maintains both computational efficiency, as well as a small regret upper bound.

ICML Conference 2018 Conference Paper

Unbiased Objective Estimation in Predictive Optimization

Shinji Ito
Akihiro Yabe
Ryohei Fujimaki

For data-driven decision-making, one promising approach, called predictive optimization, is to solve maximization problems i n which the objective function to be maximized is estimated from data. Predictive optimization, however, suffers from the problem of a calculated optimal solution’s being evaluated too optimistically, i. e. , the value of the objective function is overestimated. This paper investigates such optimistic bias and presents two methods for correcting it. The first, which is analogous to cross-validation, successfully corrects the optimistic bias but results in underestimation of the true value. Our second method employs resampling techniques to avoid both overestimation and underestimation. We show that the second method, referred to as the parameter perturbation method, achieves asymptotically unbiased estimation. Empirical results for both artificial and real-world datasets demonstrate that our proposed approach successfully corrects the optimistic bias.

NeurIPS Conference 2017 Conference Paper

Efficient Sublinear-Regret Algorithms for Online Sparse Linear Regression with Limited Observation

Shinji Ito
Daisuke Hatano
Hanna Sumita
Akihiro Yabe
Takuro Fukunaga
Naonori Kakimura
Ken-ichi Kawarabayashi

Online sparse linear regression is the task of applying linear regression analysis to examples arriving sequentially subject to a resource constraint that a limited number of features of examples can be observed. Despite its importance in many practical applications, it has been recently shown that there is no polynomial-time sublinear-regret algorithm unless NP$\subseteq$BPP, and only an exponential-time sublinear-regret algorithm has been found. In this paper, we introduce mild assumptions to solve the problem. Under these assumptions, we present polynomial-time sublinear-regret algorithms for the online sparse linear regression. In addition, thorough experiments with publicly available data demonstrate that our algorithms outperform other known algorithms.

IJCAI Conference 2017 Conference Paper

Robust Quadratic Programming for Price Optimization

Akihiro Yabe
Shinji Ito
Ryohei Fujimaki

The goal of price optimization is to maximize total revenue by adjusting the prices of products, on the basis of predicted sales numbers that are functions of pricing strategies. Recent advances in demand modeling using machine learning raise a new challenge in price optimization, i. e. , how to manage statistical errors in estimation. In this paper, we show that uncertainty in recently-proposed prescriptive price optimization frameworks can be represented by a matrix normal distribution. For this particular uncertainty, we propose novel robust quadratic programming algorithms for conservative lower-bound maximization. We offer an asymptotic probabilistic guarantee of conservativeness of our formulation. Our experiments on both artificial and actual price data show that our robust price optimization allows users to determine best risk-return trade-offs and to explore safe, profitable price strategies.

NeurIPS Conference 2016 Conference Paper

Large-Scale Price Optimization via Network Flow

Shinji Ito
Ryohei Fujimaki

This paper deals with price optimization, which is to find the best pricing strategy that maximizes revenue or profit, on the basis of demand forecasting models. Though recent advances in regression technologies have made it possible to reveal price-demand relationship of a number of multiple products, most existing price optimization methods, such as mixed integer programming formulation, cannot handle tens or hundreds of products because of their high computational costs. To cope with this problem, this paper proposes a novel approach based on network flow algorithms. We reveal a connection between supermodularity of the revenue and cross elasticity of demand. On the basis of this connection, we propose an efficient algorithm that employs network flow algorithms. The proposed algorithm can handle hundreds or thousands of products, and returns an exact optimal solution under an assumption regarding cross elasticity of demand. Even in case in which the assumption does not hold, the proposed algorithm can efficiently find approximate solutions as good as can other state-of-the-art methods, as empirical results show.