Author name cluster

Nikos Vlassis

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

35 papers

2 author rows

RLC Conference 2025 Conference Paper

Adaptive Submodular Policy Optimization

Branislav Kveton
Anup Rao
Viet Dac Lai
Nikos Vlassis
David Arbour

We propose KL-regularized policy optimization for adaptive submodular maximization, which is a framework for decision making under uncertainty with submodular rewards. Policy optimization of adaptive submodular functions justifies a surprisingly simple and efficient policy gradient update, where the optimized action only affects its immediate reward but not the future ones. It also allows us to learn adaptive submodular policies with large action spaces, such as those represented by large language models (LLMs). We prove that our policies monotonically improve as the regularization diminishes and converge to the optimal greedy policy. Our experiments show major gains in statistical efficiency, in both synthetic problems and LLMs.

PDF Details

RLJ Journal 2025 Journal Article

Adaptive Submodular Policy Optimization

Branislav Kveton
Anup Rao
Viet Dac Lai
Nikos Vlassis
David Arbour

PDF Details

AAAI Conference 2024 Conference Paper

Distributional Off-Policy Evaluation for Slate Recommendations

Shreyas Chaudhari
David Arbour
Georgios Theocharous
Nikos Vlassis

Recommendation strategies are typically evaluated by using previously logged data, employing off-policy evaluation methods to estimate their expected performance. However, for strategies that present users with slates of multiple items, the resulting combinatorial action space renders many of these methods impractical. Prior work has developed estimators that leverage the structure in slates to estimate the expected off-policy performance, but the estimation of the entire performance distribution remains elusive. Estimating the complete distribution allows for a more comprehensive evaluation of recommendation strategies, particularly along the axes of risk and fairness that employ metrics computable from the distribution. In this paper, we propose an estimator for the complete off-policy performance distribution for slates and establish conditions under which the estimator is unbiased and consistent. This builds upon prior work on off-policy evaluation for slates and off-policy distribution estimation in reinforcement learning. We validate the efficacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (MovieLens-20M). Our results show a significant reduction in estimation variance and improved sample efficiency over prior work across a range of slate structures.

PDF Details DOI

NeurIPS Conference 2021 Conference Paper

Control Variates for Slate Off-Policy Evaluation

Nikos Vlassis
Ashok Chandrashekar
Fernando Amat
Nathan Kallus

We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions, often termed slates. The problem is common to recommender systems and user-interface optimization, and it is particularly challenging because of the combinatorially-sized action space. Swaminathan et al. (2017) have proposed the pseudoinverse (PI) estimator under the assumption that the conditional mean rewards are additive in actions. Using control variates, we consider a large class of unbiased estimators that includes as specific cases the PI estimator and (asymptotically) its self-normalized variant. By optimizing over this class, we obtain new estimators with risk improvement guarantees over both the PI and the self-normalized PI estimators. Experiments with real-world recommender data as well as synthetic data validate these improvements in practice.

PDF Details

IJCAI Conference 2019 Conference Paper

Marginal Posterior Sampling for Slate Bandits

Maria Dimakopoulou
Nikos Vlassis
Tony Jebara

We introduce a new Thompson sampling-based algorithm, called marginal posterior sampling, for online slate bandits, that is characterized by three key ideas. First, it postulates that the slate-level reward is a monotone function of the marginal unobserved rewards of the base actions selected in the slates's slots, but it does not attempt to estimate this function. Second, instead of maintaining a slate-level reward posterior, the algorithm maintains posterior distributions for the marginal reward of each slot's base actions and uses the samples from these marginal posteriors to select the next slate. Third, marginal posterior sampling optimizes at the slot-level rather than the slate-level, which makes the approach computationally efficient. Simulation results establish substantial advantages of marginal posterior sampling over alternative Thompson sampling-based approaches that are widely used in the domain of web services.

PDF Details

ICML Conference 2019 Conference Paper

More Efficient Off-Policy Evaluation through Regularized Targeted Learning

Aurélien Bibaut
Ivana Malenica
Nikos Vlassis
Mark J. van der Laan

We study the problem of off-policy evaluation (OPE) in Reinforcement Learning (RL), where the aim is to estimate the performance of a new policy given historical data that may have been generated by a different policy, or policies. In particular, we introduce a novel doubly-robust estimator for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature. We also introduce several variance reduction techniques that lead to impressive performance gains in off-policy evaluation. We show empirically that our estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification. Finally, we further the existing theoretical analysis of estimators for the RL off-policy estimation problem by showing their $O_P(1/\sqrt{n})$ rate of convergence and characterizing their asymptotic distribution.