Arrow Research search

Author name cluster

Guy Tennenholtz

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

23 papers
2 author rows

Possible papers

23

ICLR Conference 2025 Conference Paper

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

  • Yinlam Chow
  • Guy Tennenholtz
  • Izzeddin Gur
  • Vincent Zhuang
  • Bo Dai 0001
  • Aviral Kumar
  • Rishabh Agarwal
  • Sridhar Thiagarajan

Recent studies indicate that effectively utilizing inference-time compute is crucial for attaining good performance from large language models (LLMs). Specifically, the Best-of-N (BoN) inference strategy, where an LLM generates multiple responses and a verifier selects the best, has shown strong empirical performance. Motivated by this, we develop a novel inference-aware fine-tuning paradigm, which encompasses the BoN-aware inference framework as a special case. We devise the first imitation learning and reinforcement learning (RL) methods for fine-tuning LLMs using BoN, overcoming the challenging, non-differentiable argmax operator in BoN. We empirically demonstrate that our BoN-aware models implicitly learn a per-example "meta-strategy", which interleaves best responses with more diverse responses that might be better suited to a test-time input—a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the BoN performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and Pass@K from 60% to 67%.

ICML Conference 2025 Conference Paper

Preference Adaptive and Sequential Text-to-Image Generation

  • Ofir Nabati
  • Guy Tennenholtz
  • Chih-Wei Hsu
  • Moonkyung Ryu
  • Deepak Ramachandran
  • Yinlam Chow
  • Xiang Li
  • Craig Boutilier

We address the problem of interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest an adaptive and diverse slate of prompt expansions to the user. Our Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user’s intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems.

ICML Conference 2024 Conference Paper

Bayesian Regret Minimization in Offline Bandits

  • Marek Petrik
  • Guy Tennenholtz
  • Mohammad Ghavamzadeh

We study how to make decisions that minimize Bayesian regret in offline linear bandits. Prior work suggests that one must take actions with maximum lower confidence bound (LCB) on their reward. We argue that reliance on LCB is inherently flawed in this setting and propose a new algorithm that directly minimizes upper-bounds on the Bayesian regret using efficient conic optimization solvers. Our bounds build heavily on new connections to monetary risk measures. Proving a matching lower-bound, we show that our upper-bounds are tight, and by minimizing them we are guaranteed to outperform the LCB approach. Our numerical results on synthetic domains confirm that our approach is superior to maximizing LCB.

ICLR Conference 2024 Conference Paper

Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding

  • Alizée Pace
  • Hugo Yèche
  • Bernhard Schölkopf
  • Gunnar Rätsch
  • Guy Tennenholtz

A prominent challenge of offline reinforcement learning (RL) is the issue of hidden confounding: unobserved variables may influence both the actions taken by the agent and the observed outcomes. Hidden confounding can compromise the validity of any causal conclusion drawn from data and presents a major obstacle to effective offline RL. In the present paper, we tackle the problem of hidden confounding in the nonidentifiable setting. We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty, which uses variation over world models compatible with the observations, and differentiate it from the well-known epistemic and aleatoric uncertainties. We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them. Our method does not assume identifiability of the unobserved confounders, and attempts to reduce the amount of confounding bias. We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice.

ICLR Conference 2024 Conference Paper

Demystifying Embedding Spaces using Large Language Models

  • Guy Tennenholtz
  • Yinlam Chow
  • Chih-Wei Hsu
  • Jihwan Jeong
  • Lior Shani
  • Azamat Tulepbergenov
  • Deepak Ramachandran
  • Martin Mladenov

Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing large language models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs.

NeurIPS Conference 2024 Conference Paper

DynaMITE-RL: A Dynamic Model for Improved Temporal Meta-Reinforcement Learning

  • Anthony Liang
  • Guy Tennenholtz
  • Chih-Wei Hsu
  • Yinlam Chow
  • Erdem Biyik
  • Craig Boutilier

We introduce DynaMITE-RL, a meta-reinforcement learning (meta-RL) approach to approximate inference in environments where the latent state evolves at varying rates. We model episode sessions---parts of the episode where the latent state is fixed---and propose three key modifications to existing meta-RL methods: (i) consistency of latent information within sessions, (ii) session masking, and (iii) prior latent conditioning. We demonstrate the importance of these modifications in various domains, ranging from discrete Gridworld environments to continuous-control and simulated robot assistive tasks, illustrating the efficacy of DynaMITE-RL over state-of-the-art baselines in both online and offline RL settings.

NeurIPS Conference 2024 Conference Paper

Embedding-Aligned Language Models

  • Guy Tennenholtz
  • Yinlam Chow
  • Chih-Wei Hsu
  • Lior Shani
  • Ethan Liang
  • Craig Boutilier

We propose a novel approach for training large language models (LLMs) to adhere to objectives defined within a latent embedding space. Our method leverages reinforcement learning (RL), treating a pre-trained LLM as an environment. Our embedding-aligned guided language (EAGLE) agent is trained to iteratively steer the LLM's generation towards optimal regions of the latent embedding space, w. r. t. some predefined criterion. We demonstrate the effectiveness of the EAGLE agent using the MovieLens 25M and Amazon Review datasets to surface content gaps that satisfy latent user demand. We also demonstrate the benefit of using an optimal design of a state-dependent action set to improve EAGLE's efficiency. Our work paves the way for controlled and grounded text generation using LLMs, ensuring consistency with domain-specific knowledge and data representations.

AAAI Conference 2024 Conference Paper

Recommender Ecosystems: A Mechanism Design Perspective on Holistic Modeling and Optimization

  • Craig Boutilier
  • Martin Mladenov
  • Guy Tennenholtz

Modern recommender systems lie at the heart of complex recommender ecosystems that couple the behavior of users, content providers, vendors, advertisers, and other actors. Despite this, the focus of much recommender systems research and deployment is on the local, myopic optimization of the recommendations made to individual users. This comes at a significant cost to the long-term utility that recommender systems generate for their users. We argue that modeling the incentives and behaviors of these actors, and the interactions among them induced by the recommender systems, is needed to maximize value and improve overall ecosystem health. Moreover, we propose the use of economic mechanism design, an area largely overlooked in recommender systems research, as a framework for developing such models. That said, one cannot apply “vanilla” mechanism design to recommender ecosystem modeling optimization out of the box—the use of mechanism design raises a number of subtle and interesting research challenges. We outline a number of these in this talk (and paper), emphasizing the need to develop nonstandard approaches to mechanism design that intersect with numerous areas of research, including preference modeling, reinforcement learning and exploration, behavioral economics, and generative AI, among others.

EWRL Workshop 2023 Workshop Paper

Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding

  • Alizée Pace
  • Hugo Yèche
  • Bernhard Schölkopf
  • Gunnar Ratsch
  • Guy Tennenholtz

A prominent challenge of offline reinforcement learning (RL) is the issue of hidden confounding: unobserved variables may influence both the actions taken by the agent and the observed outcomes. Hidden confounding can compromise the validity of any causal conclusion drawn from data and presents a major obstacle to effective offline RL. In the present paper, we tackle the problem of hidden confounding in the nonidentifiable setting. We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty, which uses variation over world models compatible with the observations, and differentiate it from the well-known epistemic and aleatoric uncertainties. We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them. Our method does not assume identifiability of the unobserved confounders, and attempts to reduce the amount of confounding bias. We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice.

ICML Conference 2023 Conference Paper

Reinforcement Learning with History Dependent Dynamic Contexts

  • Guy Tennenholtz
  • Nadav Merlis
  • Lior Shani
  • Martin Mladenov
  • Craig Boutilier

We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. This special structure allows us to derive an upper-confidence-bound style algorithm for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features. We demonstrate the efficacy of our approach on a recommendation task (using MovieLens data) where user behavior dynamics evolve in response to recommendations.

EWRL Workshop 2023 Workshop Paper

Representation-Driven Reinforcement Learning

  • Ofir Nabati
  • Guy Tennenholtz
  • Shie Mannor

We present a representation-driven framework for reinforcement learning. By representing policies as estimates of their expected values, we leverage techniques from contextual bandits to guide exploration and exploitation. Particularly, embedding a policy network into a linear feature space allows us to reframe the exploration-exploitation problem as a representation-exploitation problem, where good policy representations enable optimal exploration. We demonstrate the effectiveness of this framework through its application to evolutionary and policy gradient-based approaches, leading to significantly improved performance compared to traditional methods. Our framework provides a new perspective on reinforcement learning, highlighting the importance of policy representation in determining optimal exploration-exploitation strategies.

ICML Conference 2023 Conference Paper

Representation-Driven Reinforcement Learning

  • Ofir Nabati
  • Guy Tennenholtz
  • Shie Mannor

We present a representation-driven framework for reinforcement learning. By representing policies as estimates of their expected values, we leverage techniques from contextual bandits to guide exploration and exploitation. Particularly, embedding a policy network into a linear feature space allows us to reframe the exploration-exploitation problem as a representation-exploitation problem, where good policy representations enable optimal exploration. We demonstrate the effectiveness of this framework through its application to evolutionary and policy gradient-based approaches, leading to significantly improved performance compared to traditional methods. Our framework provides a new perspective on reinforcement learning, highlighting the importance of policy representation in determining optimal exploration-exploitation strategies.

AAAI Conference 2022 Conference Paper

Locality Matters: A Scalable Value Decomposition Approach for Cooperative Multi-Agent Reinforcement Learning

  • Roy Zohar
  • Shie Mannor
  • Guy Tennenholtz

Cooperative multi-agent reinforcement learning (MARL) faces significant scalability issues due to state and action spaces that are exponentially large in the number of agents. As environments grow in size, effective credit assignment becomes increasingly harder and often results in infeasible learning times. Still, in many real-world settings, there exist simplified underlying dynamics that can be leveraged for more scalable solutions. In this work, we exploit such locality structures effectively whilst maintaining global cooperation. We propose a novel, value-based multi-agent algorithm called LOMAQ, which incorporates local rewards in the Centralized Training Decentralized Execution paradigm. Additionally, we provide a direct reward decomposition method for finding these local rewards when only a global signal is provided. We test our method empirically, showing it scales well compared to other methods, significantly improving performance and convergence speed.

ICLR Conference 2022 Conference Paper

On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning

  • Guy Tennenholtz
  • Assaf Hallak
  • Gal Dalal
  • Shie Mannor
  • Gal Chechik
  • Uri Shalit

We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward and propose an adjustment of standard imitation learning algorithms to fit this setup. In addition, we discuss the problem of distribution shift between the expert data and the online environment when partial observability is present in the data. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.

NeurIPS Conference 2022 Conference Paper

Reinforcement Learning with a Terminator

  • Guy Tennenholtz
  • Nadav Merlis
  • Lior Shani
  • Shie Mannor
  • Uri Shalit
  • Gal Chechik
  • Assaf Hallak
  • Gal Dalal

We present the problem of reinforcement learning with exogenous termination. We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer. This formulation accounts for numerous real-world situations, such as a human interrupting an autonomous driving agent for reasons of discomfort. We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret. Motivated by our theoretical analysis, we design and implement a scalable approach, which combines optimism (w. r. t. termination) and a dynamic discount factor, incorporating the termination probability. We deploy our method on high-dimensional driving and MinAtar benchmarks. Additionally, we test our approach on human data in a driving setting. Our results demonstrate fast convergence and significant improvement over various baseline approaches.

NeurIPS Conference 2022 Conference Paper

Uncertainty Estimation Using Riemannian Model Dynamics for Offline Reinforcement Learning

  • Guy Tennenholtz
  • Shie Mannor

Model-based offline reinforcement learning approaches generally rely on bounds of model error. Estimating these bounds is usually achieved through uncertainty estimation methods. In this work, we combine parametric and nonparametric methods for uncertainty estimation through a novel latent space based metric. In particular, we build upon recent advances in Riemannian geometry of generative models to construct a pullback metric of an encoder-decoder based forward model. Our proposed metric measures both the quality of out-of-distribution samples as well as the discrepancy of examples in the data. We leverage our combined method for uncertainty estimation in a pessimistic model-based framework, showing a significant improvement upon contemporary model-based offline approaches on continuous control and autonomous driving benchmarks.

UAI Conference 2021 Conference Paper

Action redundancy in reinforcement learning

  • Nir Baram
  • Guy Tennenholtz
  • Shie Mannor

Maximum Entropy (MaxEnt) reinforcement learning is a powerful learning paradigm which seeks to maximize return under entropy regularization. However, action entropy does not necessarily coincide with state entropy, e. g. , when multiple actions produce the same transition. Instead, we propose to maximize the transition entropy, i. e. , the entropy of next states. We show that transition entropy can be described by two terms; namely, model-dependent transition entropy and action redundancy. Particularly, we explore the latter in both deterministic and stochastic settings and develop tractable approximation methods in a near model-free setup. We construct algorithms to minimize action redundancy and demonstrate their effectiveness on a synthetic environment with multiple redundant actions as well as contemporary benchmarks in Atari and Mujoco. Our results suggest that action redundancy is a fundamental problem in reinforcement learning.

UAI Conference 2021 Conference Paper

Bandits with partially observable confounded data

  • Guy Tennenholtz
  • Uri Shalit
  • Shie Mannor
  • Yonathan Efroni

We study linear contextual bandits with access to a large, confounded, offline dataset that was sampled from some fixed policy. We show that this problem is closely related to a variant of the bandit problem with side information. We construct a linear bandit algorithm that takes advantage of the projected information, and prove regret bounds. Our results demonstrate the ability to take advantage of confounded offline data. Particularly, we prove regret bounds that improve current bounds by a factor related to the visible dimensionality of the contexts in the data. Our results indicate that confounded offline data can significantly improve online learning algorithms. Finally, we demonstrate various characteristics of our approach through synthetic simulations.

AAAI Conference 2020 Conference Paper

Off-Policy Evaluation in Partially Observable Environments

  • Guy Tennenholtz
  • Uri Shalit
  • Shie Mannor

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.

NeurIPS Conference 2019 Conference Paper

Distributional Policy Optimization: An Alternative Approach for Continuous Control

  • Chen Tessler
  • Guy Tennenholtz
  • Shie Mannor

We identify a fundamental problem in policy gradient-based methods in continuous control. As policy gradient methods require the agent's underlying probability distribution, they limit policy representation to parametric distribution classes. We show that optimizing over such sets results in local movement in the action space and thus convergence to sub-optimal solutions. We suggest a novel distributional framework, able to represent arbitrary distribution functions over the continuous action space. Using this framework, we construct a generative scheme, trained using an off-policy actor-critic paradigm, which we call the Generative Actor Critic (GAC). Compared to policy gradient methods, GAC does not require knowledge of the underlying probability distribution, thereby overcoming these limitations. Empirical evaluation shows that our approach is comparable and often surpasses current state-of-the-art baselines in continuous domains.

RLDM Conference 2019 Conference Abstract

The Natural Language of Actions

  • Guy Tennenholtz
  • Shie Mannor

We introduce Act2Vec, a general framework for learning context-based action representation for Reinforcement Learning. Representing actions in a vector space help reinforcement learning algorithms achieve better performance by grouping similar actions and utilizing relations between different actions. We show how prior knowledge of an environment can be extracted from demonstrations and injected into action vector representations that encode natural compatible behavior. We then use these for augmenting state representations as well as improving function approximation of Q-values. We visualize and test action embeddings on a high dimensional navigation task and the large action space domain of StarCraft II.

ICML Conference 2019 Conference Paper

The Natural Language of Actions

  • Guy Tennenholtz
  • Shie Mannor

We introduce Act2Vec, a general framework for learning context-based action representation for Reinforcement Learning. Representing actions in a vector space help reinforcement learning algorithms achieve better performance by grouping similar actions and utilizing relations between different actions. We show how prior knowledge of an environment can be extracted from demonstrations and injected into action vector representations that encode natural compatible behavior. We then use these for augmenting state representations as well as improving function approximation of Q-values. We visualize and test action embeddings in three domains including a drawing task, a high dimensional navigation task, and the large action space domain of StarCraft II.