Author name cluster

Ashutosh Trivedi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

1 author row

JAIR Journal 2026 Journal Article

Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives

Milad Kazemi
Mateo Perez
Fabio Somenzi
Sadegh Soudjani
Ashutosh Trivedi
Alvaro Velasquez

Recent advances in reinforcement learning (RL) have renewed focus on the design of reward functions that shape agent behavior. Manually crafting such functions is often tedious and error-prone. A more principled alternative is to specify behavioral requirements using a formal, unambiguous language that can be automatically translated into a reward function. Omega-regular languages are a natural choice for this purpose, given their established role in formal verification and synthesis. However, existing approaches using omega-regular specifications typically rely on discounted reward RL in an episodic setting, where the environment is periodically reset to an initial state during learning. This setup is misaligned with the semantics of omega-regular specifications, which describe properties over infinite behavior traces. In such cases, the average reward criterion and the continuing setting—where the agent interacts with the environment over a single, uninterrupted lifetime—are more appropriate. To address the challenges of infinite-horizon, continuing tasks, we restrict our focus to the subclass of omega-regular languages known as absolute liveness specifications. These specifications cannot be violated by any finite prefix of the agent’s behavior, aligning naturally with the continuing setting. We present the first model-free RL framework that translates absolute liveness specifications to average-reward objectives. In contrast to prior work, our approach enables learning in communicating Markov Decision Processes without episodic resetting. We further introduce a reward structure for lexicographic multi-objective optimization, where the goal is to maximize an external average-reward objective among the policies that also maximize the satisfaction probability of a given absolute liveness omega-regular specification. Our method guarantees convergence in unknown communicating MDPs and supports on-the-fly reductions that do not require full knowledge of the environment, thus enabling model-free RL. Empirical results across various benchmarks demonstrate that our average-reward approach in the continuing setting is more effective than competing methods based on discounting.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Continuous-Time Reward Machines

Amin Falah
Shibashis Guha
Ashutosh Trivedi

Reinforcement Learning (RL) is a sampling-based method for sequential decision-making, in which a learning agent iteratively converges toward an optimal policy by leveraging feedback from the environment in the form of scalar reward signals. While timing information is often abstracted in discrete-time domains, time-critical learning applications—such as queuing systems, population processes, and manufacturing systems—are naturally modeled as Continuous-Time Markov Decision Processes (CTMDPs). Since the seminal work of Bradtke and Duff, model-free RL for CTMDPs has become well-understood. However, in many practical applications, practitioners possess high-quality information about system rates derived from traditional queuing theory, which learning agents could potentially exploit to accelerate convergence. Despite this, classical RL algorithms for CTMDPs typically re-learn these parameters through sampling. In this work, we propose continuous-time reward machines (CTRMs), a novel framework that embeds reward functions and real-time state-action dynamics into a unified structure. CTRMs enable RL agents to effectively navigate dense-time environments while leveraging reward shaping and counterfactual experiences for accelerated learning. Our empirical results demonstrate CTRMs' ability to improve learning efficiency in time-critical environments.

PDF Details DOI

AILAW Journal 2025 Journal Article

Performance of LLMs on VITA test: potential for AI-assisted tax returns for low income taxpayers

Sina Gogani-Khiabani
Ashutosh Trivedi
ShinPing Chyi
Saeid Tizpaz-Niari

Abstract This paper investigates the performance of a diverse set of large language models (LLMs) including leading closed-source (GPT-4, GPT-4o mini, Claude 3.5 Haiku) and open-source (Llama 3.1 70B, Llama 3.1 8B) models, alongside the earlier GPT-3.5 within the context of U.S. tax resolutions. AI-driven solutions like these have made substantial inroads into legal-critical systems with significant socio-economic implications. However, their accuracy and reliability have not been assessed in some legal domains, such as tax. Using the Volunteer Income Tax Assistance (VITA) certification tests—endorsed by the US Internal Revenue Service (IRS) for tax volunteering—this study compares these LLMs to evaluate their potential utility in assisting both tax volunteers as well as taxpayers, particularly those with low and moderate income. Since the answers to these questions are not publicly available, we first analyze 130 questions with the tax domain experts and develop the ground truths for each question. We then benchmarked these diverse LLMs against the ground truths using both the original VITA questions and syntactically perturbed versions (a total of 390 questions) to assess genuine understanding versus memorization/hallucinations. Our comparative analysis reveals distinct performance differences: closed-source models (GPT-4, Claude 3.5 Haiku, GPT-4o mini) generally demonstrated higher accuracy and robustness compared to GPT-3.5 and the open-source Llama models. For instance, on basic multiple-choice questions, top models like GPT-4 and Claude 3.5 Haiku achieved 83.33% accuracy, surpassing GPT-3.5 (54.17%) and the open-source Llama 3.1 8B (50.00%). These findings generally hold across both original and perturbed questions. However, the paper acknowledges that these developments are initial indicators, and further research is necessary to fully understand the implications of deploying LLMs in this domain. A critical limitation observed across all evaluated models was significant difficulty with open-ended questions, which require accurate numerical calculation and application of tax rules. We hope that this paper provides a means and a standard to evaluate the efficacy of current and future LLMs in the tax domain.

Details DOI

AAAI Conference 2024 Conference Paper

A PAC Learning Algorithm for LTL and Omega-Regular Objectives in MDPs

Mateo Perez
Fabio Somenzi
Ashutosh Trivedi

Linear temporal logic (LTL) and omega-regular objectives---a superset of LTL---have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Assume-Guarantee Reinforcement Learning

Milad Kazemi
Mateo Perez
Fabio Somenzi
Sadegh Soudjani
Ashutosh Trivedi
Alvaro Velasquez

We present a modular approach to reinforcement learning (RL) in environments consisting of simpler components evolving in parallel. A monolithic view of such modular environments may be prohibitively large to learn, or may require unrealizable communication between the components in the form of a centralized controller. Our proposed approach is based on the assume-guarantee paradigm where the optimal control for the individual components is synthesized in isolation by making assumptions about the behaviors of neighboring components, and providing guarantees about their own behavior. We express these assume-guarantee contracts as regular languages and provide automatic translations to scalar rewards to be used in RL. By combining local probabilities of satisfaction for each component, we provide a lower bound on the probability of satisfaction of the complete system. By solving a Markov game for each component, RL can produce a controller for each component that maximizes this lower bound. The controller utilizes the information it receives through communication, observations, and any knowledge of a coarse model of other agents. We experimentally demonstrate the efficiency of the proposed approach on a variety of case studies.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Neural Closure Certificates

Alireza Nadali
Vishnu Murali
Ashutosh Trivedi
Majid Zamani

Notions of transition invariants and closure certificates have seen recent use in the formal verification of controlled dynamical systems against \omega-regular properties. Unfortunately, existing approaches face limitations in two directions. First, they require a closed-form mathematical expression representing the model of the system. Such an expression may be difficult to find, too complex to be of any use, or unavailable due to security or privacy constraints. Second, finding such invariants typically rely on optimization techniques such as sum-of-squares (SOS) or satisfiability modulo theory (SMT) solvers. This restricts the classes of systems that need to be formally verified. To address these drawbacks, we introduce a notion of neural closure certificates. We present a data-driven algorithm that trains a neural network to represent a closure certificate. Our approach is formally correct under some mild assumptions, i.e., one is able to formally show that the unknown system satisfies the \omega-regular property of interest if a neural closure certificate can be computed. Finally, we demonstrate the efficacy of our approach with relevant case studies.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Omega-Regular Decision Processes

Ernst Moritz Hahn
Mateo Perez
Sven Schewe
Fabio Somenzi
Ashutosh Trivedi
Dominik Wojtczak

Regular decision processes (RDPs) are a subclass of non-Markovian decision processes where the transition and reward functions are guarded by some regular property of the past (a lookback). While RDPs enable intuitive and succinct representation of non-Markovian decision processes, their expressive power coincides with finite-state Markov decision processes (MDPs). We introduce omega-regular decision processes (ODPs) where the non-Markovian aspect of the transition and reward functions are extended to an omega-regular lookahead over the system evolution. Semantically, these lookaheads can be considered as promises made by the decision maker or the learning agent about her future behavior. In particular, we assume that, if the promised lookaheads are not met, then the payoff to the decision maker is falsum (least desirable payoff), overriding any rewards collected by the decision maker. We enable optimization and learning for ODPs under the discounted-reward objective by reducing them to lexicographic optimization and learning over finite MDPs. We present experimental results demonstrating the effectiveness of the proposed reduction.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Correct-by-Construction Reinforcement Learning of Cardiac Pacemakers from Duration Calculus Requirements

Kalyani Dole
Ashutosh Gupta
John Komp
Shankaranarayanan Krishna
Ashutosh Trivedi

As the complexity of pacemaker devices continues to grow, the importance of capturing its functional correctness requirement formally cannot be overestimated. The pacemaker system specification document by \emph{Boston Scientific} provides a widely accepted set of specifications for pacemakers. As these specifications are written in a natural language, they are not amenable for automated verification, synthesis, or reinforcement learning of pacemaker systems. This paper presents a formalization of these requirements for a dual-chamber pacemaker in \emph{duration calculus} (DC), a highly expressive real-time specification language. The proposed formalization allows us to automatically translate pacemaker requirements into executable specifications as stopwatch automata, which can be used to enable simulation, monitoring, validation, verification and automatic synthesis of pacemaker systems. The cyclic nature of the pacemaker-heart closed-loop system results in DC requirements that compile to a decidable subclass of stopwatch automata. We present shield reinforcement learning (shield RL), a shield synthesis based reinforcement learning algorithm, by automatically constructing safety envelopes from DC specifications.

PDF Details DOI

AAMAS Conference 2023 Conference Paper

LTL-Based Non-Markovian Inverse Reinforcement Learning

Mohammad Afzal
Sankalp Gambhir
Ashutosh Gupta
Krishna S
Ashutosh Trivedi
Alvaro Velasquez

The successes of reinforcement learning in recent years are underpinned by the characterization of suitable reward functions. However, in settings where such rewards are non-intuitive, difficult to define, or otherwise error-prone in their definition, it is useful to instead learn the reward signal from expert demonstrations. This is the crux of inverse reinforcement learning (IRL). While eliciting learning requirements in the form of scalar reward signals has been shown to be effective, such representations lack explainability and lead to opaque learning. We aim to mitigate this situation by presenting a novel IRL method for eliciting declarative learning requirements in the form of a popular formal logic—Linear Temporal Logic (LTL)—from a set of traces given by the expert policy.

PDF

AAMAS Conference 2023 Conference Paper

Reinforcement Learning with Depreciating Assets

Taylor Dohmen
Ashutosh Trivedi

A basic assumption of traditional reinforcement learning is that the value of a reward does not change once it is received by an agent. The present work forgoes this assumption and considers the situation where the value of a reward decays proportionally to the time elapsed since it was obtained. Emphasizing the inflection point occurring at the time of payment, we use the term asset to refer to a reward that is currently in the possession of an agent. Adopting this language, we initiate the study of depreciating assets within the framework of infinite-horizon quantitative optimization. In particular, we propose a notion of asset depreciation, inspired by classical exponential discounting, where the value of an asset is scaled by a fixed discount factor at each time step after it is obtained by the agent. We formulate an equational characterization of optimality in this context, establish that optimal values and policies can be computed efficiently, and develop a model-free reinforcement learning approach to obtain optimal policies.

PDF

AAMAS Conference 2022 Conference Paper

Controller Synthesis for Omega-Regular and Steady-State Specifications

Alvaro Velasquez
Ismail Alkhouri
Andre Beckus
Ashutosh Trivedi
George Atia

Given a Markov decision process (MDP) and a linear-time (𝜔regular or Linear Temporal Logic) specification which reasons about the infinite-trace behavior of a system, the controller synthesis problem aims to compute the optimal policy that satisfies said specification. Recently, problems that reason over the complementary infinite-frequency behavior of systems have been proposed through the lens of steady-state planning or steady-state policy synthesis. This entails finding a control policy for an MDP such that the Markov chain induced by the solution policy satisfies a given set of constraints on its steady-state distribution. This paper studies a generalization of the controller synthesis problem for a linear-time specification under steady-state constraints on the asymptotic behavior of the agent. We present an algorithm to find a deterministic policy satisfying 𝜔-regular and steady-state constraints by characterizing the solutions as an integer linear program, and experimentally evaluate our approach.

PDF

NeurIPS Conference 2022 Conference Paper

Recursive Reinforcement Learning

Ernst Moritz Hahn
Mateo Perez
Sven Schewe
Fabio Somenzi
Ashutosh Trivedi
Dominik Wojtczak

Recursion is the fundamental paradigm to finitely describe potentially infinite objects. As state-of-the-art reinforcement learning (RL) algorithms cannot directly reason about recursion, they must rely on the practitioner's ingenuity in designing a suitable "flat" representation of the environment. The resulting manual feature constructions and approximations are cumbersome and error-prone; their lack of transparency hampers scalability. To overcome these challenges, we develop RL algorithms capable of computing optimal policies in environments described as a collection of Markov decision processes (MDPs) that can recursively invoke one another. Each constituent MDP is characterized by several entry and exit points that correspond to input and output values of these invocations. These recursive MDPs (or RMDPs) are expressively equivalent to probabilistic pushdown systems (with call-stack playing the role of the pushdown stack), and can model probabilistic programs with recursive procedural calls. We introduce Recursive Q-learning---a model-free RL algorithm for RMDPs---and prove that it converges for finite, single-exit and deterministic multi-exit RMDPs under mild assumptions.

PDF Details

AAMAS Conference 2022 Conference Paper

Translating Omega-Regular Specifications to Average Objectives for Model-Free Reinforcement Learning

Milad Kazemi
Mateo Perez
Fabio Somenzi
Sadegh Soudjani
Ashutosh Trivedi
Alvaro Velasquez

Recent success in reinforcement learning (RL) has brought renewed attention to the design of reward functions by which agent behavior is reinforced or deterred. Manually designing reward functions is tedious and error-prone. An alternative approach is to specify a formal, unambiguous logic requirement, which is automatically translated into a reward function to be learned from. Omega-regular languages, of which Linear Temporal Logic (LTL) is a subset, are a natural choice for specifying such requirements due to their use in verification and synthesis. However, current techniques based on omega-regular languages learn in an episodic manner whereby the environment is periodically reset to an initial state during learning. In some settings, this assumption is challenging or impossible to satisfy. Instead, in the continuing setting the agent explores the environment without resets over a single lifetime. This is a more natural setting for reasoning about omega-regular specifications defined over infinite traces of agent behavior. Optimizing the average reward instead of the usual discounted reward is more natural in this case due to the infinite-horizon objective that poses challenges to the convergence of discounted RL solutions. We restrict our attention to the omega-regular languages which correspond to absolute liveness specifications. These specifications cannot be invalidated by any finite prefix of agent behavior, in accordance with the spirit of a continuing problem. We propose a translation from absolute liveness omega-regular languages to an average reward objective for RL. Our reduction can be done on-thefly, without full knowledge of the environment, thereby enabling the use of model-free RL algorithms. Additionally, we propose a reward structure that enables RL without episodic resetting in communicating MDPs, unlike previous approaches. We demonstrate empirically with various benchmarks that our proposed method of using average reward RL for continuing tasks defined by omegaregular specifications is more effective than competing approaches that leverage discounted RL.

PDF

AAAI Conference 2018 Conference Paper

Differential Performance Debugging With Discriminant Regression Trees

Saeid Tizpaz-Niari
Pavol Cerny
Bor-Yuh Evan Chang
Ashutosh Trivedi

Differential performance debugging is a technique to ﬁnd performance problems. It applies in situations where the performance of a program is (unexpectedly) different for varying classes of inputs. The task is to explain the differences in asymptotic performance among various input classes in terms of program internals. We propose a data-driven technique based on discriminant regression tree (DRT) learning problem where the goal is to discriminate among different classes of inputs. We propose a new algorithm for DRT learning that ﬁrst clusters the data into functional clusters, capturing different asymptotic performance classes, and then invokes off-the-shelf decision tree learning algorithms to explain these clusters. We focus on linear functional clusters and adapt classical clustering algorithms (K-means and spectral) to produce them. For the K-means algorithm, we generalize the notion of the cluster centroid from a point to a linear function. We adapt spectral clustering by deﬁning a novel kernel function to capture the notion of “linear” similarity between two data points. We evaluate our approach on benchmarks consisting of Java programs where we are interested in debugging performance. We show that our algorithm outperforms other well-known regression tree learning algorithms in terms of running time and accuracy of classiﬁcation.

PDF Details

TCS Journal 2016 Journal Article

Expected reachability-time games

Vojtěch Forejt
Marta Kwiatkowska
Gethin Norman
Ashutosh Trivedi

Details DOI

GandALF Workshop 2014 Workshop Paper

Improved Undecidability Results for Reachability Games on Recursive Timed Automata

Shankara Narayanan Krishna
Lakshmi Manasa
Ashutosh Trivedi

We study reachability games on recursive timed automata (RTA) that generalize Alur-Dill timed automata with recursive procedure invocation mechanism similar to recursive state machines. It is known that deciding the winner in reachability games on RTA is undecidable for automata with two or more clocks, while the problem is decidable for automata with only one clock. Ouaknine and Worrell recently proposed a time-bounded theory of real-time verification by claiming that restriction to bounded-time recovers decidability for several key decision problem related to real-time verification. We revisited games on recursive timed automata with time-bounded restriction in the hope of recovering decidability. However, we found that the problem still remains undecidable for recursive timed automata with three or more clocks. Using similar proof techniques we characterize a decidability frontier for a generalization of RTA to recursive stopwatch automata.

PDF Details DOI

Highlights Conference 2013 Conference Abstract

On the decidability of priced timed games

Thomas Brihaye
Gilles Geeraerts
S. Krishna
Lakshmi Manasa
Ashutosh Trivedi

Two-player zero-sum games on finite automata are a classical metaphor for controller synthesis. An extension of this model to time-critical systems are two-player games on priced timed automata. For those games, the optimal reachability synthesis problem asks to compute a strategy (for one of the players) that guarantees to reach a given target with a total cost satisfying some optimality constraint. While, some versions of this problem have already been showed undecidable, we propose to complete the global picture by studying other variants (such as the timed-bounded version).

PDF Details