Arrow Research search

Author name cluster

Kavitha Srinivas

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

22 papers
2 author rows

Possible papers

22

AAAI Conference 2026 System Paper

QueryGym: Step-by-Step Interaction with Relational Databases

  • Haritha Ananthakrishnan
  • Harsha Kokel
  • Kelsey Sikes
  • Debarun Bhattacharjya
  • Michael Katz
  • Shirin Sohrabi
  • Kavitha Srinivas

We introduce QueryGym, an interactive environment for building, testing, and evaluating LLM-based query planning agents. Existing frameworks often tie agents to specific query language dialects or obscure their reasoning; QueryGym instead requires agents to construct explicit sequences of relational algebra operations, ensuring engine-agnostic evaluation and transparent step-by-step planning. The environment is implemented as a Gymnasium interface that supplies observations---including schema details, intermediate results, and execution feedback---and receives actions that represent database exploration (e.g., previewing tables, sampling column values, retrieving unique values) as well as relational algebra operations (e.g., filter, project, join).We detail the motivation and the design of the environment. In the demo, we showcase the utility of the environment by contrasting it with contemporary LLMs that query databases. QueryGym serves as a practical testbed for research in error remediation, transparency, and reinforcement learning for query generation.

AAAI Conference 2025 Conference Paper

ACPBench: Reasoning About Action, Change, and Planning

  • Harsha Kokel
  • Michael Katz
  • Kavitha Srinivas
  • Shirin Sohrabi

There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multistep reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many] additional problems can be created automatically. Our extensive evaluation of 21 LLMs and OpenAI o1 reasoning models highlight the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no notable progress is made on boolean questions.

AAAI Conference 2025 Short Paper

Automating Thought of Search: A Journey Towards Soundness and Completeness (Student Abstract)

  • Daniel Cao
  • Michael Katz
  • Harsha Kokel
  • Kavitha Srinivas
  • Shirin Sohrabi

Large language models (LLMs) now turn their attention to search. Recently, Thought of Search (ToS) proposed defining the search space with code, having an LLM produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test, achieving impressive 100% accuracy on all the tested datasets. In this work, we automate ToS (AutoToS), completely taking the human out of the loop of solving planning problems. AutoToS guides the language model step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.

TMLR Journal 2025 Journal Article

Language Models Are Good Tabular Learners

  • Zhenhan Huang
  • Kavitha Srinivas
  • Horst Samulowitz
  • Niharika S. D'Souza
  • Charu C. Aggarwal
  • Pin-Yu Chen
  • Jianxi Gao

Transformer-based language models have become the de facto standard in natural language processing. However, they underperform in the tabular data domain compared to traditional tree-based methods. We posit that current models fail to achieve the full potential of language models due to (i) heterogeneity of tabular data; and (ii) challenges faced by the model in interpreting numerical values. Based on this hypothesis, we propose the Tabular Domain Transformer (TDTransformer) framework. TDTransformer has distinct embedding processes for different types of columns. The alignment layers for different column-types transform these embeddings to a common space. Besides, TDTransformer adapts piece-wise linear encoding for numerical values for better performance. We test the proposed method on 76 real-world tabular classification datasets from the OpenML benchmark. Extensive experiments indicate that TDTransformer significantly improves the state-of-the-art methods.

AAAI Conference 2024 Conference Paper

Generalized Planning in PDDL Domains with Pretrained Large Language Models

  • Tom Silver
  • Soham Dan
  • Kavitha Srinivas
  • Joshua B. Tenenbaum
  • Leslie Kaelbling
  • Michael Katz

Recent work has considered whether large language models (LLMs) can function as planners: given a task, generate a plan. We investigate whether LLMs can serve as generalized planners: given a domain and training tasks, generate a program that efficiently produces plans for other tasks in the domain. In particular, we consider PDDL domains and use GPT-4 to synthesize Python programs. We also consider (1) Chain-of-Thought (CoT) summarization, where the LLM is prompted to summarize the domain and propose a strategy in words before synthesizing the program; and (2) automated debugging, where the program is validated with respect to the training tasks, and in case of errors, the LLM is re-prompted with four types of feedback. We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines. Overall, we find that GPT-4 is a surprisingly powerful generalized planner. We also conclude that automated debugging is very important, that CoT summarization has non-uniform impact, that GPT-4 is far superior to GPT-3.5, and that just two training tasks are often sufficient for strong generalization.

ICAPS Conference 2024 Conference Paper

Large Language Models as Planning Domain Generators

  • James T. Oswald
  • Kavitha Srinivas
  • Harsha Kokel
  • Junkyu Lee 0001
  • Michael Katz 0001
  • Shirin Sohrabi

Developing domain models is one of the few remaining places that require manual human labor in AI planning. Thus, in order to make planning more accessible, it is desirable to automate the process of domain model generation. To this end, we investigate if large language models (LLMs) can be used to generate planning domain models from simple textual descriptions. Specifically, we introduce a framework for automated evaluation of LLM-generated domains by comparing the sets of plans for domain instances. Finally, we perform an empirical analysis of 7 large language models, including coding and chat models across 9 different planning domains, and under three classes of natural language domain descriptions. Our results indicate that LLMs, particularly those with high parameter counts, exhibit a moderate level of proficiency in generating correct planning domains from natural language descriptions. Our code is available at https: //github. com/IBM/NL2PDDL.

AAAI Conference 2024 Short Paper

Large Language Models as Planning Domain Generators (Student Abstract)

  • James Oswald
  • Kavitha Srinivas
  • Harsha Kokel
  • Junkyu Lee
  • Michael Katz
  • Shirin Sohrabi

The creation of planning models, and in particular domain models, is among the last bastions of tasks that require exten- sive manual labor in AI planning; it is desirable to simplify this process for the sake of making planning more accessi- ble. To this end, we investigate whether large language mod- els (LLMs) can be used to generate planning domain models from textual descriptions. We propose a novel task for this as well as a means of automated evaluation for generated do- mains by comparing the sets of plans for domain instances. Finally, we perform an empirical analysis of 7 large language models, including coding and chat models across 9 different planning domains. Our results show that LLMs, particularly larger ones, exhibit some level of proficiency in generating correct planning domains from natural language descriptions

PRL Workshop 2024 Workshop Paper

Planning with Language Models Through The Lens of Efficiency

  • Michael Katz
  • Harsha Kokel
  • Kavitha Srinivas
  • Shirin Sohrabi

We analyse the cost of using LLMs for planning and highlight that recent trends are profoundly uneconomical. We propose a significantly more efficient approach and argue for a responsible use of compute resources; urging research community to investigate LLM-based approaches that upholds efficiency.

NeurIPS Conference 2024 Conference Paper

Thought of Search: Planning with Language Models Through The Lens of Efficiency

  • Michael Katz
  • Harsha Kokel
  • Kavitha Srinivas
  • Shirin Sohrabi

Among the most important properties of algorithms investigated in computer science are soundness, completeness, and complexity. These properties, however, are rarely analyzed for the vast collection of recently proposed methods for planning with large language models. In this work, we alleviate this gap. We analyse these properties of using LLMs for planning and highlight that recent trends abandon both soundness and completeness for the sake of inefficiency. We propose a significantly more efficient approach that can, at the same time, maintain both soundness and completeness. We exemplify on four representative search problems, comparing to the LLM-based solutions from the literature that attempt to solve these problems. We show that by using LLMs to produce the code for the search components we can solve the entire datasets with 100% accuracy with only a few calls to the LLM. In contrast, the compared approaches require hundreds of thousands of calls and achieve significantly lower accuracy. We argue for a responsible use of compute resources; urging research community to investigate sound and complete LLM-based approaches that uphold efficiency.

IJCAI Conference 2023 Conference Paper

Action Space Reduction for Planning Domains

  • Harsha Kokel
  • Junkyu Lee
  • Michael Katz
  • Kavitha Srinivas
  • Shirin Sohrabi

Planning tasks succinctly represent labeled transition systems, with each ground action corresponding to a label. This granularity, however, is not necessary for solving planning tasks and can be harmful, especially for model-free methods. In order to apply such methods, the label sets are often manually reduced. In this work, we propose automating this manual process. We characterize a valid label reduction for classical planning tasks and propose an automated way of obtaining such valid reductions by leveraging lifted mutex groups. Our experiments show a significant reduction in the action label space size across a wide collection of planning domains. We demonstrate the benefit of our automated label reduction in two separate use cases: improved sample complexity of model-free reinforcement learning algorithms and speeding up successor generation in lifted planning. The code and supplementary material are available at https: //github. com/IBM/Parameter-Seed-Set.

AAAI Conference 2023 System Paper

CodeStylist: A System for Performing Code Style Transfer Using Neural Networks

  • Chih-Kai Ting
  • Karl Munson
  • Serenity Wade
  • Anish Savla
  • Kiran Kate
  • Kavitha Srinivas

Code style refers to attributes of computer programs that affect their readability, maintainability, and performance. Enterprises consider code style as important and enforce style requirements during code commits. Tools that assist in coding style compliance and transformations are highly valuable. However, many key aspects of programming style transfer are difficult to automate, as it can be challenging to specify the patterns required to perform the transfer algorithmically. In this paper, we describe a system called CodeStylist which uses neural methods to perform style transfer on code.

PRL Workshop 2023 Workshop Paper

Generalized Planning in PDDL Domains with Pretrained Large Language Models

  • Tom Silver
  • Soham Dan
  • Kavitha Srinivas
  • Joshua B. Tenenbaum
  • Leslie Pack Kaelbling
  • Michael Katz

Recent work has considered whether large language models (LLMs) can function as planners: given a task, generate a plan. We investigate whether LLMs can serve as generalized planners: given a domain and training tasks, generate a program that efficiently produces plans for other tasks in the domain. In particular, we consider PDDL domains and use GPT-4 to synthesize Python programs. We also consider (1) Chain-of-Thought (CoT) summarization, where the LLM is prompted to summarize the domain and propose a strategy in words before synthesizing the program; and (2) automated debugging, where the program is validated with respect to the training tasks, and in case of errors, the LLM is re-prompted with four types of feedback. We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines. Overall, we find that GPT-4 is a surprisingly powerful generalized planner. We also conclude that automated debugging is very important, that CoT summarization has non-uniform impact, that GPT-4 is far superior to GPT-3.5, and that just two training tasks are often sufficient for strong generalization.

IJCAI Conference 2023 Conference Paper

SemFORMS: Automatic Generation of Semantic Transforms By Mining Data Science Code

  • Ibrahim Abdelaziz
  • Julian Dolby
  • Udayan Khurana
  • Horst Samulowitz
  • Kavitha Srinivas

Careful choice of feature transformations in a dataset can help predictive model performance, data understanding and data exploration. However, finding useful features is a challenge, and while recent Automated Machine Learning (AutoML) systems provide some limited automation for feature engineering or data exploration, it is still mostly done by humans. We demonstrate a system called SemFORMS (Semantic Transforms), which attempts to mine useful expressions for a dataset from access to a repository of code that may target the same dataset/similar dataset. In many enterprises, numerous data scientists often work on the same or similar datasets, but are largely unaware of each other's work. SemFORMS finds appropriate code from such a repository, and normalizes the code to be an actionable transform that can prepended into any AutoML pipeline. We demonstrate SemFORMS operating over example datasets from the OpenML benchmarks where it sometimes leads to significant improvements in AutoML performance.

AAAI Conference 2022 Conference Paper

Can Machines Read Coding Manuals Yet? – A Benchmark for Building Better Language Models for Code Understanding

  • Ibrahim Abdelaziz
  • Julian Dolby
  • Jamie McCusker
  • Kavitha Srinivas

Code understanding is an increasingly important application of Artificial Intelligence. A fundamental aspect of understanding code is understanding text about code, e. g. , documentation and forum discussions. Pre-trained language models (e. g. , BERT) are a popular approach for various NLP tasks, and there are now a variety of benchmarks, such as GLUE, to help improve the development of such models for natural language understanding. However, little is known about how well such models work on textual artifacts about code, and we are unaware of any systematic set of downstream tasks for such an evaluation. In this paper, we derive a set of benchmarks (BLANCA - Benchmarks for LANguage models on Coding Artifacts) that assess code understanding based on tasks such as predicting the best answer to a question in a forum post, finding related forum posts, or predicting classes related in a hierarchy from class documentation. We evaluate performance of current state-of-the-art language models on these tasks and show that there is significant improvement on each task from fine tuning. We also show that multi-task training over BLANCA tasks help build better language models for code understanding.

AAAI Conference 2022 Short Paper

How to Reduce Action Space for Planning Domains? (Student Abstract)

  • Harsha Kokel
  • Junkyu Lee
  • Michael Katz
  • Shirin Sohrabi
  • Kavitha Srinivas

While AI planning and Reinforcement Learning (RL) solve sequential decision-making problems, they are based on different formalisms, which leads to a significant difference in their action spaces. When solving planning problems using RL algorithms, we have observed that a naive translation of the planning action space incurs severe degradation in sample complexity. In practice, those action spaces are often engineered manually in a domain-specific manner. In this abstract, we present a method that reduces the parameters of operators in AI planning domains by introducing a parameter seed set problem and casting it as a classical planning task. Our experiment shows that our proposed method significantly reduces the number of actions in the RL environments originating from AI planning domains.

IJCAI Conference 2022 Conference Paper

Knowledge-Based News Event Analysis and Forecasting Toolkit

  • Oktie Hassanzadeh
  • Parul Awasthy
  • Ken Barker
  • Onkar Bhardwaj
  • Debarun Bhattacharjya
  • Mark Feblowitz
  • Lee Martie
  • Jian Ni

We present a toolkit for knowledge-based news event analysis and forecasting. The toolkit is powered by a Knowledge Graph (KG) of events curated from structured and unstructured sources of event-related knowledge. The toolkit provides functions for 1) mapping ongoing news headlines to concepts in the KG, 2) retrieval, reasoning, and visualization for causal analysis and forecasting, and 3) extraction of causal knowledge from text documents to augment the KG with additional domain knowledge. Each function has a number of implementations using a wide range of state-of-the-art neuro-symbolic techniques. We show how the toolkit enables building a human-in-the-loop explainable solution for event analysis and forecasting.

AAAI Conference 2022 System Paper

Semantic Feature Discovery with Code Mining and Semantic Type Detection

  • Kavitha Srinivas
  • Takaaki Tateishi
  • Daniel Karl I. Weidele
  • Udayan Khurana
  • Horst Samulowitz
  • Toshihiro Takahashi
  • Dakuo Wang
  • Lisa Amini

In recent years, the automation of machine learning and data science (AutoML) has attracted significant attention. One under-explored dimension of AutoML is being able to automatically utilize domain knowledge (such as semantic concepts and relationships) located in historical code or literature from the problem’s domain. In this paper, we demonstrate Semantic Feature Discovery, which enables users to interactively explore features semantically discovered from existing data science code and external knowledge. It does so by detecting semantic concepts for a given dataset, and then using these concepts to determine relevant feature engineering operations from historical code and knowledge.

AAAI Conference 2021 Conference Paper

A Deep Reinforcement Learning Approach to First-Order Logic Theorem Proving

  • Maxwell Crouse
  • Ibrahim Abdelaziz
  • Bassem Makni
  • Spencer Whitehead
  • Cristina Cornelio
  • Pavan Kapanipathi
  • Kavitha Srinivas
  • Veronika Thost

Automated theorem provers have traditionally relied on manually tuned heuristics to guide how they perform proof search. Deep reinforcement learning has been proposed as a way to obviate the need for such heuristics, however, its deployment in automated theorem proving remains a challenge. In this paper we introduce TRAIL, a system that applies deep reinforcement learning to saturation-based theorem proving. TRAIL leverages (a) a novel neural representation of the state of a theorem prover and (b) a novel characterization of the inference selection process in terms of an attention-based action policy. We show through systematic analysis that these mechanisms allow TRAIL to significantly outperform previous reinforcementlearning-based theorem provers on two benchmark datasets for first-order logic automated theorem proving (proving around 15% more theorems).

AAAI Conference 2021 System Paper

IBM Scenario Planning Advisor: A Neuro-Symbolic ERM Solution

  • Mark Feblowitz
  • Oktie Hassanzadeh
  • Michael Katz
  • Shirin Sohrabi
  • Kavitha Srinivas
  • Octavian Udrea

Scenario Planning is a commonly used Enterprise Risk Management (ERM) technique to help decision makers with longterm plans by considering multiple alternative futures. It is typically a manual, highly labor intensive process involving dozens of experts and hundreds to thousands of person-hours. We previously introduced a Scenario Planning Advisor prototype (Sohrabi et al. 2018a, b) that focuses on generating scenarios quickly based on expert-developed models. We present the evolution of that prototype into a full-scale, clouddeployed ERM solution that: (i) can automatically (through NLP) create models from authoritative documents such as books, reports and articles, such that what typically took hundreds to thousands of person-hours can now be achieved in minutes to hours; (ii) can gather news and other feeds relevant to forces in the risk models and group them into storylines without any other user input; (iii) can generate scenarios at scale, starting with dozens of forces of interest from models with thousands of forces in seconds; (iv) provides interactive visualizations of scenario and force model graphs, including a full model editor in the browser. The SPA solution is deployed under a non-commercial use license at https: //spa-service. draco. res. ibm. com and includes a user guide to help new users get started. A video demonstration is available at https: //www. youtube. com/watch? v=Gd4CMKclkBY.

AAAI Conference 2021 Short Paper

Unsupervised Causal Knowledge Extraction from Text using Natural Language Inference (Student Abstract)

  • Manik Bhandari
  • Mark Feblowitz
  • Oktie Hassanzadeh
  • Kavitha Srinivas
  • Shirin Sohrabi

In this paper, we address the problem of extracting causal knowledge from text documents in a weakly supervised manner. We target use cases in decision support and risk management, where causes and effects are general phrases without any constraints. We present a method called CaKNowLI which only takes as input the text corpus and extracts a highquality collection of cause-effect pairs in an automated way. We approach this problem using state-of-the-art natural language understanding techniques based on pre-trained neural models for Natural Language Inference (NLI). Finally, we evaluate the proposed method on existing and new benchmark data sets.

AAAI Conference 2020 System Paper

Causal Knowledge Extraction through Large-Scale Text Mining

  • Oktie Hassanzadeh
  • Debarun Bhattacharjya
  • Mark Feblowitz
  • Kavitha Srinivas
  • Michael Perrone
  • Shirin Sohrabi
  • Michael Katz

In this demonstration, we present a system for mining causal knowledge from large corpuses of text documents, such as millions of news articles. Our system provides a collection of APIs for causal analysis and retrieval. These APIs enable searching for the effects of a given cause and the causes of a given effect, as well as the analysis of existence of causal relation given a pair of phrases. The analysis includes a score that indicates the likelihood of the existence of a causal relation. It also provides evidence from an input corpus supporting the existence of a causal relation between input phrases. Our system uses generic unsupervised and weakly supervised methods of causal relation extraction that do not impose semantic constraints on causes and effects. We show example use cases developed for a commercial application in enterprise risk management.

IJCAI Conference 2019 Conference Paper

Answering Binary Causal Questions Through Large-Scale Text Mining: An Evaluation Using Cause-Effect Pairs from Human Experts

  • Oktie Hassanzadeh
  • Debarun Bhattacharjya
  • Mark Feblowitz
  • Kavitha Srinivas
  • Michael Perrone
  • Shirin Sohrabi
  • Michael Katz

In this paper, we study the problem of answering questions of type "Could X cause Y? " where X and Y are general phrases without any constraints. Answering such questions will assist with various decision analysis tasks such as verifying and extending presumed causal associations used for decision making. Our goal is to analyze the ability of an AI agent built using state-of-the-art unsupervised methods in answering causal questions derived from collections of cause-effect pairs from human experts. We focus only on unsupervised and weakly supervised methods due to the difficulty of creating a large enough training set with a reasonable quality and coverage. The methods we examine rely on a large corpus of text derived from news articles, and include methods ranging from large-scale application of classic NLP techniques and statistical analysis to the use of neural network based phrase embeddings and state-of-the-art neural language models.