Author name cluster

Amit Sharma

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

Debargha Ganguly
Vikash Singh
Sreehari Sankar
Biyao Zhang
Xuecen Zhang
Srinivasan Iyengar
Xiaotian Han
Amit Sharma

Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34. 8\% on logical tasks to -44. 5\% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e. g. , grammar entropy for logic, AUROC>0. 93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100\%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

PDF Details

ICLR Conference 2025 Conference Paper

Robust Root Cause Diagnosis using In-Distribution Interventions

Lokesh Nagalapatti
Ashutosh Srivastava
Sunita Sarawagi
Amit Sharma

Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today’s cloud services and industrial operations. We propose In-Distribution Interventions (IDI), a novel algorithm that predicts root cause as nodes that meet two criteria: 1) Anomaly: root cause nodes should take on anomalous values; 2) Fix: had the root cause nodes assumed usual values, the target node would not have been anomalous. Prior methods of assessing the fix condition rely on counterfactuals inferred from a Structural Causal Model (SCM) trained on historical data. But since anomalies are rare and fall outside the training distribution, the fitted SCMs yield unreliable counterfactual estimates. IDI overcomes this by relying on interventional estimates obtained by solely probing the fitted SCM at in-distribution inputs. We present a theoretical analysis comparing and bounding the errors in assessing the fix condition using interventional and counterfactual estimates. We then conduct experiments by systematically varying the SCM’s complexity to demonstrate the cases where IDI’s interventional approach outperforms the counterfactual approach and vice versa. Experiments on both synthetic and PetShop RCD benchmark datasets demonstrate that IDI consistently identifies true root causes more accurately and robustly than nine existing state-of-the-art RCD baselines. Code will be released at https://github.com/nlokeshiisc/IDI_release.

Details

TMLR Journal 2024 Journal Article

Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

Emre Kiciman
Robert Ness
Amit Sharma
Chenhao Tan

The causal capabilities of large language models (LLMs) are a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. We conduct a "behavorial" study of LLMs to benchmark their capability in generating causal arguments. Across a wide range of tasks, we find that LLMs can generate text corresponding to correct causal arguments with high probability, surpassing the best-performing existing methods. Algorithms based on GPT-3.5 and 4 outperform existing algorithms on a pairwise causal discovery task (97%, 13 points gain), counterfactual reasoning task (92%, 20 points gain) and event causality (86% accuracy in determining necessary and sufficient causes in vignettes). We perform robustness checks across tasks and show that the capabilities cannot be explained by dataset memorization alone, especially since LLMs generalize to novel datasets that were created after the training cutoff date. That said, LLMs exhibit unpredictable failure modes and we discuss the kinds of errors that may be improved and what are the fundamental limits of LLM-based answers. Overall, by operating on the text metadata, LLMs bring capabilities so far understood to be restricted to humans, such as using collected knowledge to generate causal graphs or identifying background causal context from natural language. As a result, LLMs may be used by human domain experts to save effort in setting up a causal analysis, one of the biggest impediments to the widespread adoption of causal methods. Given that LLMs ignore the actual data, our results also point to a fruitful research direction of developing algorithms that combine LLMs with existing causal techniques. Code and datasets are available at https://github.com/py-why/pywhy-llm.

PDF Details

TMLR Journal 2024 Journal Article

Deep End-to-end Causal Inference

Tomas Geffner
Javier Antoran
Adam Foster
Wenbo Gong
Chao Ma
Emre Kiciman
Amit Sharma
Angus Lamb

Causal inference is essential for data-driven decision-making across domains such as business engagement, medical treatment, and policy making. However, in practice, causal inference suffers from many limitations including unknown causal graphs, missing data problems, and mixed data types. To tackle those challenges, we develop Deep End-to-end Causal Inference (DECI) framework, a flow based non-linear additive noise model combined with variational inference, which can perform both Bayesian causal discovery and inference. Theoretically, we show that DECI unifies many existing structural equation model (SEM) based causal inference techniques and can recover the ground truth mechanism under standard assumptions. Motivated by the challenges in the real world, we further extend DECI to heterogeneous, mixed-type data with missing values, allowing for both continuous and discrete treatment decisions. Empirically, we conduct extensive experiments (over a thousand) to show the competitive performance of DECI when compared to relevant baselines for both causal discovery and inference with both synthetic and causal machine learning benchmarks across data types and levels of missingness.

PDF Details

ICLR Conference 2024 Conference Paper

Faithful Explanations of Black-box NLP Models Using LLM-generated Counterfactuals

Yair Ori Gat
Nitay Calderon
Amir Feder
Alexander Chapanin
Amit Sharma
Roi Reichart

Causal explanations of the predictions of NLP systems are essential to ensure safety and establish trust. Yet, existing methods often fall short of explaining model predictions effectively or efficiently and are often model-specific. In this paper, we address model-agnostic explanations, proposing two approaches for counterfactual (CF) approximation. The first approach is CF generation, where a large language model (LLM) is prompted to change a specific text concept while keeping confounding concepts unchanged. While this approach is demonstrated to be very effective, applying LLM at inference-time is costly. We hence present a second approach based on matching, and propose a method that is guided by an LLM at training-time and learns a dedicated embedding space. This space is faithful to a given causal graph and effectively serves to identify matches that approximate CFs. After showing theoretically that approximating CFs is required in order to construct faithful explanations, we benchmark our approaches and explain several models, including LLMs with billions of parameters. Our empirical results demonstrate the excellent performance of CF generation models as model-agnostic explainers. Moreover, our matching approach, which requires far less test-time resources, also provides effective explanations, surpassing many baselines. We also find that Top-K techniques universally improve every tested method. Finally, we showcase the potential of LLMs in constructing new benchmarks for model explanation and subsequently validate our conclusions. Our work illuminates new pathways for efficient and accurate approaches to interpreting NLP systems.

Details

NeurIPS Conference 2023 Conference Paper

Causal Effect Regularization: Automated Detection and Removal of Spurious Correlations

Abhinav Kumar
Amit Deshpande
Amit Sharma

In many classification datasets, the task labels are spuriously correlated with some input attributes. Classifiers trained on such datasets often rely on these attributes for prediction, especially when the spurious correlation is high, and thus fail togeneralize whenever there is a shift in the attributes’ correlation at deployment. If we assume that the spurious attributes are known a priori, several methods have been proposed to learn a classifier that is invariant to the specified attributes. However, in real-world data, information about spurious attributes is typically unavailable. Therefore, we propose a method that automatically identifies spurious attributes by estimating their causal effect on the label and then uses a regularization objective to mitigate the classifier’s reliance on them. Although causal effect of an attribute on the label is not always identified, we present two commonly occurring data-generating processes where the effect can be identified. Compared to recent work for identifying spurious attributes, we find that our method, AutoACER, ismore accurate in removing the attribute from the learned model, especially when spurious correlation is high. Specifically, across synthetic, semi-synthetic, and real-world datasets, AutoACER shows significant improvement in a metric used to quantify the dependence of a classifier on spurious attributes ($\Delta$Prob), while obtaining better or similar accuracy. Empirically we find that AutoACER mitigatesthe reliance on spurious attributes even under noisy estimation of causal effects or when the causal effect is not identified. To explain the empirical robustness of our method, we create a simple linear classification task with two sets of attributes: causal and spurious. Under this setting, we prove that AutoACER only requires the ranking of estimated causal effects to be correct across attributes to select thecorrect classifier.

PDF Details

TMLR Journal 2023 Journal Article

Machine Explanations and Human Understanding

Chacha Chen
Shi Feng
Amit Sharma
Chenhao Tan

Explanations are hypothesized to improve human understanding of machine learning models and achieve a variety of desirable outcomes, ranging from model debugging to enhancing human decision making. However, empirical studies have found mixed and even negative results. An open question, therefore, is under what conditions explanations can improve human understanding and in what way. To address this question, we first identify three core concepts that cover most existing quantitative measures of understanding: task decision boundary, model decision boundary, and model error. Using adapted causal diagrams, we provide a formal characterization of the relationship between these concepts and human approximations (i.e., understanding) of them. The relationship varies by the level of human intuition in different task types, such as emulation and discovery, which are often ignored when building or evaluating explanation methods. Our key result is that human intuitions are necessary for generating and evaluating machine explanations in human-AI decision making: without assumptions about human intuitions, explanations may improve human understanding of model decision boundary, but cannot improve human understanding of task decision boundary or model error. To validate our theoretical claims, we conduct human subject studies to show the importance of human intuitions. Together with our theoretical contributions, we provide a new paradigm for designing behavioral studies towards a rigorous view of the role of machine explanations across different tasks of human-AI decision making.

PDF Details

NeurIPS Conference 2022 Conference Paper

Probing Classifiers are Unreliable for Concept Removal and Detection

Abhinav Kumar
Chenhao Tan
Amit Sharma

Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Removing such concepts is non-trivial because of a complex relationship between the concept, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the concepts entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the concept. Even under the most favorable conditions for learning a probing classifier when a concept's relevant features in representation space alone can provide 100% accuracy, we prove that a probing classifier is likely to use non-concept features and thus post-hoc or adversarial methods will fail to remove the concept correctly. These theoretical implications are confirmed by experiments on models trained on synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of concept removal such as fairness, we recommend caution against using these methods and propose a spuriousness metric to gauge the quality of the final classifier.

PDF Details

AAAI Conference 2021 Conference Paper

The Importance of Modeling Data Missingness in Algorithmic Fairness: A Causal Perspective

Naman Goel
Alfonso Amayuelas
Amit Deshpande
Amit Sharma

Training datasets for machine learning often have some form of missingness. For example, to learn a model for deciding whom to give a loan, the available training data includes individuals who were given a loan in the past, but not those who were not. This missingness, if ignored, nullifies any fairness guarantee of the training procedure when the model is deployed. Using causal graphs, we characterize the missingness mechanisms in different real-world scenarios. We show conditions under which various distributions, used in popular fairness algorithms, can or can not be recovered from the training data. Our theoretical results imply that many of these algorithms can not guarantee fairness in practice. Modeling missingness also helps to identify correct design principles for fair algorithms. For example, in multi-stage settings where decisions are made in multiple screening rounds, we use our framework to derive the minimal distributions required to design a fair algorithm. Our proposed algorithm decentralizes the decision-making process and still achieves similar performance to the optimal algorithm that requires centralization and non-recoverable distributions.

PDF Details

IJCAI Conference 2013 Conference Paper

Algorithms for Generating Ordered Solutions for Explicit AND/OR Structures: Extended Abstract

Priyankar Ghosh
Amit Sharma
P. P. Chakrabarti
Pallab Dasgupta

We present algorithms for generating alternative solutions for explicit acyclic AND/OR structures in non-decreasing order of cost. Our algorithms use a best first search technique and report the solutions using an implicit representation ordered by cost. Experiments on randomly constructed AND/OR DAGs and problem domains including matrix chain multiplication, finding the secondary structure of RNA, etc, show that the proposed algorithms perform favorably to the existing approach in terms of time and space.

PDF Details DOI