Arrow Research search

Author name cluster

Thomas Icard

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

JMLR Journal 2025 Journal Article

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

  • Atticus Geiger
  • Duligur Ibeling
  • Amir Zur
  • Maheep Chaudhary
  • Sonakshi Chauhan
  • Jing Huang
  • Aryaman Arora
  • Zhengxuan Wu

Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

ICML Conference 2025 Conference Paper

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

  • Jing Huang 0014
  • Junyi Tao
  • Thomas Icard
  • Diyi Yang
  • Christopher Potts

Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks—including symbol manipulation, knowledge retrieval, and instruction following—we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model’s behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.

NeurIPS Conference 2023 Conference Paper

Comparing Causal Frameworks: Potential Outcomes, Structural Models, Graphs, and Abstractions

  • Duligur Ibeling
  • Thomas Icard

The aim of this paper is to make clear and precise the relationship between the Rubin causal model (RCM) and structural causal model (SCM) frameworks for causal inference. Adopting a neutral logical perspective, and drawing on previous work, we show what is required for an RCM to be representable by an SCM. A key result then shows that every RCM---including those that violate algebraic principles implied by the SCM framework---emerges as an abstraction of some representable RCM. Finally, we illustrate the power of this ameliorative perspective by pinpointing an important role for SCM principles in classic applications of RCMs; conversely, we offer a characterization of the algebraic constraints implied by a graph, helping to substantiate further comparisons between the two frameworks.

TMLR Journal 2023 Journal Article

Holistic Evaluation of Language Models

  • Percy Liang
  • Rishi Bommasani
  • Tony Lee
  • Dimitris Tsipras
  • Dilara Soylu
  • Michihiro Yasunaga
  • Yian Zhang
  • Deepak Narayanan

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what’s missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios to the extent possible (87.5% of the time), ensuring that metrics beyond accuracy don’t fall to the wayside, and that trade-offs across models and metrics are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to more deeply analyze specific aspects (e.g. knowledge, reasoning, memorization/copyright, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, including 21 scenarios that were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on a set of core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings concerning the interplay between different scenarios, metrics, and models. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit for easily adding new scenarios, models, metrics, and prompting strategies. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

NeurIPS Conference 2023 Conference Paper

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

  • Zhengxuan Wu
  • Atticus Geiger
  • Thomas Icard
  • Christopher Potts
  • Noah Goodman

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call Boundless DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward deeply understanding the inner-workings of our largest and most widely deployed language models.

ICML Conference 2022 Conference Paper

Inducing Causal Structure for Interpretable Neural Networks

  • Atticus Geiger
  • Zhengxuan Wu
  • Hanson Lu
  • Josh Rozner
  • Elisa Kreiss
  • Thomas Icard
  • Noah D. Goodman
  • Christopher Potts

In many areas, we have well-founded insights about causal structure that would be useful to bring into our trained models while still allowing them to learn in a data-driven fashion. To achieve this, we present the new method of interchange intervention training (IIT). In IIT, we (1) align variables in a causal model (e. g. , a deterministic program or Bayesian network) with representations in a neural model and (2) train the neural model to match the counterfactual behavior of the causal model on a base input when aligned representations in both models are set to be the value they would be for a source input. IIT is fully differentiable, flexibly combines with other objectives, and guarantees that the target causal model is a causal abstraction of the neural model when its loss is zero. We evaluate IIT on a structural vision task (MNIST-PVR), a navigational language task (ReaSCAN), and a natural language inference task (MQNLI). We compare IIT against multi-task training objectives and data augmentation. In all our experiments, IIT achieves the best results and produces neural models that are more interpretable in the sense that they more successfully realize the target causal model.

NeurIPS Conference 2021 Conference Paper

A Topological Perspective on Causal Inference

  • Duligur Ibeling
  • Thomas Icard

This paper presents a topological learning-theoretic perspective on causal inference by introducing a series of topologies defined on general spaces of structural causal models (SCMs). As an illustration of the framework we prove a topological causal hierarchy theorem, showing that substantive assumption-free causal inference is possible only in a meager set of SCMs. Thanks to a known correspondence between open sets in the weak topology and statistically verifiable hypotheses, our results show that inductive assumptions sufficient to license valid causal inferences are statistically unverifiable in principle. Similar to no-free-lunch theorems for statistical inference, the present results clarify the inevitability of substantial assumptions for causal inference. An additional benefit of our topological approach is that it easily accommodates SCMs with infinitely many variables. We finally suggest that our framework may be helpful for the positive project of exploring and assessing alternative causal-inductive assumptions.

NeurIPS Conference 2021 Conference Paper

Causal Abstractions of Neural Networks

  • Atticus Geiger
  • Hanson Lu
  • Thomas Icard
  • Christopher Potts

Structural analysis methods (e. g. , probing and feature attribution) are increasingly important tools for neural network analysis. We propose a new structural analysis method grounded in a formal theory of causal abstraction that provides rich characterizations of model-internal representations and their roles in input/output behavior. In this method, neural representations are aligned with variables in interpretable causal models, and then interchange interventions are used to experimentally verify that the neural representations have the causal properties of their aligned variables. We apply this method in a case study to analyze neural models trained on Multiply Quantified Natural Language Inference (MQNLI) corpus, a highly complex NLI dataset that was constructed with a tree-structured natural logic causal model. We discover that a BERT-based model with state-of-the-art performance successfully realizes parts of the natural logic model’s causal structure, whereas a simpler baseline model fails to show any such structure, demonstrating that neural representations encode the compositional structure of MQNLI examples.

AIJ Journal 2020 Journal Article

Intention as commitment toward time

  • Marc van Zee
  • Dragan Doder
  • Leendert van der Torre
  • Mehdi Dastani
  • Thomas Icard
  • Eric Pacuit

In this paper we address the interplay among intention, time, and belief in dynamic environments. The first contribution is a logic for reasoning about intention, time and belief, in which assumptions of intentions are represented by preconditions of intended actions. Intentions and beliefs are coherent as long as these assumptions are not violated, i. e. as long as intended actions can be performed such that their preconditions hold as well. The second contribution is the formalization of what-if scenarios: what happens with intentions and beliefs if a new (possibly conflicting) intention is adopted, or a new fact is learned? An agent is committed to its intended actions as long as its belief-intention database is coherent. We conceptualize intention as commitment toward time and we develop AGM-based postulates for the iterated revision of belief-intention databases, and we prove a Katsuno-Mendelzon-style representation theorem.

AAAI Conference 2020 Conference Paper

Probabilistic Reasoning Across the Causal Hierarchy

  • Duligur Ibeling
  • Thomas Icard

We propose a formalization of the three-tier causal hierarchy of association, intervention, and counterfactuals as a series of probabilistic logical languages. Our languages are of strictly increasing expressivity, the first capable of expressing quantitative probabilistic reasoning—including conditional independence and Bayesian inference—the second encoding docalculus reasoning for causal effects, and the third capturing a fully expressive do-calculus for arbitrary counterfactual queries. We give a corresponding series of finitary axiomatizations complete over both structural causal models and probabilistic programs, and show that satisfiability and validity for each language are decidable in polynomial space.

UAI Conference 2019 Conference Paper

On Open-Universe Causal Reasoning

  • Duligur Ibeling
  • Thomas Icard

We extend two kinds of causal models, structural equation models and simulation models, to infinite variable spaces. This enables a semantics of counterfactuals, calculus of intervention, and axiomatization of causal reasoning for rich, expressive generative models—including those in which a causal representation exists only implicitly—in an open-universe setting. Further, we show that under suitable restrictions the two kinds of models are equivalent, perhaps surprisingly since their conditional logics differ substantially in the general case. We give a series of complete axiomatizations in which the open-universe nature of the setting is seen to be essential.

IJCAI Conference 2018 Conference Paper

On the Conditional Logic of Simulation Models

  • Duligur Ibeling
  • Thomas Icard

We propose analyzing conditional reasoning by appeal to a notion of intervention on a simulation program, formalizing and subsuming a number of approaches to conditional thinking in the recent AI literature. Our main results include a series of axiomatizations, allowing comparison between this framework and existing frameworks (normality-ordering models, causal structural equation models), and a complexity result establishing NP-completeness of the satisfiability problem. Perhaps surprisingly, some of the basic logical principles common to all existing approaches are invalidated in our causal simulation approach. We suggest that this additional flexibility is important in modeling some intuitive examples.

TARK Conference 2017 Conference Paper

Indicative Conditionals and Dynamic Epistemic Logic

  • Wesley H. Holliday
  • Thomas Icard

Recent ideas about epistemic modals and indicative conditionals in formal semantics have significant overlap with ideas in modal logic and dynamic epistemic logic. The purpose of this paper is to show how greater interaction between formal semantics and dynamic epistemic logic in this area can be of mutual benefit. In one direction, we show how concepts and tools from modal logic and dynamic epistemic logic can be used to give a simple, complete axiomatization of Yalcin's [16] semantic consequence relation for a language with epistemic modals and indicative conditionals. In the other direction, the formal semantics for indicative conditionals due to Kolodny and MacFarlane [9] gives rise to a new dynamic operator that is very natural from the point of view of dynamic epistemic logic, allowing succinct expression of dependence (as in dependence logic) or supervenience statements. We prove decidability for the logic with epistemic modals and Kolodny and MacFarlane's indicative conditional via a full and faithful computable translation from their logic to the modal logic K45.

AAAI Conference 2017 Conference Paper

Preferential Structures for Comparative Probabilistic Reasoning

  • Matthew Harrison-Trainor
  • Wesley Holliday
  • Thomas Icard
  • III

Qualitative and quantitative approaches to reasoning about uncertainty can lead to different logical systems for formalizing such reasoning, even when the language for expressing uncertainty is the same. In the case of reasoning about relative likelihood, with statements of the form ϕ ψ expressing that ϕ is at least as likely as ψ, a standard qualitative approach using preordered preferential structures yields a dramatically different logical system than a quantitative approach using probability measures. In fact, the standard preferential approach validates principles of reasoning that are incorrect from a probabilistic point of view. However, in this paper we show that a natural modification of the preferential approach yields exactly the same logical system as a probabilistic approach—not using single probability measures, but rather sets of probability measures. Thus, the same preferential structures used in the study of non-monotonic logics and belief revision may be used in the study of comparative probabilistic reasoning based on imprecise probabilities.

LORI Conference 2011 Conference Paper

Schematic Validity in Dynamic Epistemic Logic: Decidability

  • Wesley H. Holliday
  • Tomohiro Hoshi
  • Thomas Icard

Abstract Unlike standard modal logics, many dynamic epistemic logics are not closed under uniform substitution. The classic example is Public Announcement Logic ( PAL ), an extension of epistemic logic based on the idea of information acquisition as elimination of possibilities. In this paper, we address the open question of whether the set of schematic validities of PAL, the set of formulas all of whose substitution instances are valid, is decidable. We obtain positive answers for multi-agent PAL, as well as its extension with relativized common knowledge, PAL-RC. The conceptual significance of substitution failure is also discussed.

KR Conference 2010 Conference Paper

Joint revision of belief and intention

  • Thomas Icard
  • Eric Pacuit
  • Yoav Shoham

1. The agent makes some observation, e. g. from sensory inWe present a formal semantical model to capture action, belief and intention, based on the “database perspective” (Shoham 2009). We then provide postulates for belief and intention revision, and state a representation theorem relating our postulates to the formal model. Our belief postulates are in the spirit of the AGM theory; the intention postulates stand in rough correspondence with the belief postulates. Motivation