Arrow Research search

Author name cluster

Tal Haklay

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
1 author row

Possible papers

3

ICML Conference 2025 Conference Paper

MIB: A Mechanistic Interpretability Benchmark

  • Aaron Mueller
  • Atticus Geiger
  • Sarah Wiegreffe
  • Dana Arad
  • Iván Arcuschin
  • Adam Belfki
  • Yik Siu Chan
  • Jaden Fried Fiotto-Kaufman

How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components—and connections between them—most important for performing a task (e. g. , attribution patching or information flow routes). The causal variable track compares methods that featurize a hidden vector, e. g. , sparse autoencoders (SAE) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAEs features are not better than neurons, i. e. , non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.

ICLR Conference 2024 Conference Paper

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

  • Nikhil Prakash
  • Tamar Rott Shaham
  • Tal Haklay
  • Yonatan Belinkov
  • David Bau

Fine-tuning on generalized tasks such as instruction following, code generation, and mathematics has been shown to enhance language models' performance on a range of tasks. Nevertheless, explanations of how such fine-tuning influences the internal computations in these models remain elusive. We study how fine-tuning affects the internal mechanisms implemented in language models. As a case study, we explore the property of entity tracking, a crucial facet of language comprehension, where models fine-tuned on mathematics have substantial performance gains. We identify a mechanism that enables entity tracking and show that (i) both the original model and its fine-tuned version implement entity tracking with the same circuit. In fact, the entity tracking circuit of the fine-tuned version performs better than the full original model. (ii) The circuits of all the models implement roughly the same functionality, that is entity tracking is performed by tracking the position of the correct entity in both the original model and its fine-tuned version. (iii) Performance boost in the fine-tuned model is primarily attributed to its improved ability to handle positional information. To uncover these findings, we employ two methods: DCM, which automatically detects model components responsible for specific semantics, and CMAP, a new approach for patching activations across models to reveal improved mechanisms. Our findings suggest that fine-tuning enhances, rather than fundamentally alters, the mechanistic operation of the model.

ICLR Conference 2024 Conference Paper

Linearity of Relation Decoding in Transformer Language Models

  • Evan Hernandez
  • Arnab Sen Sharma
  • Tal Haklay
  • Kevin Meng
  • Martin Wattenberg
  • Jacob Andreas
  • Yonatan Belinkov
  • David Bau

Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a first-order approximation to the LM from a single prompt, and they exist for a variety of factual, commonsense, and linguistic relations. However, we also identify many cases in which LM predictions capture relational knowledge accurately, but this knowledge is not linearly encoded in their representations. Our results thus reveal a simple, interpretable, but heterogeneously deployed knowledge representation strategy in transformer LMs.