Author name cluster

Dan Klein

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri
Melissa Z Pan
Shuyi Yang
Lakshya A Agrawal
Bhavya Chopra
Rishabh Tiwari
Kurt Keutzer
Aditya Parameswaran

Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators andvalidated by high inter-annotator agreement (κ = 0. 88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2. 5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.

NeurIPS Conference 2024 Conference Paper

Explaining Datasets in Words: Statistical Models with Natural Language Parameters

Ruiqi Zhong
Heng Wang
Dan Klein
Jacob Steinhardt

To make sense of massive data, we often first fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often high-dimensional and hard to interpret. To make model parameters directly interpretable, we introduce a family of statistical models---including clustering, time series, and classification models---parameterized by natural language predicates. For example, a cluster of text about COVID could be parameterized by the predicate `` discusses COVID ''. To learn these statistical models effectively, we develop a model-agnostic algorithm that optimizes continuous relaxations of predicate parameters with gradient descent and discretizes them by prompting language models (LMs). Finally, we apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other, clustering math problems based on subareas, and explaining visual features in memorable images. Our framework is highly versatile, applicable to both textual and visual domains, can be easily steered to focus on specific properties (e. g. subareas), and explains sophisticated concepts that classical methods (e. g. n-gram analysis) struggle to produce.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Inferring Ontological Categories of OWL Classes Using Foundational Rules (Extended Abstract)

Pedro Paulo F. Barcelos
Tiago Prince Sales
Elena Romanenko
João Paulo A. Almeida
Gal Engelberg
Dan Klein
Giancarlo Guizzardi

Several efforts that leverage the tools of formal ontology have demonstrated the fruitfulness of considering key metaproperties of classes in ontology engineering. Despite that, it is still a common practice to apply representation schemes and approaches--such as OWL--that do not benefit from identifying ontological categories, and simply treat all classes in the same manner. In the original study, we proposed an approach to support the automated classification of classes into the ontological categories underlying the (g)UFO foundational ontology. We proposed a set of inference rules derived from (g)UFO's axiomatization that, given an initial classification of the classes in an OWL ontology, supports the inference of the classification for the remaining classes in the ontology. We formalized these rules, implemented them in a tool, and assessed them against a catalog of ontologies.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Goal Driven Discovery of Distributional Differences via Language Descriptions

Ruiqi Zhong
Peter Zhang
Steve Li
Jinwoo Ahn
Dan Klein
Jacob Steinhardt

Exploring large corpora can generate useful discoveries but is time-consuming for humans. We formulate a new task, D5, that automatically discovers differences between two large corpora in a goal-driven way. The task input is a problem comprising a user-specified research goal (“ comparing the side effects of drug A and drug ”) and a corpus pair (two large collections of patients' self-reported reactions after taking each drug). The output is a goal-related description (discovery) of how these corpora differ (patients taking drug A “ mention feelings of paranoia ” more often). We build a D5 system, and to quantitatively evaluate its performance, we 1) build a diagnostic benchmark, SynD5, to test whether it can recover known differences between two synthetic corpora, and 2) contribute a meta-dataset, OpenD5, aggregating 675 open-ended problems ranging across business, social sciences, humanities, machine learning, and health. With both synthetic and real datasets, we confirm that language models can leverage the user-specified goals to propose more relevant candidate discoveries, and they sometimes produce discoveries previously unknown to the authors, including demographic differences in discussion topics, political stances in speech, insights in commercial reviews, and error patterns in NLP models. Finally, we discuss the limitations of the current D5 system, which discovers correlation rather than causation and has the potential to reinforce societal biases when misused; therefore, practitioners should treat the outputs of our system with caution.

NeurIPS Conference 2021 Conference Paper

Learning Space Partitions for Path Planning

Kevin Yang
Tianjun Zhang
Chris Cummins
Brandon Cui
Benoit Steiner
Linnan Wang
Joseph E. Gonzalez
Dan Klein

Path planning, the problem of efficiently discovering high-reward trajectories, often requires optimizing a high-dimensional and multimodal reward function. Popular approaches like CEM and CMA-ES greedily focus on promising regions of the search space and may get trapped in local maxima. DOO and VOOT balance exploration and exploitation, but use space partitioning strategies independent of the reward function to be optimized. Recently, LaMCTS empirically learns to partition the search space in a reward-sensitive manner for black-box optimization. In this paper, we develop a novel formal regret analysis for when and why such an adaptive region partitioning scheme works. We also propose a new path planning method LaP3 which improves the function value estimation within each sub-region, and uses a latent representation of the search space. Empirically, LaP3 outperforms existing path planning methods in 2D navigation tasks, especially in the presence of difficult-to-escape local optima, and shows benefits when plugged into the planning components of model-based RL such as PETS. These gains transfer to highly multimodal real-world tasks, where we outperform strong baselines in compiler phase ordering by up to 39% on average across 9 tasks, and in molecular design by up to 0. 4 on properties on a 0-1 scale. Code is available at https: //github. com/yangkevin2/neurips2021-lap3.

NeurIPS Conference 2018 Conference Paper

Speaker-Follower Models for Vision-and-Language Navigation

Daniel Fried
Ronghang Hu
Volkan Cirik
Anna Rohrbach
Jacob Andreas
Louis-Philippe Morency
Taylor Berg-Kirkpatrick
Kate Saenko

Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.

NeurIPS Conference 2015 Conference Paper

On the Accuracy of Self-Normalized Log-Linear Models

Jacob Andreas
Maxim Rabinovich
Michael Jordan
Dan Klein

Calculation of the log-normalizer is a major computational obstacle in applications of log-linear models with large output spaces. The problem of fast normalizer computation has therefore attracted significant attention in the theoretical and applied machine learning literature. In this paper, we analyze a recently proposed technique known as ``self-normalization'', which introduces a regularization term in training to penalize log normalizers for deviating from zero. This makes it possible to use unnormalized model scores as approximate probabilities. Empirical evidence suggests that self-normalization is extremely effective, but a theoretical understanding of why it should work, and how generally it can be applied, is largely lacking. We prove upper bounds on the loss in accuracy due to self-normalization, describe classes of input distributionsthat self-normalize easily, and construct explicit examples of high-variance input distributions. Our theoretical results make predictions about the difficulty of fitting self-normalized models to several classes of distributions, and we conclude with empirical validation of these predictions on both real and synthetic datasets.

NeurIPS Conference 2014 Conference Paper

Unsupervised Transcription of Piano Music

Taylor Berg-Kirkpatrick
Jacob Andreas
Dan Klein

We present a new probabilistic model for transcribing piano music from audio to a symbolic form. Our model reflects the process by which discrete musical events give rise to acoustic signals that are then superimposed to produce the observed data. As a result, the inference procedure for our model naturally resolves the source separation problem introduced by the the piano's polyphony. In order to adapt to the properties of a new instrument or acoustic environment being transcribed, we learn recording specific spectral profiles and temporal envelopes in an unsupervised fashion. Our system outperforms the best published approaches on a standard piano transcription task, achieving a 10. 6% relative gain in note onset F1 on real piano audio.

AAAI Conference 2011 Conference Paper

Optimal Graph Search with Iterated Graph Cuts

David Burkett
David Hall
Dan Klein

AAAI Conference 2010 Conference Paper

Model AI Assignments

Todd Neller
John DeNero
Dan Klein
Sven Koenig
William Yeoh
Xiaoming Zheng
Kenny Daniel
Alex Nash

AAAI Conference 2010 Conference Paper

Teaching Introductory Artificial Intelligence with Pac-Man

John DeNero
Dan Klein

The projects that we have developed for UC Berkeley’s introductory artificial intelligence (AI) course teach foundational concepts using the classic video game Pac-Man. There are four project topics: state-space search, multi-agent search, probabilistic inference, and reinforcement learning. Each project requires students to implement general-purpose AI algorithms and then to inject domain knowledge about the Pac- Man environment using search heuristics, evaluation functions, and feature functions. We have found that the Pac-Man theme adds consistency to the course, as well as tapping in to students’ excitement about video games.

NeurIPS Conference 2009 Conference Paper

Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

Alexandre Bouchard-Côté
Slav Petrov
Dan Klein

Pruning can massively accelerate the computation of feature expectations in large models. However, any single pruning mask will introduce bias. We present a novel approach which employs a randomized sequence of pruning masks. Formally, we apply auxiliary variable MCMC sampling to generate this sequence of masks, thereby gaining theoretical guarantees about convergence. Because each mask is generally able to skip large portions of an underlying dynamic program, our approach is particularly compelling for high-degree algorithms. Empirically, we demonstrate our method on bilingual parsing, showing decreasing bias as more masks are incorporated, and outperforming fixed tic-tac-toe pruning.

NeurIPS Conference 2008 Conference Paper

Efficient Inference in Phylogenetic InDel Trees

Alexandre Bouchard-Côté
Dan Klein
Michael Jordan

Accurate and efficient inference in evolutionary trees is a central problem in computational biology. Realistic models require tracking insertions and deletions along the phylogenetic tree, making inference challenging. We propose new sampling techniques that speed up inference and improve the quality of the samples. We compare our method to previous approaches and show performance improvement on metrics evaluating multiple sequence alignment and reconstruction of ancestral sequences.

NeurIPS Conference 2007 Conference Paper

A Probabilistic Approach to Language Change

Alexandre Bouchard-Côté
Percy Liang
Dan Klein
Thomas Griffiths

We present a probabilistic approach to language change in which word forms are represented by phoneme sequences that undergo stochastic edits along the branches of a phylogenetic tree. Our framework combines the advantages of the classical comparative method with the robustness of corpus-based probabilistic models. We use this framework to explore the consequences of two different schemes for defining probabilistic models of phonological change, evaluating these schemes using the reconstruction of ancient word forms in Romance languages. The result is an efficient inference procedure for automatically inferring ancient word forms from modern languages, which can be generalized to support inferences about linguistic phylogenies.

NeurIPS Conference 2007 Conference Paper

Agreement-Based Learning

Percy Liang
Dan Klein
Michael Jordan

The learning of probabilistic models with many hidden variables and non- decomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks.

NeurIPS Conference 2007 Conference Paper

Discriminative Log-Linear Grammars with Latent Variables

Slav Petrov
Dan Klein

We demonstrate that log-linear grammars with latent variables can be practically trained using discriminative methods. Central to efﬁcient discriminative training is a hierarchical pruning procedure which allows feature expectations to be efﬁ- ciently approximated in a gradient-based procedure. We compare L1 and L2 reg- ularization and show that L1 regularization is superior, requiring fewer iterations to converge, and yielding sparser solutions. On full-scale treebank parsing exper- iments, the discriminative latent models outperform both the comparable genera- tive latent models as well as the discriminative non-latent baselines.

IJCAI Conference 2003 Conference Paper

Factored A* Search for Models over Sequences and Trees

Dan Klein
Christopher D. Manning

We investigate the calculation of A* bounds for sequence and tree models which are the explicit intersection of a set of simpler models or can be bounded by such an intersection. We provide a natural viewpoint which unifies various instances of factored A* models for trees and sequences, some previously known and others novel, including multiple sequence alignment, weighted finitestate transducer composition, and lexicalized statistical parsing. The specific case of parsing with a product of syntactic (PCFG) and semantic (lexical dependency) components is then considered in detail. We show that this factorization gives a modular lexicalized parser which is simpler than comparably accurate non-factored models, and which allows efficient exact inference with large treebank grammars.

IJCAI Conference 2003 Conference Paper

Spectral Learning

Sepandar D. Kamvar
Dan Klein
Christopher D. Manning

NeurIPS Conference 2002 Conference Paper

Fast Exact Inference with a Factored Model for Natural Language Parsing

Dan Klein
Christopher Manning

We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides concep- tual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to simi- lar, non-factored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing al- gorithm, which enables efﬁcient, exact inference.

NeurIPS Conference 2001 Conference Paper

Natural Language Grammar Induction Using a Constituent-Context Model

Dan Klein
Christopher Manning

This paper presents a novel approach to the unsupervised learning of syn- tactic analyses of natural language text. Most previous work has focused on maximizing likelihood according to generative PCFG models. In con- trast, we employ a simpler probabilistic model over trees based directly on constituent identity and linear context, and use an EM-like iterative procedure to induce structure. This method produces much higher qual- ity analyses, giving the best published results on the ATIS dataset. 1 Overview To enable a wide range of subsequent tasks, human language sentences are standardly given tree-structure analyses, wherein the nodes in a tree dominate contiguous spans of words called constituents, as in figure 1(a). Constituents are the linguistically coherent units in the sentence, and are usually labeled with a constituent category, such as noun phrase (NP) or verb phrase (VP). An aim of grammar induction systems is to figure out, given just the sentences in a corpus S, what tree structures correspond to them. In this sense, the grammar induction problem is an incomplete data problem, where the complete data is the corpus of trees T, but we only observe their yields S. This paper presents a new approach to this problem, which gains leverage by directly making use of constituent contexts. It is an open problem whether entirely unsupervised methods can produce linguistically accurate parses of sentences. Due to the difficulty of this task, the vast majority of statis- tical parsing work has focused on supervised learning approaches to parsing, where one uses a treebank of fully parsed sentences to induce a model which parses unseen sentences [7, 3]. But there are compelling motivations for unsupervised grammar induction. Building supervised training data requires considerable resources, including time and linguistic ex- pertise. Investigating unsupervised methods can shed light on linguistic phenomena which are implicit within a supervised parser's supervisory information (e. g. , unsupervised sys- tems often have difficulty correctly attaching subjects to verbs above objects, whereas for a supervised parser, this ordering is implicit in the supervisory information). Finally, while the presented system makes no claims to modeling human language acquisition, results on whether there is enough information in sentences to recover their structure are important data for linguistic theory, where it has standardly been assumed that the information in the data is deficient, and strong innate knowledge is required for language acquisition [4].