Author name cluster

Nick Haber

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Fantastic Bugs and Where to Find Them in AI Benchmarks

Sang Truong
Yuheng Tu
Michael Hardy
Anka Reuel-Lamparth
Zeyu Tang
Jirayu Burapacheep
Jonathan Perera
Chibuike Uwakwe

Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84\% precision. In addition, we introduce an LLM‑judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.

PDF Details

NeurIPS Conference 2025 Conference Paper

From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

Yaohui Zhang
Haijing ZHANG
Wenlong Ji
Tianyu Hua
Nick Haber
Hancheng Cao
Weixin Liang

The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.

PDF Details

ICLR Conference 2025 Conference Paper

Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models

Logan Matthew Cross
Violet Xiang
Agam Bhatia
Daniel L. K. Yamins
Nick Haber

Multi-agent reinforcement learning (MARL) methods struggle with the non-stationarity of multi-agent systems and fail to adaptively learn online when tested with novel agents. Here, we leverage large language models (LLMs) to create an autonomous agent that can handle these challenges. Our agent, Hypothetical Minds, consists of a cognitively-inspired architecture, featuring modular components for perception, memory, and hierarchical planning over two levels of abstraction. We introduce the Theory of Mind module that scaffolds the high-level planning process by generating hypotheses about other agents' strategies in natural language. It then evaluates and iteratively refines these hypotheses by reinforcing hypotheses that make correct predictions about the other agents' behavior. Hypothetical Minds significantly improves performance over previous LLM-agent and RL baselines on a range of competitive, mixed motive, and collaborative domains in the Melting Pot benchmark, including both dyadic and population-based environments. Additionally, comparisons against LLM-agent baselines and ablations reveal the importance of hypothesis evaluation and refinement for succeeding on complex scenarios.

Details

NeurIPS Conference 2025 Conference Paper

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

Tianyu Hua
Harper Hua
Violet Xiang
Benjamin Klieger
Sang Truong
Weixin Liang
Fan-Yun Sun
Nick Haber

Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement genuinely novel ideas from recent research papers—ideas unseen during pretraining—remains unclear. We introduce ResearchCodeBench, a benchmark that evaluates LLMs’ ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code. We assessed 30+ proprietary and open-source LLMs, finding that even the best models correctly implement less than 40% of the code. We present empirical findings on performance comparison, contamination, and error patterns. By providing a rigorous evaluation platform, ResearchCodeBench enables continuous understanding and advancement of LLM-driven innovation in research code generation.

PDF Details

NeurIPS Conference 2025 Conference Paper

When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration

Quan Shi
Carlos Jimenez
Shunyu Yao
Nick Haber
Diyi Yang
Karthik Narasimhan

As large language models (LLMs) increasingly serve as close collaborators for humans, it is crucial that they express their reasoning in ways that humans can understand and learn from. However, this capability remains relatively less understood and under-evaluated. To address this, we introduce a conceptual framework for such Human-AI knowledge transfer capabilities and conduct the first large-scale user study (N=118) explicitly designed to measure it. In our two-phase setup, humans first ideate with an LLM on problem-solving strategies, then independently implement solutions, isolating the influence of model reasoning on human understanding. Our findings reveal that while model benchmark performance correlates with collaborative outcomes, this relationship is notably inconsistent with significant outliers, highlighting that knowledge transfer is a distinct capability requiring dedicated optimization. Our analysis uncovers behavioral and strategic factors that mediate successful knowledge transfer, and we release our code, dataset, and evaluation framework to support future work on communicatively aligned models.

PDF Details

ICLR Conference 2024 Conference Paper

ContextRef: Evaluating Referenceless Metrics for Image Description Generation

Elisa Kreiss
Eric Zelikman
Christopher Potts
Nick Haber

Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.

Details

NeurIPS Conference 2024 Conference Paper

FactorSim: Generative Simulation via Factorized Representation

Fan-Yun Sun
S. I. Harini
Angela Yi
Yihan Zhou
Alex Zook
Jonathan Tremblay
Logan Cross
Jiajun Wu

Generating simulations to train intelligent agents in game-playing and robotics from natural language input, user input, or task documentation remains an open-ended challenge. Existing approaches focus on parts of this challenge, such as generating reward functions or task hyperparameters. Unlike previous work, we introduce FACTORSIM that generates full simulations in code from language input that can be used to train agents. Exploiting the structural modularity specific to coded simulations, we propose to use a factored partially observable Markov decision process representation that allows us to reduce context dependence during each step of the generation. For evaluation, we introduce a generative simulation benchmark that assesses the generated simulation code’s accuracy and effectiveness in facilitating zero-shot transfers in reinforcement learning settings. We show that FACTORSIM outperforms existing methods in generating simulations regarding prompt alignment (i. e. , accuracy), zero-shot transfer abilities, and human evaluation. We also demonstrate its effectiveness in generating robotic tasks.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Hypothesis Search: Inductive Reasoning with Language Models

Ruocheng Wang
Eric Zelikman
Gabriel Poesia
Yewen Pu
Nick Haber
Noah D. Goodman

Inductive reasoning is a core problem-solving capacity: humans can identify underlying principles from a few examples, which can then be robustly generalized to novel scenarios. Recent work has evaluated large language models (LLMs) on inductive reasoning tasks by directly prompting them yielding "in context learning." This can work well for straightforward inductive tasks, but performs very poorly on more complex tasks such as the Abstraction and Reasoning Corpus (ARC). In this work, we propose to improve the inductive reasoning ability of LLMs by generating explicit hypotheses at multiple levels of abstraction: we prompt the LLM to propose multiple abstract hypotheses about the problem, in natural language, then implement the natural language hypotheses as concrete Python programs. These programs can be directly verified by running on the observed examples and generalized to novel inputs. To reduce the hypothesis search space, we explore steps to filter the set of hypotheses to be implemented as programs: we either ask the LLM to summarize them into a smaller set of hypotheses, or ask human annotators to select a subset. We verify our pipeline's effectiveness on the ARC visual inductive reasoning benchmark, its variant 1D-ARC, and string transformation dataset SyGuS. On a random 40-problem subset of ARC, our automated pipeline using LLM summaries achieves 27.5% accuracy, significantly outperforming the direct prompting baseline (accuracy of 12.5%). With the minimal human input of selecting from LLM-generated candidates, the performance is boosted to 37.5%. Our ablation studies show that abstract hypothesis generation and concrete program representations are both beneficial for LLMs to perform inductive reasoning tasks.

Details

NeurIPS Conference 2024 Conference Paper

Learning Formal Mathematics From Intrinsic Motivation

Gabriel Poesia
David Broman
Nick Haber
Noah D. Goodman

How did humanity coax mathematics from the aether? We explore the Platonic view that mathematics can be discovered from its axioms---a game of conjecture and proof. We describe an agent that jointly learns to pose challenging problems for itself (conjecturing) and solve them (theorem proving). Given a mathematical domain axiomatized in dependent type theory, we first combine methods for constrained decoding and type-directed synthesis to sample valid conjectures from a language model. Our method guarantees well-formed conjectures by construction, even as we start with a randomly initialized model. We use the same model to represent a policy and value function for guiding proof search. Our agent targets generating hard but provable conjectures --- a moving target, since its own theorem proving ability also improves as it trains. We propose novel methods for hindsight relabeling on proof search trees to significantly improve the agent's sample efficiency in both tasks. Experiments on 3 axiomatic domains (propositional logic, arithmetic and group theory) demonstrate that our agent can bootstrap from only the axioms, self-improving in generating true and challenging conjectures and in finding proofs.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Policy-shaped prediction: avoiding distractions in model-based reinforcement learning

Miles Hutson
Isaac Kauvar
Nick Haber

Model-based reinforcement learning (MBRL) is a promising route to sample-efficient policy optimization. However, a known vulnerability of reconstruction-based MBRL consists of scenarios in which detailed aspects of the world are highly predictable, but irrelevant to learning a good policy. Such scenarios can lead the model to exhaust its capacity on meaningless content, at the cost of neglecting important environment dynamics. While existing approaches attempt to solve this problem, we highlight its continuing impact on leading MBRL methods ---including DreamerV3 and DreamerPro--- with a novel environment where background distractions are intricate, predictable, and useless for planning future actions. To address this challenge we develop a method for focusing the capacity of the world model through a synergy of a pretrained segmentation model, a task-aware reconstruction loss, and adversarial learning. Our method outperforms a variety of other approaches designed to reduce the impact of distractors, and is an advance towards robust model-based reinforcement learning.

PDF Details DOI

ICML Conference 2023 Conference Paper

Curious Replay for Model-based Adaptation

Isaac Kauvar
Chris Doyle
Linqi Zhou
Nick Haber

Agents must be able to adapt quickly as an environment changes. We find that existing model-based reinforcement learning agents are unable to do this well, in part because of how they use past experiences to train their world model. Here, we present Curious Replay—a form of prioritized experience replay tailored to model-based agents through use of a curiosity-based priority signal. Agents using Curious Replay exhibit improved performance in an exploration paradigm inspired by animal behavior and on the Crafter benchmark. DreamerV3 with Curious Replay surpasses state-of-the-art performance on Crafter, achieving a mean score of 19. 4 that substantially improves on the previous high score of 14. 5 by DreamerV3 with uniform replay, while also maintaining similar performance on the Deepmind Control Suite. Code for Curious Replay is available at github. com/AutonomousAgentsLab/curiousreplay.

Details

NeurIPS Conference 2023 Conference Paper

Parsel🐍: Algorithmic Reasoning with Language Models by Composing Decompositions

Eric Zelikman
Qian Huang
Gabriel Poesia
Noah Goodman
Nick Haber

Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs. With Parsel, we automatically decompose algorithmic tasks into hierarchical natural language function descriptions and then search over combinations of possible function implementations using tests. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis and robotic planning. We find that, using Parsel, LLMs solve more competition-level problems in the APPS dataset, resulting in pass rates over 75\% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. Moreover, with automatically generated tests, we find that Parsel can improve the state-of-the-art pass@1 performance on HumanEval from 67\% to 85\%. We also find that LLM-generated robotic plans using Parsel are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers. We release our code at https: //github. com/ezelikman/parsel.

PDF Details

NeurIPS Conference 2022 Conference Paper

Interaction Modeling with Multiplex Attention

Fan-Yun Sun
Isaac Kauvar
Ruohan Zhang
Jiachen Li
Mykel J Kochenderfer
Jiajun Wu
Nick Haber

Modeling multi-agent systems requires understanding how agents interact. Such systems are often difficult to model because they can involve a variety of types of interactions that layer together to drive rich social behavioral dynamics. Here we introduce a method for accurately modeling multi-agent systems. We present Interaction Modeling with Multiplex Attention (IMMA), a forward prediction model that uses a multiplex latent graph to represent multiple independent types of interactions and attention to account for relations of different strengths. We also introduce Progressive Layer Training, a training strategy for this architecture. We show that our approach outperforms state-of-the-art models in trajectory forecasting and relation inference, spanning three multi-agent scenarios: social navigation, cooperative task achievement, and team sports. We further demonstrate that our approach can improve zero-shot generalization and allows us to probe how different interactions impact agent behavior.

PDF Details

NeurIPS Conference 2021 Conference Paper

ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation

Chuang Gan
Jeremy Schwartz
Seth Alter
Damian Mrowca
Martin Schrimpf
James Traer
Julian De Freitas
Jonas Kubilius

We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation. TDW enables the simulation of high-fidelity sensory data and physical interactions between mobile agents and objects in rich 3D environments. Unique properties include real-time near-photo-realistic image rendering; a library of objects and environments, and routines for their customization; generative procedures for efficiently building classes of new environments; high-fidelity audio rendering; realistic physical interactions for a variety of material types, including cloths, liquid, and deformable objects; customizable ``avatars” that embody AI agents; and support for human interactions with VR devices. TDW’s API enables multiple agents to interact within a simulation and returns a range of sensor and physics data representing the state of the world. We present initial experiments enabled by TDW in emerging research directions in computer vision, machine learning, and cognitive science, including multi-modal physical scene understanding, physical dynamics predictions, multi-agent interactions, models that ‘learn like a child’, and attention studies in humans and neural networks.

PDF Details

ICML Conference 2020 Conference Paper

Active World Model Learning with Progress Curiosity

Kuno Kim
Megumi Sano
Julian De Freitas
Nick Haber
Daniel L. K. Yamins

World models are self-supervised predictive models of how the world evolves. Humans learn world models by curiously exploring their environment, in the process acquiring compact abstractions of high bandwidth sensory inputs, the ability to plan across long temporal horizons, and an understanding of the behavioral patterns of other agents. In this work, we study how to design such a curiosity-driven Active World Model Learning (AWML) system. To do so, we construct a curious agent building world models while visually exploring a 3D physical environment rich with distillations of representative real-world agents. We propose an AWML system driven by $\gamma$-Progress: a scalable and effective learning progress-based curiosity signal and show that $\gamma$-Progress naturally gives rise to an exploration policy that directs attention to complex but learnable dynamics in a balanced manner, as a result overcoming the “white noise problem”. As a result, our $\gamma$-Progress-driven controller achieves significantly higher AWML performance than baseline controllers equipped with state-of-the-art exploration strategies such as Random Network Distillation and Model Disagreement.

Details

NeurIPS Conference 2018 Conference Paper

Flexible neural representation for physics prediction

Damian Mrowca
Chengxu Zhuang
Elias Wang
Nick Haber
Li Fei-Fei
Josh Tenenbaum
Daniel Yamins

Humans have a remarkable capacity to understand the physical dynamics of objects in their environment, flexibly capturing complex structures and interactions at multiple levels of detail. Inspired by this ability, we propose a hierarchical particle-based object representation that covers a wide variety of types of three-dimensional objects, including both arbitrary rigid geometrical shapes and deformable materials. We then describe the Hierarchical Relation Network (HRN), an end-to-end differentiable neural network based on hierarchical graph convolution, that learns to predict physical dynamics in this representation. Compared to other neural network baselines, the HRN accurately handles complex collisions and nonrigid deformations, generating plausible dynamics predictions at long time scales in novel settings, and scaling to large scene configurations. These results demonstrate an architecture with the potential to form the basis of next-generation physics predictors for use in computer vision, robotics, and quantitative cognitive science.

PDF Details

NeurIPS Conference 2018 Conference Paper

Learning to Play With Intrinsically-Motivated, Self-Aware Agents

Nick Haber
Damian Mrowca
Stephanie Wang
Li Fei-Fei
Daniel Yamins

Infants are experts at playing, with an amazing ability to generate novel structured behaviors in unstructured environments that lack clear extrinsic reward signals. We seek to mathematically formalize these abilities using a neural network that implements curiosity-driven intrinsic motivation. Using a simple but ecologically naturalistic simulated environment in which an agent can move and interact with objects it sees, we propose a "world-model" network that learns to predict the dynamic consequences of the agent's actions. Simultaneously, we train a separate explicit "self-model" that allows the agent to track the error map of its world-model. It then uses the self-model to adversarially challenge the developing world-model. We demonstrate that this policy causes the agent to explore novel and informative interactions with its environment, leading to the generation of a spectrum of complex behaviors, including ego-motion prediction, object attention, and object gathering. Moreover, the world-model that the agent learns supports improved performance on object dynamics prediction, detection, localization and recognition tasks. Taken together, our results are initial steps toward creating flexible autonomous agents that self-supervise in realistic physical environments.

PDF Details