Arrow Research search

Author name cluster

Violet Xiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers
2 author rows

Possible papers

2

ICLR Conference 2025 Conference Paper

Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models

  • Logan Matthew Cross
  • Violet Xiang
  • Agam Bhatia
  • Daniel L. K. Yamins
  • Nick Haber

Multi-agent reinforcement learning (MARL) methods struggle with the non-stationarity of multi-agent systems and fail to adaptively learn online when tested with novel agents. Here, we leverage large language models (LLMs) to create an autonomous agent that can handle these challenges. Our agent, Hypothetical Minds, consists of a cognitively-inspired architecture, featuring modular components for perception, memory, and hierarchical planning over two levels of abstraction. We introduce the Theory of Mind module that scaffolds the high-level planning process by generating hypotheses about other agents' strategies in natural language. It then evaluates and iteratively refines these hypotheses by reinforcing hypotheses that make correct predictions about the other agents' behavior. Hypothetical Minds significantly improves performance over previous LLM-agent and RL baselines on a range of competitive, mixed motive, and collaborative domains in the Melting Pot benchmark, including both dyadic and population-based environments. Additionally, comparisons against LLM-agent baselines and ablations reveal the importance of hypothesis evaluation and refinement for succeeding on complex scenarios.

NeurIPS Conference 2025 Conference Paper

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

  • Tianyu Hua
  • Harper Hua
  • Violet Xiang
  • Benjamin Klieger
  • Sang Truong
  • Weixin Liang
  • Fan-Yun Sun
  • Nick Haber

Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement genuinely novel ideas from recent research papers—ideas unseen during pretraining—remains unclear. We introduce ResearchCodeBench, a benchmark that evaluates LLMs’ ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code. We assessed 30+ proprietary and open-source LLMs, finding that even the best models correctly implement less than 40% of the code. We present empirical findings on performance comparison, contamination, and error patterns. By providing a rigorous evaluation platform, ResearchCodeBench enables continuous understanding and advancement of LLM-driven innovation in research code generation.