Arrow Research search

Author name cluster

Peter Clark

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers
2 author rows

Possible papers

29

NeurIPS Conference 2025 Conference Paper

AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise

  • Dhruv Agarwal
  • Bodhisattwa Prasad Majumder
  • Reece Adamson
  • Megha Chakravorty
  • Satvika Reddy Gavireddy
  • Aditya Parashar
  • Harshit Surana
  • Bhavana Dalvi Mishra

The promise of autonomous scientific discovery (ASD) hinges not only on answering questions, but also on knowing which questions to ask. Most recent works in ASD explore the use of large language models (LLMs) in goal-driven settings, relying on human-specified research questions to guide hypothesis generation. However, scientific discovery may be accelerated further by allowing the AI system to drive exploration by its own criteria. The few existing approaches in open-ended ASD select hypotheses based on diversity heuristics or subjective proxies for human interestingness, but the former struggles to meaningfully navigate the typically vast hypothesis space, and the latter suffers from imprecise definitions. This paper presents AutoDiscovery—a method for open-ended ASD that instead drives scientific exploration using Bayesian surprise. Here, we quantify the epistemic shift from the LLM’s prior beliefs about a hypothesis to its posterior beliefs after gathering experimental results. To efficiently explore the space of nested hypotheses, our method employs a Monte Carlo tree search (MCTS) strategy with progressive widening using surprisal as the reward function. We evaluate AutoDiscovery in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AutoDiscovery substantially outperforms competitors by producing 5-29% more discoveries deemed surprising by the LLM. Our human evaluation further reveals that two-thirds of discoveries made by our system are surprising to domain experts as well, suggesting this is an important step towards building open-ended ASD systems.

ICLR Conference 2025 Conference Paper

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

  • Bodhisattwa Prasad Majumder
  • Harshit Surana
  • Dhruv Agarwal 0003
  • Bhavana Dalvi Mishra
  • Abhijeetsingh Meena
  • Aryan Prakhar
  • Tirth Vora
  • Tushar Khot

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. We additionally provide 903 synthetic tasks to conduct controlled evaluations on data-driven workflows that are not covered in the manually collected split. Furthermore, our structured formalism of data-driven discovery enables a facet-based evaluation that provides useful insights into different failure modes. We evaluate several popular LLM-based reasoning frameworks using both open and closed LLMs as baselines on DiscoveryBench and find that even the best system scores only 25%. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.

ICLR Conference 2025 Conference Paper

From Models to Microtheories: Distilling a Model's Topical Knowledge for Grounded Question-Answering

  • Nathaniel Weir
  • Bhavana Dalvi Mishra
  • Orion Weller
  • Oyvind Tafjord
  • Sam Hornstein
  • Alexander Sabol
  • Peter A. Jansen
  • Benjamin Van Durme

Recent reasoning methods (e.g., chain-of-thought) help users understand how language models (LMs) answer a single question, but they do little to reveal the LM’s overall understanding, or “theory,” about the question’s topic, making it still hard to trust the model. Our goal is to materialize such theories - here called microtheories (a linguistic analog of logical microtheories) - as a set of sentences encapsulating an LM’s core knowledge about a topic. These statements systematically work together to entail answers to a set of questions to both engender trust and improve performance. Our approach is to first populate a knowledge store with (model-generated) sentences that entail answers to training questions, and then distill those down to a core microtheory which is concise, general, and non-redundant. We show that, when added to a general corpus (e.g., Wikipedia), microtheories can supply critical information not necessarily present in the corpus, improving both a model’s ability to ground its answers to verifiable knowledge (i.e., show how answers are systematically entailed by documents in the corpus, grounding up to +8% more answers), and the accuracy of those grounded answers (up to +8% absolute). We also show that, in a human evaluation in the medical domain, our distilled microtheories contain a significantly higher concentration of topically critical facts than the non-distilled knowledge store. Finally, we show we can quantify the coverage of a microtheory for a topic (characterized by a dataset) using a notion of p-relevance. Together, these suggest that microtheories are an efficient distillation of an LM’s topic-relevant knowledge, that they can usefully augment existing corpora, and can provide both performance gains and an interpretable, verifiable window into the model’s knowledge of a topic.

NeurIPS Conference 2025 Conference Paper

Language Modeling by Language Models

  • Junyan Cheng
  • Peter Clark
  • Kyle Richardson

*Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? * Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stages of research, from ideation and literature search (proposal stage) to design implementation (code generation), generative pre-training, and downstream evaluation (verification). Using ideas from scaling laws, our system *Genesys* employs a *Ladder of Scales* approach; new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M$\sim$350M parameters) with a narrowing budget (the number of models we can train at each scale). To help make discovery efficient and factorizable, Genesys uses a novel genetic programming backbone, which we show has empirical advantages over commonly used direct prompt generation workflows (e. g. , $\sim$86\% percentage point improvement in successful design generation, a key bottleneck). We report experiments involving 1, 162 newly discovered designs (1, 062 fully verified) and find the best designs to be competitive with known architectures (e. g. , outperform GPT2, Mamba2, etc. , on 6/9 common benchmarks). We couple these results with comprehensive system-level ablations and formal results, which give broader insights into the design of effective autonomous discovery systems.

ICML Conference 2025 Conference Paper

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

  • Bill Yuchen Lin
  • Ronan LeBras
  • Kyle Richardson 0001
  • Ashish Sabharwal
  • Radha Poovendran
  • Peter Clark
  • Yejin Choi 0001

We investigate the logical reasoning capabilities of Large Language Models (LLMs) and their scalability across complex deductive tasks. Using ZebraLogic, a newly developed benchmark dataset of logic grid puzzles derived from constraint satisfaction problems (CSPs), we systematically evaluate LLM performance. ZebraLogic spans a broad range of search space complexities and incorporates diverse logical constraints, providing a controlled environment to assess reasoning abilities. Our results reveal a significant decline in accuracy as problem complexity increases—a phenomenon we term the “curse of complexity. ” Notably, this limitation persists even with scaling model size and inference-time computation, suggesting fundamental constraints in current LLM reasoning capabilities. Additionally, we explore strategies such as Best-of-N sampling, backtracking mechanisms, and self-verification prompts to enhance logical reasoning performance. Our findings provide critical insights into the scaling behavior of LLMs, highlight their limitations, and outline potential directions for advancing their reasoning capabilities.

ICLR Conference 2024 Conference Paper

Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs

  • Shashank Gupta
  • Vaishnavi Shrivastava
  • Ameet Deshpande
  • Ashwin Kalyan
  • Peter Clark
  • Ashish Sabharwal
  • Tushar Khot

Recent works have showcased the ability of large-scale language models (LLMs) to embody diverse personas in their responses, exemplified by prompts like ‘_You are Yoda. Explain the Theory of Relativity._’ While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs’ capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of LLMs to perform _basic reasoning tasks_. Our study covers 24 reasoning datasets (spanning mathematics, law, medicine, morals, and more), 4 LLMs (2 versions of ChatGPT-3.5, GPT-4-Turbo, and Llama-2-70b-chat), and 19 diverse personas (e.g., ‘an Asian person’) spanning 5 socio-demographic groups: race, gender, religion, disability, and political affiliation. Our experiments unveil that LLMs harbor deep rooted bias against various socio-demographics underneath a veneer of fairness. While they overtly reject stereotypes when explicitly asked (‘_Are Black people less skilled at mathematics?_’), they manifest stereotypical and often erroneous presumptions when prompted to answer questions while adopting a persona. These can be observed as abstentions in the model’s response, e.g., ‘_As a Black person, I am unable to answer this question as it requires math knowledge_’, and generally result in a substantial drop in performance on reasoning tasks. Our experiments with ChatGPT-3.5 show that this bias is _ubiquitous_—80% of our personas demonstrate bias; it is _significant_—some datasets show performance drops of 70%+; and can be especially _harmful for certain groups_—some personas suffer statistically significant drops on 80%+ of the datasets. Overall, all four LLMs exhibit persona-induced bias to varying extents, with GPT-4-Turbo showing the least but still a problematic amount of bias (evident in 42% of the personas). Further analysis shows that these persona-induced errors can be hard-to-discern as they do not always manifest as explicit abstentions, and can also be hard-to-avoid—we find de-biasing prompts to have minimal to no effect. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs—a trend on the rise—can surface their deep-rooted biases and have unforeseeable and detrimental side-effects.

NeurIPS Conference 2024 Conference Paper

DiscoveryWorld: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

  • Peter Jansen
  • Marc-Alexandre Côté
  • Tushar Khot
  • Erin Bransom
  • Bhavana Dalvi Mishra
  • Bodhisattwa Prasad Majumder
  • Oyvind Tafjord
  • Peter Clark

Automated scientific discovery promises to accelerate progress across scientific domains, but evaluating an agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DiscoveryWorld, a virtual environment that enables benchmarking an agent's ability to perform complete cycles of novel scientific discovery in an inexpensive, simulated, multi-modal, long-horizon, and fictional setting. DiscoveryWorld consists of 24 scientific tasks across three levels of difficulty, each with parametric variations that provide new discoveries for agents to make across runs. Tasks require an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. Task difficulties are normed to range from straightforward to challenging for human scientists with advanced degrees. DiscoveryWorld further provides three automatic metrics for evaluating performance, including: (1) binary task completion, (2) fine-grained report cards detailing procedural scoring of task-relevant actions, and (3) the accuracy of discovered explanatory knowledge. While simulated environments such as DiscoveryWorld are low-fidelity compared to the real world, we find that strong baseline agents struggle on most DiscoveryWorld tasks, highlighting the utility of using simulated environments as proxy tasks for near-term development of scientific discovery competency in agents.

NeurIPS Conference 2024 Conference Paper

Learning to Reason via Program Generation, Emulation, and Search

  • Nathaniel Weir
  • Muhammad Khalifa
  • Linlu Qiu
  • Orion Weller
  • Peter Clark

Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e. g. word concatenation). However, not all reasoning tasks are easily expressible as code, e. g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understanding. Our goal is to extend a LM’s program synthesis skills to such tasks and evaluate the results via pseudo-programs, namely Python programs where some leaf function calls are left undefined. To that end, we propose, Code Generation and Emulated EXecution (COGEX). COGEX works by (1) training LMs to generate pseudo-programs and (2) teaching them to emulate their generated program’s execution, including those leaf functions, allowing the LM’s knowledge to fill in the execution gaps; and (3) using them to search over many programs to find an optimal one. To adapt the COGEX model to a new task, we introduce a method for performing program search to find a single program whose pseudo-execution yields optimal performance when applied to all the instances of a given dataset. We show that our approach yields large improvements compared to standard in-context learning approaches on a battery of tasks, both algorithmic and soft reasoning. This result thus demonstrates that code synthesis can be applied to a much broader class of problems than previously considered.

IJCAI Conference 2024 Conference Paper

NELLIE: A Neuro-Symbolic Inference Engine for Grounded, Compositional, and Explainable Reasoning

  • Nathaniel Weir
  • Peter Clark
  • Benjamin Van Durme

Our goal is to develop a modern approach to answering questions via systematic reasoning where answers are supported by human interpretable proof trees grounded in an NL corpus of facts. Such a system would help alleviate the challenges of interpretability and hallucination with modern LMs, and the lack of grounding of current explanation methods (e. g. , Chain-of-Thought). This paper proposes a new take on Prolog-based inference engines, where we replace handcrafted rules with a combination of neural language modeling, guided generation, and semiparametric dense retrieval. Our implementation, NELLIE, is the first system to demonstrate fully interpretable, end-to-end grounded QA as entailment tree proof search, going beyond earlier work explaining known-to-be-true facts from text. In experiments, NELLIE outperforms a similar-sized state-of-the-art reasoner while producing knowledge-grounded explanations. We also find NELLIE can exploit both semi-structured and NL text corpora to guide reasoning. Together these suggest a new way to jointly reap the benefits of both modern neural methods and traditional symbolic reasoning.

ICML Conference 2024 Conference Paper

Position: Data-driven Discovery with Large Generative Models

  • Bodhisattwa Prasad Majumder
  • Harshit Surana
  • Dhruv Agarwal 0003
  • Sanchaita Hazra
  • Ashish Sabharwal
  • Peter Clark

With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery—a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DataVoyager, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata—a feat previously unattainable—while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.

ICML Conference 2024 Conference Paper

Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills

  • Kolby Nottingham
  • Bodhisattwa Prasad Majumder
  • Bhavana Dalvi Mishra
  • Sameer Singh 0001
  • Peter Clark
  • Roy Fox

Large language models (LLMs) have recently been used for sequential decision making in interactive environments. However, leveraging environment reward signals for continual LLM actor improvement is not straightforward. We propose Skill Set Optimization (SSO) for improving LLM actor performance through constructing and refining sets of transferable skills. SSO constructs skills by extracting common subtrajectories with high rewards and generating subgoals and instructions to represent each skill. These skills are provided to the LLM actor in-context to reinforce behaviors with high rewards. Then, SSO further refines the skill set by pruning skills that do not continue to result in high rewards. We evaluate our method in the classic videogame NetHack and the text environment ScienceWorld to demonstrate SSO’s ability to optimize a set of skills and perform in-context policy improvement. SSO outperforms baselines by 40% in our custom NetHack task and outperforms the previous state-of-the-art in ScienceWorld by 35%.

ICLR Conference 2023 Conference Paper

Complexity-Based Prompting for Multi-step Reasoning

  • Yao Fu
  • Hao Peng 0018
  • Ashish Sabharwal
  • Peter Clark
  • Tushar Khot

We study the task of prompting large-scale language models to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer, large language models can generate new reasoning chains and predict answers for new inputs. A central question is which reasoning examples make the most effective prompts. In this work, we propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning. We show that prompts with higher reasoning complexity, i.e., chains with more reasoning steps, achieve substantially better performance on math word reasoning tasks over strong baselines. We further extend our complexity-based criteria from prompting (selecting inputs) to decoding (selecting outputs), where we sample multiple reasoning chains from the model, then choose the majority of generated answers from complex reasoning chains (over simple chains). When used to prompt GPT-3, our approach substantially improves multi-step reasoning accuracy, with an 8.6% absolute improvement on GSM8K, and 6.4% on MathQA. Compared with existing example selection schemes like manual tuning or retrieval-based selection, selection based on reasoning complexity is intuitive, easy to implement, and annotation-efficient. Further results demonstrate the robustness of performance gains from complex prompts under format perturbation and distribution shift.

ICLR Conference 2023 Conference Paper

Decomposed Prompting: A Modular Approach for Solving Complex Tasks

  • Tushar Khot
  • Harsh Trivedi
  • Matthew Finlayson
  • Yao Fu
  • Kyle Richardson 0001
  • Peter Clark
  • Ashish Sabharwal

Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to solve various tasks. However, this approach struggles as the task complexity increases or when the individual reasoning steps of the task themselves are hard to learn, especially when embedded in more complex tasks. To address this, we propose Decomposed Prompting, a new approach to solve complex tasks by decomposing them (via prompting) into simpler sub-tasks that can be delegated to a library of prompting-based LLMs dedicated to these sub-tasks. This modular structure allows each prompt to be optimized for its specific sub-task, further decomposed if necessary, and even easily replaced with more effective prompts, trained models, or symbolic functions if desired. We show that the flexibility and modularity of Decomposed Prompting allows it to outperform prior work on few-shot prompting using GPT3. On symbolic reasoning tasks, we can further decompose sub-tasks that are hard for LLMs into even simpler solvable sub-tasks. When the complexity comes from the input length, we can recursively decompose the task into the same task but with smaller inputs. We also evaluate our approach on textual multi-step reasoning tasks: on long-context multi-hop QA task, we can more effectively teach the sub-tasks via our separate sub-tasks prompts; and on open-domain multi-hop QA, we can incorporate a symbolic information retrieval within our decomposition framework, leading to improved performance on both tasks. Datasets, Code and Prompts available at https://github.com/allenai/DecomP.

ICLR Conference 2023 Conference Paper

Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning

  • Pan Lu
  • Liang Qiu 0001
  • Kai-Wei Chang 0001
  • Ying Nian Wu
  • Song-Chun Zhu
  • Tanmay Rajpurohit
  • Peter Clark
  • Ashwin Kalyan

Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TabMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multi-choice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in selecting in-context examples. The data and code are available at https://promptpg.github.io.

NeurIPS Conference 2023 Conference Paper

Self-Refine: Iterative Refinement with Self-Feedback

  • Aman Madaan
  • Niket Tandon
  • Prakhar Gupta
  • Skyler Hallinan
  • Luyu Gao
  • Sarah Wiegreffe
  • Uri Alon
  • Nouha Dziri

Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides *feedback* for its output and uses it to *refine* itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner and the feedback provider. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3. 5, ChatGPT, and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by $\sim$20\% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test-time using our simple, standalone approach.

NeurIPS Conference 2022 Conference Paper

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

  • Pan Lu
  • Swaroop Mishra
  • Tanglin Xia
  • Liang Qiu
  • Kai-Wei Chang
  • Song-Chun Zhu
  • Oyvind Tafjord
  • Peter Clark

When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1. 20% in few-shot GPT-3 and 3. 99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18. 96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https: //scienceqa. github. io.

NeurIPS Conference 2020 Conference Paper

Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge

  • Alon Talmor
  • Oyvind Tafjord
  • Peter Clark
  • Yoav Goldberg
  • Jonathan Berant

To what extent can a neural network systematically reason over symbolic facts? Evidence suggests that large pre-trained language models (LMs) acquire some reasoning capacity, but this ability is difficult to control. Recently, it has been shown that Transformer-based models succeed in consistent reasoning over explicit symbolic facts, under a "closed-world" assumption. However, in an open-domain setup, it is desirable to tap into the vast reservoir of implicit knowledge already encoded in the parameters of pre-trained LMs. In this work, we provide a first demonstration that LMs can be trained to reliably perform systematic reasoning combining both implicit, pre-trained knowledge and explicit natural language statements. To do this, we describe a procedure for automatically generating datasets that teach a model new reasoning skills, and demonstrate that models learn to effectively perform inference which involves implicit taxonomic and world knowledge, chaining and counting. Finally, we show that "teaching" models to reason generalizes beyond the training distribution: they successfully compose the usage of multiple reasoning skills in single examples. Our work paves a path towards open-domain systems that constantly improve by interacting with users who can instantly correct a model by adding simple natural language statements.

AAAI Conference 2020 Conference Paper

QASC: A Dataset for Question Answering via Sentence Composition

  • Tushar Khot
  • Peter Clark
  • Michal Guerquin
  • Peter Jansen
  • Ashish Sabharwal

Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using commonsense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiplechoice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

IJCAI Conference 2020 Conference Paper

Transformers as Soft Reasoners over Language

  • Peter Clark
  • Oyvind Tafjord
  • Kyle Richardson

Beginning with McCarthy's Advice Taker (1959), AI has pursued the goal of providing a system with explicit, general knowledge and having the system reason over that knowledge. However, expressing the knowledge in a formal (logical or probabilistic) representation has been a major obstacle to this research. This paper investigates a modern approach to this problem where the facts and rules are provided as natural language sentences, thus bypassing a formal representation. We train transformers to reason (or emulate reasoning) over these sentences using synthetically generated data. Our models, that we call RuleTakers, provide the first empirical demonstration that this kind of soft reasoning over language is learnable, can achieve high (99%) accuracy, and generalizes to test data requiring substantially deeper chaining than seen during training (95%+ scores). We also demonstrate that the models transfer well to two hand-authored rulebases, and to rulebases paraphrased into more natural language. These findings are significant as it suggests a new role for transformers, namely as limited "soft theorem provers" operating over explicit theories in language. This in turn suggests new possibilities for explainability, correctability, and counterfactual reasoning in question-answering.

AAAI Conference 2019 Conference Paper

Declarative Question Answering over Knowledge Bases Containing Natural Language Text with Answer Set Programming

  • Arindam Mitra
  • Peter Clark
  • Oyvind Tafjord
  • Chitta Baral

While in recent years machine learning (ML) based approaches have been the popular approach in developing endto-end question answering systems, such systems often struggle when additional knowledge is needed to correctly answer the questions. Proposed alternatives involve translating the question and the natural language text to a logical representation and then use logical reasoning. However, this alternative falters when the size of the text gets bigger. To address this we propose an approach that does logical reasoning over premises written in natural language text. The proposed method uses recent features of Answer Set Programming (ASP) to call external NLP modules (which may be based on ML) which perform simple textual entailment. To test our approach we develop a corpus based on the life cycle questions and showed that Our system achieves up to 18% performance gain when compared to standard MCQ solvers.

AAAI Conference 2019 Conference Paper

QUAREL: A Dataset and Models for Answering Questions about Qualitative Relationships

  • Oyvind Tafjord
  • Peter Clark
  • Matt Gardner
  • Wen-tau Yih
  • Ashish Sabharwal

Many natural language questions require recognizing and reasoning with qualitative relationships (e. g. , in science, economics, and medicine), but are challenging to answer with corpus-based methods. Qualitative modeling provides tools that support such reasoning, but the semantic parsing task of mapping questions into those models has formidable challenges. We present QUAREL, a dataset of diverse story questions involving qualitative relationships that characterize these challenges, and techniques that begin to address them. The dataset has 2771 questions relating 19 different types of quantities. For example, “Jenny observes that the robot vacuum cleaner moves slower on the living room carpet than on the bedroom carpet. Which carpet has more friction? ” We contribute (1) a simple and flexible conceptual framework for representing these kinds of questions; (2) the QUAREL dataset, including logical forms, exemplifying the parsing challenges; and (3) two novel models for this task, built as extensions of type-constrained semantic parsing. The first of these models (called QUASP+) significantly outperforms off-the-shelf tools on QUAREL. The second (QUASP+ZERO) demonstrates zero-shot capability, i. e. , the ability to handle new qualitative relationships without requiring additional training data, something not possible with previous models. This work thus makes inroads into answering complex, qualitative questions that require reasoning, and scaling to new relationships at low cost. The dataset and models are available at http: //data. allenai. org/quarel.

KR Conference 2018 Short Paper

Knowledge Representation and Reasoning in Answering Science Questions: A Case Study for Food Web Questions

  • Arindam Mitra
  • Chitta Baral
  • Peter Clark

Question Type 1: Explain how a perturbation leads to a A group of researchers from the Allen Institute of Artificial Intelligence has proposed the Aristo challenge that requires answering science questions. The goal of the challenge is to aid in the development of machines that can understand natural language, use knowledge and reason. In this work, we take a subset of those questions, namely the questions from the chapters of food web. We model a consequence operator for the food webs that given a food web and a perturbation to some of the populations aims to compute possible effects on the other populations in the food web. We then use this operator to answers questions of the kind, ‘Explain why the population of rabbits might decrease if the population of mice decreased. ’ or ‘Explain why the population of rabbits might change if the population of mice decreased. ’ Unlike the previous works which deal with only direct predator-prey situations, here we aim to characterize the effect(s) even when the two populations in the question are indirectly related.

AAAI Conference 2018 Conference Paper

SciTaiL: A Textual Entailment Dataset from Science Question Answering

  • Tushar Khot
  • Ashish Sabharwal
  • Peter Clark

We present a new dataset and model for textual entailment, derived from treating multiple-choice question-answering as an entailment problem. SCITAIL is the first entailment set that is created solely from natural sentences that already exist independently “in the wild” rather than sentences authored specifically for the entailment task. Different from existing entailment datasets, we create hypotheses from science questions and the corresponding answer candidates, and premises from relevant web sentences retrieved from a large corpus. These sentences are often linguistically challenging. This, combined with the high lexical similarity of premise and hypothesis for both entailed and non-entailed pairs, makes this new entailment task particularly difficult. The resulting challenge is evidenced by state-of-the-art textual entailment systems achieving mediocre performance on SCITAIL, especially in comparison to a simple majority class baseline. As a step forward, we demonstrate that one can improve accuracy on SCITAIL by 5% using a new neural model that exploits linguistic structure.

AAAI Conference 2016 Conference Paper

Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions

  • Peter Clark
  • Oren Etzioni
  • Tushar Khot
  • Ashish Sabharwal
  • Oyvind Tafjord
  • Peter Turney
  • Daniel Khashabi

What capabilities are required for an AI system to pass standard 4th Grade Science Tests? Previous work has examined the use of Markov Logic Networks (MLNs) to represent the requisite background knowledge and interpret test questions, but did not improve upon an information retrieval (IR) baseline. In this paper, we describe an alternative approach that operates at three levels of representation and reasoning: information retrieval, corpus statistics, and simple inference over a semi-automatically constructed knowledge base, to achieve substantially improved results. We evaluate the methods on six years of unseen, unedited exam questions from the NY Regents Science Exam (using only non-diagram, multiple choice questions), and show that our overall system’s score is 71. 3%, an improvement of 23. 8% (absolute) over the MLN-based method described in previous work. We conclude with a detailed analysis, illustrating the complementary strengths of each method in the ensemble. Our datasets are being released to enable further research.

IJCAI Conference 2016 Conference Paper

Question Answering via Integer Programming over Semi-Structured Knowledge

  • Daniel Khashabi
  • Tushar Khot
  • Ashish Sabharwal
  • Peter Clark
  • Oren Etzioni
  • Dan Roth

Answering science questions posed in natural language is an important AI challenge. Answering such questions often requires non-trivial inference and knowledge that goes beyond factoid retrieval. Yet, most systems for this task are based on relatively shallow Information Retrieval (IR) and statistical correlation techniques operating on large unstructured corpora. We propose a structured inference system for this task, formulated as an Integer Linear Program (ILP), that answers natural language questions using a semi-structured knowledge base derived from text, including questions requiring multi-step inference and a combination of multiple facts. On a dataset of real, unseen science questions, our system significantly outperforms (+14%) the best previous attempt at structured reasoning for this task, which used Markov Logic Networks (MLNs). It also improves upon a previous ILP formulation by 17. 7%. When combined with unstructured inference methods, the ILP system significantly boosts overall performance (+10%). Finally, we show our approach is substantially more robust to a simple answer perturbation compared to statistical correlation methods.

KR Conference 2004 Conference Paper

A Question-Answering System for AP Chemistry: Assessing KR&R Technologies

  • Ken Barker
  • Vinay Chaudhri
  • Jason Chaw
  • Peter Clark
  • James Fan
  • David Israel
  • Sunil Mishra
  • Bruce Porter

Basic research in knowledge representation and reasoning (KR&R) has steadily advanced over the years, but it has been difficult to assess the capability of fielded systems derived from this research. In this paper, we present a knowledge-based question-answering system that we developed as part of a broader effort by Vulcan Inc. to assess KR&R technologies, and the result of its assessment. The challenge problem presented significant new challenges for knowledge representation, compared with earlier such assessments, due to the wide variability of question types that the system was expected to answer. Our solution integrated several modern KR&R technologies, in particular semantically well-defined frame systems, automatic classification methods, reusable ontologies, a methodology for knowledge base construction, and a novel extension of methods for explanation generation. The resulting system exhibited high performance, achieving scores for both accuracy and explanation which were comparable to human performance on similar tests. While there are qualifications to this result, it is a significant achievement and an informative data point about the state of the art in KR&R, and reflects significant progress by the field.

KR Conference 2004 Conference Paper

Towards a Quantitative, Platform-Independent Analysis of Knowledge Systems

  • Noah S. Friedland
  • Paul G. Allen
  • Michael Witbrock
  • Gavin Matthews
  • Nancy Salay
  • Pierluigi Miraglia
  • Jurgen Angele
  • Steffen Staab

The Halo Pilot, a six-month effort to evaluate the state-ofthe- art in applied Knowledge Representation and Reasoning (KRR) systems, collaboratively developed a taxonomy of failures with the goal of creating a common framework of metrics against which we could measure inter- and intra- system failure characteristics of each of the three Halo knowledge applications. This platform independent taxonomy was designed with the intent of maximizing its coverage of potential failure types; providing the necessary granularity and precision to enable clear categorization of failure types; and providing a productive framework for short and longer term corrective action. Examining the failure analysis and initial empirical use of the taxonomy provides quantitative insights into the strengths and weaknesses of individual systems and raises some issues shared by all three. These results are particularly interesting when considered against the long history of assumed reasons for knowledge system failure. Our study has also uncovered some shortcomings in the taxonomy itself, implying the need to improve both its granularity and precision. It is the hope of Project Halo to eventually produce a failure taxonomy and associated methodology that will be of general use in the fine-grained analysis of knowledge systems.

AAAI Conference 1997 Conference Paper

Building Concept Representations from Reusable Components

  • Peter Clark

Our goal is to build knowledge-based systems capable of answering a wide variety of questions, including questions that are unanticipated when the knowledge base is built. For systems to achieve this level of competence and generality, they require the ability to dynamically construct new concept representations, and to do so in response to the questions and tasks posed to them. Our approach to meeting this requirement is to build knowledge bases of generalized, representational components, and to develop methods for automatically composing components on demand. This work extends the normal inheritance approach used in frame-based systems, and imports ideas from several different areas of AI, in particular compositional modeling, terminological reasoning, and ontological engineering. The contribution of this work is a novel integration of these methods that improves the efficiency of building knowledge bases and the robustness of using them.