Arrow Research search

Author name cluster

Martha Lewis

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
1 author row

Possible papers

8

TMLR Journal 2025 Journal Article

Evaluating the Robustness of Analogical Reasoning in Large Language Models

  • Martha Lewis
  • Melanie Mitchell

Large language models (LLMs) have performed well on several reasoning benchmarks, including ones that test analogical reasoning abilities. However, there is debate on the extent to which they are performing general abstract reasoning versus employing shortcuts or other non-robust processes, such as ones that overly rely on similarity to what has been seen in their training data. Here we investigate the robustness of analogy-making abilities previously claimed for LLMs on three of four domains studied by Webb et al. (2023): letter-string analogies, digit matrices, and story analogies. For each of these domains we test humans and GPT models on robustness to variants of the original analogy problems—versions that test the same abstract reasoning abilities but that are likely dissimilar from tasks in the pre-training data. The performance of a system that uses robust abstract reasoning should not decline substantially on these variants. On simple letter-string analogies, we find that while the performance of humans remains high for two types of variants we tested, the GPT models’ performance declines sharply. This pattern is less pronounced as the complexity of these analogy problems is increased, as both humans and GPT models perform poorly on both the original and variant problems requiring more complex analogies. On digit-matrix problems, we find a similar pattern but only on one out of the two types of variants we tested. Lastly, we assess the robustness of humans and GPT models on story-based analogy problems, finding that, unlike humans, the performance of GPT models are susceptible to answer-order effects, and that GPT models also may be more sensitive than humans to paraphrasing. This work provides evidence that, despite previously reported successes of LLMs on zero-shot analogical reasoning, these models often lack the robustness of zero-shot human analogy- making, exhibiting brittleness on most of the variations we tested. More generally, this work points to the importance of carefully evaluating AI systems not only for accuracy but also robustness when testing their cognitive capabilities. Code, data, and results for all experiments is available at https://github.com/marthaflinderslewis/robust-analogy.

TMLR Journal 2023 Journal Article

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

  • Aarohi Srivastava
  • Abhinav Rastogi
  • Abhishek Rao
  • Abu Awal Md Shoeb
  • Abubakar Abid
  • Adam Fisch
  • Adam R. Brown
  • Adam Santoro

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG- bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood develop- ment, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google- internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

FLAP Journal 2020 Journal Article

Semantic Spaces at the Intersection of NLP, Physics, and Cognitive Science.

  • Martha Lewis
  • Dan Marsden
  • Mehrnoosh Sadrzadeh

The ability to compose parts to form a more complex whole, and to analyze a whole as a combination of elements, is desirable across disciplines. Semantic Spaces at the Intersection of Natural Language Processing (NLP), Physics, and Cognitive Science brought together researchers applying similar compositional approaches within the three disciplines. The categorical model of [6], inspired by quantum protocols, has provided a convincing account of compositionality in vector space models of NLP. Similar category-theoretic approaches have been applied in cognitive science, in the context of conceptual spaces. The interplay between the three disciplines fostered theoretically motivated approaches to understanding how meanings of words interact in sentences and discourse, and how concepts develop in a cognitive space. This volume sees commonalities between the compositional mechanisms employed extracted, and applications and phenomena traditionally thought of as ‘non-compositional’ being shown to be compositional.

FLAP Journal 2019 Journal Article

Compositionality for Recursive Neural Networks.

  • Martha Lewis

Modelling compositionality has been a longstanding area of research in the field of vector space semantics. The categorical approach to compositionality maps grammar onto vector spaces in a principled way, but comes under fire for requiring the formation of very high-dimensional matrices and tensors, and therefore being computationally infeasible. In this paper I show how a linear simplification of recursive neural tensor network models can be mapped directly onto the categorical approach, giving a way of computing the required matrices and tensors. This mapping suggests a number of lines of research for both categorical compositional vector space models of meaning and for recursive neural network models of compositionality.

IJCAI Conference 2013 Conference Paper

Concept Generation in Language Evolution

  • Martha Lewis
  • Jonathan Lawry

This thesis investigates the generation of new concepts from combinations of existing concepts as a language evolves. We give a method for combining concepts, and will be investigating the utility of composite concepts in language evolution and thence the utility of concept generation.