Arrow Research search
Back to FOCS

FOCS 2025

Density Measures for Language Generation

Conference Paper Accepted Paper Algorithms and Complexity · Theoretical Computer Science

Abstract

The recent successes of large language models (LLMs) have led to a surge of theoretical research into the properties of language generation. A recent line of work has proposed an abstract view of the question — called language generation in the limit — in which we view language generation as a game played between an adversary and an algorithm: the adversary generates strings from an unknown language K, known only to come from a countable collection of candidate languages, and after observing a finite set of these strings, the algorithm must generate new strings from the language K that it hasn't seen before. This formalism highlights an important tension: the trade-off between validity (that the algorithm should only produce strings that come from the language) and breadth (that the algorithm should be able to produce "many" strings from the language). This validity-breadth trade-off is a central issue in applied work on language generation as well, where it arises in the balance between hallucination, when models generate invalid utterances, and mode collapse, when models only generate from a very restricted set of feasible outputs. Despite its importance, this trade-off has been challenging to study quantitatively. In this work we develop ways of quantifying this trade-off, by formalizing the notion of breadth through measures of density. Roughly speaking, the density of one language L in another language L' is the limiting fraction of strings from L among the strings of L', where we take the limit over longer and longer finite prefixes of L'. Existing algorithms for language generation in the limit produce output sets that can have zero density in the true language K, in this asymptotic sense, and this represents an important failure of breadth that might seem necessary in any solution to the problem. We show here that such a failure is not in fact necessary: we provide an algorithm for language generation in the limit whose outputs have strictly positive density in the true language K. We also study the internal representations built by algorithms for this problem — the sequence of hypothesized candidate languages they iterate through as they perform generation — showing a precise sense in which the strongest form of breadth achievable is one that may need to "oscillate" indefinitely between hypothesized representations of high density and low density. Our analysis introduces a novel topology on language families, with notions of convergence and limit points in this topology playing a key role in the analysis.

Authors

Keywords

  • Computer science
  • Limiting
  • Density measurement
  • Large language models
  • Games
  • Topology
  • Surges
  • Convergence
  • Low Density
  • Finite Set
  • Generation Algorithm
  • Algorithm For Problem
  • Limit Point
  • Time Step
  • Infinity
  • Proof Of Theorem
  • Finite Time
  • Directed Graph
  • Large Density
  • Linear Order
  • Time Stamp
  • Overview Of The Results
  • Output Of Algorithm
  • Time T2
  • Set Of Integers
  • Inclusive Settings
  • Infinite Sequence
  • Distinct Language
  • Set Of Languages
  • Countable Set
  • Definition Of Density
  • Critical Language
  • Current Language
  • Topological Space
  • Positive Integer
  • Dynamic Tree
  • Partial Order
  • Open Set
  • language learning
  • large language model
  • hallucination
  • mode-collapse

Context

Venue
IEEE Symposium on Foundations of Computer Science
Archive span
1975-2025
Indexed papers
3809
Paper id
399984804959328186