FOCS Conference 2025 Conference Paper
Density Measures for Language Generation
- Jon M. Kleinberg
- Fan Wei
The recent successes of large language models (LLMs) have led to a surge of theoretical research into the properties of language generation. A recent line of work has proposed an abstract view of the question — called language generation in the limit — in which we view language generation as a game played between an adversary and an algorithm: the adversary generates strings from an unknown language K, known only to come from a countable collection of candidate languages, and after observing a finite set of these strings, the algorithm must generate new strings from the language K that it hasn't seen before. This formalism highlights an important tension: the trade-off between validity (that the algorithm should only produce strings that come from the language) and breadth (that the algorithm should be able to produce "many" strings from the language). This validity-breadth trade-off is a central issue in applied work on language generation as well, where it arises in the balance between hallucination, when models generate invalid utterances, and mode collapse, when models only generate from a very restricted set of feasible outputs. Despite its importance, this trade-off has been challenging to study quantitatively. In this work we develop ways of quantifying this trade-off, by formalizing the notion of breadth through measures of density. Roughly speaking, the density of one language L in another language L' is the limiting fraction of strings from L among the strings of L', where we take the limit over longer and longer finite prefixes of L'. Existing algorithms for language generation in the limit produce output sets that can have zero density in the true language K, in this asymptotic sense, and this represents an important failure of breadth that might seem necessary in any solution to the problem. We show here that such a failure is not in fact necessary: we provide an algorithm for language generation in the limit whose outputs have strictly positive density in the true language K. We also study the internal representations built by algorithms for this problem — the sequence of hypothesized candidate languages they iterate through as they perform generation — showing a precise sense in which the strongest form of breadth achievable is one that may need to "oscillate" indefinitely between hypothesized representations of high density and low density. Our analysis introduces a novel topology on language families, with notions of convergence and limit points in this topology playing a key role in the analysis.