Density Measures for Language Generation

Jon M. Kleinberg; Fan Wei

Back to FOCS

FOCS 2025

Density Measures for Language Generation

Conference Paper Accepted Paper Algorithms and Complexity · Theoretical Computer Science

Details

Abstract

The recent successes of large language models (LLMs) have led to a surge of theoretical research into the properties of language generation. A recent line of work has proposed an abstract view of the question — called language generation in the limit — in which we view language generation as a game played between an adversary and an algorithm: the adversary generates strings from an unknown language K, known only to come from a countable collection of candidate languages, and after observing a finite set of these strings, the algorithm must generate new strings from the language K that it hasn't seen before. This formalism highlights an important tension: the trade-off between validity (that the algorithm should only produce strings that come from the language) and breadth (that the algorithm should be able to produce "many" strings from the language). This validity-breadth trade-off is a central issue in applied work on language generation as well, where it arises in the balance between hallucination, when models generate invalid utterances, and mode collapse, when models only generate from a very restricted set of feasible outputs. Despite its importance, this trade-off has been challenging to study quantitatively. In this work we develop ways of quantifying this trade-off, by formalizing the notion of breadth through measures of density. Roughly speaking, the density of one language L in another language L' is the limiting fraction of strings from L among the strings of L', where we take the limit over longer and longer finite prefixes of L'. Existing algorithms for language generation in the limit produce output sets that can have zero density in the true language K, in this asymptotic sense, and this represents an important failure of breadth that might seem necessary in any solution to the problem. We show here that such a failure is not in fact necessary: we provide an algorithm for language generation in the limit whose outputs have strictly positive density in the true language K. We also study the internal representations built by algorithms for this problem — the sequence of hypothesized candidate languages they iterate through as they perform generation — showing a precise sense in which the strongest form of breadth achievable is one that may need to "oscillate" indefinitely between hypothesized representations of high density and low density. Our analysis introduces a novel topology on language families, with notions of convergence and limit points in this topology playing a key role in the analysis.

Authors

Keywords

Computer science
Limiting
Density measurement
Large language models
Games
Topology
Surges
Convergence
Low Density
Finite Set
Generation Algorithm
Algorithm For Problem
Limit Point
Time Step
Infinity
Proof Of Theorem
Finite Time
Directed Graph
Large Density
Linear Order
Time Stamp
Overview Of The Results
Output Of Algorithm
Time T2
Set Of Integers
Inclusive Settings
Infinite Sequence
Distinct Language
Set Of Languages
Countable Set
Definition Of Density
Critical Language
Current Language
Topological Space
Positive Integer
Dynamic Tree
Partial Order
Open Set
language learning
large language model
hallucination
mode-collapse

Context

Venue: IEEE Symposium on Foundations of Computer Science
Archive span: 1975-2025
Indexed papers: 3809
Paper id: 399984804959328186