Author name cluster

Noah Smith

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Brian Zheng
Alisa Liu
Orevaoghene Ahia
Jonathan Hayase
Yejin Choi
Noah Smith

Modern tokenizers employ deterministic algorithms to map text into a single ``canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the language model vocabulary, including tokenizing by character. In this paper, we investigate the robustness of LMs to input encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93. 4\% of their original performance when given a randomly sampled tokenization, and 90. 8\% with character-level tokenization. We find that overall stronger models tend to be more robust, and that robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we identify settings where non-canonical tokenization schemes can \textit{improve} performance, finding that character‑level segmentation improves string manipulation and code understanding tasks by up to 15\%, and right‑aligned digit grouping enhances large‑number arithmetic by over 33\%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We provide evidence that both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings). However, base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less committed to their tokenizer than previously believed, and highlight the promise of intervening on tokenization at inference time to boost language model performance.

PDF Details

NeurIPS Conference 2025 Conference Paper

FlexOLMo: Open Language Models for Flexible Data Use

Weijia Shi
Akshita Bhagia
Kevin Farhat
Niklas Muennighoff
Jacob Morrison
Evan Walsh
Dustin Schwenk
Shayne Longpre

We introduce FlexOLMo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on private datasets, and (2) data-flexible inference, where these parameters along with their associated data can be easily included or excluded from model inferences with no further training. FlexOLMo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on private datasets and later integrated through a new nonparametric routing without any joint training across datasets. FlexOLMo is trained on FLEXMIX, a corpus we curate comprising seven restricted sets, either real or realistic approximations, alongside publicly available datasets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners significantly benefiting from these restricted sets (an average 41% relative improvement) while allowing flexible opt-out at inference time (e. g. , for users without appropriate licenses or permissions). Our approach also outperforms prior model merging methods by 10. 1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, FlexOLMo enables training on restricted data while keeping data local and supports fine-grained control of data access at inference.

PDF Details

NeurIPS Conference 2025 Conference Paper

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

David Heineman
Valentin Hofmann
Ian Magnusson
Yuling Gu
Noah Smith
Hanna Hajishirzi
Kyle Lo
Jesse Dodge

Developing large language models is expensive and often involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable and useful for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark’s ability to separate better models from worse models, and noise, a benchmark’s sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce four interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e. g. , perplexity rather than accuracy) leads to better reliability and scaling law error. We also find that filtering noisy benchmarks such that they have better signal-to-noise ratio leads to more reliable evaluations. We also find that averaging the output of a model's checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 465 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 50K evaluation benchmark results, totaling 200M instances.

PDF Details

NeurIPS Conference 2025 Conference Paper

The Leaderboard Illusion

Shivalika Singh
Yiyang Nan
Alex Wang
Daniel Dsouza
Sayash Kapoor
Ahmet Üstün
Sanmi Koyejo
Yuntian Deng

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we found one provider testing 27 private variants before making one model public at the second position on the leaderboard. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. The top two providers have individually received an estimated 19. 2% and 20. 4% of all data on the arena. In contrast, a combined 83 open-weight models have only received an estimated 29. 7% of the total data. With conservative estimates, we show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on ArenaHard, a test set from the arena distribution. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field.

PDF Details

AAAI Conference 2015 Conference Paper

The Utility of Text: The Case of Amicus Briefs and the Supreme Court

Yanchuan Sim
Bryan Routledge
Noah Smith

We explore the idea that authoring a piece of text is an act of maximizing one’s expected utility. To make this idea concrete, we consider the societally important decisions of the Supreme Court of the United States. Extensive past work in quantitative political science provides a framework for empirically modeling the decisions of justices and how they relate to text. We incorporate into such a model texts authored by amici curiae (“friends of the court” separate from the litigants) who seek to weigh in on the decision, then explicitly model their goals in a random utility model. We demonstrate the benefits of this approach in improved vote prediction and the ability to perform counterfactual analysis.

PDF Details

AAAI Conference 2015 Conference Paper

Weakly-Supervised Grammar-Informed Bayesian CCG Parser Learning

Dan Garrette
Chris Dyer
Jason Baldridge
Noah Smith

Combinatory Categorial Grammar (CCG) is a lexicalized grammar formalism in which words are associated with categories that, in combination with a small universal set of rules, specify the syntactic configurations in which they may occur. Previous work has shown that learning sequence models for CCG tagging can be improved by using priors that are sensitive to the formal properties of CCG as well as cross-linguistic universals. We extend this approach to the task of learning a full CCG parser from weak supervision. We present a Bayesian formulation for CCG parser induction that assumes only supervision in the form of an incomplete tag dictionary mapping some word types to sets of potential categories. Our approach outperforms a baseline model trained with uniform priors by exploiting universal, intrinsic properties of the CCG formalism to bias the model toward simpler, more cross-linguistically common categories.

PDF Details

NeurIPS Conference 2014 Conference Paper

Conditional Random Field Autoencoders for Unsupervised Structured Prediction

Waleed Ammar
Chris Dyer
Noah Smith

We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observed data using a feature-rich conditional random field (CRF). Then a reconstruction of the input is (re)generated, conditional on the latent structure, using a generative model which factorizes similarly to the CRF. The autoencoder formulation enables efficient exact inference without resorting to unrealistic independence assumptions or restricting the kinds of features that can be used. We illustrate insightful connections to traditional autoencoders, posterior regularization and multi-view learning. Finally, we show competitive results with instantiations of the framework for two canonical tasks in natural language processing: part-of-speech induction and bitext word alignment, and show that training our model can be substantially more efficient than comparable feature-rich baselines.

PDF Details

NeurIPS Conference 2010 Conference Paper

Empirical Risk Minimization with Approximations of Probabilistic Grammars

Noah Smith
Shay Cohen

Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting.

PDF Details

NeurIPS Conference 2008 Conference Paper

Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction

Shay Cohen
Kevin Gimpel
Noah Smith

We explore a new Bayesian model for probabilistic grammars, a family of distributions over discrete structures that includes hidden Markov models and probabilistic context-free grammars. Our model extends the correlated topic model framework to probabilistic grammars, exploiting the logistic normal distribution as a prior over the grammar parameters. We derive a variational EM algorithm for that model, and then experiment with the task of unsupervised grammar induction for natural language dependency parsing. We show that our model achieves superior results over previous models that use different priors.

PDF Details