Arrow Research search

Author name cluster

Jenia Jitsev

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
2 author rows

Possible papers

9

TMLR Journal 2026 Journal Article

Detecting generalization deficits in large language and reasoning models by using natural variations in simple problems

  • Marianna Nezhurina
  • Lucia Cipolina-Kun
  • Mehdi Cherti
  • Jenia Jitsev

Large language and reasoning models (LLMs, LRMs) are instances of foundation models exhibiting scaling laws that predict generalization improvement when increasing the pre-training scale. As such, they are supposed to possess strong generalization and therefore transfer robustly across various tasks and conditions in few-show or zero-shot manner. Such claims rely on various standardized benchmarks that should measure core functions like generalization and reasoning, where state-of-the-art (SOTA) models score high. We demonstrate remarkable zero-shot generalization deficit in most SOTA models which claim strong function, including reasoning models like DeepSeek R1 or o1-mini, trained at the largest scales, using a simple, short common sense math problem formulated in concise natural language, easily solvable by humans, which we term Alice in Wonderland, AIW, problem. The deficit manifests in strong performance fluctuations on natural variations in the simple problem template that do not change either problem structure or its difficulty at all. By testing models on further control problems with similar form, we rule out that deficit might be rooted in minor low-level issues like natural language or numbers parsing. In conventional LLMs, we observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations. Many models showing the deficit also collapse close to 0 accuracy on AIW problems, while still exhibiting high scores on various standardized benchmarks. We show how this illusion of strong function might be caused by leakage of test sets into training. For reasoning models, while observing clearly improved performance compared to LLMs, we still see strong fluctuations on problem variations that keep structure and difficulty unchanged. Our observations suggest that current LLMs and LRMs possess generalization deficits that can be detected by controlled structure and difficulty preserving variations in simple problems, in contrast to standardized benchmarks which contain problems of higher difficulty but fail to detect such clear deficits. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/AIW

TMLR Journal 2026 Journal Article

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

  • Huu Nguyen
  • Victor May
  • Harsh Raj
  • Marianna Nezhurina
  • Yishan Wang
  • Yanqi Luo
  • Vu Minh Chien
  • Taishi Nakamura

We present MixtureVitae, an open‑access pretraining corpus built to minimize legal risk while providing strong downstream performance. MixtureVitae follows a permissive‑first, risk‑mitigated sourcing strategy that combines public‑domain and permissively licensed text (e.g., CC‑BY/Apache) with carefully justified low‑risk additions (e.g., government works and EU TDM‑eligible sources). MixtureVitae adopts a simple, single-stage pretraining recipe that integrates a large proportion of permissive synthetic instruction and reasoning data—signals typically introduced during post-training and generally scarce in permissive web corpora. We categorize all sources into a three-tier scheme that reflects varying risk levels and provide shard-level provenance metadata to enable risk-aware usage. In controlled experiments using the open‑sci‑ref training protocol (fixed architectures and hyperparameters; 50B and 300B token budgets across 130M–1.7B parameters), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B-parameters/300B-tokens setting, they match FineWeb‑Edu and approach DCLM--demonstrating that the large fraction of reasoning and instruction data does not come at the cost of general-purpose language understanding. Performance is particularly strong on MMLU and on math and code benchmarks: a 1.7B model pretrained on 300B MixtureVitae tokens outperforms all strong non-permissive reference datasets and matches or exceeds smolLM2-Instruct, a strong 1.7B instruction‑tuned baseline on GSM8K, HumanEval, and MBPP, despite using over 36$\times$ fewer tokens (300B vs. $\approx$11T). Supported by a thorough decontamination analysis, these results show that permissive‑first data with high instruction and reasoning density, tiered by licensing and provenance-related risk, can provide a practical and risk-mitigated foundation for training capable LLMs, reducing reliance on broad web scrapes without sacrificing competitiveness. Dataset, source code for experiments reproduction and pre-trained models are available at https://github.com/ontocord/mixturevitae.

ICLR Conference 2025 Conference Paper

Language models scale reliably with over-training and on downstream tasks

  • Samir Yitzhak Gadre
  • Georgios Smyrnis
  • Vaishaal Shankar
  • Suchin Gururangan
  • Mitchell Wortsman
  • Rulin Shao
  • Jean Mercat
  • Alex Fang

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)––each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$\times$ less compute.

NeurIPS Conference 2025 Conference Paper

Learning in Compact Spaces with Approximately Normalized Transformer

  • Jörg Franke
  • Urs Spiegelhalter
  • Marianna Nezhurina
  • Jenia Jitsev
  • Frank Hutter
  • Michael Hefenbrock

The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but do not increase the total number of normalization layers. Our experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. When deriving scaling laws, we found that our method enables training with larger batch sizes while preserving the favorable scaling characteristics of classic GPT architectures.

NeurIPS Conference 2025 Conference Paper

Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

  • Marianna Nezhurina
  • Tomer Porian
  • Giovanni Puccetti
  • Tommie Kerssies
  • Romain Beaumont
  • Mehdi Cherti
  • Jenia Jitsev

In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. Taking language-vision learning as example, we show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. Full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. For the first time, we use derived scaling laws to compare both models and three open datasets, DataComp-1. 4B, Re-LAION-1. 4B and DFN-1. 4B, while ensuring sufficient prediction accuracy on held out points. From comparison, we obtain evidence for (i) MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP (ii) DFN-1. 4B outperforming other open datasets. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, observing consistently the same scaling trends for models and datasets across tasks. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison on aligned common compute axis across large scale span, avoiding misleading conclusions based on measurements from few isolated single reference scales only. This paves road for guided collective improvement of open foundation models and training datasets, as scaling law based comparisons from various studies executed in common frame can be combined to identify overall better procedures. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves 80. 3% zero-shot ImageNet-1k accuracy, trained on 12. 8B samples from DataComp-1. 4B. Code for reproducing experiments in the paper and raw experiments data can be found at https: //github. com/LAION-AI/scaling-laws-for-comparison.

NeurIPS Conference 2024 Conference Paper

DataComp-LM: In search of the next generation of training sets for language models

  • Jeffrey Li
  • Alex Fang
  • Georgios Smyrnis
  • Maor Ivgi
  • Matt Jordan
  • Samir Gadre
  • Hritik Bansal
  • Etash Guha

We introduce DataComp for Language Models, a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing atmodel scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline, enables training a 7B parameter language model from scratch to 63% 5-shot accuracy on MMLU with 2T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6 percentage point improvement on MMLU while being trained with half the compute. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation. We release the \dclm benchmark, framework, models, and datasets at https: //www. datacomp. ai/dclm/

NeurIPS Conference 2024 Conference Paper

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

  • Tomer Porian
  • Mitchell Wortsman
  • Jenia Jitsev
  • Ludwig Schmidt
  • Yair Carmon

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i. e. , "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al. , we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter is essential at lower batch sizes.

NeurIPS Conference 2023 Conference Paper

DataComp: In search of the next generation of multimodal datasets

  • Samir Yitzhak Gadre
  • Gabriel Ilharco
  • Alex Fang
  • Jonathan Hayase
  • Georgios Smyrnis
  • Thao Nguyen
  • Ryan Marten
  • Mitchell Wortsman

Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12. 8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79. 2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3. 7 percentage points while using the same training procedure and compute. We release \datanet and all accompanying code at www. datacomp. ai.

NeurIPS Conference 2022 Conference Paper

LAION-5B: An open large-scale dataset for training next generation image-text models

  • Christoph Schuhmann
  • Romain Beaumont
  • Richard Vencu
  • Cade Gordon
  • Ross Wightman
  • Mehdi Cherti
  • Theo Coombes
  • Aarush Katta

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5. 85 billion CLIP-filtered image-text pairs, of which 2. 32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection.