Arrow Research search

Author name cluster

Aaron Gokaslan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

ICLR Conference 2025 Conference Paper

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

  • Marianne Arriola
  • Aaron Gokaslan
  • Justin T. Chiu
  • Zhihan Yang
  • Zhixuan Qi
  • Jiaqi Han
  • Subham Sekhar Sahoo
  • Volodymyr Kuleshov

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms/

NeurIPS Conference 2025 Conference Paper

Encoder-Decoder Diffusion Language Models for Efficient Training and Inference

  • Marianne Arriola
  • Yair Schiff
  • Hao Phung
  • Aaron Gokaslan
  • Volodymyr Kuleshov

Discrete diffusion models enable parallel token sampling for faster inference than autoregressive approaches. However, prior diffusion models use a decoder-only architecture, which requires sampling algorithms that invoke the full network at every denoising step and incur high computational cost. Our key insight is that discrete diffusion models perform two types of computation: 1) representing clean tokens and 2) denoising corrupted tokens, which enables us to use separate modules for each task. We propose an encoder-decoder architecture to accelerate discrete diffusion inference, which relies on an encoder to represent clean tokens and a lightweight decoder to iteratively refine a noised sequence. We also show that this architecture enables faster training of block diffusion models, which partition sequences into blocks for better quality and are commonly used in diffusion language model inference. We introduce a framework for E fficient E ncoder- D ecoder D iffusion (E2D2), consisting of an architecture with specialized training and sampling algorithms, and we show that E2D2 achieves superior trade-offs between generation quality and inference throughput on summarization, translation, and mathematical reasoning tasks.

NeurIPS Conference 2025 Conference Paper

Position: Benchmarking is Broken - Don't Let AI be Its Own Judge

  • Zerui Cheng
  • Stella Wohnig
  • Ruchika Gupta
  • Samiul Alam
  • Tassallah Abdullahi
  • João Alves Ribeiro
  • Christian Nielsen-Garcia
  • Saif Mir

The meteoric rise of Artificial Intelligence (AI), with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e. g. , SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that a laissez-faire approach is untenable. For true and sustainable AI advancement, we call for a paradigm shift to a unified, live, and quality-controlled benchmarking framework—robust by construction rather than reliant on courtesy or goodwill. Accordingly, we dissect the systemic flaws undermining today’s evaluation ecosystem and distill the essential requirements for next-generation assessments. To concretize this position, we introduce the idea of PeerBench, a community-governed, proctored evaluation blueprint that seeks to improve security and credibility through sealed execution, item banking with rolling renewal, and delayed transparency. PeerBench is presented as a complementary, certificate-grade layer alongside open benchmarks, not a replacement. We discuss trade-offs and limits and call for further research on mechanism design, governance, and reliability guarantees. Our goal is to lay the groundwork for evaluations that restore integrity and deliver genuinely trustworthy measures of AI progress.

NeurIPS Conference 2025 Conference Paper

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

  • Nikhil Kandpal
  • Brian Lester
  • Colin Raffel
  • Sebastian Majstorovic
  • Stella Biderman
  • Baber Abbasi
  • Luca Soldaini
  • Enrico Shippole

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0. 1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0. 1-1T and Comma v0. 1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0. 1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0. 1 models.

ICML Conference 2025 Conference Paper

The Diffusion Duality

  • Subham Sekhar Sahoo
  • Justin Deschenaux
  • Aaron Gokaslan
  • Guanghan Wang
  • Justin T. Chiu
  • Volodymyr Kuleshov

Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: https: //s-sahoo. github. io/duo

ICML Conference 2024 Conference Paper

Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

  • Yair Schiff
  • Chia-Hsiang Kao
  • Aaron Gokaslan
  • Tri Dao
  • Albert Gu
  • Volodymyr Kuleshov

Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance. Code to reproduce our experiments is available here: https: //github. com/kuleshov-group/caduceus.

NeurIPS Conference 2024 Conference Paper

DataComp-LM: In search of the next generation of training sets for language models

  • Jeffrey Li
  • Alex Fang
  • Georgios Smyrnis
  • Maor Ivgi
  • Matt Jordan
  • Samir Gadre
  • Hritik Bansal
  • Etash Guha

We introduce DataComp for Language Models, a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing atmodel scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline, enables training a 7B parameter language model from scratch to 63% 5-shot accuracy on MMLU with 2T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6 percentage point improvement on MMLU while being trained with half the compute. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation. We release the \dclm benchmark, framework, models, and datasets at https: //www. datacomp. ai/dclm/

NeurIPS Conference 2024 Conference Paper

Diffusion Models With Learned Adaptive Noise

  • Subham S. Sahoo
  • Aaron Gokaslan
  • Chris De
  • Volodymyr Kuleshov

Diffusion models have gained traction as powerful algorithms for synthesizing high-quality images. Central to these algorithms is the diffusion process, a set of equations which maps data to noise in a way that can significantly affect performance. In this paper, we explore whether the diffusionprocess can be learned from data. Our work is grounded in Bayesian inference and seeks to improve log-likelihood estimation by casting the learned diffusion process as an approximate variational posterior that yields a tighter lower bound (ELBO) on the likelihood. A widely held assumption is that the ELBO is invariant to the noise process: our work dispels this assumption and proposes multivariate learned adaptive noise (MuLAN), a learned diffusion process that applies noise at different rates across an image. Our method consists of three components: a multivariate noise schedule, adaptive input-conditional diffusion, and auxiliary variables; these components ensure that the ELBO is no longer invariant to the choice of the noise schedule as in previous works. Empirically, MuLAN sets a new state-of-the-art in density estimation on CIFAR-10 and ImageNet while matching the performance of previous state-of-the-art models with 50% fewer steps. We provide the code, along with a blog post and video tutorial on the project page: https: //s-sahoo. com/MuLAN

ICML Conference 2024 Conference Paper

Position: Standardization of Behavioral Use Clauses is Necessary for the Adoption of Responsible Licensing of AI

  • Daniel McDuff
  • Tim Korjakow
  • Scott Cambo
  • Jesse Josua Benjamin
  • Jenny Lee
  • Yacine Jernite
  • Carlos Muñoz Ferrandis
  • Aaron Gokaslan

Growing concerns over negligent or malicious uses of AI have increased the appetite for tools that help manage the risks of the technology. In 2018, licenses with behaviorial-use clauses (commonly referred to as Responsible AI Licenses) were proposed to give developers a framework for releasing AI assets while specifying their users to mitigate negative applications. As of the end of 2023, on the order of 40, 000 software and model repositories have adopted responsible AI licenses licenses. Notable models licensed with behavioral use clauses include BLOOM (language) and LLaMA2 (language), Stable Diffusion (image), and GRID (robotics). This paper explores why and how these licenses have been adopted, and why and how they have been adapted to fit particular use cases. We use a mixed-methods methodology of qualitative interviews, clustering of license clauses, and quantitative analysis of license adoption. Based on this evidence we take the position that responsible AI licenses need standardization to avoid confusing users or diluting their impact. At the same time, customization of behavioral restrictions is also appropriate in some contexts (e. g. , medical domains). We advocate for “standardized customization” that can meet users’ needs and can be supported via tooling.

NeurIPS Conference 2024 Conference Paper

Simple and Effective Masked Diffusion Language Models

  • Subham S. Sahoo
  • Marianne Arriola
  • Yair Schiff
  • Aaron Gokaslan
  • Edgar Marroquin
  • Justin T. Chiu
  • Alexander Rush
  • Volodymyr Kuleshov

While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form-it is a mixture of classical masked language modeling losses-and can be used to train encoder-only language models that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional language model. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We provide the code, along with a blog post and video tutorial on the project page: https: //s-sahoo. com/mdlm

NeurIPS Conference 2024 Conference Paper

The GAN is dead; long live the GAN! A Modern GAN Baseline

  • Yiwen Huang
  • Aaron Gokaslan
  • Volodymyr Kuleshov
  • James Tompkin

There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, this loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline---R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models. Code: https: //www. github. com/brownvc/R3GAN

ICML Conference 2023 Conference Paper

InfoDiffusion: Representation Learning Using Information Maximizing Diffusion Models

  • Yingheng Wang
  • Yair Schiff
  • Aaron Gokaslan
  • Weishen Pan
  • Fei Wang 0001
  • Christopher De Sa
  • Volodymyr Kuleshov

While diffusion models excel at generating high-quality samples, their latent variables typically lack semantic meaning and are not suitable for representation learning. Here, we propose InfoDiffusion, an algorithm that augments diffusion models with low-dimensional latent variables that capture high-level factors of variation in the data. InfoDiffusion relies on a learning objective regularized with the mutual information between observed and hidden variables, which improves latent space quality and prevents the latents from being ignored by expressive diffusion-based decoders. Empirically, we find that InfoDiffusion learns disentangled and human-interpretable latent representations that are competitive with state-of-the-art generative and contrastive methods, while retaining the high sample quality of diffusion models. Our method enables manipulating the attributes of generated images and has the potential to assist tasks that require exploring a learned latent space to generate quality samples, e. g. , generative design.

NeurIPS Conference 2022 Conference Paper

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

  • Hugo Laurençon
  • Lucile Saulnier
  • Thomas Wang
  • Christopher Akiki
  • Albert Villanova del Moral
  • Teven Le Scao
  • Leandro Von Werra
  • Chenghao Mou

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1. 6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

NeurIPS Conference 2021 Conference Paper

Habitat 2.0: Training Home Assistants to Rearrange their Habitat

  • Andrew Szot
  • Alexander Clegg
  • Eric Undersander
  • Erik Wijmans
  • Yili Zhao
  • John Turner
  • Noah Maestre
  • Mustafa Mukadam

We introduce Habitat 2. 0 (H2. 0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack – data, simulation, and benchmark tasks. Specifically, we present: (i) ReplicaCAD: an artist-authored, annotated, reconfigurable 3D dataset of apartments (matching real spaces) with articulated objects (e. g. cabinets and drawers that can open/close); (ii) H2. 0: a high-performance physics-enabled 3D simulator with speeds exceeding 25, 000 simulation steps per second (850x real-time) on an 8-GPU node, representing 100x speed-ups over prior work; and, (iii) Home Assistant Benchmark (HAB): a suite of common tasks for assistive robots (tidy the house, stock groceries, set the table) that test a range of mobile manipulation capabilities. These large-scale engineering contributions allow us to systematically compare deep reinforcement learning (RL) at scale and classical sense-plan-act (SPA) pipelines in long-horizon structured tasks, with an emphasis on generalization to new objects, receptacles, and layouts. We find that (1) flat RL policies struggle on HAB compared to hierarchical ones; (2) a hierarchy with independent skills suffers from ‘hand-off problems’, and (3) SPA pipelines are more brittle than RL policies.

NeurIPS Conference 2021 Conference Paper

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

  • Santhosh Kumar Ramakrishnan
  • Aaron Gokaslan
  • Erik Wijmans
  • Oleksandr Maksymets
  • Alexander Clegg
  • John Turner
  • Eric Undersander
  • Wojciech Galuba

We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1, 000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-floor residences, stores, and other private indoor spaces. HM3D surpasses existing datasets available for academic research in terms of physical scale, completeness of the reconstruction, and visual fidelity. HM3D contains 112. 5k m^2 of navigable space, which is 1. 4 - 3. 7× larger than other building-scale datasets (MP3D, Gibson). When compared to existing photorealistic 3D datasets (Replica, MP3D, Gibson, ScanNet), rendered images from HM3D have 20 - 85% higher visual fidelity w. r. t. counterpart images captured with real cameras, and HM3D meshes have 34 - 91% fewer artifacts due to incomplete surface reconstruction. The increased scale, fidelity, and diversity of HM3D directly impacts the performance of embodied AI agents trained using it. In fact, we find that HM3D is ‘pareto optimal’ in the following sense – agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100% performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset. The HM3D dataset, analysis code, and pre-trained models are publicly released: https: //aihabitat. org/datasets/hm3d/.

NeurIPS Conference 2021 Conference Paper

TöRF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis

  • Benjamin Attal
  • Eliot Laidlaw
  • Aaron Gokaslan
  • Changil Kim
  • Christian Richardt
  • James Tompkin
  • Matthew O'Toole

Neural networks can represent and accurately reconstruct radiance fields for static 3D scenes (e. g. , NeRF). Several works extend these to dynamic scenes captured with monocular video, with promising performance. However, the monocular setting is known to be an under-constrained problem, and so methods rely on data-driven priors for reconstructing dynamic content. We replace these priors with measurements from a time-of-flight (ToF) camera, and introduce a neural representation based on an image formation model for continuous-wave ToF cameras. Instead of working with processed depth maps, we model the raw ToF sensor measurements to improve reconstruction quality and avoid issues with low reflectance regions, multi-path interference, and a sensor's limited unambiguous depth range. We show that this approach improves robustness of dynamic scene reconstruction to erroneous calibration and large motions, and discuss the benefits and limitations of integrating RGB+ToF sensors now available on modern smartphones.