Arrow Research search

Author name cluster

Alex Fang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
2 author rows

Possible papers

13

NeurIPS Conference 2025 Conference Paper

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

  • Alex Fang
  • Hadi Pouransari
  • Matt Jordan
  • Alexander Toshev
  • Vaishaal Shankar
  • Ludwig Schmidt
  • Tom Gunter

Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.

ICLR Conference 2025 Conference Paper

Language models scale reliably with over-training and on downstream tasks

  • Samir Yitzhak Gadre
  • Georgios Smyrnis
  • Vaishaal Shankar
  • Suchin Gururangan
  • Mitchell Wortsman
  • Rulin Shao
  • Jean Mercat
  • Alex Fang

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)––each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$\times$ less compute.

NeurIPS Conference 2024 Conference Paper

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

  • Yiping Wang
  • Yifang Chen
  • Wendan Yan
  • Alex Fang
  • Wenjing Zhou
  • Kevin Jamieson
  • Simon S. Du

Data selection has emerged as a core issue for large-scale visual-language model pretaining (e. g. , CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e. g. , CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce $\textbf{negCLIPLoss}$, a method inspired by CLIP training loss that adds the alignment between one sample and its contrastive pairs as an extra normalization term to CLIPScore for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, $\textbf{NormSim}$, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp [Gadre et al. , 2023]. Compared to the best baseline using only OpenAI's CLIP-L/14, our methods achieve a 5. 3\% improvement on ImageNet-1k and a 2. 8\% improvement on 38 downstream evaluation tasks. Moreover, both $\textbf{negCLIPLoss}$ and $\textbf{NormSim}$ are compatible with existing techniques. By combining our methods with the current best methods DFN [Fang et al. , 2023] and HYPE [Kim et al. , 2024], we can boost average performance on downstream tasks by 0. 9\%, achieving a new state-of-the-art on the DataComp-medium benchmark.

ICLR Conference 2024 Conference Paper

Data Filtering Networks

  • Alex Fang
  • Albin Madappally Jose
  • Amit Jain
  • Ludwig Schmidt
  • Alexander Toshev
  • Vaishaal Shankar

Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a *data filtering network* (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing larger models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI’s WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.

NeurIPS Conference 2024 Conference Paper

DataComp-LM: In search of the next generation of training sets for language models

  • Jeffrey Li
  • Alex Fang
  • Georgios Smyrnis
  • Maor Ivgi
  • Matt Jordan
  • Samir Gadre
  • Hritik Bansal
  • Etash Guha

We introduce DataComp for Language Models, a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing atmodel scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline, enables training a 7B parameter language model from scratch to 63% 5-shot accuracy on MMLU with 2T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6 percentage point improvement on MMLU while being trained with half the compute. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation. We release the \dclm benchmark, framework, models, and datasets at https: //www. datacomp. ai/dclm/

NeurIPS Conference 2023 Conference Paper

DataComp: In search of the next generation of multimodal datasets

  • Samir Yitzhak Gadre
  • Gabriel Ilharco
  • Alex Fang
  • Jonathan Hayase
  • Georgios Smyrnis
  • Thao Nguyen
  • Ryan Marten
  • Mitchell Wortsman

Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12. 8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79. 2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3. 7 percentage points while using the same training procedure and compute. We release \datanet and all accompanying code at www. datacomp. ai.

NeurIPS Conference 2023 Conference Paper

Does progress on ImageNet transfer to real-world datasets?

  • Alex Fang
  • Simon Kornblith
  • Ludwig Schmidt

Does progress on ImageNet transfer to real-world datasets? We investigate this question by evaluating ImageNet pre-trained models with varying accuracy (57% - 83%) on six practical image classification datasets. In particular, we study datasets collected with the goal of solving real-world tasks (e. g. , classifying images from camera traps or satellites), as opposed to web-scraped benchmarks collected for comparing models. On multiple datasets, models with higher ImageNet accuracy do not consistently yield performance improvements. For certain tasks, interventions such as data augmentation improve performance even when architectures do not. We hope that future benchmarks will include more diverse datasets to encourage a more comprehensive approach to improving learning algorithms.

NeurIPS Conference 2023 Conference Paper

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

  • Wanrong Zhu
  • Jack Hessel
  • Anas Awadalla
  • Samir Yitzhak Gadre
  • Jesse Dodge
  • Alex Fang
  • Youngjae Yu
  • Ludwig Schmidt

In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e. g. , ``What do image A and image B have in common? ''To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. Multimodal C4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (88\%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (80\%). After filtering NSFW images, ads, etc. , the resulting corpus consists of 101. 2M documents with 571M images interleaved in 43B English tokens.

NeurIPS Conference 2023 Conference Paper

Neural Priming for Sample-Efficient Adaptation

  • Matthew Wallingford
  • Vivek Ramanujan
  • Alex Fang
  • Aditya Kusupati
  • Roozbeh Mottaghi
  • Aniruddha Kembhavi
  • Ludwig Schmidt
  • Ali Farhadi

We propose Neural Priming, a technique for adapting large pretrained models to distribution shifts and downstream tasks given few or no labeled examples. Presented with class names or unlabeled test samples, Neural Priming enables the model to recall and conditions its parameters on relevant data seen throughout pretraining, thereby priming it for the test distribution. Neural Priming can be performed at test time in even for pretraining datasets as large as LAION-2B. Performing lightweight updates on the recalled data significantly improves accuracy across a variety of distribution shift and transfer learning benchmarks. Concretely, in the zero-shot setting, we see a 2. 45% improvement in accuracy on ImageNet and 3. 81% accuracy improvement on average across standard transfer learning benchmarks. Further, using our test time inference scheme, we see a 1. 41% accuracy improvement on ImageNetV2. These results demonstrate the effectiveness of Neural Priming in addressing the common challenge of limited labeled data and changing distributions. Code and models are open-sourced at https: //www. github. com/RAIVNLab/neural-priming.

ICLR Conference 2023 Conference Paper

Neural Radiance Field Codebooks

  • Matthew Wallingford
  • Aditya Kusupati
  • Alex Fang
  • Vivek Ramanujan
  • Aniruddha Kembhavi
  • Roozbeh Mottaghi
  • Ali Farhadi

Compositional representations of the world are a promising step towards enabling high-level scene understanding and efficient transfer to downstream tasks. Learning such representations for complex scenes and tasks remains an open challenge. Towards this goal, we introduce Neural Radiance Field Codebooks (NRC), a scalable method for learning object-centric representations through novel view reconstruction. NRC learns to reconstruct scenes from novel views using a dictionary of object codes which are decoded through a volumetric renderer. This enables the discovery of reoccurring visual and geometric patterns across scenes which are transferable to downstream tasks. We show that NRC representations transfer well to object navigation in THOR, outperforming 2D and 3D representation learning methods by 3.1\% success rate. We demonstrate that our approach is able to perform unsupervised segmentation for more complex synthetic (THOR) and real scenes (NYU Depth) better than prior methods (.101 ARI). Finally, we show that NRC improves on the task of depth ordering by 5.5% accuracy in THOR.

ICML Conference 2022 Conference Paper

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

  • Alex Fang
  • Gabriel Ilharco
  • Mitchell Wortsman
  • Yuhao Wan
  • Vaishaal Shankar
  • Achal Dave
  • Ludwig Schmidt

Contrastively trained language-image models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these language-image models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a version of ImageNet with original text annotations from Flickr, to enable further controlled experiments of language-image training.

ICML Conference 2020 Conference Paper

Evaluating Machine Accuracy on ImageNet

  • Vaishaal Shankar
  • Rebecca Roelofs
  • Horia Mania
  • Alex Fang
  • Benjamin Recht
  • Ludwig Schmidt

We evaluate a wide range of ImageNet models with five trained human labelers. In our year-long experiment, trained humans first annotated 40, 000 images from the ImageNet and ImageNetV2 test sets with multi-class labels to enable a semantically coherent evaluation. Then we measured the classification accuracy of the five trained humans on the full task with 1, 000 classes. Only the latest models from 2020 are on par with our best human labeler, and human accuracy on the 590 object classes is still 4% and 10% higher than the best model on ImageNet and ImageNetV2, respectively. Moreover, humans achieve the same accuracy on ImageNet and ImageNetV2, while all models see a consistent accuracy drop. Overall, our results show that there is still substantial room for improvement on ImageNet and direct accuracy comparisons between humans and machines may overstate machine performance.

ICML Conference 2020 Conference Paper

Neural Kernels Without Tangents

  • Vaishaal Shankar
  • Alex Fang
  • Wenshuo Guo
  • Sara Fridovich-Keil
  • Jonathan Ragan-Kelley
  • Ludwig Schmidt
  • Benjamin Recht

We investigate the connections between neural networks and simple building blocks in kernel space. In particular, using well established feature space tools such as direct sum, averaging, and moment lifting, we present an algebra for creating “compositional” kernels from bags of features. We show that these operations correspond to many of the building blocks of “neural tangent kernels (NTK)”. Experimentally, we show that there is a correlation in test error between neural network architectures and the associated kernels. We construct a simple neural network architecture using only 3x3 convolutions, 2x2 average pooling, ReLU, and optimized with SGD and MSE loss that achieves 96% accuracy on CIFAR10, and whose corresponding compositional kernel achieves 90% accuracy. We also use our constructions to investigate the relative performance of neural networks, NTKs, and compositional kernels in the small dataset regime. In particular, we find that compositional kernels outperform NTKs and neural networks outperform both kernel methods.