Arrow Research search

Author name cluster

Mayee Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
1 author row

Possible papers

5

NeurIPS Conference 2025 Conference Paper

Weaver: Shrinking the Generation-Verification Gap by Scaling Compute for Verification

  • Jon Saad-Falcon
  • Estefany Kelly Buchanan
  • Mayee Chen
  • Tzu-Heng (Brian) Huang
  • Brendan McLaughlin
  • Tanvir Bhathal
  • Shang Zhu
  • Ben Athiwaratkun

Verifiers can improve language model (LM) capabilities by providing feedback or selecting the best response from a pool of generated candidates. Currently, high-quality verifiers are either unscalable (e. g. , humans) or limited in utility (e. g. , tools like Lean for formal proofs). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers. To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. First we find that weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in the verifiers. To reduce the dependency on labeled data, Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses several challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these challenges by using dataset statistics to normalize outputs and filter specific verifiers. We study the effectiveness of Weaver in repeated sampling settings, where a model generates multiple candidate responses at test time and a verifier is used to select the correct one. Our evaluations demonstrate that Weaver significantly improves the pass@1 performance across several reasoning and math tasks, achieving o3-mini level accuracy with Llama 3. 3 70B Instruct (a much cheaper non-reasoning model) as the generator, and an ensemble of smaller judge and reward models as the verifiers (86. 2% average). This gain mirrors the jump achieved between GPT-4o and o3-mini (69. 0% vs. 86. 7%), which required extensive finetuning and post-training interventions. To make Weaver more efficient, we train a compact 400M cross-encoder using Weaver's combined output scores. This distilled model retains 98. 7% of Weaver's full accuracy while reducing verification compute by up to 99. 97%.

NeurIPS Conference 2024 Conference Paper

DataComp-LM: In search of the next generation of training sets for language models

  • Jeffrey Li
  • Alex Fang
  • Georgios Smyrnis
  • Maor Ivgi
  • Matt Jordan
  • Samir Gadre
  • Hritik Bansal
  • Etash Guha

We introduce DataComp for Language Models, a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing atmodel scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline, enables training a 7B parameter language model from scratch to 63% 5-shot accuracy on MMLU with 2T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6 percentage point improvement on MMLU while being trained with half the compute. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation. We release the \dclm benchmark, framework, models, and datasets at https: //www. datacomp. ai/dclm/

NeurIPS Conference 2023 Conference Paper

A case for reframing automated medical image classification as segmentation

  • Sarah Hooper
  • Mayee Chen
  • Khaled Saab
  • Kush Bhatia
  • Curtis Langlotz
  • Christopher Ré

Image classification and segmentation are common applications of deep learning to radiology. While many tasks can be framed using either classification or segmentation, classification has historically been cheaper to label and more widely used. However, recent work has drastically reduced the cost of training segmentation networks. In light of this recent work, we reexamine the choice of training classification vs. segmentation models. First, we use an information theoretic approach to analyze why segmentation vs. classification models may achieve different performance on the same dataset and overarching task. We then implement multiple methods for using segmentation models to classify medical images, which we call segmentation-for-classification, and compare these methods against traditional classification on three retrospective datasets. We use our analysis and experiments to summarize the benefits of switching from segmentation to classification, including: improved sample efficiency, enabling improved performance with fewer labeled images (up to an order of magnitude lower), on low-prevalence classes, and on certain rare subgroups (up to 161. 1\% improved recall); improved robustness to spurious correlations (up to 44. 8\% improved robust AUROC); and improved model interpretability, evaluation, and error analysis.

NeurIPS Conference 2023 Conference Paper

Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification

  • Neel Guha
  • Mayee Chen
  • Kush Bhatia
  • Azalia Mirhoseini
  • Frederic Sala
  • Christopher Ré

Recent work has shown that language models' (LMs) prompt-based learning capabilities make them well suited for automating data labeling in domains where manual annotation is expensive. The challenge is that while writing an initial prompt is cheap, improving a prompt is costly---practitioners often require significant labeled data in order to evaluate the impact of prompt modifications. Our work asks whether it is possible to improve prompt-based learning without additional labeled data. We approach this problem by attempting to modify the predictions of a prompt, rather than the prompt itself. Our intuition is that accurate predictions should also be consistent: samples which are similar under some feature representation should receive the same prompt prediction. We propose Embroid, a method which computes multiple representations of a dataset under different embedding functions, and uses the consistency between the LM predictions for neighboring samples to identify mispredictions. Embroid then uses these neighborhoods to create additional predictions for each sample, and combines these predictions with a simple latent variable graphical model in order to generate a final corrected prediction. In addition to providing a theoretical analysis of Embroid, we conduct a rigorous empirical evaluation across six different LMs and up to 95 different tasks. We find that (1) Embroid substantially improves performance over original prompts (e. g. , by an average of 7. 3 points on GPT-JT), (2) also realizes improvements for more sophisticated prompting strategies (e. g. , chain-of-thought), and (3) can be specialized to domains like law through the embedding functions.

NeurIPS Conference 2023 Conference Paper

Skill-it! A data-driven skills framework for understanding and training language models

  • Mayee Chen
  • Nicholas Roberts
  • Kush Bhatia
  • Jue Wang
  • Ce Zhang
  • Frederic Sala
  • Christopher Ré

The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 37. 5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13. 6% versus training on data associated with the target skill itself. We apply our skills framework on the RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.