Arrow Research search

Author name cluster

Bonnie Berger

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
2 author rows

Possible papers

24

AAAI Conference 2026 Conference Paper

Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports

  • Sidhika Balachandar
  • Shuvom Sadhuka
  • Bonnie Berger
  • Emma Pierson
  • Nikhil Garg

Graph neural networks (GNNs) are widely used in urban spatiotemporal forecasting, e.g., predicting infrastructure problems. In this setting, government officials aim to identify in which neighborhoods incidents like potholes or rodents occur. The true state of incidents is observed via government inspection ratings. However, these ratings are only conducted for a sparse set of neighborhoods and incident types. We also observe the state of incidents via crowdsourced reports, which are more densely observed but may be biased due to heterogeneous reporting. First, we propose a multiview, multioutput GNN-based model that uses both unbiased rating data and biased reporting data to predict the true latent state of incidents. Second, we investigate a case study of New York City urban incidents and collect a dataset of 9,615,863 crowdsourced reports and 1,041,415 government inspection ratings over 3 years and across 139 types of incidents. We show on both real and semi-synthetic data that our model can better predict the latent state compared to models that use only reporting data or only rating data. Finally, we quantify demographic biases in crowdsourced reporting, e.g., higher-income neighborhoods report problems at higher rates. Our analysis showcases a widely applicable approach for latent state prediction using heterogeneous, sparse, and biased data.

FOCS Conference 2025 Conference Paper

Efficiently Batching Unambiguous Interactive Proofs

  • Bonnie Berger
  • Rohan Goyal
  • Matthew M. Hong
  • Yael Tauman Kalai

We show that if a language $\mathcal{L}$ admits a public-coin unambiguous interactive proof (UIP) with round complexity $\ell$, where a bits are communicated per round, then the batch language ${\mathcal{L}}^{\otimes k}$, i. e. the set of k-tuples of statements all belonging to $\mathcal{L}$, has an unambiguous interactive proof with round complexity $\ell \cdot$ polylog $(k)$, per-round communication of $a \cdot \ell \cdot$ polylog $(k)+$ poly $(\ell)$ bits, assuming the verifier in the UIP has depth bounded by polylog $(k)$. Prior to this work, the best known batch UIP for ${\mathcal{L}}^{\otimes k}$ required communication complexity at least ($\operatorname{poly}(a) \cdot k^{\epsilon}+k$) $\cdot \ell^{1 / \epsilon}$ for any arbitrarily small constant $\epsilon\gt 0$ (Reingold-Rothblum-Rothblum, STOC 2016). As a corollary of our result, we obtain a doubly efficient proof system, that is, a proof system whose proving overhead is polynomial in the time of the underlying computation, for any language computable in polynomial space and in time at most $n^{O\left(\sqrt{\frac{\log n}{\log \log n}}\right)}$. This expands the state of the art of doubly efficient proof systems: prior to our work, such systems were known for languages computable in polynomial space and in time $n^{(\log n)^{\delta}}$ for a small $\delta\gt 0$ significantly smaller than 1/2 (Reingold-Rothblum-Rothblum, STOC 2016).

NeurIPS Conference 2025 Conference Paper

Evaluating multiple models using labeled and unlabeled data

  • Divya Shanmugam
  • Shuvom Sadhuka
  • Manish Raghavan
  • John Guttag
  • Bonnie Berger
  • Emma Pierson

It is difficult to evaluate machine learning classifiers without large labeled datasets, which are often unavailable. In contrast, unlabeled data is plentiful, but not easily used for evaluation. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. The key idea is to estimate the joint distribution of ground truth labels and classifier scores using a semi-supervised mixture model. The semi-supervised mixture model allows SSME to learn from three sources of information: unlabeled data, multiple classifiers, and probabilistic classifier scores. Once fit, the mixture model enables estimation of any metric that is a function of classifier scores and ground truth labels (e. g. , accuracy or AUC). We derive theoretical bounds on the error of these estimates, showing that estimation error decreases with the number of classifiers and the amount of unlabeled data. We present experiments in four domains where obtaining large labeled datasets is often impractical: healthcare, content moderation, molecular property prediction, and text classification. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5. 1x relative to using labeled data alone and 2. 4x relative to the next best method.

ICML Conference 2024 Conference Paper

AlphaFold Meets Flow Matching for Generating Protein Ensembles

  • Bowen Jing 0002
  • Bonnie Berger
  • Tommi S. Jaakkola

The biological functions of proteins often depend on dynamic structural ensembles. In this work, we develop a flow-based generative modeling approach for learning and sampling the conformational landscapes of proteins. We repurpose highly accurate single-state predictors such as AlphaFold and ESMFold and fine-tune them under a custom flow matching framework to obtain sequence-conditioned generative models of protein structure called AlphaFlow and ESMFlow. When trained and evaluated on the PDB, our method provides a superior combination of precision and diversity compared to AlphaFold with MSA subsampling. When further trained on ensembles from all-atom MD, our method accurately captures conformational flexibility, positional distributions, and higher-order ensemble observables for unseen proteins. Moreover, our method can diversify a static PDB structure with faster wall-clock convergence to certain equilibrium properties than replicate MD trajectories, demonstrating its potential as a proxy for expensive physics-based simulations. Code is available at https: //github. com/bjing2016/alphaflow.

ICML Conference 2024 Conference Paper

Dirichlet Flow Matching with Applications to DNA Sequence Design

  • Hannes Stärk
  • Bowen Jing 0002
  • Chenyu Wang 0003
  • Gabriele Corso
  • Bonnie Berger
  • Regina Barzilay
  • Tommi S. Jaakkola

Discrete diffusion or flow models could enable faster and more controllable sequence generation than autoregressive models. We show that naive linear flow matching on the simplex is insufficient toward this goal since it suffers from discontinuities in the training target and further pathologies. To overcome this, we develop Dirichlet flow matching on the simplex based on mixtures of Dirichlet distributions as probability paths. In this framework, we derive a connection between the mixtures’ scores and the flow’s vector field that allows for classifier and classifier-free guidance. Further, we provide distilled Dirichlet flow matching, which enables one-step sequence generation with minimal performance hits, resulting in $O(L)$ speedups compared to autoregressive models. On complex DNA sequence generation tasks, we demonstrate superior performance compared to all baselines in distributional metrics and in achieving desired design targets for generated sequences. Finally, we show that our classifier-free guidance approach improves unconditional generation and is effective for generating DNA that satisfies design targets.

ICLR Conference 2024 Conference Paper

Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms

  • Bowen Jing 0002
  • Tommi S. Jaakkola
  • Bonnie Berger

Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. The runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures. Code is available at https://github.com/bjing2016/scalar-fields.

NeurIPS Conference 2024 Conference Paper

Generative Modeling of Molecular Dynamics Trajectories

  • Bowen Jing
  • Hannes Stärk
  • Tommi Jaakkola
  • Bonnie Berger

Molecular dynamics (MD) is a powerful technique for studying microscopic phenomena, but its computational cost has driven significant interest in the development of deep learning-based surrogate models. We introduce generative modeling of molecular trajectories as a paradigm for learning flexible multi-task surrogate models of MD from data. By conditioning on appropriately chosen frames of the trajectory, we show such generative models can be adapted to diverse tasks such as forward simulation, transition path sampling, and trajectory upsampling. By alternatively conditioning on part of the molecular system and inpainting the rest, we also demonstrate the first steps towards dynamics-conditioned molecular design. We validate the full set of these capabilities on tetrapeptide simulations and show preliminary results on scaling to protein monomers. Altogether, our work illustrates how generative modeling can unlock value from MD data towards diverse downstream tasks that are not straightforward to address with existing methods or even MD itself. Code is available at https: //github. com/bjing2016/mdgen.

TMLR Journal 2023 Journal Article

Causally-guided Regularization of Graph Attention Improves Generalizability

  • Alexander P Wu
  • Thomas Markovich
  • Bonnie Berger
  • Nils Yannick Hammerla
  • Rohit Singh

Graph attention networks estimate the relational importance of node neighbors to aggregate relevant information over local neighborhoods for a prediction task. However, the inferred attentions are vulnerable to spurious correlations and connectivity in the training data, hampering the generalizability of models. We introduce CAR, a general-purpose regularization framework for graph attention networks. Embodying a causal inference approach based on invariance prediction, CAR aligns the attention mechanism with the causal effects of active interventions on graph connectivity in a scalable manner. CAR is compatible with a variety of graph attention architectures, and we show that it systematically improves generalizability on various node classification tasks. Our ablation studies indicate that CAR hones in on the aspects of graph structure most pertinent to the prediction (e.g., homophily), and does so more effectively than alternative approaches. Finally, we also show that \methodname enhances interpretability of attention coefficients by accentuating node-neighbor relations that point to causal hypotheses.

ICLR Conference 2022 Conference Paper

Granger causal inference on DAGs identifies genomic loci regulating transcription

  • Alexander P. Wu
  • Rohit Singh 0001
  • Bonnie Berger

When a dynamical system can be modeled as a sequence of observations, Granger causality is a powerful approach for detecting predictive interactions between its variables. However, traditional Granger causal inference has limited utility in domains where the dynamics need to be represented as directed acyclic graphs (DAGs) rather than as a linear sequence, such as with cell differentiation trajectories. Here, we present GrID-Net, a framework based on graph neural networks with lagged message passing for Granger causal inference on DAG-structured systems. Our motivating application is the analysis of single-cell multimodal data to identify genomic loci that mediate the regulation of specific genes. To our knowledge, GrID-Net is the first single-cell analysis tool that accounts for the temporal lag between a genomic locus becoming accessible and its downstream effect on a target gene's expression. We applied GrID-Net on multimodal single-cell assays that profile chromatin accessibility (ATAC-seq) and gene expression (RNA-seq) in the same cell and show that it dramatically outperforms existing methods for inferring regulatory locus-gene links, achieving up to 71% greater agreement with independent population genetics-based estimates. By extending Granger causality to DAG-structured dynamical systems, our work unlocks new domains for causal analyses and, more specifically, opens a path towards elucidating gene regulatory interactions relevant to cellular differentiation and complex human diseases at unprecedented scale and resolution.

ICLR Conference 2021 Conference Paper

Multi-resolution modeling of a discrete stochastic process identifies causes of cancer

  • Adam Uri Yaari
  • Maxwell Sherman
  • Oliver Clarke Priebe
  • Po-Ru Loh
  • Boris Katz
  • Andrei Barbu
  • Bonnie Berger

Detection of cancer-causing mutations within the vast and mostly unexplored human genome is a major challenge. Doing so requires modeling the background mutation rate, a highly non-stationary stochastic process, across regions of interest varying in size from one to millions of positions. Here, we present the split-Poisson-Gamma (SPG) distribution, an extension of the classical Poisson-Gamma formulation, to model a discrete stochastic process at multiple resolutions. We demonstrate that the probability model has a closed-form posterior, enabling efficient and accurate linear-time prediction over any length scale after the parameters of the model have been inferred a single time. We apply our framework to model mutation rates in tumors and show that model parameters can be accurately inferred from high-dimensional epigenetic data using a convolutional neural network, Gaussian process, and maximum-likelihood estimation. Our method is both more accurate and more efficient than existing models over a large range of length scales. We demonstrate the usefulness of multi-resolution modeling by detecting genomic elements that drive tumor emergence and are of vastly differing sizes.

NeurIPS Conference 2020 Conference Paper

Learning Mutational Semantics

  • Brian Hie
  • Ellen Zhong
  • Bryan Bryson
  • Bonnie Berger

In many natural domains, changing a small part of an entity can transform its semantics; for example, a single word change can alter the meaning of a sentence, or a single amino acid change can mutate a viral protein to escape antiviral treatment or immunity. Although identifying such mutations can be desirable (for example, therapeutic design that anticipates avenues of viral escape), the rules governing semantic change are often hard to quantify. Here, we introduce the problem of identifying mutations with a large effect on semantics, but where valid mutations are under complex constraints (for example, English grammar or biological viability), which we refer to as constrained semantic change search (CSCS). We propose an unsupervised solution based on language models that simultaneously learn continuous latent representations. We report good empirical performance on CSCS of single-word mutations to news headlines, map a continuous semantic space of viral variation, and, notably, show unprecedented zero-shot prediction of single-residue escape mutations to key influenza and HIV proteins, suggesting a productive link between modeling natural language and pathogenic evolution.

ICLR Conference 2020 Conference Paper

Reconstructing continuous distributions of 3D protein structure from cryo-EM images

  • Ellen D. Zhong
  • Tristan Bepler
  • Joseph H. Davis
  • Bonnie Berger

Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structure of proteins and other macromolecular complexes at near-atomic resolution. In single particle cryo-EM, the central problem is to reconstruct the 3D structure of a macromolecule from $10^{4-7}$ noisy and randomly oriented 2D projection images. However, the imaged protein complexes may exhibit structural variability, which complicates reconstruction and is typically addressed using discrete clustering approaches that fail to capture the full range of protein dynamics. Here, we introduce a novel method for cryo-EM reconstruction that extends naturally to modeling continuous generative factors of structural heterogeneity. This method encodes structures in Fourier space using coordinate-based deep neural networks, and trains these networks from unlabeled 2D cryo-EM images by combining exact inference over image orientation with variational inference for structural heterogeneity. We demonstrate that the proposed method, termed cryoDRGN, can perform ab-initio reconstruction of 3D protein complexes from simulated and real 2D cryo-EM image data. To our knowledge, cryoDRGN is the first neural network-based approach for cryo-EM reconstruction and the first end-to-end method for directly reconstructing continuous ensembles of protein structures from cryo-EM images.

NeurIPS Conference 2019 Conference Paper

Explicitly disentangling image content from translation and rotation with spatial-VAE

  • Tristan Bepler
  • Ellen Zhong
  • Kotaro Kelley
  • Edward Brignole
  • Bonnie Berger

Given an image dataset, we are often interested in finding data generative factors that encode semantic content independently from pose variables such as rotation and translation. However, current disentanglement approaches do not impose any specific structure on the learned latent representations. We propose a method for explicitly disentangling image rotation and translation from other unstructured latent factors in a variational autoencoder (VAE) framework. By formulating the generative model as a function of the spatial coordinate, we make the reconstruction error differentiable with respect to latent translation and rotation parameters. This formulation allows us to train a neural network to perform approximate inference on these latent variables while explicitly constraining them to only represent rotation and translation. We demonstrate that this framework, termed spatial-VAE, effectively learns latent representations that disentangle image rotation and translation from content and improves reconstruction over standard VAEs on several benchmark datasets, including applications to modeling continuous 2-D views of proteins from single particle electron microscopy and galaxies in astronomical images.

FOCS Conference 1993 Conference Paper

Near-Linear Cost Sequential and Distribured Constructions of Sparse Neighborhood Covers

  • Baruch Awerbuch
  • Bonnie Berger
  • Lenore Cowen
  • David Peleg

This paper introduces the first near-linear (specifically, O(Elog n+nlog/sup 2/ n)) time algorithm for constructing a sparse neighborhood cover in sequential and distributed environments. This automatically implies analogous improvements (from quadratic to near-linear) to all the results in the literature that rely on network decompositions, both in sequential and distributed domains, including adaptive routing schemes with O/spl tilde/(1) stretch and memory, small edge cuts in planar graphs, sequential algorithms for dynamic approximate shortest paths with O/spl tilde/(E) cost for edge insertion/deletion and O/spl tilde/(1) time to answer shortest-path queries, weight and distance-preserving graph spanners with O/spl tilde/(E) running time and space, and distributed asynchronous "from-scratch" breadth-first-search and network synchronizer constructions with O/spl tilde/(1) message and space overhead (down from O(n)). >

FOCS Conference 1989 Conference Paper

Efficient NC Algorithms for Set Cover with Applications to Learning and Geometry

  • Bonnie Berger
  • John Rompel
  • Peter W. Shor

NC approximation algorithms are given for the unweighted and weighted set cover problems. The algorithms use a linear number of processors and give a cover that has at most log n times the optimal size/weight, thus matching the performance of the best sequential algorithms. The set cover algorithm is applied to learning theory, providing an NC algorithm for learning the concept class obtained by taking the closure under finite union or finite intersection of any concept class of finite VC dimension which has an NC hypothesis finder. In addition, a linear-processor NC algorithm is given for a variant of the set cover problem and used to obtain NC algorithms for several problems in computational geometry. >

FOCS Conference 1989 Conference Paper

Simulating (log ^c n)-wise Independence in NC

  • Bonnie Berger
  • John Rompel

A general framework is developed for removing randomness from randomized NC algorithms whose analysis uses only polylogarithmic independence. Previously, no techniques were known to determinize those RNC algorithms depending on more than constant independence. One application of the techniques is an NC algorithm for the set discrepancy problem, which can be used to obtain many other NC algorithms, including a better NC edge-coloring algorithm. As another application an NC algorithm for the hypergraph coloring problem is provided. >