Arrow Research search

Author name cluster

Jeff Shen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers
1 author row

Possible papers

4

NeurIPS Conference 2025 Conference Paper

AION-1: Omnimodal Foundation Model for Astronomical Sciences

  • Liam Parker
  • Francois Lanusse
  • Jeff Shen
  • Ollie Liu
  • Tom Hehir
  • Leopoldo Sarra
  • Lucas Meyer
  • Micah Bowles

While foundation models have shown promise across a variety of fields, astronomy lacks a unified framework for joint modeling across its highly diverse data modalities. In this paper, we present AION-1, the first large-scale multimodal foundation family of models for astronomy. AION-1 enables arbitrary transformations between heterogeneous data types using a two-stage architecture: modality-specific tokenization followed by transformer-based masked modeling of cross-modal token sequences. Trained on over 200M astronomical objects, AION-1 demonstrates strong performance across regression, classification, generation, and object retrieval tasks. Beyond astronomy, AION-1 provides a scalable blueprint for multimodal scientific foundation models that can seamlessly integrate heterogeneous combinations of real-world observations. Our model release is entirely open source, including the dataset, training script, and weights.

NeurIPS Conference 2025 Conference Paper

Predicting partially observable dynamical systems via diffusion models with a multiscale inference scheme

  • Rudy Morel
  • Francesco Ramunno
  • Jeff Shen
  • Alberto Bietti
  • Kyunghyun Cho
  • Miles Cranmer
  • Siavash Golkar
  • OLEXANDR GUGNIN

Conditional diffusion models provide a natural framework for probabilistic prediction of dynamical systems and have been successfully applied to fluid dynamics and weather prediction. However, in many settings, the available information at a given time represents only a small fraction of what is needed to predict future states, either due to measurement uncertainty or because only a small fraction of the state can be observed. This is true for example in solar physics, where we can observe the Sun’s surface and atmosphere, but its evolution is driven by internal processes for which we lack direct measurements. In this paper, we tackle the probabilistic prediction of partially observable, long-memory dynamical systems, with applications to solar dynamics and the evolution of active regions. We show that standard inference schemes, such as autoregressive rollouts, fail to capture long-range dependencies in the data, largely because they do not integrate past information effectively. To overcome this, we propose a multiscale inference scheme for diffusion models, tailored to physical processes. Our method generates trajectories that are temporally fine-grained near the present and coarser as we move farther away, which enables capturing long-range temporal dependencies without increasing computational cost. When integrated into a diffusion model, we show that our inference scheme significantly reduces the bias of the predicted distributions and improves rollout stability.

NeurIPS Conference 2024 Conference Paper

The Multimodal Universe: Enabling Large-Scale Machine Learning with 100 TB of Astronomical Scientific Data

  • Eirini Angeloudi
  • Jeroen Audenaert
  • Micah Bowles
  • Benjamin M. Boyd
  • David Chemaly
  • Brian Cherinka
  • Ioana Ciucă
  • Miles Cranmer

We present the Multimodal Universe, a large-scale multimodal dataset of scientific astronomical data, compiled specifically to facilitate machine learning research. Overall, our dataset contains hundreds of millions of astronomical observations, constituting 100TB of multi-channel and hyper-spectral images, spectra, multivariate time series, as well as a wide variety of associated scientific measurements and metadata. In addition, we include a range of benchmark tasks representative of standard practices for machine learning methods in astrophysics. This massive dataset will enable the development of large multi-modal models specifically targeted towards scientific applications. All codes used to compile the dataset, and a description of how to access the data is available at https: //github. com/MultimodalUniverse/MultimodalUniverse

NeurIPS Conference 2024 Conference Paper

The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning

  • Ruben Ohana
  • Michael McCabe
  • Lucas Meyer
  • Rudy Morel
  • Fruzsina J. Agocs
  • Miguel Beneitez
  • Marsha Berger
  • Blakesley Burkhart

Machine learning based surrogate models offer researchers powerful tools for accelerating simulation-based workflows. However, as standard datasets in this space often cover small classes of physical behavior, it can be difficult to evaluate the efficacy of new approaches. To address this gap, we introduce the Well: a large-scale collection of datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain experts and numerical software developers to provide 15TB of data across 16 datasets covering diverse domains such as biological systems, fluid dynamics, acoustic scattering, as well as magneto-hydrodynamic simulations of extra-galactic fluids or supernova explosions. These datasets can be used individually or as part of a broader benchmark suite. To facilitate usage of the Well, we provide a unified PyTorch interface for training and evaluating models. We demonstrate the function of this library by introducing example baselines that highlight the new challenges posed by the complex dynamics of the Well. The code and data is available at https: //github. com/PolymathicAI/the_well.