JMLR Journal 2025 Journal Article
ClimSim-Online: A Large Multi-Scale Dataset and Framework for Hybrid Physics-ML Climate Emulation
- Sungduk Yu
- Zeyuan Hu
- Akshay Subramaniam
- Walter Hannah
- Liran Peng
- Jerry Lin
- Mohamed Aziz Bhouri
- Ritwik Gupta
Modern climate projections lack adequate spatial and temporal resolution due to computational constraints, leading to inaccuracies in representing critical processes like thunderstorms that occur on the sub-resolution scale. Hybrid methods combining physics with machine learning (ML) offer faster, higher fidelity climate simulations by outsourcing compute-hungry, high-resolution simulations to ML emulators. However, these hybrid physics-ML simulations require domain-specific data and workflows that have been inaccessible to many ML experts. This paper is an extended version of our NeurIPS award-winning ClimSim dataset paper. The ClimSim dataset includes 5.7 billion pairs of multivariate input/output vectors spanning ten years at high temporal resolution, capturing the influence of high-resolution, high-fidelity physics on a host climate simulator's macro-scale state. In this extended version, we introduce a significant new contribution in Section 5, which provides a cross-platform, containerized pipeline to integrate ML models into operational climate simulators for hybrid testing. We also implement various baselines of ML models and hybrid simulators to highlight the ML challenges of building stable, skillful emulators. The data (https://huggingface.co/datasets/LEAP/ClimSim_high-res, also in a low-resolution version at https://huggingface.co/datasets/LEAP/ClimSim_low-res and an aquaplanet version at https://huggingface.co/datasets/LEAP/ClimSim_low-res_aqua-planet) and code (https://leap-stc.github.io/ClimSim and https://github.com/leap-stc/climsim-online) are publicly released to support the development of hybrid physics-ML and high-fidelity climate simulations. [abs] [ pdf ][ bib ] [ code ] © JMLR 2025. ( edit, beta )