Author name cluster

Gengchen Mai

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

AAAI Conference 2026 Conference Paper

EcoDiffusion: Uncertainty-Aware Emulation of Ecosystem Processes with Conditional Diffusion for Long Sequences with Single-Step Initialization

Ruohan Li
Zhihao Wang
Xiaowei Jia
Gengchen Mai
Lei Ma
George C. Hurtt
Quan Shen
Zhili Li

Terrestrial ecosystems constitute a major component of the global carbon sink and play a critical role in regulating the global carbon cycle. Although process-based models such as the Ecosystem Demography (ED) model are widely used to simulate these dynamics and widely adopted in research and applications, they remain computationally intensive and are not well suited for large-scale (e.g., global) projections at high spatial and temporal resolution, or under wide-range of future scenarios. AI-based emulators of process-based physical models have emerged as promising ways to accelerate the computation. However, there are several challenges in developing emulators for ecosystem processes, including error accumulation over long sequences, single-step initial conditions, and high-dimensional environmental conditions. Existing works often rely on time-series patterns in look-back windows, which are not well-suited for the problem with single-step initial conditions. Moreover, they often do not consider uncertainty, making it hard to know when the approximations are highly confident and when the results may need to be updated, e.g., by the process-based models. To address these limitations, we introduce EcoDiffusion, a conditional diffusion framework tailored for ecosystem dynamics emulation. We evaluated EcoDiffusion at locations distributed worldwide under different scenarios and showed that it demonstrated significant improvements over existing models.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

4KAgent: Agentic Any Image to 4K Super-Resolution

Yushen Zuo
Qi Zheng
Mingyang Wu
Xinrui Jiang
Renjie Li
Jian Wang
Yide Zhang
Gengchen Mai

We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at $256\times 256$, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-experts policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e. g. , NIQE, MUSIQ) and fidelity (e. g. , PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We release all the code, models, and results at: https: //4kagent. github. io.

PDF Details

NeurIPS Conference 2025 Conference Paper

LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space

Zhangyu Wang
Zeping Liu
Jielu Zhang
Zhongliang Zhou
Qian Cao
Nemin Wu
Lan Mu
Yang Song

Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. State-of-the-art methods employ either grid-based classification or gallery-based image-location retrieval, whose spatial generalizability significantly suffers if the spatial distribution of test images does not align with the choices of grids and galleries. Recently emerging generative approaches, while getting rid of grids and galleries, use raw geographical coordinates and suffer quality losses due to their lack of multi-scale information. To address these limitations, we propose a multi-scale latent diffusion model called LocDiff for image geolocalization. We developed a novel positional encoding-decoding framework called Spherical Harmonics Dirac Delta (SHDD) Representations, which encodes points on a spherical surface (e. g. , geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking on spherical probability distributions. We also propose a novel SirenNet-based architecture (CS-UNet) to learn an image-based conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. To the best of our knowledge, LocDiff is the first image geolocalization model that performs latent diffusion in a multi-scale location encoding space and generates geolocations under the guidance of images. Experimental results show that LocDiff can outperform all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 challenging global-scale image geolocalization datasets, and demonstrates significantly stronger generalizability to unseen geolocations.

PDF Details

NeurIPS Conference 2025 Conference Paper

TreeFinder: A US-Scale Benchmark Dataset for Individual Tree Mortality Monitoring Using High-Resolution Aerial Imagery

Zhihao Wang
Cooper Li
Ruichen Wang
Lei Ma
George Hurtt
Xiaowei Jia
Gengchen Mai
Zhili Li

Monitoring individual tree mortality at scale has been found to be crucial for understanding forest loss, ecosystem resilience, carbon fluxes, and climate-induced impacts. However, the fine-granularity monitoring faces major challenges on both the data and methodology sides because: (1) finding isolated individual-level tree deaths requires high-resolution remote sensing images with broad coverage, and (2) compared to regular geo-objects (e. g. , buildings), dead trees often exhibit weaker contrast and high variability across tree types, landscapes and ecosystems. Existing datasets on tree mortality primarily rely on moderate-resolution satellite imagery (e. g. , 30m resolution), which aims to detect large-patch wipe-outs but is unable to recognize individual-level tree mortality events. Several efforts have explored alternatives via very-high-resolution drone imagery. However, drone images are highly expensive and can only be collected at local scales, which are therefore not suitable for national-scale applications and beyond. To bridge the gaps, we introduce TreeFinder, the first high-resolution remote sensing benchmark dataset designed for individual-level tree mortality mapping across the Contiguous United States (CONUS). Specifically, the dataset uses NAIP imagery at 0. 6m resolution that provides wall-to-wall coverage of the entire CONUS. TreeFinder contains images with pixel-level labels generated via extensive manual annotation that covers forested areas in 48 states with over 23, 000 hectares. All annotations are rigorously validated using multi-temporal NAIP images and auxiliary vegetation indices from remote sensing imagery. Moreover, TreeFinder includes multiple evaluation scenarios to test the models' ability in generalizing across different geographic regions, climate zones, and forests with different plant function types. Finally, we develop benchmarks using a suite of semantic segmentation models, including both convolutional architectures and more recent foundation models based on vision transformers for general and remote sensing images. Our dataset and code are publicly available on Kaggle and GitHub: https: //www. kaggle. com/datasets/zhihaow/tree-finder and https: //github. com/zhwang0/treefinder.

PDF Details

ICLR Conference 2024 Conference Paper

GeoLLM: Extracting Geospatial Knowledge from Large Language Models

Rohin Manvi
Samar Khanna
Gengchen Mai
Marshall Burke
David B. Lobell
Stefano Ermon

The application of machine learning (ML) in a range of geospatial tasks is increasingly common but often relies on globally available covariates such as satellite imagery that can either be expensive or lack predictive power. Here we explore the question of whether the vast amounts of knowledge found in Internet language corpora, now compressed within large language models (LLMs), can be leveraged for geospatial prediction tasks. We first demonstrate that LLMs embed remarkable spatial information about locations, but naively querying LLMs using geographic coordinates alone is ineffective in predicting key indicators like population density. We then present GeoLLM, a novel method that can effectively extract geospatial knowledge from LLMs with auxiliary map data from OpenStreetMap. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Across these tasks, our method demonstrates a 70\% improvement in performance (measured using Pearson's $r^2$) relative to baselines that use nearest neighbors or use information directly from the prompt, and performance equal to or exceeding satellite-based benchmarks in the literature. With GeoLLM, we observe that GPT-3.5 outperforms Llama 2 and RoBERTa by 19\% and 51\% respectively, suggesting that the performance of our method scales well with the size of the model and its pretraining dataset. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe. Crucially, GeoLLM shows promise in mitigating the limitations of existing geospatial covariates and complementing them well. Code is available on the project website: https://rohinmanvi.github.io/GeoLLM

Details

ICML Conference 2024 Conference Paper

MC-GTA: Metric-Constrained Model-Based Clustering using Goodness-of-fit Tests with Autocorrelations

Zhangyu Wang
Gengchen Mai
Krzysztof Janowicz
Ni Lao

A wide range of (multivariate) temporal (1D) and spatial (2D) data analysis tasks, such as grouping vehicle sensor trajectories, can be formulated as clustering with given metric constraints. Existing metric-constrained clustering algorithms overlook the rich correlation between feature similarity and metric distance, i. e. , metric autocorrelation. The model-based variations of these clustering algorithms (e. g. TICC and STICC) achieve SOTA performance, yet suffer from computational instability and complexity by using a metric-constrained Expectation-Maximization procedure. In order to address these two problems, we propose a novel clustering algorithm, MC-GTA ( M odel-based C lustering via G oodness-of-fit T ests with A utocorrelations). Its objective is only composed of pairwise weighted sums of feature similarity terms (square Wasserstein-2 distance) and metric autocorrelation terms (a novel multivariate generalization of classic semivariogram). We show that MC-GTA is effectively minimizing the total hinge loss for intra-cluster observation pairs not passing goodness-of-fit tests, i. e. , statistically not originating from the same distribution. Experiments on 1D/2D synthetic and real-world datasets demonstrate that MC-GTA successfully incorporates metric autocorrelation. It outperforms strong baselines by large margins (up to 14. 3% in ARI and 32. 1% in NMI) with faster and stabler optimization ($>$10x speedup).

Details

NeurIPS Conference 2024 Conference Paper

TorchSpatial: A Location Encoding Framework and Benchmark for Spatial Representation Learning

Nemin Wu
Qian Cao
Zhangyu Wang
Zeping Liu
Yanlin Qi
Jielu Zhang
Joshua Ni
Xiaobai Yao

Spatial representation learning (SRL) aims at learning general-purpose neural network representations from various types of spatial data (e. g. , points, polylines, polygons, networks, images, etc. ) in their native formats. Learning good spatial representations is a fundamental problem for various downstream applications such as species distribution modeling, weather forecasting, trajectory generation, geographic question answering, etc. Even though SRL has become the foundation of almost all geospatial artificial intelligence (GeoAI) research, we have not yet seen significant efforts to develop an extensive deep learning framework and benchmark to support SRL model development and evaluation. To fill this gap, we propose TorchSpatial, a learning framework and benchmark for location (point) encoding, which is one of the most fundamental data types of spatial representation learning. TorchSpatial contains three key components: 1) a unified location encoding framework that consolidates 15 commonly recognized location encoders, ensuring scalability and reproducibility of the implementations; 2) the LocBench benchmark tasks encompassing 7 geo-aware image classification and 10 geo-aware imageregression datasets; 3) a comprehensive suite of evaluation metrics to quantify geo-aware models’ overall performance as well as their geographic bias, with a novel Geo-Bias Score metric. Finally, we provide a detailed analysis and insights into the model performance and geographic bias of different location encoders. We believe TorchSpatial will foster future advancement of spatial representationlearning and spatial fairness in GeoAI research. The TorchSpatial model framework and LocBench benchmark are available at https: //github. com/seai-lab/TorchSpatial, and the Geo-Bias Score evaluation framework is available at https: //github. com/seai-lab/PyGBS.

PDF Details DOI

ICML Conference 2023 Conference Paper

CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations

Gengchen Mai
Ni Lao
Yutong He
Jiaming Song
Stefano Ermon

Geo-tagged images are publicly available in large quantities, whereas labels such as object classes are rather scarce and expensive to collect. Meanwhile, contrastive learning has achieved tremendous success in various natural image and language tasks with limited labeled data. However, existing methods fail to fully leverage geospatial information, which can be paramount to distinguishing objects that are visually similar. To directly leverage the abundant geospatial information associated with images in pre-training, fine-tuning, and inference stages, we present Contrastive Spatial Pre-Training (CSP), a self-supervised learning framework for geo-tagged images. We use a dual-encoder to separately encode the images and their corresponding geo-locations, and use contrastive objectives to learn effective location representations from images, which can be transferred to downstream supervised tasks such as image classification. Experiments show that CSP can improve model performance on both iNat2018 and fMoW datasets. Especially, on iNat2018, CSP significantly boosts the model performance with 10-34% relative improvement with various labeled training data sampling ratios.

Details

ICLR Conference 2020 Conference Paper

Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells

Gengchen Mai
Krzysztof Janowicz
Bo Yan 0003
Rui Zhu 0008
Ling Cai 0002
Ni Lao

Unsupervised text encoding models have recently fueled substantial progress in NLP. The key idea is to use neural networks to convert words in texts to vector space representations (embeddings) based on word positions in a sentence and their contexts, which are suitable for end-to-end training of downstream tasks. We see a strikingly similar situation in spatial analysis, which focuses on incorporating both absolute positions and spatial contexts of geographic objects such as POIs into models. A general-purpose representation model for space is valuable for a multitude of tasks. However, no such general model exists to date beyond simply applying discretization or feed-forward nets to coordinates, and little effort has been put into jointly modeling distributions with vastly different characteristics, which commonly emerges from GIS data. Meanwhile, Nobel Prize-winning Neuroscience research shows that grid cells in mammals provide a multi-scale periodic representation that functions as a metric for location encoding and is critical for recognizing places and for path-integration. Therefore, we propose a representation learning model called Space2Vec to encode the absolute positions and spatial relationships of places. We conduct experiments on two real-world geographic data for two different tasks: 1) predicting types of POIs given their positions and context, 2) image classification leveraging their geo-locations. Results show that because of its multi-scale representations, Space2Vec outperforms well-established ML approaches such as RBF kernels, multi-layer feed-forward nets, and tile embedding approaches for location modeling and image classification tasks. Detailed analysis shows that all baselines can at most well handle distribution at one scale but show poor performances in other scales. In contrast, Space2Vec ’s multi-scale representation can handle distributions at different scales.

Details