CLIPPan: Adapting CLIP as a Supervisor for Unsupervised Pansharpening

Lihua Jian; Jiabo Liu; Shaowu Wu; Lihui Chen

doi:10.1609/aaai.v40i7.37451

Back to AAAI

AAAI 2026

CLIPPan: Adapting CLIP as a Supervisor for Unsupervised Pansharpening

Conference Paper AAAI Technical Track on Computer Vision IV Artificial Intelligence

PDF Details DOI

Abstract

Despite remarkable advancements in supervised pansharpening neural networks, these methods face domain adaptation challenges of resolution due to the intrinsic disparity between simulated reduced-resolution training data and real-world full-resolution scenarios. To bridge this gap, we propose an unsupervised pansharpening framework, CLIPPan, that enables model training at full resolution directly by taking CLIP, a visual-language model, as a supervisor. However, directly applying CLIP to supervise pansharpening remains challenging due to its inherent bias toward natural images and limited understanding of pansharpening tasks. Therefore, we first introduce a lightweight fine-tuning pipeline that adapts CLIP to recognize low-resolution multispectral, panchromatic, and high-resolution multispectral images, as well as to understand the pansharpening process. Then, building on the adapted CLIP, we formulate a novel loss integrating semantic language constraints, which aligns image-level fusion transitions with protocol-aligned textual prompts (e.g., Wald's or Khan's descriptions), thus enabling CLIPPan to use language as a powerful supervisory signal and guide fusion learning without ground truth. Extensive experiments demonstrate that CLIPPan consistently improves spectral and spatial fidelity across various pansharpening backbones on real-world datasets, setting a new state of the art for unsupervised full-resolution pansharpening.

CLIPPan: Adapting CLIP as a Supervisor for Unsupervised Pansharpening

Abstract

Authors

Keywords

Context