AAAI Conference 2026 Conference Paper
SGPFeat: Semantic and Geometric Priors for Multi-modal Image Matching
- Yuxin Deng
- Botian Wang
- Kaining Zhang
- Hao Zhang
- Jiayi Ma
Multi-modal image matching is a fundamental task in multi-view and multi-modal image processing. Its key challenge lies in extracting features that remain consistent despite drastic appearance variations across modalities. However, the learning of the feature is hindered by the scarcity and the inaccurate alignment of existing multi-modal datasets. To address this, we propose a knowledge distillation framework termed SGPFeat that transfers rich prior knowledge from large-scale unimodal tasks to enhance multi-modal representation learning. Specifically, semantic priors from a vision foundation model guide the feature extractor to identify shared semantic structures across modalities, enabling better generalization under large appearance gaps. In parallel, geometric priors derived from accurately aligned visible-light datasets improve detection precision on noisy aligned multi-modal pairs. Furthermore, we introduce a Heterogeneous Feature Aggregation (HFA) module to facilitate effective distillation and feature representation. Extensive experiments demonstrate that semantic and geometric priors bring significant improvement for our SGPFeat across diverse multi-modal image matching benchmarks.