ControlFuse: Instruction-guided Multi-Granularity Controllable Image Fusion

Libo Zhao; Xiaoli Zhang; Zeyu Wang

doi:10.1609/aaai.v40i16.38321

Back to AAAI

AAAI 2026

ControlFuse: Instruction-guided Multi-Granularity Controllable Image Fusion

Conference Paper AAAI Technical Track on Computer Vision XIII Artificial Intelligence

PDF Details DOI

Abstract

Infrared and Visible Image Fusion (IVIF) produces enhanced images by fusing complementary visual information. However, most existing methods generate fixed outputs and cannot flexibly adapt to user-specific requirements. Recent text-guided approaches offer partial control but are limited to global or semantic levels, lacking instance-level control. This limitation arises from two challenges: first, the lack of datasets that directly link textual instructions with corresponding spatial annotations, and second, the use of coarse cross-modal alignment methods that struggle to precisely match textual instructions with visual features. To overcome these challenges, we propose ControlFuse, a controllable IVIF framework enabling multi-granularity fusion across global, semantic, and instance levels, guided by user instructions. First, we construct an automated multi-granularity dataset that provides explicit textual-mask correspondences at these three levels. Second, inspired by manifold geometry, we design a Multimodal Feature Interaction Module (MFIM) comprising Feature Manifold Converter (FMC) and Curvature-Guided Interaction (CGI). FMC projects textual and visual features into a unified manifold space, while CGI leverages manifold curvature as a geometric cue to refine cross-modal alignment. Extensive experiments validate ControlFuse, outperforming state-of-the-art methods in robustness and flexibility.

ControlFuse: Instruction-guided Multi-Granularity Controllable Image Fusion

Abstract

Authors

Keywords

Context