Author name cluster

Bing Cao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers

1 author row

AAAI Conference 2026 Conference Paper

Dream-IF: Dynamic Relative EnhAnceMent for Image Fusion

Xingxin Xu
Bing Cao
Dongdong Li
Qinghua Hu
Pengfei Zhu

Image fusion aims to integrate comprehensive information from images acquired through multiple sources. However, images captured by diverse sensors often encounter various degradations that can negatively affect fusion quality. Traditional fusion methods generally treat image enhancement and fusion as separate processes, overlooking the inherent correlation between them; notably, the dominant regions in one modality of a fused image often indicate areas where the other modality might benefit from enhancement. Inspired by this observation, we introduce the concept of dominant regions for image enhancement and present a Dynamic Relative EnhAnceMent framework for Image Fusion (Dream-IF). This framework quantifies the relative dominance of each modality across different layers and leverages this information to facilitate reciprocal cross-modal enhancement. By integrating the relative dominance derived from image fusion, our approach supports not only image restoration but also a broader range of image enhancement applications. Furthermore, we employ prompt-based encoding to capture degradation-specific details, which dynamically steer the restoration process and promote coordinated enhancement in both multi-modal image fusion and image enhancement scenarios. Extensive experimental results demonstrate that Dream-IF consistently outperforms its counterparts.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Reconcile Gradient Modulation for Harmony Multimodal Learning

Xiyuan Gao
Bing Cao
Baoquan Gong
Pengfei Zhu

Multimodal learning frequently faces two coupled challenges: modality imbalance, where dominant modalities suppress others during training, and modality conflict, where opposing gradient directions hinder optimization. Existing methods typically address these issues in isolation, yet they are intrinsically correlated and most fundamentally reflected in the gradient space—severe imbalance may obscure conflicts, while suppressing conflict may homogenize features and worsen imbalance, affecting fusion performance. To jointly address this coupled challenge, we propose Reconcile Gradient Modulation (RGM), a unified framework that adaptively adjusts gradient magnitude and direction for harmony multimodal learning. The core of RGM is SynOrth Grad, which minimizes Dirichlet energy to perform minimal-gradient surgery. It enhances cooperation synergy when modalities are aligned and enforces orthogonality to preserve uniqueness in conflict situations, thus promoting stable and balanced learning. To guide this modulation, we propose Cumulative Gradient Energy (CGE) as a convergence-guaranteed measure of modality-wise progress, and construct a Balance-nonConflict Plane (BCP) for real-time diagnosis and control of training dynamics. Experiments on diverse benchmarks validate our effectiveness and generalizability, consistently outperforming counterparts that are designed to handle multimodal imbalance or conflict independently.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Asymmetric Reinforcing Against Multi-Modal Representation Bias

Xiyuan Gao
Bing Cao
Pengfei Zhu
Nannan Wang
Qinghua Hu

The strength of multimodal learning lies in its ability to integrate information from various sources, providing rich and comprehensive insights. However, in real-world scenarios, multi-modal systems often face the challenge of dynamic modality contributions, the dominance of different modalities may change with the environments, leading to suboptimal performance in multimodal learning. Current methods mainly enhance weak modalities to balance multimodal representation bias, which inevitably optimizes from a partialmodality perspective, easily leading to performance descending for dominant modalities. To address this problem, we propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM). Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information. Moreover, we provide an in-depth analysis that optimizing certain modalities could cause information loss and prevent leveraging the full advantages of multimodal data. By exploring the dominance and narrowing the contribution gaps between modalities, we have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Multimodal Negative Learning

Baoquan Gong
Xiyuan Gao
Pengfei Zhu
Qinghua Hu
Bing Cao

Multimodal learning systems often encounter challenges related to modality imbalance, where a dominant modality may overshadow others, thereby hindering the learning of weak modalities. Conventional approaches often force weak modalities to align with dominant ones in "Learning to be (the same)" (Positive Learning), which risks suppressing the unique information inherent in the weak modalities. To address this challenge, we offer a new learning paradigm: "Learning Not to be" (Negative Learning). Instead of enhancing weak modalities’ target-class predictions, the dominant modalities dynamically guide the weak modality to suppress non-target classes. This stabilizes the decision space and preserves modality-specific information, allowing weak modalities to preserve unique information without being over-aligned. We proceed to reveal the multimodal learning from a robustness perspective and theoretically derive the Multimodal Negative Learning (MNL) framework, which introduces a dynamic guidance mechanism tailored for negative learning. Our method provably tightens the robustness lower bound of multimodal learning by increasing the Unimodal Confidence Margin (UCoM) and reduces the empirical error of weak modalities, particularly under noisy and imbalanced scenarios. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generalizability of our approach against the competing methods. The code will be available at: https: //github. com/BaoquanGong/Multimodal-Negative-Learning. git

PDF Details

AAAI Conference 2024 Conference Paper

Bi-directional Adapter for Multimodal Tracking

Bing Cao
Junliang Guo
Pengfei Zhu
Qinghua Hu

Due to the rapid development of computer vision, single-modal (RGB) object tracking has made significant progress in recent years. Considering the limitation of single imaging sensor, multi-modal images (RGB, infrared, etc.) are introduced to compensate for this deficiency for all-weather object tracking in complex environments. However, as acquiring sufficient multi-modal tracking data is hard while the dominant modality changes with the open environment, most existing techniques fail to extract multi-modal complementary information dynamically, yielding unsatisfactory tracking performance. To handle this problem, we propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter, cross-prompting multiple modalities mutually. Our model consists of a universal bi-directional adapter and multiple modality-specific transformer encoder branches with sharing parameters. The encoders extract features of each modality separately by using a frozen, pre-trained foundation model. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another, performing visual feature prompt fusion in an adaptive manner. With adding fewer (0.32M) trainable parameters, our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods. Our code is available: https://github.com/SparkTempest/BAT.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Conditional Controllable Image Fusion

Bing Cao
Xingxin Xu
Pengfei Zhu
Qilong Wang
Qinghua Hu

Image fusion aims to integrate complementary information from multiple input images acquired through various sources to synthesize a new fused image. Existing methods usually employ distinct constraint designs tailored to specific scenes, forming fixed fusion paradigms. However, this data-driven fusion approach is challenging to deploy in varying scenarios, especially in rapidly changing environments. To address this issue, we propose a conditional controllable fusion (CCF) framework for general image fusion tasks without specific training. Due to the dynamic differences of different samples, our CCF employs specific fusion constraints for each individual in practice. Given the powerful generative capabilities of the denoising diffusion model, we first inject the specific constraints into the pre-trained DDPM as adaptive fusion conditions. The appropriate conditions are dynamically selected to ensure the fusion process remains responsive to the specific requirements in each reverse diffusion stage. Thus, CCF enables conditionally calibrating the fused images step by step. Extensive experiments validate our effectiveness in general fusion tasks across diverse scenarios against the competing methods without additional training. The code is publicly available.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Dynamic Brightness Adaptation for Robust Multi-modal Image Fusion

Yiming Sun
Bing Cao
Pengfei Zhu
Qinghua Hu

Infrared and visible image fusion aim to integrate modality strengths for visually enhanced, informative images. Visible imaging in real-world scenarios is susceptible to dynamic environmental brightness fluctuations, leading to texture degradation. Existing fusion methods lack robustness against such brightness perturbations, significantly compromising the visual fidelity of the fused imagery. To address this challenge, we propose the Brightness Adaptive multimodal dynamic fusion framework (BA-Fusion), which achieves robust image fusion despite dynamic brightness fluctuations. Specifically, we introduce a Brightness Adaptive Gate (BAG) module, which is designed to dynamically select features from brightness-related channels for normalization, while preserving brightness-independent structural information within the source images. Furthermore, we propose a brightness consistency loss function to optimize the BAG module. The entire framework is tuned via alternating training strategies. Extensive experiments validate that our method surpasses state-of-the-art methods in preserving multi-modal image information and visual fidelity, while exhibiting remarkable robustness across varying brightness levels. Our code is available: https: //github. com/SunYM2020/BA-Fusion.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Test-Time Dynamic Image Fusion

Bing Cao
Yinan Xia
Yi Ding
Changqing Zhang
Qinghua Hu

The inherent challenge of image fusion lies in capturing the correlation of multi-source images and comprehensively integrating effective information from different sources. Most existing techniques fail to perform dynamic image fusion while notably lacking theoretical guarantees, leading to potential deployment risks in this field. Is it possible to conduct dynamic image fusion with a clear theoretical justification? In this paper, we give our solution from a generalization perspective. We proceed to reveal the generalized form of image fusion and derive a new test-time dynamic image fusion paradigm. It provably reduces the upper bound of generalization error. Specifically, we decompose the fused image into multiple components corresponding to its source data. The decomposed components represent the effective information from the source data, thus the gap between them reflects the \textit{Relative Dominability} (RD) of the uni-source data in constructing the fusion image. Theoretically, we prove that the key to reducing generalization error hinges on the negative correlation between the RD-based fusion weight and the uni-source reconstruction loss. Intuitively, RD dynamically highlights the dominant regions of each source and can be naturally converted to the corresponding fusion weight, achieving robust results. Extensive experiments and discussions with in-depth analysis on multiple benchmarks confirm our findings and superiority. Our code is available at https: //github. com/Yinan-Xia/TTD.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Auto-GAN: Self-Supervised Collaborative Learning for Medical Image Synthesis

Bing Cao
Han Zhang
Nannan Wang
Xinbo Gao
Dinggang Shen

In various clinical scenarios, medical image is crucial in disease diagnosis and treatment. Different modalities of medical images provide complementary information and jointly helps doctors to make accurate clinical decision. However, due to clinical and practical restrictions, certain imaging modalities may be unavailable nor complete. To impute missing data with adequate clinical accuracy, here we propose a framework called self-supervised collaborative learning to synthesize missing modality for medical images. The proposed method comprehensively utilize all available information correlated to the target modality from multi-source-modality images to generate any missing modality in a single model. Different from the existing methods, we introduce an autoencoder network as a novel, self-supervised constraint, which provides target-modality-speciﬁc information to guide generator training. In addition, we design a modality mask vector as the target modality label. With experiments on multiple medical image databases, we demonstrate a great generalization ability as well as specialty of our method compared with other state-of-the-arts.

PDF Details

IJCAI Conference 2019 Conference Paper

Multi-Margin based Decorrelation Learning for Heterogeneous Face Recognition

Bing Cao
Nannan Wang
Xinbo Gao
Jie Li
Zhifeng Li

Heterogeneous face recognition (HFR) refers to matching face images acquired from different domains with wide applications in security scenarios. However, HFR is still a challenging problem due to the significant cross-domain discrepancy and the lacking of sufficient training data in different domains. This paper presents a deep neural network approach namely Multi-Margin based Decorrelation Learning (MMDL) to extract decorrelation representations in a hyperspherical space for cross-domain face images. The proposed framework can be divided into two components: heterogeneous representation network and decorrelation representation learning. First, we employ a large scale of accessible visual face images to train heterogeneous representation network. The decorrelation layer projects the output of the first component into decorrelation latent subspace and obtain decorrelation representation. In addition, we design a multi-margin loss (MML), which consists of tetradmargin loss (TML) and heterogeneous angular margin loss (HAML), to constrain the proposed framework. Experimental results on two challenging heterogeneous face databases show that our approach achieves superior performance on both verification and recognition tasks, comparing with state-of-the-art methods.

PDF Details

AAAI Conference 2018 Conference Paper

Asymmetric Joint Learning for Heterogeneous Face Recognition

Bing Cao
Nannan Wang
Xinbo Gao
Jie Li

Heterogeneous face recognition (HFR) refers to matching a probe face image taken from one modality to face images acquired from another modality. It plays an important role in security scenarios. However, HFR is still a challenging problem due to great discrepancies between cross-modality images. This paper proposes an asymmetric joint learning (AJL) approach to handle this issue. The proposed method transforms the cross-modality differences mutually by incorporating the synthesized images into the learning process which provides more discriminative information. Although the aggregated data would augment the scale of intraclasses, it also reduces the diversity (i. e. discriminative information) for inter-classes. Then, we develop the AJL model to balance this dilemma. Finally, we could obtain the similarity score between two heterogeneous face images through the log-likelihood ratio. Extensive experiments on viewed sketch database, forensic sketch database and near infrared image database illustrate that the proposed AJL-HFR method achieve superior performance in comparison to state-of-theart methods.

PDF Details