Arrow Research search

Author name cluster

Cong Bai

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
2 author rows

Possible papers

11

AAAI Conference 2026 Conference Paper

PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image–Text Retrieval

  • Pengxiang Ouyang
  • Qing Ma
  • Zheng Wang
  • Cong Bai

Remote sensing (RS) image–text retrieval faces significant challenges in real-world datasets due to the presence of Pseudo-Matched Pairs (PMPs), semantically mismatched or weakly aligned image–text pairs, which hinder the learning of reliable cross-modal alignments. To address this issue, we propose a novel retrieval framework that leverages Cross-Modal Gated Attention and a Positive–Negative Awareness Attention mechanism to mitigate the impact of such noisy associations. The gated module dynamically regulates cross-modal information flow, while the awareness mechanism explicitly distinguishes informative (positive) cues from misleading (negative) ones during alignment learning. Extensive experiments on three benchmark RS datasets, i.e., RSICD, RSITMD, and RS5M, demonstrate that our method consistently achieves state-of-the-art performance, highlighting its robustness and effectiveness in handling real-world mismatches and PMPs in RS image–text retrieval tasks.

JBHI Journal 2026 Journal Article

Refocal Loss in Transformer for Long-Tailed Multi-Granularity Cataract Classification

  • Qiong Wang
  • Yan Wang
  • Hongdi Sun
  • Yu Feng
  • Zhe Dong
  • Cong Bai

Different cataract types and various severities usually require different countermeasures. For automatic cataract diagnosis, existing cataract classification methods group cataracts into common types, such as nuclear cataract, cortical cataract, and posterior subcapsular cataract, while existing cataract grading works aim to achieve fine-grained evaluation of the severity of the most common types of cataract. The severity assessment differs among various types of cataracts. Existing work is limited in predicting various cataract types at different granularity levels. In order to improve diagnostic efficiency, our study explores this matter in the context of multi-granularity cataract classification. Firstly, a large-scale dataset called Multi-Granularity Long-Tailed Cataract is collected. Secondly, an end-to-end training network is proposed, in which the Transformer is investigated for the extraction of multi-granularity cataract features. What is more, considering the imbalanced cataract data with the long-tailed distribution, the Refocal loss is proposed to rebalance the loss contribution of different classes by enhancing the reciprocal value of the effective number of samples. Compared with state-of-the-art methods, the experiments conducted on the multi-granularity cataract classification dataset demonstrate that the proposed model achieves the highest Precision of 78. 22%, F1-score of 68. 35%, Kappa of 64. 38% and MCC of 64. 49%, indicating that the proposed framework is promising in offering physicians reliable quantitative evaluations for multi-granularity cataract classification, which can help guide appropriate treatment decisions before the patient’s cataracts worsen.

AAAI Conference 2025 Conference Paper

Dust-Mamba: An Efficient Dust Storm Detection Network with Multiple Data Sources

  • Cong Bai
  • Zhonghao Lin
  • Jinglin Zhang
  • Shengyong Chen

Accurate detection of dust storms is challenging due to complex meteorological interactions. With the development of deep learning, deep neural networks have been increasingly applied to dust storm detection, offering better learning and generalization capabilities compared to traditional physical modeling. However, existing methods face some limitations, leading to performance bottlenecks in dust storm detection. From the task perspective, existing research focuses on occurrence detection while neglecting intensity detection. From the data perspective, existing research fails to explore the utilization of multi-source data. From the model perspective, most models are built on convolutional neural networks, which have an inherent limitation in capturing long-range dependencies. To address the challenges mentioned, this study proposes Dust-Mamba. To the best of our knowledge, this study is the first attempt to accomplish both the occurrence and intensity detection of dust storms with advanced deep learning technology. In Dust-Mamba, multi-source data is introduced to provide a comprehensive perspective, Mamba and attention are applied to boost feature selection while maintaining long-range modeling capability. Additionally, this study proposes Structure Sharing Transfer Learning Strategies for intensity detection, which further enhances the performance of Dust-Mamba with minimal time cost. As shown by experiments, Dust-Mamba achieves Dice scores of 0.963 for occurrence detection and 0.560 for intensity detection, surpassing several baseline models. In conclusion, this study offers valuable baselines for dust storm detection, with significant reference value and promising application potential.

NeurIPS Conference 2025 Conference Paper

IDOL: Meeting Diverse Distribution Shifts with Prior Physics for Tropical Cyclone Multi-Task Estimation

  • HantingYan Yan
  • Pan Mu
  • Shiqi Zhang
  • Yuchao Zhu
  • Jinglin Zhang
  • Cong Bai

Tropical Cyclone (TC) estimation aims to accurately estimate various TC attributes in real time. However, distribution shifts arising from the complex and dynamic nature of TC environmental fields, such as varying geographical conditions and seasonal changes, present significant challenges to reliable estimation. Most existing methods rely on multi-modal fusion for feature extraction but overlook the intrinsic distribution of feature representations, leading to poor generalization under out-of-distribution (OOD) scenarios. To address this, we propose an effective Identity Distribution-Oriented Physical Invariant Learning framework (IDOL), which imposes identity-oriented constraints to regulate the feature space under the guidance of prior physical knowledge, thereby dealing distribution variability with physical invariance. Specifically, the proposed IDOL employs the wind field model and dark correlation knowledge of TC to model task-shared and task-specific identity tokens. These tokens capture task dependencies and intrinsic physical invariances of TC, enabling robust estimation of TC wind speed, pressure, inner-core, and outer-core size under distribution shifts. Extensive experiments conducted on multiple datasets and tasks demonstrate the outperformance of the proposed IDOL, verifying that imposing identity-oriented constraints based on prior physical knowledge can effectively mitigates diverse distribution shifts in TC estimation.

ECAI Conference 2025 Conference Paper

LS-Mamba: Spatial-Spectral Mamba for Multispectral Cloud Image Semantic Segmentation

  • Qiong Wang
  • Zhiying Hu 0005
  • Cong Bai

Cloud semantic segmentation, which assigns semantic labels to each pixel in multispectral images, plays a critical role in weather analysis and climate studies. Despite recent advancements in deep learning and the emergence of the Mamba architecture, existing methods for cloud segmentation continue to face significant challenges. In particular, current approaches often fall short in effectively modeling the complex relationships among spectral channels, which can lead to ambiguous representations and result in misclassification, especially of spectrally similar cloud types. Additionally, while Mamba excels at long-range modeling, it often overlooks local 2D structural dependencies, resulting in inaccuracies for clouds with complex spatial distributions. To address these challenges, we propose a novel cloud semantic classification model based on Spatial-Spectral Mamba. We design a spectral Mamba block (SpeMamba) to capture intricate intraspectral relationships to improve discrimination between confused cloud types, and also design a spatial Mamba block to model local-global dependencies through local-global scanning, preserving fine-grained spatial structures while maintaining global features. The proposed method is evaluated on the Himawari-8 image dataset, and the experimental results demonstrate the effectiveness of the proposed method, achieving the new state-of-the-art performance. Codes are available at https: //github. com/Zjut-MultimediaPlus/LS-Mamba.

AAAI Conference 2025 Conference Paper

TC-Diffuser: Bi-Condition Multi-Modal Diffusion for Tropical Cyclone Forecasting

  • Shiqi Zhang
  • Pan Mu
  • Cheng Huang
  • Jinglin Zhang
  • Cong Bai

Tropical cyclones (TCs) are complex weather systems with strong winds and heavy rainfall, causing substantial loss of life and property. Therefore, accurate TC forecasting is crucial for the effective prevention of disasters caused by TCs. TC forecasting can be regarded as a spatio-temporal prediction problem. It has been proven that using multi-modal data can effectively introduce atmospheric information to achieve better prediction results and higher interpretability. But it also introduces inevitably introduces noise into the prediction process. The diffusion model's unique noise modeling capability can reduce prediction noise when using multi-modal datasets. However, adapting it to TC forecasting has two main challenges: how to extract valuable information from multi-modal data, and how to utilize them to guide the generation process. For the first challenge, while recent methods can predict multiple TC attributes using multi-modal data, they often overlook the interdependence of multiple attributes and the semantic gap between modalities. Considering the interdependence of attributes, we propose two condition generators that capture the commonalities and characteristics of TC attributes, extracting spatio-temporal and environmental features and incorporating expert knowledge. To reduce the semantic gap between multi-modal data, we introduce the PGSA-LSTM module to map primary and auxiliary modalities. For the second challenge, we propose a novel Bi-condition diffusion model that sequentially processes conditions from the characteristics to commonalities of attributes, thereby expanding the guidance information that the diffusion model can accept. Our results surpass state-of-the-art deep learning models and outperform the numerical weather prediction model used by the China Central Meteorological Observatory. TC-Diffuser shows high generalizability across global ocean areas, strong robustness in handling missing data, and higher computational efficiency.

ICML Conference 2025 Conference Paper

TCP-Diffusion: A Multi-modal Diffusion Model for Global Tropical Cyclone Precipitation Forecasting with Change Awareness

  • Cheng Huang
  • Pan Mu
  • Cong Bai
  • Peter AG Watson

Deep learning methods have made significant progress in regular rainfall forecasting, yet the more hazardous tropical cyclone (TC) rainfall has not received the same attention. While regular rainfall models can offer valuable insights for designing TC rainfall forecasting models, most existing methods suffer from cumulative errors and lack physical consistency. Additionally, these methods overlook the importance of meteorological factors in TC rainfall and their integration with the numerical weather prediction (NWP) model. To address these issues, we propose Tropical Cyclone Precipitation Diffusion (TCP-Diffusion), a multi-modal model for forecasting of TC precipitation given an existing TC in any location globally. It forecasts rainfall around the TC center for the next 12 hours at 3 hourly resolution based on past rainfall observations and multi-modal environmental variables. Adjacent residual prediction (ARP) changes the training target from the absolute rainfall value to the rainfall trend and gives our model the capability of rainfall change awareness, reducing cumulative errors and ensuring physical consistency. Considering the influence of TC-related meteorological factors and the useful information from NWP model forecasts, we propose a multi-model framework with specialized encoders to extract richer information from environmental variables and results provided by NWP models. The results of extensive experiments show that our method outperforms other DL methods and the NWP method from the European Centre for Medium-Range Weather Forecasts (ECMWF).

AAAI Conference 2025 Conference Paper

Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline

  • Zekai Zhang
  • Qinghui Chen
  • Maomao Xiong
  • Shijiao Ding
  • Zhanzhi Su
  • Xinjie Yao
  • Yiming Sun
  • Cong Bai

Large Visual Language Models (LVLMs) have achieved remarkable success in vision tasks. However, the significant differences between industrial and natural scenes make applying LVLMs challenging. Existing LVLMs rely on user-provided prompts to segment objects. This often leads to suboptimal performance due to the inclusion of irrelevant pixels. In addition, the scarcity of data also makes the application of LVLMs in industrial scenarios remain unexplored. To fill this gap, this paper proposes an open industrial dataset and a Refined Text-Visual Prompt (RTVP) for zero-shot industrial defect detection. First, this paper constructs the Multi-Modal Industrial Open Dataset (MMIO) containing 80K+ samples. MMIO contains diverse industrial categories, including 6 super categories and 18 subcategories. MMIO is the first large-scale multi-scenes pre-training dataset for industrial zero-shot learning, and provides valuable training data for open models in future industrial scenarios. Based on MMIO, this paper provides a RTVP specifically for industrial zero-shot tasks. RTVP has two significant advantages: First, this paper designs an expert-guided large model domain adaptation mechanism and designs an industrial zero-shot method based on Mobile-SAM, which enhances the generalization ability of large models in industrial scenarios. Second, RTVP automatically generates visual prompts directly from images and considers text-visual prompt interactions ignored by previous LVLM, improving visual and textual content understanding. RTVP achieves SOTA with 42.2% and 24.7% AP in zero-shot and closed scenes of MMIO.

ECAI Conference 2024 Conference Paper

Phy-CoCo: Physical Constraint-Based Correlation Learning for Tropical Cyclone Intensity and Size Estimation

  • Hanting Yan
  • Pan Mu
  • Cheng Huang
  • Jinglin Zhang
  • Cong Bai

Tropical Cyclone (TC) estimation aims to estimate various attributes of TC in real-time to alleviate and prevent disasters caused by violent TCs. As artificial intelligence technology advances, various deep learning-based multi-task estimation approaches have been proposed. However, most of them only focus on extracting common features of tasks, disregarding potential negative transfer and task interactions between different tasks. This paper is thus motivated to propose a Physical Constraint-based Correlation (Phy-CoCo) learning framework from the perspective of Multi-Task Learning (MTL). Specifically, for task-specific feature learning, we introduce Correlation Modeling (CoM) based on Centrally Expanded Pooling (CEP). Furthermore, for cross-task interaction, we propose a Multi-Domain Recurrent Convolution (MDRC) module to incorporate physical constraints into MTL. These physical constraints enable the transformation of different task features by simulating the physical relations among different attributes of TC. Lastly, in combination with a task-shared network that leverages the hybrid fusion of multi-modal data, our MTL framework accurately estimates various TC attributes. Extensive experiments conducted on our constructed dataset demonstrate that the proposed Phy-CoCo outperforms previous methods in TC estimation in terms of estimation error, verifying the potential of the physics-incorporated MTL model.

AAAI Conference 2023 Conference Paper

MGTCF: Multi-Generator Tropical Cyclone Forecasting with Heterogeneous Meteorological Data

  • Cheng Huang
  • Cong Bai
  • Sixian Chan
  • Jinglin Zhang
  • YuQuan Wu

Accurate forecasting of tropical cyclone (TC) plays a critical role in the prevention and defense of TC disasters. We must explore a more accurate method for TC prediction. Deep learning methods are increasingly being implemented to make TC prediction more accurate. However, most existing methods lack a generic framework for adapting heterogeneous meteorological data and do not focus on the importance of the environment. Therefore, we propose a Multi-Generator Tropical Cyclone Forecasting model (MGTCF), a generic, extensible, multi-modal TC prediction model with the key modules of Generator Chooser Network (GC-Net) and Environment Net (Env-Net). The proposed method can utilize heterogeneous meteorologic data efficiently and mine environmental factors. In addition, the Multi-generator with Generator Chooser Net is proposed to tackle the drawbacks of single-generator TC prediction methods: the prediction of undesired out-of-distribution samples and the problems stemming from insufficient learning ability. To prove the effectiveness of MGTCF, we conduct extensive experiments on the China Meteorological Administration Tropical Cyclone Best Track Dataset. MGTCF obtains better performance compared with other deep learning methods and outperforms the official prediction method of the China Central Meteorological Observatory in most indexes.

ICRA Conference 2023 Conference Paper

SGPT: The Secondary Path Guides the Primary Path in Transformers for HOI Detection

  • Sixian Chan 0001
  • Weixiang Wang
  • Zhanpeng Shao
  • Cong Bai

HOI detection is essential for human-computer interaction, especially in behavior detection and robot manipulation. Existing mainstream transformer methods of HOI detection are focused on single-stream detection only, e. g. , $image \rightarrow HOI(\mathcal{P}_{1})$, or $image \rightarrow HO\rightarrow I(\mathcal{P}_{2})$. Both paths have their own characteristics of concern, so we propose a novel method, using the Secondary path $(\mathcal{P}_{2})$ Guides the Primary path $(\mathcal{P}_{1})$ in Transformers (SGPT). SGPT contains two core modules: the Dual-Path Consistency (DPC) module and the Instance Interaction Attention (IIA) module. DPC keeps human, object and interaction consistent on the dual-path and lets $\mathcal{P}_{2}$ guide $\mathcal{P}_{1}$ to learn more meaningful features. IIA fuses human and object to enhance interaction in $\mathcal{P}_{2}$, which allows instance to constrain interaction. Our proposed dual-path are employed during training, and only the $\mathcal{P}_{1}$ path is used for inference. Hence, SGPT improves generalization without increasing model capacity in HICO-DET and V-COCO datasets compared to the state-of-the-arts. The code of this work is available at https://github.com/visualVk/sgpt.git.