Arrow Research search

Author name cluster

Bin Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

44 papers
2 author rows

Possible papers

44

JBHI Journal 2026 Journal Article

A Conditional GAN-based Framework for Sparse sEMG Data Augmentation with Muscle Synergy Prior Constraints

  • Meiju Li
  • Zijun Wei
  • Zhi-Qiang Zhang
  • Bin Yang
  • Sheng Quan Xie

The scarcity of high-quality surface electromyography (sEMG) data, caused by ethical constraints, privacy concerns, and noise interference, poses significant challenges for developing robust deep learning models in sEMG analysis. Multi-channel sEMG signals exhibit complex inter-channel correlations reflecting neuromuscular coordination. However, existing generative methods suffer from error accumulation in sequential channel generation, insufficient inter-channel relationship modeling, and lack of physiological constraints, producing data-driven valid but physiologically implausible signals that compromise biological fidelity for clinical applications. To address these fundamental limitations, we propose a Muscle Synergy-Constrained Conditional GAN (MS-cGAN) framework that simultaneously generates multi-channel sEMG signals while preserving bio-mechanical fidelity. Firstly, A novel Graph Convolutional Network (GCN)-based generator architecture specifically tailored for sparse sEMG signals, which enables the generator to capture and model complex inter-channel relationship features through graph-based representation learning, thereby circumventing error accumulation issues by leveraging the inherent inter-channel correlations. Secondly, Integration of Muscle Synergy (MS) prior constraints as dynamic loss functions based on MS theory, which enforces generator optimization within a physiologically plausible parameter space and ensures signals maintain synergistic consistency with underlying physiological mechanisms. Lastly, experiments on IRASS datasets and public datasets (NinaPro DB1 and DB2) demonstrate that MS-cGAN significantly improves signal authenticity and enhances downstream task performance compared to traditional GANs and state-of-the-art diffusion models. The generated data effectively supplement scarce sEMG datasets and improve kinematic prediction precision for deep learning models.

AAAI Conference 2026 Conference Paper

Domain-Aware Suppression and Aggregation for Federated DG ReID

  • Zhixi Yu
  • Wei Liu
  • Wenke Huang
  • Bin Yang
  • Qian Bie
  • Guancheng Wan
  • Xin Xu

Federated domain generalization in person re-identification (FedDG-ReID) aims to learn a privacy-preserving server model from decentralized client source domains that generalizes to unseen domains. Existing approaches enhance the generalizability of the server model by increasing the diversity of client person data. However, these methods overlook that ReID model parameters are easily biased by client-specific data distributions, leading to the capture of excessive domain-specific identity information. Such identity information (e.g., clothing style) struggles with identity information in unseen domains, thereby hindering the generalization ability of the server model. To address this, we propose a novel FedDG-ReID framework, which mainly consists of Domain-aware Parameter Suppression (DPS) and Domain-invariant Weighted Aggregation (DWA), called FedSupWA. Specifically, DPS adaptively attenuates the update magnitude of the parameters based on the fit of the parameters to the client's domain, encouraging the model to focus on more generalized domain-independent identity information, such as pedestrian contours, and other consistent information across domains. DWA enhances the server model’s generalization by evaluating the effectiveness of the client model in maintaining the consistency of pedestrian identities to measure the importance of the learned domain-independent identity information and assigning greater aggregation weights to clients that contribute more generalized information. Extensive experiments demonstrate the effectiveness of FedSupWA, showing that it achieves state-of-the-art performance.

AAAI Conference 2026 Conference Paper

DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation

  • Mohamed Abdelsamad
  • Michael Ulrich
  • Bin Yang
  • Miao Zhang
  • Yakov Miron
  • Abhinav Valada

Recent advances in self-supervised learning (SSL) have shown tremendous potential for learning 3D point cloud representations without human annotations. However, SSL for 3D point clouds still faces critical challenges due to irregular geometry, shortcut-prone reconstruction, and unbalanced semantics distribution. In this work, we propose DOS (Distilling Observable Softmaps), a novel SSL framework that self-distills semantic relevance softmaps only at observable (unmasked) points. This strategy prevents information leakage from masked regions and provides richer supervision than discrete token-to-prototype assignments. To address the challenge of unbalanced semantics in an unsupervised setting, we introduce Zipfian prototypes and incorporate them using a modified Sinkhorn-Knopp algorithm, Zipf-Sinkhorn, which enforces a power-law prior over prototype usage and modulates the sharpness of the target softmap during training. DOS outperforms current state-of-the-art methods on semantic segmentation and 3D object detection across multiple benchmarks, including nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200, without relying on extra data or annotations. Our results demonstrate that observable-point softmaps distillation offers a scalable and effective paradigm for learning robust 3D representations.

AAAI Conference 2026 Conference Paper

MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics

  • Jing Li
  • Yifan Wang
  • Jiafeng Yan
  • Renlong Zhang
  • Bin Yang

Infrared and visible image fusion aims to integrate complementary multi-modal information into a single fused result. However, existing methods 1) fail to account for the degradation visible images under adverse weather conditions, thereby compromising fusion performance; and 2) rely on fixed network architectures, limiting their adaptability to diverse degradation scenarios. To address these issues, we propose a one-stop degradation-aware image fusion framework for multi-degradation scenarios driven by a large language model (MdaIF). Given the distinct scattering characteristics of different degradation scenarios (e.g., haze, rain, and snow) in atmospheric transmission, a mixture-of-experts (MoE) system is introduced to tackle image fusion across multiple degradation scenarios. To adaptively extract diverse weather-aware degradation knowledge and scene feature representations, collectively referred to as the semantic prior, we employ a pre-trained vision-language model (VLM) in our framework. Guided by the semantic prior, we propose degradation-aware channel attention module (DCAM), which employ degradation prototype decomposition to facilitate multi-modal feature interaction in channel domain. In addition, to achieve effective expert routing, the semantic prior and channel-domain modulated features are utilized to guide the MoE, enabling robust image fusion in complex degradation scenarios. Extensive experiments validate the effectiveness of our MdaIF, demonstrating superior performance over SOTA methods.

AAAI Conference 2026 Conference Paper

Rethinking Irregular Time Series Forecasting: A Simple Yet Effective Baseline

  • Xvyuan Liu
  • Xiangfei Qiu
  • Xingjian Wu
  • Zhengyu Li
  • Chenjuan Guo
  • Jilin Hu
  • Bin Yang

The forecasting of irregular multivariate time series (IMTS) is crucial in key areas such as healthcare, biomechanics, climate science, and astronomy. However, achieving accurate and practical predictions is challenging due to two main factors. First, the inherent irregularity and data missingness in irregular time series make modeling difficult. Second, most existing methods are typically complex and resource-intensive. In this study, we propose a general framework called APN to address these challenges. Specifically, we design a novel Time-Aware Patch Aggregation (TAPA) module that achieves adaptive patching. By learning dynamically adjustable patch boundaries and a time-aware weighted averaging strategy, TAPA transforms the original irregular sequences into high-quality, regularized representations in a channel-independent manner. Additionally, we use a simple query module to effectively integrate historical information while maintaining the model's efficiency. Finally, predictions are made by a shallow MLP. Experimental results on multiple real-world datasets show that APN outperforms existing state-of-the-art methods in both efficiency and accuracy.

AAAI Conference 2026 Conference Paper

Robust Pedestrian Detection with Uncertain Modality

  • Qian Bie
  • Xiao Wang
  • Bin Yang
  • Zhixi Yu
  • Jun Chen
  • Xin Xu

Existing cross-modal pedestrian detection (CMPD) employs complementary information from RGB and thermal-infrared (TIR) modalities to detect pedestrians in 24h-surveillance systems. RGB captures rich pedestrian details under daylight, while TIR excels at night. However, TIR focuses primarily on the person's silhouette, neglecting critical texture details essential for detection. While the near-infrared (NIR) captures texture under low-light conditions, which effectively alleviates performance issues of RGB and detail loss in TIR, thereby reducing missed detections. To this end, we construct a new Triplet RGB–NIR–TIR (TRNT) dataset, comprising 8,281 pixel-aligned image triplets, establishing a comprehensive foundation for algorithmic research. However, due to the variable nature of real-world scenarios, imaging devices may not always capture all three modalities simultaneously. This results in input data with unpredictable combinations of modal types, which challenge existing CMPD methods that fail to extract robust pedestrian information under arbitrary input combinations, leading to significant performance degradation. To address these challenges, we propose the Adaptive Uncertainty-aware Network (AUNet) for accurately discriminating modal availability and fully utilizing the available information under uncertain inputs. Specifically, we introduce Unified Modality Validation Refinement (UMVR), which includes an uncertainty-aware router to validate modal availability and a semantic refinement to ensure the reliability of information within the modality. Furthermore, we design a Modality-Aware Interaction (MAI) module to adaptively activate or deactivate its internal interaction mechanisms per UMVR output, enabling effective complementary information fusion from available modalities. AUNet enables accurate modality validation and robust inference without fixed modality pairings, facilitating the effective fusion of RGB, NIR, and TIR information across diverse inputs.

AAAI Conference 2026 Conference Paper

Towards Non-Stationary Time Series Forecasting with Temporal Stabilization and Frequency Differencing

  • Junkai Lu
  • Peng Chen
  • Chenjuan Guo
  • Yang Shu
  • Meng Wang
  • Bin Yang

Time series forecasting is critical for decision making across dynamic domains such as energy, finance, transportation, and cloud computing. However, real-world time series often exhibit non-stationarity, including temporal distribution shifts and spectral variability, which poses significant challenges for existing long-term time series forecasting methods. In this paper, we propose DTAF, a dual-branch framework that addresses non-stationarity in both the temporal and frequency domains. For the temporal domain, the Temporal Stabilizing Fusion (TFS) module employs a non-stationary mix of experts (MOE) filter to disentangle and suppress temporal non-stationary patterns while preserving long-term dependencies. For the frequency domain, the Frequency Wave Modeling (FWM) module applies frequency differencing to dynamically highlight components with significant spectral shifts. By fusing the complementary outputs of TFS and FWM, DTAF generates robust forecasts that adapt to both temporal and frequency domain non-stationarity. Extensive experiments on multiple real-world benchmarks demonstrate that DTAF outperforms state-of-the-art baselines, yielding significant improvements in forecasting accuracy under non-stationary conditions.

AAAI Conference 2026 Conference Paper

UniABG: Unified Adversarial View Bridging and Graph Correspondence for Unsupervised Cross-View Geo-Localization

  • Cuiqun Chen
  • Qi Chen
  • Bin Yang
  • Xingyi Zhang

Cross-view geo-localization (CVGL) matches query images (e.g., drone) to geographically corresponding opposite-view imagery (e.g., satellite). While supervised methods achieve strong performance, their reliance on extensive pairwise annotations limits scalability. Unsupervised alternatives avoid annotation costs but suffer from noisy pseudo-labels due to intrinsic cross-view domain gaps. To address these limitations, we propose UniABG, a novel dual-stage unsupervised cross-view geo-localization framework integrating adversarial view bridging with graph-based correspondence calibration. Our approach first employs View-Aware Adversarial Bridging (VAAB) to model view-invariant features and enhance pseudo-label robustness. Subsequently, Heterogeneous Graph Filtering Calibration (HGFC) refines cross-view associations by constructing dual inter-view structure graphs, achieving reliable view correspondence. Extensive experiments demonstrate state-of-the-art unsupervised performance, showing that UniABG improves Satellite → Drone AP by +10.63% on University-1652 and +16.73% on SUES-200, even surpassing supervised baselines.

IJCAI Conference 2025 Conference Paper

An Empirical Study of Federated Prompt Learning for Vision Language Model

  • Zhihao Wang
  • Wenke Huang
  • Tian Chen
  • Zekun Shi
  • Guancheng Wan
  • Yu Qiao
  • Bin Yang
  • Jian Wang

The Vision Language Model (VLM) excels in aligning vision and language representations, and prompt learning has emerged as a key technique for adapting such models to downstream tasks. However, the application of prompt learning with VLM in federated learning (FL) scenarios remains underexplored. This paper systematically investigates the behavioral differences between language prompt learning (LPT) and vision prompt learning (VPT) under data heterogeneity challenges, including label skew and domain shift. We conduct extensive experiments to evaluate the impact of various FL and prompt configurations, such as client scale, aggregation strategies, and prompt length, to assess the robustness of Federated Prompt Learning (FPL). Furthermore, we explore strategies for enhancing prompt learning in complex scenarios where label skew and domain shift coexist, including leveraging both prompt types when computational resources allow. Our findings offer practical insights into optimizing prompt learning in federated settings, contributing to the broader deployment of VLMs in privacy-preserving environments.

AAAI Conference 2025 Conference Paper

Assessing Pre-Trained Models for Transfer Learning Through Distribution of Spectral Components

  • Tengxue Zhang
  • Yang Shu
  • Xinyang Chen
  • Yifei Long
  • Chenjuan Guo
  • Bin Yang

Pre-trained model assessment for transfer learning aims to identify the optimal candidate for the downstream tasks from a model hub, without the need of time-consuming fine-tuning. Existing advanced works mainly focus on analyzing the intrinsic characteristics of the entire features extracted by each pre-trained model or how well such features fit the target labels. This paper proposes a novel perspective for pre-trained model assessment through the Distribution of Spectral Components (DISCO). Through singular value decomposition of features extracted from pre-trained models, we investigate different spectral components and observe that they possess distinct transferability, contributing diversely to the fine-tuning performance. Inspired by this, we propose an assessment method based on the distribution of spectral components which measures the proportions of their corresponding singular values. Pre-trained models with features concentrating on more transferable components are regarded as better choices for transfer learning. We further leverage the labels of downstream data to better estimate the transferability of each spectral component and derive the final assessment criterion. Our proposed method is flexible and can be applied to both classification and regression tasks. We conducted comprehensive experiments across three benchmarks and two tasks including image classification and object detection, demonstrating that our method achieves state-of-the-art performance in choosing proper pre-trained models from the model hub for transfer learning.

NeurIPS Conference 2025 Conference Paper

CrossAD: Time Series Anomaly Detection with Cross-scale Associations and Cross-window Modeling

  • Beibu Li
  • Qichao Shentu
  • Yang Shu
  • Hui Zhang
  • Ming Li
  • Ning Jin
  • Bin Yang
  • Chenjuan Guo

Time series anomaly detection plays a crucial role in a wide range of real-world applications. Given that time series data can exhibit different patterns at different sampling granularities, multi-scale modeling has proven beneficial for uncovering latent anomaly patterns that may not be apparent at a single scale. However, existing methods often model multi-scale information independently or rely on simple feature fusion strategies, neglecting the dynamic changes in cross-scale associations that occur during anomalies. Moreover, most approaches perform multi-scale modeling based on fixed sliding windows, which limits their ability to capture comprehensive contextual information. In this work, we propose CrossAD, a novel framework for time series Anomaly Detection that takes Cross-scale associations and Cross-window modeling into account. We propose a cross-scale reconstruction that reconstructs fine-grained series from coarser series, explicitly capturing cross-scale associations. Furthermore, we design a query library and incorporate global multi-scale context to overcome the limitations imposed by fixed window sizes. Extensive experiments conducted on seven real-world datasets using nine evaluation metrics validate the effectiveness of CrossAD, demonstrating state-of-the-art performance in anomaly detection.

NeurIPS Conference 2025 Conference Paper

DBLoss: Decomposition-based Loss Function for Time Series Forecasting

  • Xiangfei Qiu
  • Xingjian Wu
  • Hanyin Cheng
  • Xvyuan Liu
  • Chenjuan Guo
  • Jilin Hu
  • Bin Yang

Time series forecasting holds significant value in various domains such as economics, traffic, energy, and AIOps, as accurate predictions facilitate informed decision-making. However, the existing Mean Squared Error (MSE) loss function sometimes fails to accurately capture the seasonality or trend within the forecasting horizon, even when decomposition modules are used in the forward propagation to model the trend and seasonality separately. To address these challenges, we propose a simple yet effective Decomposition-Based Loss function called DBLoss. This method uses exponential moving averages to decompose the time series into seasonal and trend components within the forecasting horizon, and then calculates the loss for each of these components separately, followed by weighting them. As a general loss function, DBLoss can be combined with any deep learning forecasting model. Extensive experiments demonstrate that DBLoss significantly improves the performance of state-of-the-art models across diverse real-world datasets and provides a new perspective on the design of time series loss functions.

NeurIPS Conference 2025 Conference Paper

Enhancing Time Series Forecasting through Selective Representation Spaces: A Patch Perspective

  • Xingjian Wu
  • Xiangfei Qiu
  • Hanyin Cheng
  • Zhengyu Li
  • Jilin Hu
  • Chenjuan Guo
  • Bin Yang

Time Series Forecasting has made significant progress with the help of Patching technique, which partitions time series into multiple patches to effectively retain contextual semantic information into a representation space beneficial for modeling long-term dependencies. However, conventional patching partitions a time series into adjacent patches, which causes a fixed representation space, thus resulting in insufficiently expressful representations. In this paper, we pioneer the exploration of constructing a selective representation space to flexibly include the most informative patches for forecasting. Specifically, we propose the Selective Representation Space (SRS) module, which utilizes the learnable Selective Patching and Dynamic Reassembly techniques to adaptively select and shuffle the patches from the contextual time series, aiming at fully exploiting the information of contextual time series to enhance the forecasting performance of patch-based models. To demonstrate the effectiveness of SRS module, we propose a simple yet effective SRSNet consisting of SRS and an MLP head, which achieves state-of-the-art performance on real-world datasets from multiple domains. Furthermore, as a novel plug-and-play module, SRS can also enhance the performance of existing patch-based models. The resources are available at https: //github. com/decisionintelligence/SRSNet.

NeurIPS Conference 2025 Conference Paper

Learning to Factorize Spatio-Temporal Foundation Models

  • Siru Zhong
  • Junjie Qiu
  • Yangyu Wu
  • Xingchen Zou
  • Zhongwen Rao
  • Bin Yang
  • Chenjuan Guo
  • Hao Xu

Spatio-Temporal Foundation Models (STFMs) promise zero/few-shot generalization across various datasets, yet joint spatio-temporal pretraining is computationally prohibitive and struggles with domain-specific spatial correlations. To this end, we introduce FactoST, a factorized STFM that decouples universal temporal pretraining from spatio-temporal adaptation. The first stage pretrains a space-agnostic backbone with multi-frequency reconstruction and domain-aware prompting, capturing cross-domain temporal regularities at low computational cost. The second stage freezes or further fine-tunes the backbone and attaches an adapter that fuses spatial metadata, sparsifies interactions, and aligns domains with continual memory replay. Extensive forecasting experiments reveal that, in few-shot setting, FactoST reduces MAE by up to 46. 4% versus UniST, uses 46. 2% fewer parameters, and achieves 68% faster inference than OpenCity, while remaining competitive with expert models. We believe this factorized view offers a practical and scalable path toward truly universal STFMs. The code will be released upon notification.

JBHI Journal 2025 Journal Article

MACTFusion: Lightweight Cross Transformer for Adaptive Multimodal Medical Image Fusion

  • Xinyu Xie
  • Xiaozhi Zhang
  • Xinglong Tang
  • Jiaxi Zhao
  • Dongping Xiong
  • Lijun Ouyang
  • Bin Yang
  • Hong Zhou

Multimodal medical image fusion aims to integrate complementary information from different modalities of medical images. Deep learning methods, especially recent vision Transformers, have effectively improved image fusion performance. However, there are limitations for Transformers in image fusion, such as lacks of local feature extraction and cross-modal feature interaction, resulting in insufficient multimodal feature extraction and integration. In addition, the computational cost of Transformers is higher. To address these challenges, in this work, we develop an adaptive cross-modal fusion strategy for unsupervised multimodal medical image fusion. Specifically, we propose a novel lightweight cross Transformer based on cross multi-axis attention mechanism. It includes cross-window attention and cross-grid attention to mine and integrate both local and global interactions of multimodal features. The cross Transformer is further guided by a spatial adaptation fusion module, which allows the model to focus on the most relevant information. Moreover, we design a special feature extraction module that combines multiple gradient residual dense convolutional and Transformer layers to obtain local features from coarse to fine and capture global features. The proposed strategy significantly boosts the fusion performance while minimizing computational costs. Extensive experiments, including clinical brain tumor image fusion, have shown that our model can achieve clearer texture details and better visual quality than other state-of-the-art fusion methods.

ICML Conference 2025 Conference Paper

Non-asymptotic Error Bounds in W2-Distance with Sqrt(d) Dimension Dependence and First Order Convergence for Langevin Monte Carlo beyond Log-Concavity

  • Bin Yang
  • Xiaojie Wang

Generating samples from a high dimensional probability distribution is a fundamental task with wide-ranging applications in the area of scientific computing, statistics and machine learning. This article revisits the popular Langevin Monte Carlo (LMC) sampling algorithms and provides a non-asymptotic error analysis in $\mathcal{W}_2$-distance in a non-convex setting. In particular, we prove an error bound $O(\sqrt{d} h)$, which guarantees a mixing time $ \tilde{O} (\sqrt{d} \epsilon^{-1})$ to achieve the accuracy tolerance $\epsilon$, under certain log-smooth conditions and the assumption that the target distribution satisfies a log-Sobolev inequality, as opposed to the strongly log-concave condition used in (Li et al. , 2019; 2022). This bound matches the best one in the strongly log-concave case and improves upon the best-known convergence rates in non-convex settings. To prove it, we establish a new framework of uniform-in-time convergence for discretizations of SDEs. Distinct from (Li et al. , 2019; 2022), we start from the finite-time mean-square fundamental convergence theorem, which combined with uniform-in-time moment bounds of LMC and the exponential ergodicity of SDEs in the non-convex setting gives the desired uniform-in-time convergence. Our framework also applies to the case when the gradient of the potential $U$ is non-globally Lipschitz with superlinear growth, for which modified LMC samplers are proposed and analyzed, with a non-asymptotic error bound in $\mathcal{W}_2$-distance obtained. Numerical experiments corroborate the theoretical analysis.

NeurIPS Conference 2025 Conference Paper

Rethinking Fair Federated Learning from Parameter and Client View

  • Kaiqi Guan
  • Wenke Huang
  • Xianda Guo
  • Yueyang Yuan
  • Bin Yang
  • Mang Ye

Federated Learning is a promising technique that enables collaborative machine learning while preserving participant privacy. With respect to multi-party collaboration, achieving performance fairness acts as a critical challenge in federated systems. Existing explorations mainly focus on considering all parameter-wise fairness and consistently protecting weak clients to achieve performance fairness in federation. However, these approaches neglect two critical issues. 1) Parameter Redundancy: Redundant parameters that are unnecessary for fairness training may conflict with critical parameters update, thereby leading to performance degradation. 2) Persistent Protection: Current fairness mechanisms persistently enhance weak clients throughout the entire training cycle, hindering global optimization and causing lower performance alongside unfairness. To address these, we propose a strategy with two key components: First, parameter adjustment with mask and rescale which discarding redundant parameter and highlight critical ones, preserving key parameter updates and decrease conflict. Second, we observe that the federated training process exhibits distinct characteristics across different phases. We propose a dynamic aggregation strategy that adaptively weights clients based on local update directions and performance variations. Empirical results on single-domain and cross-domain scenarios demonstrate the effectiveness of the proposed solution and the efficiency of crucial modules. The code is available at https: //github. com/guankaiqi/FedPW.

JBHI Journal 2025 Journal Article

SpectFusion: Cross-modal Spectrum-aware Attention Network for Unsupervised Multimodal Medical Image Fusion

  • Lamei Wang
  • Xinyu Xie
  • Youxi Yang
  • Dongping Xiong
  • Hong Zhou
  • Bin Yang
  • Kok Lay Teo
  • Bingo Wing-Kuen Ling

Medical image fusion aims to synthesize relevant and complementary information from different modalities, thereby enhancing clinical diagnosis. Current deep learning-based fusion approaches, particularly Transformer-based architectures, have achieved remarkable results due to their strong capacity for modeling long-range dependencies. However, there are still limitations in capturing sufficient global information because of the window-based local attention mechanism. Moreover, existing fusion schemes predominantly focus on spatial features while rarely considering spectral features, thus affecting the fusion performance. To address these challenges, we propose a new unsupervised cross-modal spectrum-aware fusion framework, named SpectFusion, for medical image fusion. Specifically, we devise a spatial-spectrum hybrid block, which effectively extracts fine-grained local features via a gradient retention strategy in the spatial domain, and captures global features with an image-wide receptive field through Fourier convolution in the frequency domain. Furthermore, we develop a novel cross-modal spectrum-aware attention to facilitate spatial-spectrum information interactions during fusion. It dynamically guides the retention of relevant spectral components while integrating multimodal spatial features. Additionally, to achieve more precise alignment image pairs, we incorporate a refined registration module to correct minor local deviations. We also define corresponding frequency and spatial domain losses to jointly constrain the proposed SpectFusion. By leveraging spatial-spectrum information interactions, fine-grained fusion can be adaptively realized. Extensive experiments, including clinical brain tumor image fusion, demonstrate that SpectFusion outperforms other state-of-the-art methods both qualitatively and quantitatively. We show that SpectFusion can boost performance in downstream tasks such as multimodal medical image segmentation. The code is available at https://github.com/PlumW/SpectFusion.

AAAI Conference 2025 Conference Paper

TokenMatcher: Diverse Tokens Matching for Unsupervised Visible-Infrared Person Re-Identification

  • Xiao Wang
  • Lekai Liu
  • Bin Yang
  • Mang Ye
  • Zheng Wang
  • Xin Xu

Unsupervised visible-infrared person re-identification (US-VI-ReID) seeks to match infrared and visible images of the same individual without the use of annotations. Current methods typically derive cross-modal correspondences through a single global feature matching process for generating pseudo labels and learning modality-invariant features. However, this matching approach is hindered by both intra-modality and inter-modality discrepancies, which result in imprecise measurements. As a consequence, the clustering of individuals with single global feature is often incomplete and unreliable, leading to suboptimal performance in cross-modal clustering tasks. To address these challenges and to extract cross-modality discriminative identity information, we propose a TokenMatcher, which encompasses three key components: Diverse Tokens Matching (DTM), Diverse Tokens Neighbor Learning (DTNL), and the Homogeneous Fusion (HF) Module. DTM utilizes multiple class tokens within the visual transformer framework to capture diverse embedding representations, thereby facilitating the integration of fine-grained information essential for reliable cross-modality correspondences. DTNL enhances the intra-modality and inter-modality consistency among diverse tokens by refining neighborhood sets with insights from neighboring tokens and camera information, promoting robust neighborhood learning and fostering discriminative identity information. Additionally, the HF module consolidates clusters of the same identity while effectively separating those of different identities. Extensive experiments conducted on the publicly available SYSU-MM01 and RegDB datasets demonstrate the efficacy of the proposed method.

NeurIPS Conference 2025 Conference Paper

Unbiased Prototype Consistency Learning for Multi-Modal and Multi-Task Object Re-Identification

  • Zhongao Zhou
  • Bin Yang
  • Wenke Huang
  • Jun Chen
  • Mang Ye

In object re-identification (ReID) task, both cross-modal and multi-modal retrieval methods have achieved notable progress. However, existing approaches are designed for specific modality and category (person or vehicle) retrieval task, lacking generalizability to others. Acquiring multiple task-specific models would result in wasteful allocation of both training and deployment resources. To address the practical requirements for unified retrieval, we introduce Multi-Modal and Multi-Task object ReID ($\rm {M^3T}$-ReID). The $\rm {M^3T}$-ReID task aims to utilize a unified model to simultaneously achieve retrieval tasks across different modalities and different categories. Specifically, to tackle the challenges of modality distibution divergence and category semantics discrepancy posed in $\rm {M^3T}$-ReID, we design a novel Unbiased Prototype Consistency Learning (UPCL) framework, which consists of two main modules: Unbiased Prototypes-guided Modality Enhancement (UPME) and Cluster Prototype Consistency Regularization (CPCR). UPME leverages modality-unbiased prototypes to simultaneously enhance cross-modal shared features and multi-modal fused features. Additionally, CPCR regulates discriminative semantics learning with category-consistent information through prototypes clustering. Under the collaborative operation of these two modules, our model can simultaneously learn robust cross-modal shared feature and multi-modal fused feature spaces, while also exhibiting strong category-discriminative capabilities. Extensive experiments on multi-modal datasets RGBNT201 and RGBNT100 demonstrates our UPCL framework showcasing exceptional performance for $\rm {M^3T}$-ReID. The code is available at https: //github. com/ZhouZhongao/UPCL.

NeurIPS Conference 2024 Conference Paper

Empowering Visible-Infrared Person Re-Identification with Large Foundation Models

  • Zhangyi Hu
  • Bin Yang
  • Mang Ye

Visible-Infrared Person Re-identification (VI-ReID) is a challenging cross-modal retrieval task due to significant modality differences, primarily resulting from the absence of color information in the infrared modality. The development of large foundation models like Large Language Models (LLMs) and Vision Language Models (VLMs) motivates us to explore a feasible solution to empower VI-ReID with off-the-shelf large foundation models. To this end, we propose a novel Text-enhanced VI-ReID framework driven by Large Foundation Models (TVI-LFM). The core idea is to enrich the representation of the infrared modality with textual descriptions automatically generated by VLMs. Specifically, we incorporate a pre-trained VLM to extract textual features from texts generated by VLM and augmented by LLM, and incrementally fine-tune the text encoder to minimize the domain gap between generated texts and original visual modalities. Meanwhile, to enhance the infrared modality with extracted textual representations, we leverage modality alignment capabilities of VLMs and VLM-generated feature-level filters. This enables the text model to learn complementary features from the infrared modality, ensuring the semantic structural consistency between the fusion modality and the visible modality. Furthermore, we introduce modality joint learning to align features across all modalities, ensuring that textual features maintain stable semantic representation of overall pedestrian appearance during complementary information learning. Additionally, a modality ensemble retrieval strategy is proposed to leverage complementary strengths of each query modality to improve retrieval effectiveness and robustness. Extensive experiments on three expanded VI-ReID datasets demonstrate that our method significantly improves the retrieval performance, paving the way for the utilization of large foundation models in downstream multi-modal retrieval tasks.

NeurIPS Conference 2023 Conference Paper

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

  • Ge Zheng
  • Bin Yang
  • Jiajin Tang
  • Hong-Yu Zhou
  • Sibei Yang

A long-standing goal of AI systems is to perform complex multimodal reasoning like humans. Recently, large language models (LLMs) have made remarkable strides in such multi-step reasoning on the language modality solely by leveraging the chain of thought (CoT) to mimic human thinking. However, the transfer of these advancements to multimodal contexts introduces heightened challenges, including but not limited to the impractical need for labor-intensive annotation and the limitations in terms of flexibility, generalizability, and explainability. To evoke CoT reasoning in multimodality, this work first conducts an in-depth analysis of these challenges posed by multimodality and presents two key insights: “keeping critical thinking” and “letting everyone do their jobs” in multimodal CoT reasoning. Furthermore, this study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning by first dividing the reasoning responsibility of LLMs into reasoning and recognition and then integrating the visual recognition capability of visual models into the joint reasoning process. The rationales generated by DDCoT not only improve the reasoning abilities of both large and small language models in zero-shot prompting and fine-tuning learning, significantly outperforming state-of-the-art methods but also exhibit impressive generalizability and explainability.

AAAI Conference 2022 Conference Paper

Hyperverlet: A Symplectic Hypersolver for Hamiltonian Systems

  • Frederik Baymler Mathiesen
  • Bin Yang
  • Jilin Hu

Hamiltonian systems represent an important class of dynamical systems such as pendulums, molecular dynamics, and cosmic systems. The choice of solvers is significant to the accuracy when simulating Hamiltonian systems, where symplectic solvers show great significance. Recent advances in neural network-based hypersolvers, though achieve competitive results, still lack the symplecity necessary for reliable simulations, especially over long time horizons. To alleviate this, we introduce Hyperverlet, a new hypersolver composing the traditional, symplectic velocity Verlet and symplectic neural network-based solvers. More specifically, we propose a parameterization of symplectic neural networks and prove that hyperbolic tangent is r-finite expanding the set of allowable activation functions for symplectic neural networks, improving the accuracy. Extensive experiments on a spring-mass and a pendulum system justify the design choices and suggest that Hyperverlet outperforms both traditional solvers and hypersolvers.

NeurIPS Conference 2022 Conference Paper

Interpreting Operation Selection in Differentiable Architecture Search: A Perspective from Influence-Directed Explanations

  • Miao Zhang
  • Wei Huang
  • Bin Yang

The Differentiable ARchiTecture Search (DARTS) has dominated the neural architecture search community due to its search efficiency and simplicity. DARTS leverages continuous relaxation to convert the intractable operation selection problem into a continuous magnitude optimization problem which can be easily handled with gradient-descent, while it poses an additional challenge in measuring the operation importance or selecting an architecture from the optimized magnitudes. The vanilla DARTS assumes the optimized magnitudes reflect the importance of operations, while more recent works find this naive assumption leads to poor generalization and is without any theoretical guarantees. In this work, we leverage influence functions, the functional derivatives of the loss function, to theoretically reveal the operation selection part in DARTS and estimate the candidate operation importance by approximating its influence on the supernet with Taylor expansions. We show the operation strength is not only related to the magnitude but also second-order information, leading to a fundamentally new criterion for operation selection in DARTS, named Influential Magnitude. Empirical studies across different tasks on several spaces show that vanilla DARTS and its variants can avoid most failures by leveraging the proposed theory-driven operation selection criterion.

AAAI Conference 2022 Conference Paper

REMOTE: Reinforced Motion Transformation Network for Semi-supervised 2D Pose Estimation in Videos

  • Xianzheng Ma
  • Hossein Rahmani
  • Zhipeng Fan
  • Bin Yang
  • Jun Chen
  • Jun Liu

Existing approaches for 2D pose estimation in videos often require a large number of dense annotations, which are costly and labor intensive to acquire. In this paper, we propose a semi-supervised REinforced MOtion Transformation nEtwork (REMOTE) to leverage a few labeled frames and temporal pose variations in videos, which enables effective learning of 2D pose estimation in sparsely annotated videos. Specifically, we introduce a Motion Transformer (MT) module to perform cross frame reconstruction, aiming to learn motion dynamic knowledge in videos. Besides, a novel reinforcement learning-based Frame Selection Agent (FSA) is designed within our framework, which is able to harness informative frame pairs on the fly to enhance the pose estimator under our cross reconstruction mechanism. We conduct extensive experiments that show the efficacy of our proposed REMOTE framework.

IROS Conference 2022 Conference Paper

Simultaneous Depth Estimation and Localization for Cell Manipulation Based on Deep Learning

  • Zengshuo Wang
  • Huiying Gong
  • Ke Li 0026
  • Bin Yang
  • Yue Du
  • Yaowei Liu
  • Xin Zhao 0010
  • Mingzhu Sun

Visual localization, which is a key technology to realize the automation of cell manipulation, has been widely studied. Since the depth of field of the microscope is narrow, the planar localization and depth estimation are usually coupled together. At present, most methods adopt the serial working mode of focusing first and then planar localization, but they usually do not have good real-time performance and stability. In this paper, a simultaneous depth estimation and localization network was developed for cell manipulation. The network takes a focused image and a defocus-offset image as inputs, and outputs the defocus in the depth direction and the offset in the plane at the same time after going through defocus-offset information extraction, defocus classification mapping and offset regression mapping. To train and test our network, we also create two datasets: An Adherent Cell dataset and an Injection Micropipette dataset. The experimental results demonstrated that the proposed method achieves the detection of all test samples with a frame rate of more than 40Hz, and the maximum errors of depth estimation and localization are $\boldsymbol{2. 44\mu m}$ and $\boldsymbol{0. 49\mu m}$, respectively. The proposed method has good stability, which is mainly reflected in its strong generalization ability and anti-noise ability.

IJCAI Conference 2022 Conference Paper

Triformer: Triangular, Variable-Specific Attentions for Long Sequence Multivariate Time Series Forecasting

  • Razvan-Gabriel Cirstea
  • Chenjuan Guo
  • Bin Yang
  • Tung Kieu
  • Xuanyi Dong
  • Shirui Pan

A variety of real-world applications rely on far future information to make decisions, thus calling for efficient and accurate long sequence multivariate time series forecasting. While recent attention-based forecasting models show strong abilities in capturing long-term dependencies, they still suffer from two key limitations. First, canonical self attention has a quadratic complexity w. r. t. the input time series length, thus falling short in efficiency. Second, different variables’ time series often have distinct temporal dynamics, which existing studies fail to capture, as they use the same model parameter space, e. g. , projection matrices, for all variables’ time series, thus falling short in accuracy. To ensure high efficiency and accuracy, we propose Triformer, a triangular, variable-specific attention. (i) Linear complexity: we introduce a novel patch attention with linear complexity. When stacking multiple layers of the patch attentions, a triangular structure is proposed such that the layer sizes shrink exponentially, thus maintaining linear complexity. (ii) Variable-specific parameters: we propose a light-weight method to enable distinct sets of model parameters for different variables’ time series to enhance accuracy without compromising efficiency and memory usage. Strong empirical evidence on four datasets from multiple domains justifies our design choices, and it demonstrates that Triformer outperforms state-of-the-art methods w. r. t. both accuracy and efficiency. Source code is publicly available at https: //github. com/razvanc92/triformer.

NeurIPS Conference 2022 Conference Paper

Weighted Mutual Learning with Diversity-Driven Model Compression

  • Miao Zhang
  • Li Wang
  • David Campos
  • Wei Huang
  • Chenjuan Guo
  • Bin Yang

Online distillation attracts attention from the community as it simplifies the traditional two-stage knowledge distillation process into a single stage. Online distillation collaboratively trains a group of peer models, which are treated as students, and all students gain extra knowledge from each other. However, memory consumption and diversity among peers are two key challenges to the scalability and quality of online distillation. To address the two challenges, this paper presents a framework called Weighted Mutual Learning with Diversity-Driven Model Compression (WML) for online distillation. First, at the base of a hierarchical structure where peers share different parts, we leverage the structured network pruning to generate diversified peer models and reduce the memory requirements. Second, rather than taking the average of peers, this paper, for the first time, leverages a bi-level formulation to estimate the relative importance of peers with a close-form, to further boost the effectiveness of the distillation from each other. Extensive experiments show the generalization of the proposed framework, which outperforms existing online distillation methods on a variety of deep neural networks. More interesting, as a byproduct, \WML produces a series of pruned models under different model sizes in a single run, which also achieves competitive results compared with existing channel pruning methods.

ICRA Conference 2021 Conference Paper

A Model-Free Synchronous Control of Humanoid Robot Finger

  • Ziqi Liu
  • Li Jiang 0001
  • Bin Yang
  • Chongyang Li
  • Ming Cheng
  • Shaowei Fan
  • Dapeng Yang 0001

For a multi-fingered robot hand, the individual control over single joints cannot guarantee their fine collaboration. For achieving a high-precision synchronization, a theory of synchronous control is introduced to multi-fingered robot hands. This paper introduced a new model-free and cross-coupling control strategy. It had been tested on the humanoid robot fingers and showed high positioning performance. For realizing the mutual influence between the control of all joints, we establish the synchronization error by the differential disposal of adjacent actuator errors, then position errors and synchronization errors are incorporated into a unified control frame. Meanwhile, considering the complex dynamic formulations of the dexterous hand and the characteristics of the control system, a model-free, cross-coupled trajectory tracking method is introduced and the explicit dynamic modeling parameters is not necessary. Finally, we tested our method on a multi-fingered hand platform HIT/DLR-II. The results prove that the new method has superior performance over traditional non-synchronous approaches.

IJCAI Conference 2021 Conference Paper

Unsupervised Path Representation Learning with Curriculum Negative Sampling

  • Sean Bin Yang
  • Chenjuan Guo
  • Jilin Hu
  • Jian Tang
  • Bin Yang

Path representations are critical in a variety of transportation applications, such as estimating path ranking in path recommendation systems and estimating path travel time in navigation systems. Existing studies often learn task-specific path representations in a supervised manner, which require a large amount of labeled training data and generalize poorly to other tasks. We propose an unsupervised learning framework Path InfoMax (PIM) to learn generic path representations that work for different downstream tasks. We first propose a curriculum negative sampling method, for each input path, to generate a small amount of negative paths, by following the principles of curriculum learning. Next, PIM employs mutual information maximization to learn path representations from both a global and a local view. In the global view, PIM distinguishes the representations of the input paths from those of the negative paths. In the local view, PIM distinguishes the input path representations from the representations of the nodes that appear only in the negative paths. This enables the learned path representations encode both global and local information at different scales. Extensive experiments on two downstream tasks, ranking score estimation and travel time estimation, using two road network datasets suggest that PIM significantly outperforms other unsupervised methods and is also able to be used as a pre-training method to enhance supervised path representation learning.

IJCAI Conference 2019 Conference Paper

Outlier Detection for Time Series with Recurrent Autoencoder Ensembles

  • Tung Kieu
  • Bin Yang
  • Chenjuan Guo
  • Christian S. Jensen

We propose two solutions to outlier detection in time series based on recurrent autoencoder ensembles. The solutions exploit autoencoders built using sparsely-connected recurrent neural networks (S-RNNs). Such networks make it possible to generate multiple autoencoders with different neural network connection structures. The two solutions are ensemble frameworks, specifically an independent framework and a shared framework, both of which combine multiple S-RNN based autoencoders to enable outlier detection. This ensemble-based approach aims to reduce the effects of some autoencoders being overfitted to outliers, this way improving overall detection quality. Experiments with two large real-world time series data sets, including univariate and multivariate time series, offer insight into the design properties of the proposed frameworks and demonstrate that the resulting solutions are capable of outperforming both baselines and the state-of-the-art methods.

JBHI Journal 2019 Journal Article

Simultaneous Volumetric Segmentation of Vertebral Bodies and Intervertebral Discs on Fat-Water MR Images

  • Faezeh Fallah
  • Sven Stephan Walter
  • Fabian Bamberg
  • Bin Yang

Fat-water magnetic resonance (MR) images allow automated noninvasive analysis of morphological properties and fat fractions of vertebral bodies (VBs) and inter-vertebral discs (IVDs) that constitute an important part of human biomechanical systems. In this paper, we propose a fully automated approach for simultaneously segmenting multiple VBs and IVDs on fat-water MR images without prior localization or geometry estimation. This method involved a hierarchical random forest (HRF) classifier and a hierarchical conditional random field (HCRF) that encoded a multiresolution image pyramid based on a set of multiscale local and contextual features. The HRF classifier employed penalized multivariate linear discriminants and SMOTE Bagging to handle limited and imbalanced training data with large feature dimension. The HCRF estimated optimum labels according to their spatial and hierarchical consistencies by using the layer-wise significant features determined over the trained HRF classifier. To handle variable sample numbers at different resolutions, resolution-specific hyper-parameters were used. This method was trained and evaluated for segmenting 15 thoracic and lumbar VBs and their IVDs on fat-water MR images of a subset of a large cohort data set. It was further evaluated for segmenting seven IVDs of the lower spine on fat-water images of a public grand challenge. These evaluations revealed the comparable accuracy of this method with the state-of-the-art while requiring less computational burden due to a simultaneous localization and segmentation.

IROS Conference 2008 Conference Paper

Self-Localization with RFID snapshots in densely tagged environments

  • Philipp Vorst
  • Sebastian Schneegans
  • Bin Yang
  • Andreas Zell

In this paper we show that, despite some disadvantageous properties of radio frequency identification (RFID), it is possible to localize a mobile robot quite accurately in environments which are densely tagged. We therefore employ a recently presented probabilistic fingerprinting technique called RFID snapshots. This method interprets short series of RFID measurements as feature vectors and is able to position a mobile robot after a training phase. It requires no explicit sensor model and is capable of exploiting given tag infrastructures, e. g. , provided by supermarket shelves containing labeled products.