Arrow Research search

Author name cluster

Yifan Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

54 papers
2 author rows

Possible papers

54

AAAI Conference 2026 Conference Paper

BraSTORM: A Dual-Branch Self-Supervised Framework for EEG Representation Learning via Input-Level Spatio-Temporal Decomposition

  • Yifan Wang
  • Der-Horng Lee
  • Bruce X.B. Yu

Prevalent pre-training strategies for Brain-Computer Interfaces (BCIs) are often constrained by spatio-temporal entanglement. This critical issue arises from processing multi-channel Electroencephalography (EEG) signals as monolithic sequences, which intertwines the signal's temporal dynamics with its spatial topography and hinders the learning of robust and generalizable representations. To address this, we introduce BraSTORM, a framework that explicitly disentangles EEG data into separate temporal and spatial streams at the input level. Two streams are processed by parallel encoders trained with a composite dual-objective: a masked signal reconstruction loss captures fine-grained, intra-modal details, while a cross-modal contrastive loss enforces high-level semantic alignment. Extensive fine-tuning experiments on six benchmarks covering three major BCI downstream tasks—Emotion Recognition, Sleep Staging, and Motor Imagery—demonstrate that BraSTORM achieves state-of-the-art performance. Our findings validate that resolving spatio-temporal entanglement at the input level can be a competitive pre-training framework for the BCI field.

JBHI Journal 2026 Journal Article

Dual-Cross Tri-Level Routing Transformer Based Metric Learning Network for Epileptic Seizure Prediction Using a Single-Channel iEEG

  • Yifan Wang
  • Weidong Yan
  • Yulan Ma
  • Liang Qiao
  • Tao Yu
  • Jingyu Liu

With the development of deep brain stimulation technique, single-channel intracranial electroencephalography (iEEG) based seizure prediction is a necessary and urgent needed tool for epilepsy closed-loop neuromodulation. However, previous prediction methods based on multi-channel scalp signals heavily relied on the spatial information, failing to fully exploit the interdependencies between temporal scales and spectral rhythms of single-channel iEEG. Additionally, current contrastive learning strategies can lead to model overfitting by excessively learning the feature distances in small samples, limiting the precision of seizure prediction. To tackle above issues, based on a single-channel iEEG, we propose a novel dual-cross tri-level routing transformer based metric learning network (DC-TRT-MLNet) for epileptic seizure prediction. First, a scale-rhythm dual-cross (DC) graph attention network is introduced to construct the dependent relationships across multi-scale temporal and multi-rhythm spectral features. Second, we design a tri-level routing transformer (TRT) network to comprehensively refine the most seizure-potential routing features while eliminating redundant information. Finally, a hard triplet optimization based metric learning (ML) strategy is developed to iteratively optimize the intra-class and inter-class distances of inter-ictal and pre-ictal routing features. Competitive experimental results on a private Xuanwu Single-Channel iEEG dataset validate the effectiveness of our proposed method, demonstrating the superior prediction performance of our DC-TRT-MLNet compared with the state-of-the-art methods. Our study may offer a new solution for intracranial single-channel seizure prediction.

AAAI Conference 2026 Conference Paper

Evidence-aware Integration and Domain Identification of Spatial Transcriptomics Data

  • Wei Zhang
  • Siyu Yi
  • Lezhi Chen
  • Yifan Wang
  • Ziyue Qiao
  • Yongdao Zhou
  • Wei Ju

Spatial transcriptomics (ST) enables joint profiling of gene expression and spatial positions, thereby revealing spatially resolved biological functions. However, many existing ST analysis methods often fail to explicitly quantify the belief and uncertainty in decisions caused by noisy ST data, making it difficult to handle spots of varying quality in a fine-grained manner. In addition, domain identification is a fundamental and critical task in ST, but commonly used models that separate expression learning and clustering often struggle to learn cluster-friendly latent representations effectively. To address these issues, we propose PREST, a prototype-based evidence-aware integration framework for ST data. PREST performs multi-scale representation learning with fine-grained attention fusion and introduces learnable class prototypes to quantify belief and uncertainty in model decisions. We aim to align overall belief scores with latent semantic information to enhance uncertainty quantification and prototype learning, thereby promoting the learning of clustering-friendly representations. PREST further integrates an uncertainty-aware reconstruction module and spatial regularization to reduce overfitting to unreliable spots and promote denoised, discriminative representations. Extensive experiments on several benchmark datasets validate the effectiveness and superiority of our proposed PREST across various downstream tasks.

AAAI Conference 2026 Conference Paper

FairGC: Fostering Individual and Group Fairness for Deep Graph Clustering

  • Haodong Zhang
  • Xinyue Wang
  • Tao Ren
  • Yifan Wang
  • Siyu Yi
  • Fanchun Meng
  • Zeyu Ma
  • Qingqing Long

The widespread adoption of graph neural networks (GNNs) has brought increased attention to fairness issues related to sensitive attributes, such as gender and race, in practical scenarios. However, this concern remains largely unexplored in the context of graph clustering. Conventional fair graph clustering methods primarily depend on spectral clustering approaches. Meanwhile, we argue that existing graph learning works mainly focus on a single type of fairness, whereas graph clustering should achieve group equality-informed individual fairness. In this paper, we introduce for the first time a fairness-aware framework termed FairGC for deep graph clustering, which integrates the dual objectives of individual and group fairness while maintaining accurate clustering results. Specifically, we construct two views with distinct semantics using Siamese encoders. Then, we apply multi-step random walks on view-specific affinity graphs to capture high-order affinities of node pairs, thereby reformulating the contrastive learning with a focus on individual similarity. Besides, we utilize adversarial learning by making node representations independent of the estimated sensitive attributes to further eliminate group biases of clustering results. Extensive experiments on four benchmarks demonstrate the effectiveness and superiority of our proposed framework FairGC.

AAAI Conference 2026 Conference Paper

MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics

  • Jing Li
  • Yifan Wang
  • Jiafeng Yan
  • Renlong Zhang
  • Bin Yang

Infrared and visible image fusion aims to integrate complementary multi-modal information into a single fused result. However, existing methods 1) fail to account for the degradation visible images under adverse weather conditions, thereby compromising fusion performance; and 2) rely on fixed network architectures, limiting their adaptability to diverse degradation scenarios. To address these issues, we propose a one-stop degradation-aware image fusion framework for multi-degradation scenarios driven by a large language model (MdaIF). Given the distinct scattering characteristics of different degradation scenarios (e.g., haze, rain, and snow) in atmospheric transmission, a mixture-of-experts (MoE) system is introduced to tackle image fusion across multiple degradation scenarios. To adaptively extract diverse weather-aware degradation knowledge and scene feature representations, collectively referred to as the semantic prior, we employ a pre-trained vision-language model (VLM) in our framework. Guided by the semantic prior, we propose degradation-aware channel attention module (DCAM), which employ degradation prototype decomposition to facilitate multi-modal feature interaction in channel domain. In addition, to achieve effective expert routing, the semantic prior and channel-domain modulated features are utilized to guide the MoE, enabling robust image fusion in complex degradation scenarios. Extensive experiments validate the effectiveness of our MdaIF, demonstrating superior performance over SOTA methods.

AAAI Conference 2026 Conference Paper

TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization

  • Maowei Jiang
  • Zihang Wang
  • Qi Wang
  • Peter Búš
  • Moquan Cheng
  • Yifan Wang
  • Quangao Liu
  • Ruiqi Li

Reinforcement learning (RL) has emerged as a powerful framework to improve the reasoning performance of large language models (LLMs), with approaches such as Group Relative Policy Optimization (GRPO) showing promising results. However, GRPO and its variants struggle with collapsed groups (i.e., all-correct or all-incorrect completions), leading to zero-variance rewards and ineffective gradient signals. Moreover, focusing solely on final answer correctness while ignoring the reasoning process, along with rigid length penalties, can hinder training stability and output quality. To address these issues, we introduce TAPO, a reinforcement learning framework that enhances optimization signals by modifying sampled completions within training groups. TAPO incorporates three core techniques: (1) Dynamic Teacher Injection (DTI), which selectively injects high-quality or adversarial examples to restore effective gradient signals in collapsed groups; (2) Perturbed Answer Injection (PAI), which makes partially correct completions to provide contrastive supervision separating reasoning correctness but wrong answer from the trajectories; and (3) InfoLen-Aware Reward Shaping, a fine-grained reward strategy that penalizes outputs based on both length and semantic redundancy, encouraging concise yet informative responses. Extensive experimental results demonstrate that TAPO significantly improves the mathematical reasoning capabilities of LLMs across multiple challenging benchmarks, outperforming the GRPO baseline by a substantial margin. Component-wise ablations further validate the contribution of each proposed technique.

AAAI Conference 2025 Conference Paper

Aligning Composed Query with Image via Discriminative Perception from Negative Correspondences

  • Yifan Wang
  • Wuliang Huang
  • Chun Yuan

The task of composed image retrieval aims to match the multi-modal query composed of a reference image and a modification sentence with the target image. Most current approaches narrow the distances between the composed queries and targets by investigating matched correspondences in positive triplets. Nevertheless, they are inclined to exhibit heavy reliance on partial correlations. As the negative correspondences are underestimated, semantic clues that distinguish the target from mismatched candidates are obscured by incomplete associations. Moreover, the correlations between the modification textual features and the visual variations from the reference to candidates are imperative to further strengthen the semantic discriminations. In this paper, we propose DIscriminative Perception from NEgative Correspondences (DIPNEC) to address the aforementioned issues. To encourage awareness of the differences between matched and mismatched correspondences, DIPNEC introduces optimal transport with semantic preservation for reassignments on hard negative triplets. Besides, Difference Quantization Alignments (DQA) and Composed Word-level Alignments (CWA) jointly determine the matching scores between multi-modal queries and candidates. Specifically, DQA concentrates on the correlations of textual features with source-to-target visual differences, and CWA further emphasizes the differentiated semantics. DIPNEC has demonstrated competitive performances on the experimental results and ablation studies on widely-used datasets FashionIQ and CIRR.

TMLR Journal 2025 Journal Article

B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

  • Yifan Wang
  • Sukrut Rao
  • Ji-Ung Lee
  • Mayank Jobanputra
  • Vera Demberg

Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos Language Models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we present a first exploration of transforming decoder-only models to B-cos LMs for generation tasks. Our code is available at https://github.com/Ewanwong/bcos_lm.

NeurIPS Conference 2025 Conference Paper

DAAC: Discrepancy-Aware Adaptive Contrastive Learning for Medical Time series

  • Yifan Wang
  • Hongfeng Ai
  • Ruiqi Li
  • Maowei Jiang
  • Quangao Liu
  • Jiahua Dong
  • ruiyuan kang
  • Alan Liang

Medical time-series data play a vital role in disease diagnosis but suffer from limited labeled samples and single-center bias, which hinder model generalization and lead to overfitting. To address these challenges, we propose DAAC (Discrepancy-Aware Adaptive Contrastive learning), a learnable multi-view contrastive framework that integrates external normal samples and enhances feature learning through adaptive contrastive strategies. DAAC consists of two key modules: (1) a Discrepancy Estimator, built upon a GAN-enhanced encoder-decoder architecture, captures the distribution of normal data and computes reconstruction errors as indicators of abnormality. These discrepancy features augment the target dataset to mitigate overfitting. (2) an Adaptive Contrastive Learner uses multi-head attention to extract discriminative representations by contrasting embeddings across multiple views and data granularities (subject, trial, epoch, and temporal levels), eliminating the need for handcrafted positive-negative sample pairs. Extensive experiments on three clinical datasets—covering Alzheimer’s disease, Parkinson’s disease, and myocardial infarction—demonstrate that DAAC significantly outperforms existing methods, even when only 10\% of labeled data is available, showing strong generalization and diagnostic performance. Our code is available at https: //github. com/CUHKSZ-MED-BioE/DAAC.

AAAI Conference 2025 Conference Paper

DisCo: Graph-Based Disentangled Contrastive Learning for Cold-Start Cross-Domain Recommendation

  • Hourun Li
  • Yifan Wang
  • Zhiping Xiao
  • Jia Yang
  • Changling Zhou
  • Ming Zhang
  • Wei Ju

Recommender systems are widely used in various real-world applications, but they often encounter the persistent challenge of the user cold-start problem. Cross-domain recommendation (CDR), which leverages user interactions from one domain to improve prediction performance in another, has emerged as a promising solution. However, users with similar preferences in the source domain may exhibit different interests in the target domain. Therefore, directly transferring embeddings may introduce irrelevant source-domain collaborative information. In this paper, we propose a novel graph-based disentangled contrastive learning framework to capture fine-grained user intent and filter out irrelevant collaborative information, thereby avoiding negative transfer. Specifically, for each domain, we use a multi-channel graph encoder to capture diverse user intents. We then construct the affinity graph in the embedding space and perform multi-step random walks to capture high-order user similarity relationships. Treating one domain as the target, we propose a disentangled intent-wise contrastive learning approach, guided by user similarity, to refine the bridging of user intents across domains.Extensive experiments on four benchmark CDR datasets demonstrate that DisCo consistently outperforms existing state-of-the-art baselines, thereby validating the effectiveness of both DisCo and its components.

ICLR Conference 2025 Conference Paper

Discovering Influential Neuron Path in Vision Transformers

  • Yifan Wang
  • Yifei Liu
  • Yingdong Shi
  • Changming Li
  • Anqi Pang
  • Sibei Yang
  • Jingyi Yu 0001
  • Kan Ren

Vision Transformer models exhibit immense power yet remain opaque to human understanding, posing challenges and risks for practical applications. While prior research has attempted to demystify these models through input attribution and neuron role analysis, there's been a notable gap in considering layer-level information and the holistic path of information flow across layers. In this paper, we investigate the significance of influential neuron paths within vision Transformers, which is a path of neurons from the model input to output that impacts the model inference most significantly. We first propose a joint influence measure to assess the contribution of a set of neurons to the model outcome. And we further provide a layer-progressive neuron locating approach that efficiently selects the most influential neuron at each layer trying to discover the crucial neuron path from input to output within the target model. Our experiments demonstrate the superiority of our method finding the most influential neuron path along which the information flows, over the existing baseline solutions. Additionally, the neuron paths have illustrated that vision Transformers exhibit some specific inner working mechanism for processing the visual information within the same image category. We further analyze the key effects of these neurons on the image classification task, showcasing that the found neuron paths have already preserved the model capability on downstream tasks, which may also shed some lights on real-world applications like model pruning. The project website including implementation code is available at https://foundation-model-research.github.io/NeuronPath/.

NeurIPS Conference 2025 Conference Paper

Dual Prototype-Enhanced Contrastive Framework for Class-Imbalanced Graph Domain Adaptation

  • Xin Ma
  • Yifan Wang
  • Siyu Yi
  • Wei Ju
  • Junyu Luo
  • Yusheng Zhao
  • Xiao Luo
  • Jiancheng Lv

Graph transfer learning, especially in unsupervised domain adaptation, aims to transfer knowledge from a label-abundant source graph to an unlabeled target graph. However, most existing approaches overlook the common issue of label imbalance in the source domain, typically assuming a balanced label distribution that rarely holds in practice. Moreover, they face challenges arising from biased knowledge in the source graph and substantial domain distribution shifts. To remedy the above challenges, we propose a dual-branch prototype-enhanced contrastive framework for class-imbalanced graph domain adaptation in this paper. Specifically, we introduce a dual-branch graph encoder to capture both local and global information, generating class-specific prototypes from a distilled anchor set. Then, a prototype-enhanced contrastive learning framework is introduced. On the one hand, we encourage class alignment between the two branches based on constructed prototypes to alleviate the bias introduced by class imbalance. On the other hand, we infer the pseudo-labels for the target domain and align sample pairs across domains that share similar semantics to reduce domain discrepancies. Experimental results show that our ImGDA outperforms the state-of-the-art methods across multiple datasets and settings. The code is available at: https: //github. com/maxin88scu/ImGDA.

IROS Conference 2025 Conference Paper

EIC Framework for Hand Exoskeletons Based on a Multimodal Large Language Model

  • Houcheng Li
  • Zhenchan Su
  • Honglei Guo
  • Yifan Wang
  • Zeyu Liu
  • Long Cheng

Current hand exoskeleton interaction methods primarily focus on recognizing a limited range of hand motion intentions and rely on pre-programmed control to execute predefined commands. However, these approaches face significant limitations when confronted with unanticipated or non-predefined scenarios, such as performing various gestures or grasping different objects. To address this challenge, this paper proposes an embodied interaction control (EIC) framework for hand exoskeletons based on a multimodal large language model (MLLM). First, an embodied interaction method leveraging multi-modal fusion of speech and image information is developed, enabling more intuitive, hands-free, accurate, and robust human-robot interaction. By utilizing multi-modal data, the MLLM infers the user’s hand motion intentions and generates corresponding motion plans for the exoskeleton. The underlying control strategy is then used to execute the motion planning. Notably, leveraging the advanced reasoning and code-generation capabilities of MLLMs, the framework can generate undefined gestures and grasping actions. Finally, experimental results validate the effectiveness and generalizability of the EIC framework.

AAAI Conference 2025 Conference Paper

Enhancing Large Language Model Performance with Gradient-Based Parameter Selection

  • Haoling Li
  • Xin Zhang
  • Xiao Liu
  • Yeyun Gong
  • Yifan Wang
  • Qi Chen
  • Peng Cheng

Large language models (LLMs) have revolutionized numerous fields of research, driving significant advancements in natural language processing, machine translation, and beyond. Although the extensive number of parameters contributes a lot to the great success, existing studies indicate that not all model parameters hold equal importance, which further leads to redundancy during the parameter update process. Recent works for reducing redundant parameter updates for LLMs either lack task-specific data information, may leading to suboptimal model performance, or discard transformer components or insignificant parameters, limiting the model's scalability across different tasks and potentially compromising the LLM structure. To address these issues and further enhance the performance of LLMs, we propose Gradient-Mask Tuning (GMT), a method that selectively updates parameters based on gradient information, which is specific to the target tasks. Specifically, after calculating gradients during back propagation, we measure their absolute values and mask those with small absolute values. Our empirical results in various training paradigms like SFT and DPO for various domains of tasks demonstrate that GMT not only preserves the original network structure but also enhances the potential performance of LLMs. Further analysis indicates that GMT exhibits insensitivity to mask ratio and possesses computational efficiency comparable to vanilla training approach.

NeurIPS Conference 2025 Conference Paper

From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction

  • Zhida Zhao
  • Talas Fu
  • Yifan Wang
  • Lijun Wang
  • Huchuan Lu

Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code will be released at https: //github. com/6550Zhao/Policy-World-Model.

ICML Conference 2025 Conference Paper

From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications

  • Ajay Kumar Jaiswal
  • Yifan Wang
  • Lu Yin 0006
  • Shiwei Liu 0003
  • Runjin Chen
  • Jiawei Zhao
  • Ananth Grama
  • Yuandong Tian

Large Language Models (LLMs) matrices can often be expressed in low-rank format with potential to relax memory and compute resource requirements. Unlike previous works which pivot around developing novel matrix decomposition algorithms, in this work we focus to study the emerging non-uniform low-rank properties across weight matrices in LLMs through the lens of stabilizing gradient subspace. Firstly, we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. Secondly, we empirically establish a consequential relationship between the gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient dynamics perspective illustrate that LRCs tend to have better finetuning capabilities and their standalone finetuning can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. All codes and checkpoints will be released.

ECAI Conference 2025 Conference Paper

FTCFormer: Fuzzy Token Clustering Transformer for Image Classification

  • Muyi Bao
  • Changyu Zeng
  • Yifan Wang
  • Zhengni Yang
  • Zimu Wang
  • Guangliang Cheng
  • Jun Qi 0001
  • Wei Wang 0042

Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks, largely attributed to their long-range self-attention mechanism and scalability. However, most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions, resulting in suboptimal feature representations. To address this issue, we propose Fuzzy Token Clustering Transformer (FTCFormer), which incorporates a novel clustering-based downsampling module to dynamically generate vision tokens based on the semantic meanings instead of spatial positions. It allocates fewer tokens to less informative regions and more tokens to represent semantically important regions, regardless of their spatial adjacency or shape irregularity. To further enhance feature extraction and representation, we propose a Density Peak Clustering-Fuzzy K-Nearest Neighbor (DPC-FKNN) mechanism for clustering center determination, a Spatial Connectivity Score (SCS) for token assignment, and a channel-wise merging (Cmerge) strategy for token merging. Extensive experiments on 32 datasets across diverse domains validate the effectiveness of FTCFormer on image classification, showing consistent improvements over the TCFormer baseline, achieving gains of improving 1. 43% on five fine-grained datasets, 1. 09% on six natural image datasets, 0. 97% on three medical datasets and 0. 55% on four remote sensing datasets. The code is available at: https: //github. com/BaoBao0926/FTCFormer/tree/main.

ICRA Conference 2025 Conference Paper

Fusionsense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction

  • Irving Fang
  • Kairui Shi
  • Xujin He
  • Siqi Tan
  • Yifan Wang
  • Hanwen Zhao
  • Hung-Jui Huang
  • Wenzhen Yuan 0001

Humans effortlessly integrate common-sense knowledge with sensory input from vision and touch to understand their surroundings. Emulating this capability, we introduce FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors. FusionSense addresses three key challenges: (i) How can robots efficiently acquire robust global shape information about the surrounding scene and objects? (ii) How can robots strategically select touch points on the object using geometric and commonsense priors? (iii) How can partial observations such as tactile signals improve the overall representation of the object? Our framework employs 3D Gaussian Splatting as a core representation and incorporates a hierarchical optimization strategy involving global structure construction, object visual hull pruning and local geometric constraints. This advancement results in fast and robust perception in environments with traditionally challenging objects that are transparent, reflective, or dark, enabling more downstream manipulation or navigation tasks. Experiments on real-world data suggest that our framework outperforms previously state-of-the-art sparse-view methods. All code and data are open-sourced on the project website.

AAAI Conference 2025 Conference Paper

GigaGS: 3D Gaussian Based Planar Representation for Large-Scene Surface Reconstruction

  • Junyi Chen
  • Weicai Ye
  • Yifan Wang
  • Danpeng Chen
  • Di Huang
  • Wanli Ouyang
  • Guofeng Zhang
  • Yu Qiao

3D Gaussian Splatting (3DGS) has shown promising performance in novel view synthesis. Previous methods adapt it to obtaining surfaces of either individual 3D objects or within limited scenes. In this paper, we make the first attempt to tackle the challenging task of large-scale scene surface reconstruction. This task is particularly difficult due to the high GPU memory consumption, different levels of details for geometric representation, and noticeable inconsistencies in appearance. To this end, we propose GigaGS, the first work for high-quality surface reconstruction for large-scale scenes using 3DGS. GigaGS first applies a partitioning strategy based on the mutual visibility of spatial regions, which effectively grouping cameras for parallel processing. To enhance the quality of the surface, we also propose novel multi-view photometric and geometric consistency constraints based on Level-of-Detail representation. In doing so, our method can reconstruct detailed surface structures. Comprehensive experiments are conducted on various datasets. The consistent improvement demonstrates the superiority of GigaGS.

IJCAI Conference 2025 Conference Paper

Granular-Ball-Induced Multiple Kernel K-Means

  • Shuyin Xia
  • Yifan Wang
  • Lifeng Shen
  • Guoyin Wang

Most existing multi-kernel clustering algorithms, such as multi-kernel K-means, often struggle with computational efficiency and robustness when faced with complex data distributions. These challenges stem from their dependence on point-to-point relationships for optimization, which can lead to difficulty in accurately capturing data sets' inherent structure and diversity. Additionally, the intricate interplay between multiple kernels in such algorithms can further exacerbate these issues, effectively impacting their ability to cluster data points in high-dimensional spaces. In this paper, we leverage granular-ball computing to improve the multi-kernel clustering framework. The core of granular-ball computing is to adaptively fit data distribution by balls from coarse to acceptable levels. Each ball can enclose data points based on a density consistency measurement. Such ball-based data description thus improves the computational efficiency and the robustness to unknown noises. Specifically, based on granular-ball representations, we introduce the granular-ball kernel (GBK) and its corresponding granular-ball multi-kernel K-means framework (GB-MKKM) for efficient clustering. Using granular-ball relationships in multiple kernel spaces, the proposed GB-MKKM framework shows its superiority in efficiency and clustering performance in the empirical evaluation of various clustering tasks.

AAAI Conference 2025 Conference Paper

Mitigating Pervasive Modality Absence Through Multimodal Generalization and Refinement

  • Wuliang Huang
  • Yiqiang Chen
  • Xinlong Jiang
  • Chenlong Gao
  • Teng Zhang
  • Qian Chen
  • Yifan Wang

The performance of multimodal models often deteriorates when modality absence occurs. The absence disrupts the learned inter-modal correlations, resulting in biased multimodal representations. This challenge is especially pronounced when the absence is pervasive, affecting both the training and inference phases. Recent studies have attempted to reconstruct the missing information; however, most of them require complete supervision, which is seldom available in scenarios of pervasive absence. The quality of reconstruction remains a critical issue. Alternatively, others aim to learn robust representations from the available modalities but the substantial variations and biases are not fully addressed. This paper introduces the Multimodal Generalization and Refinement (MGR) framework to mitigate the issue of pervasive modality absence. MGR begins by acquiring generalized multimodal representations and iteratively refines them to recognize and calibrate the biased representations. Initially, multimodal samples with absence are embedded through foundation models, and MGR integrates independent unimodal features to further enhance generalization. Additionally, a novel mixed-context prompt is adopted to identify biases in both features and correlations. A redistribution operation can then refine these biases through graph pooling, culminating in robust and calibrated multimodal representations, which are suitable for downstream tasks. Comprehensive experiments on four benchmark datasets demonstrate that the proposed MGR framework outperforms state-of-the-art methods, effectively mitigating the impact of pervasive modality absence.

ICLR Conference 2025 Conference Paper

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

  • Qi Wu
  • Yubo Zhao
  • Yifan Wang
  • Xinhang Liu
  • Yu-Wing Tai
  • Chi-Keung Tang

While previous approaches to 3D human motion generation have achieved notable success, they often rely on extensive training and are limited to specific tasks. To address these challenges, we introduce **Motion-Agent**, an efficient conversational framework designed for general human motion generation, editing, and understanding. Motion-Agent employs an open-source pre-trained language model to develop a generative agent, **MotionLLM**, that bridges the gap between motion and text. This is accomplished by encoding and quantizing motions into discrete tokens that align with the language model's vocabulary. With only 1-3% of the model's parameters fine-tuned using adapters, MotionLLM delivers performance on par with diffusion models and other transformer-based methods trained from scratch. By integrating MotionLLM with GPT-4 without additional training, Motion-Agent is able to generate highly complex motion sequences through multi-turn conversations, a capability that previous models have struggled to achieve. Motion-Agent supports a wide range of motion-language tasks, offering versatile capabilities for generating and customizing human motion through interactive conversational exchanges.

JBHI Journal 2025 Journal Article

P2TC: A Lightweight Pyramid Pooling Transformer-CNN Network for Accurate 3D Whole Heart Segmentation

  • Hengfei Cui
  • Yifan Wang
  • Fan Zheng
  • Yan Li
  • Yanning Zhang
  • Yong Xia

Cardiovascular disease is a leading global cause of death, requiring accurate heart segmentation for diagnosis and surgical planning. Deep learning methods have been demonstrated to achieve superior performances in cardiac structures segmentation. However, there are still limitations in 3D whole heart segmentation, such as inadequate spatial context modeling, difficulty in capturing long-distance dependencies, high computational complexity, and limited representation of local high-level semantic information. To tackle the above problems, we propose a lightweight Pyramid Pooling Transformer-CNN (P2TC) network for accurate 3D whole heart segmentation. The proposed architecture comprises a dual encoder-decoder structure with a 3D pyramid pooling Transformer for multi-scale information fusion and a lightweight large-kernel Convolutional Neural Network (CNN) for local feature extraction. The decoder has two branches for precise segmentation and contextual residual handling. The first branch is used to generate segmentation masks for pixel-level classification based on the features extracted by the encoder to achieve accurate segmentation of cardiac structures. The second branch highlights contextual residuals across slices, enabling the network to better handle variations and boundaries. Extensive experimental results on the Multi-Modality Whole Heart Segmentation (MM-WHS) 2017 challenge dataset demonstrate that P2TC outperforms the most advanced methods, achieving the Dice scores of 92. 6% and 88. 1% in Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) modalities respectively, which surpasses the baseline model by 1. 5% and 1. 7%, and achieves state-of-the-art segmentation results.

IJCAI Conference 2025 Conference Paper

PALA: Class-imbalanced Graph Domain Adaptation via Prototype-anchored Learning and Alignment

  • Xin Ma
  • Yifan Wang
  • Siyu Yi
  • Wei Ju
  • Bei Wu
  • Ziyue Qiao
  • Chenwei Tang
  • Jiancheng Lv

Graph domain adaptation is a key subfield of graph transfer learning that aims to bridge domain gaps by transferring knowledge from a label-rich source graph to an unlabeled target graph. However, most existing methods assume balanced labels in the source graph, which often fails in practice and leads to biased knowledge transfer. To address this, in this paper, we propose a prototype-anchored learning and alignment framework for class-imbalanced graph domain adaptation. Specifically, we incorporate pointwise node mutual information into the graph encoder to capture high-order topological proximity and learn generalized node representations. Leveraging this, we then introduce categorical prototypes with adversarial proto-instances for prototype-anchored learning and recalibration to represent the source graph under an imbalanced class distribution. Finally, we introduce a weighted prototype contrastive adaptation strategy that aligns target pseudo-labels with source prototypes to handle class imbalance during adaptation. Extensive experiments show that our PALA outperforms the state-of-the-art methods. Our code is available at https: //github. com/maxin88scu/PALA.

NeurIPS Conference 2025 Conference Paper

Towards Robust Uncertainty Calibration for Composed Image Retrieval

  • Yifan Wang
  • Wuliang Huang
  • Yufan Wen
  • Shunning Liu
  • Chun Yuan

The interactive task of composed image retrieval aims to retrieve the most relevant images with the bi-modal query, consisting of a reference image and a modification sentence. Despite significant efforts to bridge the heterogeneous gap within the bi-modal query and leverage contrastive learning to reduce the disparity between positive and negative triplets, prior methods often fail to ensure reliable matching due to aleatoric and epistemic uncertainty. Specifically, the aleatoric uncertainty stems from underlying semantic correlations within candidate instances and annotation noise, and the epistemic uncertainty is usually caused by overconfidence in dominant semantic categories. In this paper, we propose Robust UNcertainty Calibration (RUNC) to quantify the uncertainty and calibrate the imbalanced semantic distribution. To mitigate semantic ambiguity in similarity distribution between fusion queries and targets, RUNC maximizes the matching evidence by utilizing a high-order conjugate prior distribution to fit the semantic covariances in candidate samples. With the estimated uncertainty coefficient of each candidate, the target distribution is calibrated to encourage balanced semantic alignment. Additionally, we minimize the ambiguity in the fusion evidence when forming the unified query by incorporating orthogonal constraints on explicit textual embeddings and implicit queries, to reduce the representation redundancy. Extensive experiments and ablation analysis on benchmark datasets FashionIQ and CIRR verify the robustness of RUNC in predicting reliable retrieval results from a large image gallery.

NeurIPS Conference 2025 Conference Paper

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

  • Ruilin Luo
  • Zhuofan Zheng
  • Lei Wang
  • Yifan Wang
  • Xinzhe Ni
  • Zicheng Lin
  • Songtao Jiang
  • Yiyao Yu

Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (i) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal Large Language Models (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (ii) a lack of automated methods for process labeling within multimodal contexts persists; (iii) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage Unfolding multimodal pRocess-Supervision Aided training framework. We first construct MMathCoT-1M, a high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset, to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we go through an automatic process to synthesize process supervision data, which emphasizes both logical correctness and perceptual consistency. We introduce DualMath-1. 1M to facilitate the training of URSA-8B-RM. Finally, we propose Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a multimodal PRM-aided online RL method that outperforms vanilla GRPO. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8. 4% and 2. 7% on average across 6 benchmarks.

IJCAI Conference 2024 Conference Paper

A Survey of Data-Efficient Graph Learning

  • Wei Ju
  • Siyu Yi
  • Yifan Wang
  • Qingqing Long
  • Junyu Luo
  • Zhiping Xiao
  • Ming Zhang

Graph-structured data, prevalent in domains ranging from social networks to biochemical analysis, serve as the foundation for diverse real-world systems. While graph neural networks demonstrate proficiency in modeling this type of data, their success is often reliant on significant amounts of labeled data, posing a challenge in practical scenarios with limited annotation resources. To tackle this problem, tremendous efforts have been devoted to enhancing graph machine learning performance under low-resource settings by exploring various approaches to minimal supervision. In this paper, we introduce a novel concept of Data-Efficient Graph Learning (DEGL) as a research frontier, and present the first survey that summarizes the current progress of DEGL. We initiate by highlighting the challenges inherent in training models with large labeled data, paving the way for our exploration into DEGL. Next, we systematically review recent advances on this topic from several key aspects, including self-supervised graph learning, semi-supervised graph learning, and few-shot graph learning. Also, we state promising directions for future research, contributing to the evolution of graph machine learning.

ICML Conference 2024 Conference Paper

A Theory of Fault-Tolerant Learning

  • Changlong Wu
  • Yifan Wang
  • Ananth Grama

Developing machine learning models that account for potential faults encountered in real-world environments presents a fundamental challenge for mission-critical applications. In this paper, we introduce a novel theoretical framework grounded in learning theory for dealing with faults. In particular, we propose a framework called fault-tolerant PAC learning, aimed at identifying the most fault-tolerant models from a given hypothesis class (such as neural networks). We show that if faults occur randomly, fault-tolerant learning is equivalent to regular PAC learning. However, for adversarial faults, we show that the sample complexity of fault-tolerant PAC learning can grow linearly w. r. t. the number of perturbing functions induced by the faults, even for a hypothesis class with VC-dimension 1. We then provide a matching upper bound by restricting the number of perturbing functions. Finally, we show that the linear dependency on the number of perturbing functions can be substantially improved for deletion faults in neural networks. Our work provides a powerful formal framework and avenues for a number of future investigations on the precise characterization of fault-tolerant learning.

ECAI Conference 2024 Conference Paper

B2MAPO: A Batch-by-Batch Multi-Agent Policy Optimization to Balance Performance and Efficiency

  • Wenjing Zhang
  • Wei Zhang 0331
  • Wenqing Hu
  • Yifan Wang

Most multi-agent reinforcement learning approaches adopt two types of policy optimization methods that either update policy simultaneously or sequentially. Simultaneously updating policies of all agents introduces non-stationarity problem. Although sequentially updating policies agent-by-agent in an appropriate order improves policy performance, it is prone to low efficiency due to sequential execution, resulting in longer model training and execution time. Intuitively, partitioning policies of all agents according to their interdependence and updating joint policy batch-by-batch can effectively balance performance and efficiency. However, how to determine the optimal batch partition of policies and batch updating order are challenging problems. Firstly, a sequential batched policy updating scheme, B2MAPO (Batch by Batch Multi-Agent Policy Optimization), is proposed with a theoretical guarantee of the monotonic incrementally tightened bound. Secondly, a universal modulized plug-and-play B2MAPO hierarchical framework, which satisfies CTDE principle, is designed to conveniently integrate any MARL models to fully exploit and merge their merits, including policy optimality and inference efficiency. Next, a DAG-based B2MAPO algorithm is devised, which is a carefully designed implementation of B2MAPO framework. The upper layer employs PPO algorithm with attention mechanism to reveal interdependence between policies, and generates DAGs of agents which are used to produce optimal batch sequence through topological sorting. The lower layer trains two joint policies in parallel and minimizes KL divergence between them periodically. One joint policy is sequentially updated according to B2MAPO scheme with batch sequence, another derived joint policy is simultaneously updated by MAPPO. While decentralized execution, only the derived joint policy is adopted for decision-making. Comprehensive experimental results conducted on StarCraftII Multi-agent Challenge and Google Football Research demonstrate the performance of DAG-based B2MAPO algorithm outperforms baseline methods. Meanwhile, compared with A2PO, our algorithm reduces the model training and execution time by 60. 4% and 78. 7%, respectively.

AAAI Conference 2024 Conference Paper

CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues

  • Linglin Jing
  • Sheng Xu
  • Yifan Wang
  • Yuzhe Zhou
  • Tao Shen
  • Zhigang Ji
  • Hui Fang
  • Zhen Li

Accurate identification of protein nucleic acid binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in incomplete or inaccurate protein analysis. To address the above issue, in this paper, we present CrossBind, a novel collaborative cross modal approach for identifying binding residues by exploiting both protein geometric structure and its sequence prior knowledge extracted from a large scale protein language model. Specifically, our multi modal approach leverages a contrastive learning technique and atom wise attention to capture the positional relationships between atoms and residues, thereby incorporating fine grained local geometric knowledge, for better binding residue prediction. Extensive experimental results demonstrate that our approach outperforms the next best state of the art methods, GraphSite and GraphBind, on DNA and RNA datasets by 10.8/17.3% in terms of the harmonic mean of precision and recall (F1 Score) and 11.9/24.8% in Matthews correlation coefficient (MCC), respectively. We release the code at https://github.com/BEAM-Labs/CrossBind.

AAAI Conference 2024 Conference Paper

Deep Reinforcement Learning for Early Diagnosis of Lung Cancer

  • Yifan Wang
  • Qining Zhang
  • Lei Ying
  • Chuan Zhou

Lung cancer remains the leading cause of cancer-related death worldwide, and early diagnosis of lung cancer is critical for improving the survival rate of patients. Performing annual low-dose computed tomography (LDCT) screening among high-risk populations is the primary approach for early diagnosis. However, after each screening, whether to continue monitoring (with follow-up screenings) or to order a biopsy for diagnosis remains a challenging decision to make. Continuing with follow-up screenings may lead to delayed diagnosis but ordering a biopsy without sufficient evidence incurs unnecessary risk and cost. In this paper, we tackle the problem by an optimal stopping approach. Our proposed algorithm, called EarlyStop-RL, utilizes the structure of the Snell envelope for optimal stopping, and model-free deep reinforcement learning for making diagnosis decisions. Through evaluating our algorithm on a commonly used clinical trial dataset (the National Lung Screening Trial), we demonstrate that EarlyStop-RL has the potential to greatly enhance risk assessment and early diagnosis of lung cancer, surpassing the performance of two widely adopted clinical models, namely the Lung-RADS and the Brock model.

AAAI Conference 2024 Conference Paper

DME: Unveiling the Bias for Better Generalized Monocular Depth Estimation

  • Songsong Yu
  • Yifan Wang
  • Yunzhi Zhuge
  • Lijun Wang
  • Huchuan Lu

This paper aims to design monocular depth estimation models with better generalization abilities. To this end, we have conducted quantitative analysis and discovered two important insights. First, the Simulation Correlation phenomenon, commonly seen in long-tailed classification problems, also exists in monocular depth estimation, indicating that the imbalanced depth distribution in training data may be the cause of limited generalization ability. Second, the imbalanced and long-tail distribution of depth values extends beyond the dataset scale, and also manifests within each individual image, further exacerbating the challenge of monocular depth estimation. Motivated by the above findings, we propose the Distance-aware Multi-Expert (DME) depth estimation model. Unlike prior methods that handle different depth range indiscriminately, DME adopts a divide-and-conquer philosophy where each expert is responsible for depth estimation of regions within a specific depth range. As such, the depth distribution seen by each expert is more uniform and can be more easily predicted. A pixel-level routing module is further designed and learned to stitch the prediction of all experts into the final depth map. Experiments show that DME achieves state-of-the-art performance on both NYU-Depth v2 and KITTI, and also delivers favorable zero-shot generalization capability on unseen datasets.

ICRA Conference 2024 Conference Paper

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction

  • Irving Fang
  • Yuzhong Chen
  • Yifan Wang
  • Jianghan Zhang
  • Qiushi Zhang
  • Jiali Xu
  • Xibo He
  • Weibo Gao

A robot’s ability to anticipate the 3D action target location of a hand’s movement from egocentric videos can greatly improve safety and efficiency in human-robot interaction (HRI). While previous research predominantly focused on semantic action classification or 2D target region prediction, we argue that predicting the action target’s 3D coordinate could pave the way for more versatile downstream robotics tasks, especially given the increasing prevalence of headset devices. This study expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction. We augment both its size and diversity, enhancing its potential for generalization. Moreover, we substantially enhance the baseline algorithm by introducing a large pre-trained model and human prior knowledge. Remarkably, our novel algorithm can now achieve superior prediction outcomes using solely RGB images, eliminating the previous need for 3D point clouds and IMU input. Furthermore, we deploy our enhanced baseline algorithm on a real-world robotic platform to illustrate its practical utility in straightforward HRI tasks. The demonstrations showcase the real-world applicability of our advancements and may inspire more HRI use cases involving egocentric vision. All code and data are open-sourced and can be found on the project website.

NeurIPS Conference 2024 Conference Paper

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models

  • Yichi Zhang
  • Yao Huang
  • Yitong Sun
  • Chang Liu
  • Zhe Zhao
  • Zhengwei Fang
  • Yifan Wang
  • Huanran Chen

Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks, highlighting the complexities introduced by the multimodality and underscoring the necessity for advanced methodologies to enhance their reliability. For instance, typical proprietary models still struggle with the perception of visually confusing images and are vulnerable to multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to disclose privacy in text and reveal ideological and cultural biases even when paired with irrelevant images in inference, indicating that the multimodality amplifies the internal risks from base LLMs. Additionally, we release a scalable toolbox for standardized trustworthiness research, aiming to facilitate future advancements in this important field. Code and resources are publicly available at: https: //multi-trust. github. io/.

NeurIPS Conference 2024 Conference Paper

NeuRodin: A Two-stage Framework for High-Fidelity Neural Surface Reconstruction

  • Yifan Wang
  • Di Huang
  • Weicai Ye
  • Guofeng Zhang
  • Wanli Ouyang
  • Tong He

Signed Distance Function (SDF)-based volume rendering has demonstrated significant capabilities in surface reconstruction. Although promising, SDF-based methods often fail to capture detailed geometric structures, resulting in visible defects. By comparing SDF-based volume rendering to density-based volume rendering, we identify two main factors within the SDF-based approach that degrade surface quality: SDF-to-density representation and geometric regularization. These factors introduce challenges that hinder the optimization of the SDF field. To address these issues, we introduce NeuRodin, a novel two-stage neural surface reconstruction framework that not only achieves high-fidelity surface reconstruction but also retains the flexible optimization characteristics of density-based methods. NeuRodin incorporates innovative strategies that facilitate transformation of arbitrary topologies and reduce artifacts associated with density bias. Extensive evaluations on the Tanks and Temples and ScanNet++ datasets demonstrate the superiority of NeuRodin, showing strong reconstruction capabilities for both indoor and outdoor environments using solely posed RGB captures. Project website: https: //open3dvlab. github. io/NeuRodin/

IROS Conference 2024 Conference Paper

Online Rotor Fault Detection and Isolation for Vertical Takeoff and Landing Vehicles

  • Jiaqi Lian
  • Neeraj Gandhi
  • Yifan Wang
  • Linh Thi Xuan Phan

Vertical take-off and landing (VTOL) vehicles are becoming increasingly popular for real-world transport; but, as with any vehicle, guaranteeing safety is both extremely critical and highly challenging due to issues like rotor faults. Existing fault detection and isolation (FDI) techniques usually focus on multirotor systems or fixed wing systems, rather than the hybrid VTOLs. Since VTOLs have both rotors and ailerons, a fault in a rotor may be masked by the (correctly working) ailerons, making it much more difficult to detect faults. However, this masking only works when ailersons are used (e. g. , during cruising), leaving the takeoff and landing vulnerable to crashes. This paper presents an online rotor fault detection and isolation (FDI) method for VTOLs. The approach uses pose analysis and aileron command data to quickly and accurately identify the faulty rotor and to compute the severity of the fault. Our method works for hard-to-detect fault scenarios, such as small-severity faults that are masked during cruise flight but not during vertical motion. We evaluated our technique in a SITL PX4 simulation of a modified Deltaquad QuadPlane. The results show that our FDI technique can quickly detect and isolate faults in real time (within 1s-2. 5s) and achieve high isolation success rate (91. 67%) across six rotors, and that it can estimate the severity of faults to within 2%. When applying a simple recovery process post-isolation, the system consistently achieved safe landing.

ICRA Conference 2024 Conference Paper

Quasi-static Path Planning for Continuum Robots By Sampling on Implicit Manifold

  • Yifan Wang
  • Yue Chen

Continuum robots (CR) offer excellent dexterity and compliance in contrast to rigid-link robots, making them suitable for navigating through, and interacting with, confined environments. However, the study of path planning for CRs while considering external elastic contact is limited. The challenge lies in the fact that CRs can have multiple possible configurations when in contact, rendering the forward kinematics not well-defined, and characterizing the set of feasible robot configurations is non-trivial. In this paper, we propose to perform quasi-static path planning on an implicit manifold. We model elastic obstacles as external potential fields and formulate the robot statics in the potential field as the extremal trajectory of an optimal control problem. We show that the set of stable robot configurations is a smooth manifold diffeomorphic to a submanifold embedded in the product space of the CR actuation and base internal wrench. We then propose to perform path planning on this manifold using AtlasRRT*, a sampling-based planner dedicated to planning on implicit manifolds. Simulations in different operation scenarios were conducted and the results show that the proposed planner outperforms Euclidean space planners in terms of success rate and computational efficiency.

IJCAI Conference 2024 Conference Paper

Rank and Align: Towards Effective Source-free Graph Domain Adaptation

  • Junyu Luo
  • Zhiping Xiao
  • Yifan Wang
  • Xiao Luo
  • Jingyang Yuan
  • Wei Ju
  • Langechuan Liu
  • Ming Zhang

Graph neural networks (GNNs) have achieved impressive performance in graph domain adaptation. However, extensive source graphs could be unavailable in real-world scenarios due to privacy and storage concerns. To this end, we investigate an underexplored yet practical problem of source-free graph domain adaptation, which transfers knowledge from source models instead of source graphs to a target domain. To solve this problem, we introduce a novel GNN-based approach called Rank and Align (RNA), which ranks graph similarities with spectral seriation for robust semantics learning, and aligns inharmonic graphs with harmonic graphs which close to the source domain for subgraph extraction. In particular, to overcome label scarcity, we employ the spectral seriation algorithm to infer the robust pairwise rankings, which can guide semantic learning using a similarity learning objective. To depict distribution shifts, we utilize spectral clustering and the silhouette coefficient to detect harmonic graphs, which the source model can easily classify. To reduce potential domain discrepancy, we extract domain-invariant subgraphs from inharmonic graphs by an adversarial edge sampling process, which guides the invariant learning of GNNs. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed RNA.

AAAI Conference 2024 Conference Paper

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

  • Qianrui Zhou
  • Hua Xu
  • Hao Li
  • Hanlei Zhang
  • Xiaohan Zhang
  • Yifan Wang
  • Kai Gao

Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP.

JBHI Journal 2023 Journal Article

An Improved Combination of Faster R-CNN and U-Net Network for Accurate Multi-Modality Whole Heart Segmentation

  • Hengfei Cui
  • Yifan Wang
  • Yan Li
  • Di Xu
  • Lei Jiang
  • Yong Xia
  • Yanning Zhang

Detailed information of substructures of the whole heart is usually vital in the diagnosis of cardiovascular diseases and in 3D modeling of the heart. Deep convolutional neural networks have been demonstrated to achieve state-of-the-art performance in 3D cardiac structures segmentation. However, when dealing with high-resolution 3D data, current methods employing tiling strategies usually degrade segmentation performances due to GPU memory constraints. This work develops a two-stage multi-modality whole heart segmentation strategy, which adopts an improved Combination of Faster R-CNN and 3D U-Net (CFUN+). More specifically, the bounding box of the heart is first detected by Faster R-CNN, and then the original Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) images of the heart aligned with the bounding box are input into 3D U-Net for segmentation. The proposed CFUN+ method redefines the bounding box loss function by replacing the previous Intersection over Union (IoU) loss with Complete Intersection over Union (CIoU) loss. Meanwhile, the integration of the edge loss makes the segmentation results more accurate, and also improves the convergence speed. The proposed method achieves an average Dice score of 91. 1% on the Multi-Modality Whole Heart Segmentation (MM-WHS) 2017 challenge CT dataset, which is 5. 2% higher than the baseline CFUN model, and achieves state-of-the-art segmentation results. In addition, the segmentation speed of a single heart has been dramatically improved from a few minutes to less than 6 seconds.

ICML Conference 2023 Conference Paper

Learning Functional Distributions with Private Labels

  • Changlong Wu
  • Yifan Wang
  • Ananth Grama
  • Wojciech Szpankowski

We study the problem of learning functional distributions in the presence of noise. A functional is a map from the space of features to distributions over a set of labels, and is often assumed to belong to a known class of hypotheses $\mathcal{F}$. Features are generated by a general random process and labels are sampled independently from feature-dependent distributions. In privacy sensitive applications, labels are passed through a noisy kernel. We consider online learning, where at each time step, a predictor attempts to predict the actual (label) distribution given only the features and noisy labels in prior steps. The performance of the predictor is measured by the expected KL-risk that compares the predicted distributions to the underlying truth. We show that the minimax expected KL-risk is of order $\tilde{\Theta}(\sqrt{T\log|\mathcal{F}|})$ for finite hypothesis class $\mathcal{F}$ and any non-trivial noise level. We then extend this result to general infinite classes via the concept of stochastic sequential covering and provide matching lower and upper bounds for a wide range of natural classes.

AAAI Conference 2022 Conference Paper

DisenCite: Graph-Based Disentangled Representation Learning for Context-Specific Citation Generation

  • Yifan Wang
  • Yiping Song
  • Shuai Li
  • Chaoran Cheng
  • Wei Ju
  • Ming Zhang
  • Sheng Wang

Citing and describing related literature are crucial to scientific writing. Many existing approaches show encouraging performance in citation recommendation, but are unable to accomplish the more challenging and onerous task of citation text generation. In this paper, we propose a novel disentangled representation based model DisenCite to automatically generate the citation text through integrating paper text and citation graph. A key novelty of our method compared with existing approaches is to generate context-specific citation text, empowering the generation of different types of citations for the same paper. In particular, we first build and make available a graph enhanced contextual citation dataset (GCite) with 25K edges in different types characterized by citation contained sections over 4. 8K research papers. Based on this dataset, we encode each paper according to both textual contexts and structure information in the heterogeneous citation graph. The resulted paper representations are then disentangled by the mutual information regularization between this paper and its neighbors in graph. Extensive experiments demonstrate the superior performance of our method comparing to state-of-the-art approaches. We further conduct ablation and case studies to reassure that the improvement of our method comes from generating the context-specific citation through incorporating the citation graph.

IJCAI Conference 2022 Conference Paper

Learnability of Competitive Threshold Models

  • Yifan Wang
  • Guangmo Tong

Modeling the spread of social contagions is central to various applications in social computing. In this paper, we study the learnability of the competitive threshold model from a theoretical perspective. We demonstrate how competitive threshold models can be seamlessly simulated by artificial neural networks with finite VC dimensions, which enables analytical sample complexity and generalization bounds. Based on the proposed hypothesis space, we design efficient algorithms under the empirical risk minimization scheme. The theoretical insights are finally translated into practical and explainable modeling methods, the effectiveness of which is verified through a sanity check over a few synthetic and real datasets. The experimental results promisingly show that our method enjoys a decent performance without using excessive data points, outperforming off-the-shelf methods.

JBHI Journal 2022 Journal Article

Learning From Highly Confident Samples for Automatic Knee Osteoarthritis Severity Assessment: Data From the Osteoarthritis Initiative

  • Yifan Wang
  • Zhaori Bi
  • Yuxue Xie
  • Tao Wu
  • Xuan Zeng
  • Shuang Chen
  • Dian Zhou

Knee osteoarthritis (OA) is a chronic disease that considerably reduces patients’ quality of life. Preventive therapies require early detection and lifetime monitoring of OA progression. In the clinical environment, the severity of OA is classified by the Kellgren and Lawrence (KL) grading system, ranging from KL-0 to KL-4. Recently, deep learning methods were applied to OA severity assessment to improve accuracy and efficiency. However, this task is still challenging due to the ambiguity between adjacent grades, especially in early-stage OA. Low confident samples, which are less representative than the typical ones, undermine the training process. Targeting the uncertainty in the OA dataset, we propose a novel learning scheme that dynamically separates the data into two sets according to their reliability. Besides, we design a hybrid loss function to help CNN learn from the two sets accordingly. With the proposed approach, we emphasize the typical samples and control the impacts of low confident cases. Experiments are conducted in a five-fold manner on five-class task and early-stage OA task. Our method achieves a mean accuracy of 70. 13% on the five-class OA assessment task, which outperforms all other state-of-art methods. Despite early-stage OA detection still benefiting from the human intervention of lesion region selection, our approach achieves superior performance on the KL-0 vs. KL-2 task. Moreover, we design an experiment to validate large-scale automatic data refining during training. The result verifies the ability to characterize low confidence samples. The dataset used in this paper was obtained from the Osteoarthritis Initiative.

IJCAI Conference 2022 Conference Paper

TGNN: A Joint Semi-supervised Framework for Graph-level Classification

  • Wei Ju
  • Xiao Luo
  • Meng Qu
  • Yifan Wang
  • Chong Chen
  • Minghua Deng
  • Xian-Sheng Hua
  • Ming Zhang

This paper studies semi-supervised graph classification, a crucial task with a wide range of applications in social network analysis and bioinformatics. Recent works typically adopt graph neural networks to learn graph-level representations for classification, failing to explicitly leverage features derived from graph topology (e. g. , paths). Moreover, when labeled data is scarce, these methods are far from satisfactory due to their insufficient topology exploration of unlabeled data. We address the challenge by proposing a novel semi-supervised framework called Twin Graph Neural Network (TGNN). To explore graph structural information from complementary views, our TGNN has a message passing module and a graph kernel module. To fully utilize unlabeled data, for each module, we calculate the similarity of each unlabeled graph to other labeled graphs in the memory bank and our consistency loss encourages consistency between two similarity distributions in different embedding spaces. The two twin modules collaborate with each other by exchanging instance similarity knowledge to fully explore the structure information of both labeled and unlabeled data. We evaluate our TGNN on various public datasets and show that it achieves strong performance.

JBHI Journal 2022 Journal Article

Unsupervised Cross-Modality Domain Adaptation Network for X-Ray to CT Registration

  • Shiqiang Zheng
  • Xin Yang
  • Yifan Wang
  • Mingyue Ding
  • Wenguang Hou

2D/3D registration that achieves high accuracy and real-time computation is one of the enabling technologies for radiotherapy and image-guided surgeries. Recently, the Convolutional Neural Network (CNN) has been explored to significantly improve the accuracy and efficiency of 2D/3D registration. A pair of intraoperative 2-D x-ray images and synthetic data from pre-operative volume are often required to model the nonconvex mappings between registration parameters and image residual. However, a large clinical dataset collection with accurate poses for x-ray images can be very challenging or even impractical, while exclusive training on synthetic data can frequently cause performance degradation when tested on x-rays. Thus, we propose to train a model on source domain (i. e. , synthetic data) to build appearance-pose relationship first and then use an unsupervised cross-modality domain adaptation network (UCMDAN) to adapt the model to target domain (i. e. , X-rays) through adversarial learning. We propose to narrow the significant domain gap by alignment in both pixel and feature space. In particular, the image appearance transformation and domain-invariance feature learning by multiple aspects are conducted synergistically. Extensive experiments on CT and CBCT dataset show that the proposed UCMDAN outperforms the existing state-of-the-art domain adaptation approaches.

AAAI Conference 2022 Conference Paper

You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation

  • Dezhuang Li
  • Ruoqi Li
  • Lijun Wang
  • Yifan Wang
  • Jinqing Qi
  • Lu Zhang
  • Ting Liu
  • Qingquan Xu

We present YOFO (You Only inFer Once), a new paradigm for referring video object segmentation (RVOS) that operates in an one-stage manner. Our key insight is that the language descriptor should serve as target-specific guidance to identify the target object, while a direct feature fusion of image and language can increase feature complexity and thus may be sub-optimal for RVOS. To this end, we propose a metatransfer module, which is trained in a learning-to-learn fashion and aims to transfer the target-specific information from the language domain to the image domain, while discarding the uncorrelated complex variations of language description. To bridge the gap between the image and language domains, we develop a multi-scale cross-modal feature mining block that aggregates all the essential features required by RVOS from both domains and generates regression labels for the meta-transfer module. The whole system can be trained in an end-to-end manner and shows competitive performance against state-of-the-art two-stage approaches.

NeurIPS Conference 2018 Conference Paper

Dual Principal Component Pursuit: Improved Analysis and Efficient Algorithms

  • Zhihui Zhu
  • Yifan Wang
  • Daniel Robinson
  • Daniel Naiman
  • René Vidal
  • Manolis Tsakiris

Recent methods for learning a linear subspace from data corrupted by outliers are based on convex L1 and nuclear norm optimization and require the dimension of the subspace and the number of outliers to be sufficiently small [27]. In sharp contrast, the recently proposed Dual Principal Component Pursuit (DPCP) method [22] can provably handle subspaces of high dimension by solving a non-convex L1 optimization problem on the sphere. However, its geometric analysis is based on quantities that are difficult to interpret and are not amenable to statistical analysis. In this paper we provide a refined geometric analysis and a new statistical analysis that show that DPCP can tolerate as many outliers as the square of the number of inliers, thus improving upon other provably correct robust PCA methods. We also propose a scalable Projected Sub-Gradient Descent method (DPCP-PSGD) for solving the DPCP problem and show it admits linear convergence even though the underlying optimization problem is non-convex and non-smooth. Experiments on road plane detection from 3D point cloud data demonstrate that DPCP-PSGD can be more efficient than the traditional RANSAC algorithm, which is one of the most popular methods for such computer vision applications.

NeurIPS Conference 2017 Conference Paper

Identifying Outlier Arms in Multi-Armed Bandit

  • Honglei Zhuang
  • Chi Wang
  • Yifan Wang

We study a novel problem lying at the intersection of two areas: multi-armed bandit and outlier detection. Multi-armed bandit is a useful tool to model the process of incrementally collecting data for multiple objects in a decision space. Outlier detection is a powerful method to narrow down the attention to a few objects after the data for them are collected. However, no one has studied how to detect outlier objects while incrementally collecting data for them, which is necessary when data collection is expensive. We formalize this problem as identifying outlier arms in a multi-armed bandit. We propose two sampling strategies with theoretical guarantee, and analyze their sampling efficiency. Our experimental results on both synthetic and real data show that our solution saves 70-99% of data collection cost from baseline while having nearly perfect accuracy.

NeurIPS Conference 2017 Conference Paper

MarrNet: 3D Shape Reconstruction via 2.5D Sketches

  • Jiajun Wu
  • Yifan Wang
  • Tianfan Xue
  • Xingyuan Sun
  • Bill Freeman
  • Josh Tenenbaum

3D object reconstruction from a single image is a highly under-determined problem, requiring strong prior knowledge of plausible 3D shapes. This introduces challenge for learning-based approaches, as 3D object annotations in real images are scarce. Previous work chose to train on synthetic data with ground truth 3D information, but suffered from the domain adaptation issue when tested on real data. In this work, we propose an end-to-end trainable framework, sequentially estimating 2. 5D sketches and 3D object shapes. Our disentangled, two-step formulation has three advantages. First, compared to full 3D shape, 2. 5D sketches are much easier to be recovered from a 2D image, and to transfer from synthetic to real data. Second, for 3D reconstruction from the 2. 5D sketches, we can easily transfer the learned model on synthetic data to real images, as rendered 2. 5D sketches are invariant to object appearance variations in real images, including lighting, texture, etc. This further relieves the domain adaptation problem. Third, we derive differentiable projective functions from 3D shape to 2. 5D sketches, making the framework end-to-end trainable on real images, requiring no real-image annotations. Our framework achieves state-of-the-art performance on 3D shape reconstruction.