Arrow Research search

Author name cluster

Jie Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

93 papers
2 author rows

Possible papers

93

AAAI Conference 2026 Conference Paper

AdaDepth: Exploiting Inherent Scene Information for Self-Supervised Depth Estimation in Dynamic Scenes

  • Xuanang Gao
  • Xiongbin Wu
  • Zhiwei Ning
  • Runze Yang
  • Zhonglong Zheng
  • Jie Yang
  • Wei Liu

Self-supervised monocular depth estimation methods severely compromise accuracy in dynamic objects due to their static scene assumption. Existing approaches for dynamic scenes suffer from two critical shortcomings: 1) reliance on supervised segmentation models (requiring costly annotations) or computationally intensive multi-branch models to isolate moving objects, and 2) simple integration of 2D/3D motion flow without reliable supervision for dynamic objects. We propose AdaDepth, a two‑stage framework that jointly performs unsupervised scene decomposition and dynamic-aware depth learning. In the initial structural stage, our geometry-motion joint scene decomposition (GMoDecomp) module ensures the robust generation of a depth prior and simultaneously partitions the scene into multiple regions through the fusion of geometric and motion cues. In the region-adaptive refinement stage, we exploit the depth prior and decomposed regions to introduce motion-aware and geometry-consistent constraints, effectively improving depth estimation in dynamic scenes. AdaDepth achieves accurate depth prediction in highly dynamic scenes without relying on external labels or specialized segmentation models. Extensive experiments on KITTI, Cityscapes, and Waymo Open demonstrate its superiority over state-of-the-art approaches.

AAAI Conference 2026 Conference Paper

Bridging the Modality Reliability Gap in Drug-Target Interaction Prediction via a Confidence-aware Multimodal Fusion Framework

  • Jie Yang
  • Junxiong Zhang
  • Kun Qian
  • Qingyu Yang
  • Weikai Li
  • Zhen Cheng

With the rapid advancement of deep learning, drug target interaction (DTI) prediction has seen substantial performance enhancements. However, existing methodologies face a critical, yet unaddressed challenge, i.e., the Modality Reliability Gap. Such a gap arises from the unpredictable variance in the informativeness and reliability of 1D sequence versus 3D structural data across different drug-target pairs, critically limiting model robustness and domain generalization capabilities. To overcome it, we introduce DrugCMF, a novel Drug-Target interaction prediction method via Confidence-aware Multimodal Fusion framework designed specifically to bridge the Modality Reliability Gap. Specifically, the DrugCMF employs a four-stage approach: (1) it extracts rich features by utilizing four pre-trained models to obtain token-level embeddings from both 1D sequences and 3D structures. (2) it preserves modality informativeness by independently learning interaction patterns within each modality through a Token-level Interaction module. (3) it explicitly quantifies the reliability gap by employing a novel confidence estimation mechanism to dynamically learn weights for each modality. (4) it bridges the gap by using these confidence scores to guide a learnable cross-modal fusion module, adaptively fusing information from the most trustworthy source. By methodically addressing the Modality Reliability Gap, DrugCMF significantly outperforms SOTA methods.

JBHI Journal 2026 Journal Article

Camera-Based Respiratory Imaging System for Monitoring Infant Thoracoabdominal Patterns of Respiration

  • Dongmin Huang
  • Yongshen Zeng
  • Yingen Zhu
  • Xiaoyan Song
  • Liping Pan
  • Jie Yang
  • Yanrong Wang
  • Hongzhou Lu

Existing respiratory monitoring techniques primarily focus on respiratory rate measurement, neglecting the potential of using thoracoabdominal patterns of respiration for infant lung health assessment. To bridge this gap, we exploit the unique advantage of spatial redundancy of a camera sensor to analyze the infant thoracoabdominal respiratory motion. Specifically, we propose a camera-based respiratory imaging (CRI) system that utilizes optical flow to construct a spatio-temporal respiratory imager for comparing the infant chest and abdominal respiratory motion, and employs deep learning algorithms to identify infant abdominal, thoracoabdominal synchronous, and thoracoabdominal asynchronous patterns of respiration. To alleviate the challenges posed by limited clinical training data and subject variability, we introduce a novel multiple-expert contrastive learning (MECL) strategy to CRI. It enriches training samples by reversing and pairing different-class data, and promotes the representation consistency of same-class data through multi-expert collaborative optimization. Clinical validation involving 44 infants shows that MECL achieves 70% in sensitivity and 80. 21% in specificity, which validates the feasibility of CRI for respiratory pattern recognition. This work investigates a novel video-based approach for assessing the infant thoracoabdominal patterns of respiration, revealing a new value stream of video health monitoring in neonatal care.

AAAI Conference 2026 Conference Paper

Fair Graph Learning with Limited Sensitive Attribute Information

  • Zichong Wang
  • Jie Yang
  • Jun Zhuang
  • Puqing Jiang
  • Mingzhe Chen
  • Ye Hu
  • Wenbin Zhang

Graph neural networks (GNNs) excel at modeling graph-structured data but often inherit and amplify biases, leading to substantial efforts in developing fair GNNs. However, most existing approaches assume full access to sensitive attribute information, which is often impractical in real-world scenarios due to privacy concerns or risks of discrimination. To address this limitation, this paper focuses on graph fairness with limited sensitive attribute information, ensuring applicability to real-world contexts where current methods fall short. Specifically, we introduce an innovative fairness optimization strategy, propose a novel framework named FGLISA, and provide a theoretical perspective linking limited sensitive attribute information access to fairness objectives, thus enabling fair graph learning in real-world applications with limited sensitive attribute information. Experiments on diverse real-world datasets and tasks validate the effectiveness of our approach in achieving both fairness and predictive performance.

AAAI Conference 2026 Conference Paper

MACRec: A Multi-View Subspace Alignment Framework for Contrastive Sampling Calibration in Recommendation

  • Junping Liu
  • Mingchao Yu
  • Xinrong Hu
  • Rui Yan
  • Wanqing Li
  • Jie Yang
  • Yi Guo

Graph Contrastive Learning (GCL) has proven effective in mitigating data sparsity and enhancing representation learning for recommendation. Yet, most GCL frameworks indiscriminately treat all non-anchor nodes as negatives during contrastive sampling, often leading to the false negative problem where semantically similar nodes are incorrectly repelled. Previous attempts to mitigate this issue rely on predetermined heuristics or local neighborhood mining, which struggle to reliably identify false negatives. More critically, they often overlook authentic user-item interactions for anchoring sample relationships. As a result, this paper presents MACRec, a Multi-View subspace-Alignment framework designed to Calibrate contrastive sampling in GCLbased Recommendation. MACRec comprises three core components: (1) a Multi-View Affinity (MVA) module that captures consistent semantic relations across multiple augmentations via self-expression modeling; (2) a Cross-Subspace Alignment (CSA) mechanism that leverages authentic useritem behavioral interactions to enforce semantic consistency across user and item subspaces; and (3) a Calibrationbased Contrastive Reweighting (CCR) strategy to dynamically down-weight potential false negatives during the contrastive learning process. Extensive experiments on three realworld benchmarks demonstrate that MACRec consistently improves performance across various augmentation backbones, achieving up to 14.55% relative gains.

AAAI Conference 2026 Conference Paper

Multi-View Clustering with Granularity-Aware Pseudo Supervision

  • Jie Yang
  • Cheng-You Lu
  • Zhongli Wang
  • Hsiang-Ting Chen
  • Guang-Kui Xu
  • Chenglong Zhang
  • Shuting Dong
  • Xinyan Liang

Modern multi-view clustering (MVC) is dominated by two paradigms: multi-view fusion and pseudo-label-guided learning. Pseudo-labeling methods can suffer from confirmation bias; their reliance on a fixed-granularity supervision from an initial clustering can cause learned embeddings to drift from the data's true structure and lose discriminative power. Conversely, fusion methods excel at integrating information but often struggle to robustly differentiate between high-quality and noisy views, which can obscure final cluster boundaries and degrade performance. To address these complementary challenges, we propose GAPS (Granularity-Aware Pseudo Supervision), a novel MVC framework. GAPS introduces a granularity-aware supervision mechanism that generates a full hierarchy of pseudo-labels, enabling the selection of a supervision level that best aligns with the data's intrinsic multi-scale structure. Furthermore, to ensure a high-quality supervisory signal, it incorporates a reliability-aware view selection strategy using a novel Separation-Compactness Index (SCI) to identify and leverage the most informative view for pseudo-label generation. This dual approach ensures the supervisory signal is both structurally adaptive and derived from the most reliable source, leading to highly effective final representations. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness and superiority of GAPS over other competitors.

JBHI Journal 2026 Journal Article

Neuro-BERT: Rethinking Masked Autoencoding for Self-Supervised Neurological Pretraining

  • Di Wu
  • Siyuan Li
  • Jie Yang
  • Mohamad Sawan

Deep learning associated with neurological signals is poised to drive major advancements in diverse fields such as medical diagnostics, neurorehabilitation, and brain-computer interfaces. The challenge in harnessing the full potential of these signals lies in the dependency on extensive, high-quality annotated data, which is often scarce and expensive to acquire, requiring specialized infrastructure and domain expertise. To address the appetite for data in deep learning, we present Neuro-BERT, a self-supervised pre-training framework of neurological signals based on masked autoencoding in the Fourier domain. The intuition behind our approach is simple: frequency and phase distribution of neurological signals can reveal intricate neurological activities. We propose a novel pre-training task dubbed Fourier Inversion Prediction (FIP), which randomly masks out a portion of the input signal and then predicts the missing information using the Fourier inversion theorem. Pre-trained models can be potentially used for various downstream tasks such as sleep stage classification and gesture recognition. Unlike contrastive-based methods, which strongly rely on carefully hand-crafted augmentations and siamese structure, our approach works reasonably well with a simple transformer encoder with no augmentation requirements. By evaluating our method on several benchmark datasets, we show that Neuro-BERT improves downstream neurological-related tasks by a large margin.

JBHI Journal 2026 Journal Article

NeuroCLIP: A Multimodal Contrastive Learning Method for rTMS-treated Methamphetamine Addiction Analysis

  • Chengkai Wang
  • Di Wu
  • Yunsheng Liao
  • Wenyao Zheng
  • Ziyi Zeng
  • Xurong Gao
  • Hemmings Wu
  • Zhoule Zhu

Methamphetamine dependence poses a significant global health challenge, yet its assessment and the evaluation of treatments like repetitive transcranial magnetic stimulation (rTMS) frequently depend on subjective self-reports, which may introduce uncertainties. While objective neuroimaging modalities such as electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) offer alternatives, their individual limitations and the reliance on conventional, often hand-crafted, feature extraction can compromise the reliability of derived biomarkers. To overcome these limitations, we propose NeuroCLIP, a novel deep learning framework integrating simultaneously recorded EEG and fNIRS data through a progressive learning strategy. This approach offers a robust and trustworthy data-driven biomarker for methamphetamine addiction. Validation experiments show that NeuroCLIP significantly improves discriminative capabilities among the methamphetamine-dependent individuals and healthy controls compared to models using either EEG or only fNIRS alone. Furthermore, the proposed framework facilitates objective, brain-based evaluation of rTMS treatment efficacy, demonstrating measurable shifts in neural patterns towards healthy control profiles after treatment. Critically, we establish the trustworthiness of the multimodal data-driven biomarker by showing its strong correlation with psychometrically validated craving scores. These findings suggest that biomarker derived from EEG-fNIRS data via NeuroCLIP offers enhanced robustness and reliability over single-modality approaches, providing a valuable tool for addiction neuroscience research and potentially improving clinical assessments.

JBHI Journal 2026 Journal Article

Order-Aware Deep Learning for Drug Combination Benefit Prediction in Cancer Cell Lines

  • Xuan Liu
  • Jian Zhang
  • Jie Yang
  • Shichao Liu
  • Wen Zhang

Drug combination therapy has exhibited favorable effects in treating cancer patients, with less toxicity and adverse reactions compared to monotherapy. To accelerate the discovery of therapeutic drug combinations, numerous computational methods have been developed to predict drug synergy in cancer cell lines, typically modeling the task as binary classification (synergistic vs non-synergistic) or regression (continuous synergy scores). Yet, a recent study proposes categorizing drug combination benefits into multiple ordered classes (e. g. , synergy, bliss additivity, independent actions) based on clinical activities, and suggests that drug combinations remain valuable if they reduce cancer cell viability, even without defined synergy. To distinguish various levels of combination benefits, we present a novel order-aware deep learning model, called OrderCombo. Specifically, OrderCombo extracts the drug representation via a pretrained chemical language model and the cell line representation via an omics-oriented linear network. Then, these representations are fused into a unified embedding for each drug-drug-cell line triplet, by leveraging a hybrid encoder that combines concatenation-based dependencies and attention-based interactions. Finally, an ordinal contrastive loss is designed to promote a discriminative embedding space and maintain class ordinality, thereby improving the predictions of drug combination benefits. We evaluate OrderCombo on a large-scale combination benefit dataset, and in silico results show that our method outperforms the state-of-the-art baselines in terms of prediction accuracy, while maintaining robust generalization to unseen drug pairs and cell lines. Substantial case studies further demonstrate OrderCombo's potential value in discovering novel anticancer drug combinations across different therapeutic levels.

AAAI Conference 2026 Conference Paper

SMoFi: Step-wise Momentum Fusion for Split Federated Learning on Heterogeneous Data

  • Mingkun Yang
  • Ran Zhu
  • Qing Wang
  • Jie Yang

Split Federated Learning is a system-efficient federated learning paradigm that leverages the rich computing resources at a central server to train model partitions. Data heterogeneity across silos, however, presents a major challenge undermining the convergence speed and accuracy of the global model. This paper introduces Step-wise Momentum Fusion (SMoFi), an effective and lightweight framework that counteracts gradient divergence arising from data heterogeneity by synchronizing the momentum buffers across server-side optimizers. To control gradient divergence over the training process, we design a staleness-aware alignment mechanism that imposes constraints on gradient updates of the server-side submodel at each optimization step. Extensive validations on multiple real-world datasets show that SMoFi consistently improves global model accuracy (up to 7.1%) and convergence speed (up to 10.25x). Furthermore, SMoFi has a greater impact with more clients involved and deeper learning models, making it particularly suitable for model training in resource-constrained contexts.

AAAI Conference 2026 Conference Paper

Trustworthy Classification for Complex Social Surveys: A Memory-Enhanced Hierarchical Framework with Calibrated Uncertainty

  • Zeqiang Wang
  • Rebecca Oldroyd
  • Yuqi Wang
  • Jiageng Wu
  • Jie Yang
  • Wei Wang
  • Nishanth R. Sastry
  • Jon Johnson

Automated classification of complex social survey questionnaires is crucial for large-scale social science research but faces significant reliability challenges due to intricate hierarchical label structures, severe class imbalance, semantic ambiguity, and incomplete data coverage. Conventional classification methods often struggle with these combined complexities, yielding results that lack trustworthiness. We introduce HOCM, a framework designed for trustworthy classification in complex, real-world taxonomies. It features two synergistic components: (1) memory-enhanced contrastive learning, tailored to learn robust representations from noisy, imbalanced data by leveraging quality-aware category memory banks; and (2) hierarchical uncertainty calibration, which enforces taxonomic consistency while providing reliable confidence estimates and identifying inputs falling outside well-represented known categories. Our evaluation on a large-scale, real-world social survey dataset—a challenging exemplar of our target problem class—demonstrates that HOCM maintains strong accuracy on known classes while effectively identifying uncertain cases, significantly boosting accuracy on confident predictions. Furthermore, it adeptly detects low-resource/unknown categories. HOCM provides a more reliable automated classification tool, enabling efficient expert review and enhancing the trustworthiness of analysis in domains with complex, hierarchical data.

IROS Conference 2025 Conference Paper

Automatic MILP Model Construction for Multi-Robot Task Allocation and Scheduling Based on Large Language Models

  • Mingming Peng
  • Zhendong Chen
  • Jie Yang
  • Jin Huang
  • Zhengqi Shi
  • Qihao Liu
  • Xinyu Li 0001
  • Liang Gao 0001

With the accelerated development of Industry 4. 0, intelligent manufacturing systems increasingly require efficient task allocation and scheduling in multi-robot systems. However, existing methods rely on domain expertise and face challenges in adapting to dynamic production constraints. Additionally, enterprises have high privacy requirements for production scheduling data, which prevents the use of cloud-based large language models (LLMs) for solution development. To address these challenges, there is an urgent need for an automated modeling solution that meets data privacy requirements. This study proposes a knowledge-augmented mixed integer linear programming (MILP) automated formulation framework, integrating local LLMs with domain-specific knowledge bases to generate executable code from natural language descriptions automatically. The framework employs a knowledge-guided DeepSeek-R1-Distill-Qwen-32B model to extract complex spatiotemporal constraints (82% average accuracy) and leverages a supervised fine-tuned Qwen2. 5-Coder-7B-Instruct model for efficient MILP code generation (90% average accuracy). Experimental results demonstrate that the framework successfully achieves automatic modeling in the aircraft skin manufacturing case while ensuring data privacy and computational efficiency. This research provides a low-barrier and highly reliable technical path for modeling in complex industrial scenarios.

AAAI Conference 2025 Conference Paper

CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs

  • Siyu Wang
  • Cailian Chen
  • Xinyi Le
  • Qimin Xu
  • Lei Xu
  • Yanzhou Zhang
  • Jie Yang

Computer-aided design (CAD) significantly enhances the efficiency, accuracy, and innovation of design processes by enabling precise 2D and 3D modeling, extensive analysis, and optimization. Existing methods for creating CAD models rely on latent vectors or point clouds, which are difficult to obtain, and storage costs are substantial. Recent advances in Multimodal Large Language Models (MLLMs) have inspired researchers to use natural language instructions and images for CAD model construction. However, these models still struggle with inferring accurate 3D spatial location and orientation, leading to inaccuracies in determining the spatial 3D starting points and extrusion directions for constructing geometries. This work introduces CAD-GPT, a CAD synthesis method with spatial reasoning-enhanced MLLM that takes either a single image or a textual description as input. To achieve precise spatial inference, our approach introduces a 3D Modeling Spatial Mechanism. This method maps 3D spatial positions and 3D sketch plane rotation angles into a 1D linguistic feature space using a specialized spatial unfolding mechanism, while discretizing 2D sketch coordinates into an appropriate planar space to enable precise determination of spatial starting position, sketch orientation, and 2D sketch coordinate translations. Extensive experiments demonstrate that CAD-GPT consistently outperforms existing state-of-the-art methods in CAD model synthesis, both quantitatively and qualitatively.

AAAI Conference 2025 Conference Paper

Collaborative Similarity Fusion and Consistency Recovery for Incomplete Multi-view Clustering

  • Bingbing Jiang
  • Chenglong Zhang
  • Xinyan Liang
  • Peng Zhou
  • Jie Yang
  • Xingyu Wu
  • Junyi Guan
  • Weiping Ding

As partial samples are often absent in certain views, incomplete multi-view clustering has become a challenging task. To tackle data with missing views, current methods either utilize the data similarity relations to recover missing samples or primarily consider the available information of existing samples, typically facing some inherent limitations. Firstly, traditional solutions cannot fully explore the potential information contained in missing samples due to their omission strategy, leading to sub-optimal graphs. Moreover, most methods mainly focus on data recovery from the view level, ignoring the differences among available/missing samples in various views. To this end, we propose a collaborative Similarity Fusion and Consistency Recovery (SFCR) method, which resolves the incomplete multi-view clustering problem by learning a unified similarity graph and recovering missing samples with consistent structures. Specifically, to learn a reliable graph compatible across views, a novel view-to-sample fusion model is designed to adaptively coalesce the view-wise similarities among available samples, not only preserving the complementarity and consistency among views but also properly balancing different samples. Furthermore, the missing samples are effectively recovered under the guidance of the fused similarity graph, so as to maintain the consistent structure of recovered data across views. In this way, the similarity learning and the missing data recovery benefit from each other in a collaborative reinforcement manner. Meanwhile, SFCR can directly obtain the final clustering labels without additional post-processing. Extensive experiments demonstrate the effectiveness and superiority of SFCR.

AAAI Conference 2025 Conference Paper

Enhanced Denesity Peak Clustering for High-Dimensional Data

  • Zhongli Wang
  • Jie Yang
  • Junyi Guan
  • Chenglong Zhang
  • Xinyan Liang
  • Bingbing Jiang
  • Weiguo Sheng

As a foundational clustering paradigm, Density Peak Clustering (DPC) partitions samples into clusters based on their density peaks, garnering widespread attention. However, traditional DPC methods usually focus on high-density regions, neglecting representative peaks in relatively low-density areas, particularly in datasets with varying densities and multiple peaks. Moreover, existing DPC variants struggle to identify clusters correctly in high-dimensional spaces due to the indistinct distance differences among samples and sparse data distributions. Additionally, existing methods typically adopt a one-step label assignment strategy, making them prone to cascading errors when initial misassignments occur. To address these challenges, we propose an Enhanced Density Peak Clustering (EDPC) method, which creatively incorporates multilayer perceptron (MLP)-based dimensionality reduction and a hierarchical label assignment strategy to significantly improve clustering performance in high-dimensional scenarios. Specifically, we introduce an effective selection condition that combines average densities and density-related distances to generate potential cluster centers, ensuring that peaks across different density regions are considered simultaneously. Furthermore, an MLP, guided by pseudo-labels from sub-clusters, is designed to learn low-dimensional embeddings for high-dimensional data, preserving data locality while enhancing clusterability. Extensive experiments demonstrate the effectiveness and superiority of EDPC against state-of-the-art DPC methods.

NeurIPS Conference 2025 Conference Paper

Flick: Empowering Federated Learning with Commonsense Knowledge

  • Ran Zhu
  • Mingkun Yang
  • Shiqiang Wang
  • Jie Yang
  • Qing Wang

Federated Learning (FL) has emerged as a privacy-preserving framework for training models on data generated at the edge. However, the heterogeneity of data silos (e. g. , label skew and domain shift) often leads to inconsistent learning objectives and suboptimal model performance. Inspired by the data-driven approach, we propose Flick, a novel data generation framework for heterogeneous **F**ederated **L**earning w**i**th **C**ommonsense **K**nowledge from Large Language Models (LLMs). In Flick, the client performs the local data summary to capture client-specific knowledge in textual form. The central server then distills task-relevant, high-quality knowledge from the out-of-the-box LLM -- guided by cross-client-specific insights -- to generate informative text prompts. These prompts direct a generative model in producing synthetic data, enabling global model fine-tuning and local data compensation. This process gradually aligns the label and feature distributions across clients. Extensive results on three datasets demonstrate that Flick improves the global model accuracy by up to 11. 43\%, and accelerates convergence by up to 12. 9$\times$, validating its effectiveness in addressing data heterogeneity.

ICLR Conference 2025 Conference Paper

From GNNs to Trees: Multi-Granular Interpretability for Graph Neural Networks

  • Jie Yang
  • Yuwen Wang
  • Kaixuan Chen 0004
  • Tongya Zheng
  • Yihe Zhou
  • Zhenbang Xiao
  • Ji Cao 0001
  • Mingli Song

Interpretable Graph Neural Networks (GNNs) aim to reveal the underlying reasoning behind model predictions, attributing their decisions to specific subgraphs that are informative. However, existing subgraph-based interpretable methods suffer from an overemphasis on local structure, potentially overlooking long-range dependencies within the entire graphs. Although recent efforts that rely on graph coarsening have proven beneficial for global interpretability, they inevitably reduce the graphs to a fixed granularity. Such an inflexible way can only capture graph connectivity at a specific level, whereas real-world graph tasks often exhibit relationships at varying granularities (e.g., relevant interactions in proteins span from functional groups, to amino acids, and up to protein domains). In this paper, we introduce a novel Tree-like Interpretable Framework (TIF) for graph classification, where plain GNNs are transformed into hierarchical trees, with each level featuring coarsened graphs of different granularity as tree nodes. Specifically, TIF iteratively adopts a graph coarsening module to compress original graphs (i.e., root nodes of trees) into increasingly coarser ones (i.e., child nodes of trees), while preserving diversity among tree nodes within different branches through a dedicated graph perturbation module. Finally, we propose an adaptive routing module to identify the most informative root-to-leaf paths, providing not only the final prediction but also the multi-granular interpretability for the decision-making process. Extensive experiments on the graph classification benchmarks with both synthetic and real-world datasets demonstrate the superiority of TIF in interpretability, while also delivering a competitive prediction performance akin to the state-of-the-art counterparts.

AAAI Conference 2025 Conference Paper

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

  • Yirui Chen
  • Xudong Huang
  • Quan Zhang
  • Wei Li
  • Mingjian Zhu
  • Qiangyu Yan
  • Simiao Li
  • Hanting Chen

The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location (IMDL). However, the lack of a large-scale data foundation makes the IMDL task unattainable. In this paper, we build a local manipulation data generation pipeline that integrates the powerful capabilities of SAM, LLM, and generative models. Upon this basis, we propose the GIM dataset, which has the following advantages: 1) Large scale, GIM includes over one million pairs of AI-manipulated images and real images. 2) Rich image content, GIM encompasses a broad range of image classes. 3) Diverse generative manipulation, the images are manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce the GIM benchmark with two settings to evaluate existing IMDL methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial block (FSB), and a Multi-Window Anomalous Modeling (MWAM) module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses the previous state-of-the-art approach on two different benchmarks.

NeurIPS Conference 2025 Conference Paper

Glocal Information Bottleneck for Time Series Imputation

  • Jie Yang
  • Kexin Zhang
  • Guibin Zhang
  • Philip S Yu
  • Kaize Ding

Time Series Imputation (TSI), which aims to recover missing values in temporal data, remains a fundamental challenge due to the complex and often high-rate missingness in real-world scenarios. Existing models typically optimize the point-wise reconstruction loss, focusing on recovering numerical values (local information). However, we observe that under high missing rates, these models still perform well in the training phase yet produce poor imputations and distorted latent representation distributions (global information) in the inference phase. This reveals a critical optimization dilemma: current objectives lack global guidance, leading models to overfit local noise and fail to capture global information of the data. To address this issue, we propose a new training paradigm, Glocal I nformation B ottleneck ( Glocal-IB ). Glocal-IB is model-agnostic and extends the standard IB framework by introducing a Global Alignment loss, derived from a tractable mutual information approximation. This loss aligns the latent representations of masked inputs with those of their originally observed counterparts. It helps the model retain global structure and local details while suppressing noise caused by missing values, giving rise to better generalization under high missingness. Extensive experiments on nine datasets confirm that Glocal-IB leads to consistently improved performance and aligned latent representations under missingness. Our code implementation is available in https: //github. com/Muyiiiii/NeurIPS-25-Glocal-IB.

AAAI Conference 2025 Conference Paper

Holistic Semantic Representation for Navigational Trajectory Generation

  • Ji Cao
  • Tongya Zheng
  • Qinghong Guo
  • Yu Wang
  • Junshu Dai
  • Shunyu Liu
  • Jie Yang
  • Jie Song

Trajectory generation has garnered significant attention from researchers in the field of spatio-temporal analysis, as it can generate substantial synthesized human mobility trajectories that enhance user privacy and alleviate data scarcity. However, existing trajectory generation methods often focus on improving trajectory generation quality from a singular perspective, lacking a comprehensive semantic understanding across various scales. Consequently, we are inspired to develop a HOlistic SEmantic Representation (HOSER) framework for navigational trajectory generation. Given an origin-and-destination (OD) pair and the starting time point of a latent trajectory, we first propose a Road Network Encoder to expand the receptive field of road- and zone-level semantics. Second, we design a Multi-Granularity Trajectory Encoder to integrate the spatio-temporal semantics of the generated trajectory at both the point and trajectory levels. Finally, we employ a Destination-Oriented Navigator to seamlessly integrate destination-oriented guidance. Extensive experiments on three real-world datasets demonstrate that HOSER outperforms state-of-the-art baselines by a significant margin. Moreover, the model's performance in few-shot learning and zero-shot learning scenarios further verifies the effectiveness of our holistic semantic representation.

JBHI Journal 2025 Journal Article

Memory-Efficient Intrinsic Gating Adaptation for Enhanced On-Device Epilepsy Diagnosis

  • Shanjin Li
  • Di Wu
  • Shiqi Zhao
  • Jie Yang
  • Mohamad Sawan

Recently, advances in neuroscience and the rise of artificial intelligence have significantly enhanced the capabilities of epilepsy diagnosis. While EEG-based diagnosis offer a promising avenue for detecting and predicting seizure activity, practical implementation in real-world scenarios remains hindered by the heterogeneity of epilepsy and the variability of patient-specific biomarkers over time. Conventional deep learning models, trained on historical EEG, often fail to adapt to such biomarker variations, leading to degraded performance. Moreover, the computational and memory constraints of edge devices further exacerbate the challenge of on-device learning. To address these challenges, we introduce a novel framework, Memory-Efficient Intrinsic Gating Adaptation (MEIGA), designed to enhance real-world epilepsy diagnosis on resource-constrained edge devices. Our approach pre-trains a model using historical EEG data and employs lightweight adapter networks for efficient on-device tuning across new sessions, addressing session-to-session variability. By leveraging Direct Feedback Alignment (DFA), MEIGA reduces memory usage and computational overhead while maintaining high classification accuracy. Extensive experiments on the CHB-MIT epilepsy dataset demonstrate that MEIGA outperforms the pretrained-only Vision Transformer baseline, raising seizure prediction accuracy from 47. 88% to 86. 77% with only 3, 908 tunable parameters (5. 05% of the backbone). For seizure detection, MEIGA improves accuracy from 85. 06% to 96. 29% by adapting 2, 008 parameters (17. 40% of the base architecture). Further experiments on the AES dataset demonstrate that MEIGA consistently delivers strong performance across subjects and scales effectively to larger networks.

IJCAI Conference 2025 Conference Paper

Multi-view Clustering via Multi-granularity Ensemble

  • Jie Yang
  • Wei Chen
  • Feng Liu
  • Peng Zhou
  • Zhongli Wang
  • Xinyan Liang
  • Bingbing Jiang

Multi-view clustering aims to integrate complementary information from multiple views to improve clustering performance. However, existing ensemble-based methods suffer from information loss due to their reliance on single-granularity labels, limiting the discriminative capability of learned representations. Meanwhile, representation and graph fusion-based approaches face challenges such as explicit view alignment and manual weight tuning, making them less effective for heterogeneous views with varying data distributions. To address these limitations, we propose a novel multi-view clustering framework via Multi-granularity Ensemble (MGE), fully using the multi-granularity information across diverse views for accurate and consistent clustering. Specifically, MGE first modifies the hierarchical clustering and then leverages it on each view (including the fused view) to achieve multi-granularity labels. Moreover, the cross-view and cross-granularity fusion strategy is designed to learn a robust co-association similarity matrix, which effectively preserves the fine-grained and coarse-grained structures of multi-view data and facilitates subsequent clustering. Therefore, MGE can provide a comprehensive representation of local and global patterns within data, eliminating the requirement for view alignment and weight tuning. Experiments demonstrate that MGE consistently outperforms state-of-the-art methods across multiple datasets, validating its effectiveness and superiority in handling heterogeneous views.

AAAI Conference 2025 Conference Paper

Revisiting Interpolation for Noisy Label Correction

  • Yuanzhuo Xu
  • Xiaoguang Niu
  • Jie Yang
  • Ruiyi Su
  • Jian Zhang
  • Shubo Liu
  • Steve Drew

Label correction methods are popular for their simple architecture in learning with noisy labels. However, they suffer severely from false label correction and achieve subpar performance compared with state-of-the-art methods. In this paper, we revisit the label correction methods through theoretical analysis of gradient scaling and demonstrate that the sample-wise dynamic and class-wise uniformity of interpolation weight prevents memorization of the mislabeled samples. We then propose DULC, a simple yet effective label correction method that uses the normalized Jensen-Shannon divergence (JSD) metric as the interpolation weight to promote sample-wise dynamic and class-wise uniformity. Additionally, we provide theoretical evidence that sharpening predictions in label correction facilitates the memorization of true class, and we achieve it by employing the augmentation strategy along with the sharpening function. Extensive experiments on CIFAR-10, CIFAR-100, TinyImageNet, WebVision and Clothing1M datasets demonstrate substantial improvements over state-of-the-art methods.

JBHI Journal 2025 Journal Article

SLoRD: Structural Low-Rank Descriptors for Shape Consistency in Vertebrae Segmentation

  • Xin You
  • Yixin Lou
  • Minghui Zhang
  • Jie Yang
  • Yun Gu

Automatic and precise multi-class vertebrae segmentation from CT images is crucial for various clinical applications. However, due to similar appearances between adjacent vertebrae and the existence of various pathologies, existing single-stage and multi-stage methods suffer from imprecise vertebrae segmentation. Essentially, these methods fail to explicitly impose both contour precision and intra-vertebrae voxel consistency constraints synchronously, resulting in the intra-vertebrae segmentation inconsistency, which refers to multiple label predictions inside a singular vertebra. In this work, we intend to label complete binary masks with sequential indices to address that challenge. Specifically, a contour generation network is proposed based on Structural Low-Rank Descriptors for shape consistency, termed SLoRD. For a structural representation of vertebral contours, we adopt the spherical coordinate system and devise the spherical centroid to calculate contour descriptors. Due to vertebrae’s similar appearances, basic contour descriptors can be acquired offline to restore original contours. Therefore, SLoRD leverages these contour priors and explicit shape constraints to facilitate regressed contour points close to vertebral surfaces. Quantitative and qualitative evaluations on VerSe 2019 and 2020 demonstrate the superior performance of our framework over other single-stage and multi-stage state-of-the-art (SOTA) methods. Further, SLoRD is a plug-and-play framework to refine the segmentation inconsistency existing in coarse predictions from other approaches.

ICRA Conference 2025 Conference Paper

Unlock the Power of Unlabeled Data in Language Driving Model

  • Chaoqun Wang 0012
  • Jie Yang
  • Xiaobin Hong 0002
  • Ruimao Zhang

Recent Vision-based Large Language Models (VisionLLMs) for autonomous driving have seen rapid advancements. However, such promotion is extremely dependent on large-scale high-quality annotated data, which is costly and labor-intensive. To address this issue, we propose unlocking the value of abundant yet unlabeled data to improve the language-driving model in a semi-supervised learning manner. Specifically, we first introduce a series of template-based prompts to extract scene information, generating questions that create pseudo-answers for the unlabeled data based on a model trained with limited labeled data. Next, we propose a Self-Consistency Refinement method to improve the quality of these pseudo-annotations, which are later used for further training. By utilizing a pre-trained VisionLLM (e. g. , InternVL), we build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-theart methods. Extensive experiments on the DriveLM benchmark show that our approach performs well with just 5% labeled data, achieving competitive performance against models trained with full datasets. In particular, our LDM achieves 44. 85% performance with limited labeled data, increasing to 54. 27 % when using unlabeled data, while models trained with full datasets reach 60. 68% on the DriveLM benchmark.

NeurIPS Conference 2025 Conference Paper

VIKI‑R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

  • Li Kang
  • Xiufeng Song
  • Heng Zhou
  • Yiran Qin
  • Jie Yang
  • Xiaohong Liu
  • Philip Torr
  • Lei Bai

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

IROS Conference 2024 Conference Paper

Bayesian Deep Predictive Coding for Snake-like Robotic Control in Unknown Terrains

  • William Ziming Qu
  • Jessica Ziyu Qu
  • Li Li
  • Jie Yang
  • Yuanyuan Jia

Effectively modeling the spatio-temporal interactions both internally and externally is a challenge in controlling multi-linked snake robots. This paper presents an effective method based on deep predictive coding: SnakeFormer, to address the aforementioned issue. The main contributions include: 1) Deriving a variational free energy function with two innovative regularization terms through Bayesian probabilistic analysis, offering a novel perspective to simulate the interactions between agent and the environment; 2) Introducing an interaction-attention model within a Transformer structure for predicting dynamics, and collaboratively addressing path planning and obstacle avoidance tasks. 3) By incorporating serpenoid embedding and optimizing self-attention computations, the gait stability and motion efficiency are improved. Preliminary experiments and comparative analysis with baseline models fully validate the effectiveness and generalizability of the method.

IJCAI Conference 2024 Conference Paper

Efficient Multi-view Unsupervised Feature Selection with Adaptive Structure Learning and Inference

  • Chenglong Zhang
  • Yang Fang
  • Xinyan Liang
  • Han Zhang
  • Peng Zhou
  • Xingyu Wu
  • Jie Yang
  • Bingbing Jiang

As data with diverse representations become high-dimensional, multi-view unsupervised feature selection has been an important learning paradigm. Generally, existing methods encounter the following challenges: (i) traditional solutions either concatenate different views or introduce extra parameters to weight them, affecting the performance and applicability; (ii) emphasis is typically placed on graph construction, yet disregarding the clustering information of data; (iii) exploring the similarity structure of all samples from the original features is suboptimal and extremely time-consuming. To solve this dilemma, we propose an efficient multi-view unsupervised feature selection (EMUFS) to construct bipartite graphs between samples and anchors. Specifically, a parameter-free manner is devised to collaboratively fuse the membership matrices and graphs to learn the compatible structure information across all views, naturally balancing different views. Moreover, EMUFS leverages the similarity relations of data in the feature subspace induced by l2, 0-norm to dynamically update the graph. Accordingly, the cluster information of anchors can be accurately propagated to samples via the graph structure and further guide feature selection, enhancing the quality of selected features and the computational costs in solution processes. A convergent optimization is developed to solve the formulated problem, and experiments demonstrate the effectiveness and efficiency of EMUFS.

AAAI Conference 2024 Conference Paper

Fair Graph Learning Using Constraint-Aware Priority Adjustment and Graph Masking in River Networks

  • Erhu He
  • Yiqun Xie
  • Alexander Sun
  • Jacob Zwart
  • Jie Yang
  • Zhenong Jin
  • Yang Wang
  • Hassan Karimi

Accurate prediction of water quality and quantity is crucial for sustainable development and human well-being. However, existing data-driven methods often suffer from spatial biases in model performance due to heterogeneous data, limited observations, and noisy sensor data. To overcome these challenges, we propose Fair-Graph, a novel graph-based recurrent neural network that leverages interrelated knowledge from multiple rivers to predict water flow and temperature within large-scale stream networks. Additionally, we introduce node-specific graph masks for information aggregation and adaptation to enhance prediction over heterogeneous river segments. To reduce performance disparities across river segments, we introduce a centralized coordination strategy that adjusts training priorities for segments. We evaluate the prediction of water temperature within the Delaware River Basin, and the prediction of streamflow using simulated data from U.S. National Water Model in the Houston River network. The results showcase improvements in predictive performance and highlight the proposed model's ability to maintain spatial fairness over different river segments.

JBHI Journal 2024 Journal Article

Generalized Camera-Based Infant Sleep-Wake Monitoring in NICUs: A Multi-Center Clinical Trial

  • Dongmin Huang
  • Dongfang Yu
  • Yongshen Zeng
  • Xiaoyan Song
  • Liping Pan
  • Junli He
  • Lirong Ren
  • Jie Yang

The infant sleep-wake behavior is an essential indicator of physiological and neurological system maturity, the circadian transition of which is important for evaluating the recovery of preterm infants from inadequate physiological function and cognitive disorders. Recently, camera-based infant sleep-wake monitoring has been investigated, but the challenges of generalization caused by variance in infants and clinical environments are not addressed for this application. In this paper, we conducted a multi-center clinical trial at four hospitals to improve the generalization of camera-based infant sleep-wake monitoring. Using the face videos of 64 term and 39 preterm infants recorded in NICUs, we proposed a novel sleep-wake classification strategy, called consistent deep representation constraint (CDRC), that forces the convolutional neural network (CNN) to make consistent predictions for the samples from different conditions but with the same label, to address the variances caused by infants and environments. The clinical validation shows that by using CDRC, all CNN backbones obtain over 85% accuracy, sensitivity, and specificity in both the cross-age and cross-environment experiments, improving the ones without CDRC by almost 15% in all metrics. This demonstrates that by improving the consistency of the deep representation of samples with the same state, we can significantly improve the generalization of infant sleep-wake classification.

IJCAI Conference 2024 Conference Paper

Guiding Clinical Reasoning with Large Language Models via Knowledge Seeds

  • Jiageng Wu
  • Xian Wu
  • Jie Yang

Clinical reasoning refers to the cognitive process that physicians employ in evaluating and managing patients. This process typically involves suggesting necessary examinations, diagnosing patients’ diseases, and selecting appropriate therapies, etc. Accurate clinical reasoning requires extensive medical knowledge and rich clinical experience, setting a high bar for physicians. This is particularly challenging in developing countries due to the overwhelming number of patients and limited physician resources, contributing significantly to global health inequity and necessitating automated clinical reasoning approaches. Recently, the emergence of large language models (LLMs) such as ChatGPT and GPT-4 have demonstrated their potential in clinical reasoning. However, these LLMs are prone to hallucination problems, and the reasoning process of LLMs may not align with the clinical decision pathways of physicians. In this study, we introduce a novel framework, In-Context Padding (ICP), to enhance LLMs reasoning with medical knowledge. Specifically, we infer critical clinical reasoning elements (referred to as knowledge seeds) and use these as anchors to guide the generation process of LLMs. Experiments on two clinical question datasets validate that ICP significantly improves the clinical reasoning ability of LLMs.

NeurIPS Conference 2024 Conference Paper

Kernel PCA for Out-of-Distribution Detection

  • Kun Fang
  • Qinghua Tao
  • Kexin Lv
  • Mingzhen He
  • Xiaolin Huang
  • Jie Yang

Out-of-Distribution (OoD) detection is vital for the reliability of Deep Neural Networks (DNNs). Existing works have shown the insufficiency of Principal Component Analysis (PCA) straightforwardly applied on the features of DNNs in detecting OoD data from In-Distribution (InD) data. The failure of PCA suggests that the network features residing in OoD and InD are not well separated by simply proceeding in a linear subspace, which instead can be resolved through proper non-linear mappings. In this work, we leverage the framework of Kernel PCA (KPCA) for OoD detection, and seek suitable non-linear kernels that advocate the separability between InD and OoD data in the subspace spanned by the principal components. Besides, explicit feature mappings induced from the devoted task-specific kernels are adopted so that the KPCA reconstruction error for new test samples can be efficiently obtained with large-scale data. Extensive theoretical and empirical results on multiple OoD data sets and network structures verify the superiority of our KPCA detector in efficiency and efficacy with state-of-the-art detection performance.

NeurIPS Conference 2024 Conference Paper

KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension

  • Jie Yang
  • Wang Zeng
  • Sheng Jin
  • Lumin Xu
  • Wentao Liu
  • Chen Qian
  • Ruimao Zhang

Recent advancements in Multimodal Large Language Models (MLLMs) have greatly improved their abilities in image understanding. However, these models often struggle with grasping pixel-level semantic details, e. g. , the keypoints of an object. To bridge this gap, we introduce the novel challenge of Semantic Keypoint Comprehension, which aims to comprehend keypoints across different task scenarios, including keypoint semantic understanding, visual prompt-based keypoint detection, and textual prompt-based keypoint detection. Moreover, we introduce KptLLM, a unified multimodal model that utilizes an identify-then-detect strategy to effectively address these challenges. KptLLM underscores the initial discernment of semantics in keypoints, followed by the precise determination of their positions through a chain-of-thought process. With several carefully designed modules, KptLLM adeptly handles various modality inputs, facilitating the interpretation of both semantic contents and keypoint locations. Our extensive experiments demonstrate KptLLM's superiority in various keypoint detection benchmarks and its unique semantic capabilities in interpreting keypoints.

NeurIPS Conference 2024 Conference Paper

MedJourney: Benchmark and Evaluation of Large Language Models over Patient Clinical Journey

  • Xian Wu
  • Yutian Zhao
  • Yunyan Zhang
  • Jiageng Wu
  • Zhihong Zhu
  • Yingying Zhang
  • Yi Ouyang
  • Ziheng Zhang

Large language models (LLMs) have demonstrated remarkable capabilities in language understanding and generation, leading to their widespread adoption across various fields. Among these, the medical field is particularly well-suited for LLM applications, as many medical tasks can be enhanced by LLMs. Despite the existence of benchmarks for evaluating LLMs in medical question-answering and exams, there remains a notable gap in assessing LLMs' performance in supporting patients throughout their entire hospital visit journey in real-world clinical practice. In this paper, we address this gap by dividing a typical patient's clinical journey into four stages: planning, access, delivery and ongoing care. For each stage, we introduce multiple tasks and corresponding datasets, resulting in a comprehensive benchmark comprising 12 datasets, of which five are newly introduced, and seven are constructed from existing datasets. This proposed benchmark facilitates a thorough evaluation of LLMs' effectiveness across the entire patient journey, providing insights into their practical application in clinical settings. Additionally, we evaluate three categories of LLMs against this benchmark: 1) proprietary LLM services such as GPT-4; 2) public LLMs like QWen; and 3) specialized medical LLMs, like HuatuoGPT2. Through this extensive evaluation, we aim to provide a better understanding of LLMs' performance in the medical domain, ultimately contributing to their more effective deployment in healthcare settings.

ICRA Conference 2024 Conference Paper

MMA-Net: Multiple Morphology-Aware Network for Automated Cobb Angle Measurement

  • Zhengxuan Qiu
  • Jie Yang
  • Jiankun Wang 0001

Scoliosis diagnosis and assessment depend largely on the measurement of the Cobb angle in spine X-ray images. With the emergence of deep learning techniques that employ landmark detection, tilt prediction, and spine segmentation, automated Cobb angle measurement has become increasingly popular. However, these methods encounter difficulties such as high noise sensitivity, intricate computational procedures, and exclusive reliance on a single type of morphological information. In this paper, we introduce the Multiple Morphology-Aware Network (MMA-Net), a novel framework that improves Cobb angle measurement accuracy by integrating multiple spine morphology as attention information. In the MMA-Net, we first feed spine X-ray images into the segmentation network to produce multiple morphological information (spine region, centerline, and boundary) and then concatenate the original X-ray image with the resulting segmentation maps as input for the regression module to perform precise Cobb angle measurement. Furthermore, we devise joint loss functions for our segmentation and regression network training, respectively. We evaluate our method on the AASCE challenge dataset and achieve superior performance with the SMAPE of 7. 28% and the MAE of 3. 18°, indicating a strong competitiveness compared to other outstanding methods. Consequently, we can offer clinicians automated, efficient, and reliable Cobb angle measurement.

JAIR Journal 2024 Journal Article

Opening the Analogical Portal to Explainability: Can Analogies Help Laypeople in AI-assisted Decision Making?

  • Gaole He
  • Agathe Balayn
  • Stefan Buijsman
  • Jie Yang
  • Ujwal Gadiraju

Concepts are an important construct in semantics, based on which humans understand the world with various levels of abstraction. With the recent advances in explainable artificial intelligence (XAI), concept-level explanations are receiving an increasing amount of attention from the broad research community. However, laypeople may find such explanations difficult to digest due to the potential knowledge gap and the concomitant cognitive load. Inspired by prior work that has explored analogies and sensemaking, we argue that augmenting concept-level explanations with analogical inference information from commonsense knowledge can be a potential solution to tackle this issue. To investigate the validity of our proposition, we first designed an effective analogy-based explanation generation method and collected 600 analogy-based explanations from 100 crowd workers. Next, we proposed a set of structured dimensions for the qualitative assessment of such explanations, and conducted an empirical evaluation of the generated analogies with experts. Our findings revealed significant positive correlations between the qualitative dimensions of analogies and the perceived helpfulness of analogy-based explanations, suggesting the effectiveness of the dimensions. To understand the practical utility and the effectiveness of analogybased explanations in assisting human decision-making, we conducted a follow-up empirical study (N = 280) on a skin cancer detection task with non-expert humans and an imperfect AI system. Thus, we designed a between-subjects study spanning five different experimental conditions with varying types of explanations. The results of our study confirmed that a knowledge gap can prevent participants from understanding concept-level explanations. Consequently, when only the target domain of our designed analogy-based explanation was provided (in a specific experimental condition), participants demonstrated relatively more appropriate reliance on the AI system. In contrast to our expectations, we found that analogies were not effective in fostering appropriate reliance. We carried out a qualitative analysis of the open-ended responses from participants in the study regarding their perceived usefulness of explanations and analogies. Our findings suggest that human intuition and the perceived plausibility of analogies may have played a role in affecting user reliance on the AI system. We also found that the understanding of commonsense explanations varied with the varying experience of the recipient user, which points out the need for further work on personalization when leveraging commonsense explanations. In summary, although we did not find quantitative support for our hypotheses around the benefits of using analogies, we found considerable qualitative evidence suggesting the potential of high-quality analogies in aiding non-expert users in their decision making with AI-assistance. These insights can inform the design of future methods for the generation and use of effective analogy-based explanations.

NeurIPS Conference 2024 Conference Paper

OPUS: Occupancy Prediction Using a Sparse Set

  • Jiabao Wang
  • Zhaojiang Liu
  • Qiang Meng
  • Liujiang Yan
  • Ke Wang
  • Jie Yang
  • Wei Liu
  • Qibin Hou

Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment into voxels, then perform classification on such dense grids. However, inspection on sample data reveals that the vast majority of voxels is unoccupied. Performing classification on these empty voxels demands suboptimal computation resource allocation, and reducing such empty voxels necessitates complex algorithm designs. To this end, we present a novel perspective on the occupancy prediction task: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures. Our proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries. Firstly, we employ the Chamfer distance loss to scale the set-to-set comparison problem to unprecedented magnitudes, making training such model end-to-end a reality. Subsequently, semantic classes are adaptively assigned using nearest neighbor search based on the learned locations. In addition, OPUS incorporates a suite of non-trivial strategies to enhance model performance, including coarse-to-fine learning, consistent point sampling, and adaptive re-weighting, etc. Finally, compared with current state-of-the-art methods, our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6. 1 RayIoU.

AAAI Conference 2024 Conference Paper

Pre-trained Online Contrastive Learning for Insurance Fraud Detection

  • Rui Zhang
  • Dawei Cheng
  • Jie Yang
  • Yi Ouyang
  • Xian Wu
  • Yefeng Zheng
  • Changjun Jiang

Medical insurance fraud has always been a crucial challenge in the field of healthcare industry. Existing fraud detection models mostly focus on offline learning scenes. However, fraud patterns are constantly evolving, making it difficult for models trained on past data to detect newly emerging fraud patterns, posing a severe challenge in medical fraud detection. Moreover, current incremental learning models are mostly designed to address catastrophic forgetting, but often exhibit suboptimal performance in fraud detection. To address this challenge, this paper proposes an innovative online learning method for medical insurance fraud detection, named POCL. This method combines contrastive learning pre-training with online updating strategies. In the pre-training stage, we leverage contrastive learning pre-training to learn on historical data, enabling deep feature learning and obtaining rich risk representations. In the online learning stage, we adopt a Temporal Memory Aware Synapses online updating strategy, allowing the model to perform incremental learning and optimization based on continuously emerging new data. This ensures timely adaptation to fraud patterns and reduces forgetting of past knowledge. Our model undergoes extensive experiments and evaluations on real-world insurance fraud datasets. The results demonstrate our model has significant advantages in accuracy compared to the state-of-the-art baseline methods, while also exhibiting lower running time and space consumption. Our sources are released at https://github.com/finint/POCL.

NeurIPS Conference 2024 Conference Paper

Rethinking Fourier Transform from A Basis Functions Perspective for Long-term Time Series Forecasting

  • Runze Yang
  • Longbing Cao
  • Jianxun Li
  • Jie Yang

The interaction between Fourier transform and deep learning opens new avenues for long-term time series forecasting (LTSF). We propose a new perspective to reconsider the Fourier transform from a basis functions perspective. Specifically, the real and imaginary parts of the frequency components can be viewed as the coefficients of cosine and sine basis functions at tiered frequency levels, respectively. We argue existing Fourier-based methods do not involve basis functions thus fail to interpret frequency coefficients precisely and consider the time-frequency relationship sufficiently, leading to inconsistent starting cycles and inconsistent series length issues. Accordingly, a novel Fourier basis mapping (FBM) method addresses these issues by mixing time and frequency domain features through Fourier basis expansion. Differing from existing approaches, FBM (i) embeds the discrete Fourier transform with basis functions, and then (ii) can enable plug-and-play in various types of neural networks for better performance. FBM extracts explicit frequency features while preserving temporal characteristics, enabling the mapping network to capture the time-frequency relationships. By incorporating our unique time-frequency features, the FBM variants can enhance any type of networks like linear, multilayer-perceptron-based, transformer-based, and Fourier-based networks, achieving state-of-the-art LTSF results on diverse real-world datasets with just one or three fully connected layers. The code is available at: https: //github. com/runze1223/Fourier-Basis-Mapping.

AAAI Conference 2024 Conference Paper

SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger

  • Yuting Gao
  • Jinfeng Liu
  • Zihan Xu
  • Tong Wu
  • Enwei Zhang
  • Ke Li
  • Jie Yang
  • Wei Liu

During the preceding biennium, vision-language pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a challenging task, and noise exists in the commonly used datasets. To address this issue, we propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment by introducing a softened target, which is generated from the fine-grained intra-modal self-similarity. The intra-modal guidance is indicative to enable two pairs have some local similarities and model many-to-many relationships between the two modalities. Besides, since the positive still dominates in the softened target distribution, we disentangle the negatives in the distribution to further boost the relation alignment with the negatives in the cross-modal learning. Extensive experiments demonstrate the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline.

AAAI Conference 2024 Conference Paper

Tackling Vision Language Tasks through Learning Inner Monologues

  • Diji Yang
  • Kezhen Chen
  • Jinmeng Rao
  • Xiaoyuan Guo
  • Yawen Zhang
  • Jie Yang
  • Yi Zhang

Visual language tasks such as Visual Question Answering (VQA) or Visual Entailment (VE) require AI models to comprehend and reason with both visual and textual content. Driven by the power of Large Language Models (LLMs), two prominent methods have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. To tackle this dilemma, we propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems by simulating Inner Monologue, a cognitive process in which an individual engages in silent verbal communication with themselves. More specifically, we enable LLMs and VLMs to interact through natural language conversation (i.e., Inner Monologue) and propose to use a two-stage training process to learn how to do Inner Monologue (self-asking questions and answering questions). IMMO is evaluated on two popular tasks and achieves competitive performance with less training data when compared with state-of-the-art models while concurrently keeping the interpretability. The results suggest that by emulating the cognitive phenomenon of internal dialogue, our approach can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models. More importantly, instead of using predefined human-crafted monologues, IMMO learns this process within the deep learning models, broadening its potential applications across various AI challenges beyond vision and language tasks.

JBHI Journal 2023 Journal Article

A Benchmark Dataset of Endoscopic Images and Novel Deep Learning Method to Detect Intestinal Metaplasia and Gastritis Atrophy

  • Jie Yang
  • Yan Ou
  • Zhiqian Chen
  • Juan Liao
  • Wenjian Sun
  • Yang Luo
  • Chunbo Luo

Endoscopy has been routinely used to diagnose stomach diseases including intestinal metaplasia (IM) and gastritis atrophy (GA). Such routine examination usually demands highly skilled radiologists to focus on a single patient with substantial time, causing the following two key challenges: 1) the dependency on the radiologist's experience leading to inconsistent diagnosis results across different radiologists; 2) limited examination efficiency due to the demanding time and energy consumption to the radiologist. This paper proposes to address these two issues in endoscopy using novel machine learning method in three-folds. Firstly, we build a novel and relatively big endoscopy dataset of 21, 420 images from the widely used White Light Imaging (WLI) endoscopy and more recent Linked Color Imaging (LCI) endoscopy, which were annotated by experienced radiologists and validated with biopsy results, presenting a benchmark dataset. Secondly, we propose a novel machine learning model inspired by the human visual system, named as local attention grouping, to effectively extract key visual features, which is further improved by learning from multiple randomly selected regional images via ensemble learning. Such a method avoids the significant problem in the deep learning methods that decrease the resolution of original images to reduce the size of input samples, which would remove smaller lesions in endoscopy images. Finally, we propose a dual transfer learning strategy to train the model with co-distributed features between WLI and LCI images to further improve the performance. The experiment results, measured by accuracy, specificity, sensitivity, positive detection rate and negative detection rate, on IM are 99. 18 $\%$, 98. 90 $\%$, 99. 45 $\%$, 99. 45 $\%$, 98. 91 $\%$, respectively, and on GA are 97. 12 $\%$, 95. 34 $\%$, 98. 90 $\%$, 98. 86 $\%$, 95. 50 $\%$, respectively, achieving state of the art performance that outperforms current mainstream deep learning models.

ICLR Conference 2023 Conference Paper

Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

  • Jie Yang
  • Ailing Zeng
  • Shilong Liu 0004
  • Feng Li 0040
  • Ruimao Zhang
  • Lei Zhang 0001

This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit box detection processes with a unified representation and regression supervision. First, we introduce a human detection decoder from encoded tokens to extract global features. It can provide a good initialization for the latter keypoint detection, making the training process converge fast. Second, to bring in contextual information near keypoints, we regard pose estimation as a keypoint box detection problem to learn both box positions and contents for each keypoint. A human-to-keypoint detection decoder adopts an interactive learning strategy between human and keypoint features to further enhance global and local feature aggregation. In general, ED-Pose is conceptually simple without post-processing and dense heatmap supervision. It demonstrates its effectiveness and efficiency compared with both two-stage and one-stage methods. Notably, explicit box detection boosts the pose estimation performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and achieves the state-of-the-art with 76.6 AP on CrowdPose without bells and whistles. Code is available at https://github.com/IDEA-Research/ED-Pose.

AAAI Conference 2023 Conference Paper

FoPro: Few-Shot Guided Robust Webly-Supervised Prototypical Learning

  • Yulei Qin
  • Xingyu Chen
  • Chao Chen
  • Yunhang Shen
  • Bo Ren
  • Yun Gu
  • Jie Yang
  • Chunhua Shen

Recently, webly supervised learning (WSL) has been studied to leverage numerous and accessible data from the Internet. Most existing methods focus on learning noise-robust models from web images while neglecting the performance drop caused by the differences between web domain and real-world domain. However, only by tackling the performance gap above can we fully exploit the practical value of web datasets. To this end, we propose a Few-shot guided Prototypical (FoPro) representation learning method, which only needs a few labeled examples from reality and can significantly improve the performance in the real-world domain. Specifically, we initialize each class center with few-shot real-world data as the ``realistic" prototype. Then, the intra-class distance between web instances and ``realistic" prototypes is narrowed by contrastive learning. Finally, we measure image-prototype distance with a learnable metric. Prototypes are polished by adjacent high-quality web images and involved in removing distant out-of-distribution samples. In experiments, FoPro is trained on web datasets with a few real-world examples guided and evaluated on real-world datasets. Our method achieves the state-of-the-art performance on three fine-grained datasets and two large-scale datasets. Compared with existing WSL methods under the same few-shot settings, FoPro still excels in real-world generalization. Code is available at https://github.com/yuleiqin/fopro.

IJCAI Conference 2023 Conference Paper

GreenPLM: Cross-Lingual Transfer of Monolingual Pre-Trained Language Models at Almost No Cost

  • Qingcheng Zeng
  • Lucas Garay
  • Peilin Zhou
  • Dading Chong
  • Yining Hua
  • Jiageng Wu
  • Yikang Pan
  • Han Zhou

Large pre-trained models have revolutionized natural language processing (NLP) research and applications, but high training costs and limited data resources have prevented their benefits from being shared equally amongst speakers of all the world's languages. To address issues of cross-linguistic access to such models and reduce energy consumption for sustainability during large-scale model training, this study proposes an effective and energy-efficient framework called GreenPLM that uses bilingual lexicons to directly ``translate'' pre-trained language models of one language into another at almost no additional cost. We validate this approach in 18 languages' BERT models and show that this framework is comparable to, if not better than, other heuristics with high training costs. In addition, given lightweight continued pre-training on limited data where available, this framework outperforms the original monolingual language models in six out of seven tested languages with up to 200x less pre-training efforts. Aiming at the Leave No One Behind Principle (LNOB), our approach manages to reduce inequalities between languages and energy consumption greatly. We make our codes and models publicly available at https: //github. com/qcznlp/GreenPLMs.

ECAI Conference 2023 Conference Paper

Stock Movement Prediction via Attention-Aware Multi-Order Relation Graph Neural Network

  • Hao Peng
  • Jie Yang

Stock Movement Prediction (SMP) is a challenging task that aims at predicting the future stock price trend of companies in the stock. Recent advances mainly apply the Graph Convolutional Network (GCN) to learn connections among companies for SMP. However, these methods usually ignore the semantics of the specific relations (e. g. , investment and share) between two entities (i. e. , companies and persons) on the market knowledge graph. Meanwhile, considering the long-chain cross-shareholding structures among entities, it is difficult for GCN to obtain high-order neighbor information over long distances. To address these two problems, we present an Attention-aware Multi-order Relation GCN for SMP (AMRGCN-SMP). Specifically, an attention-aware multi-channel aggregation manner achieves the weighted fusion of nodes across multiple semantic channels. Moreover, the dynamic update of the adjacent tensor can fuse the multi-order relation representations and bring more abundant long-chain connections. The experiments on the CSI100E and CSI300E datasets demonstrate that the proposed method achieves state-of-the-art performances compared with the recent advances.

AAAI Conference 2023 Conference Paper

USDNL: Uncertainty-Based Single Dropout in Noisy Label Learning

  • Yuanzhuo Xu
  • Xiaoguang Niu
  • Jie Yang
  • Steve Drew
  • Jiayu Zhou
  • Ruizhi Chen

Deep Neural Networks (DNNs) possess powerful prediction capability thanks to their over-parameterization design, although the large model complexity makes it suffer from noisy supervision. Recent approaches seek to eliminate impacts from noisy labels by excluding data points with large loss values and showing promising performance. However, these approaches usually associate with significant computation overhead and lack of theoretical analysis. In this paper, we adopt a perspective to connect label noise with epistemic uncertainty. We design a simple, efficient, and theoretically provable robust algorithm named USDNL for DNNs with uncertainty-based Dropout. Specifically, we estimate the epistemic uncertainty of the network prediction after early training through single Dropout. The epistemic uncertainty is then combined with cross-entropy loss to select the clean samples during training. Finally, we theoretically show the equivalence of replacing selection loss with single cross-entropy loss. Compared to existing small-loss selection methods, USDNL features its simplicity for practical scenarios by only applying Dropout to a standard network, while still achieving high model accuracy. Extensive empirical results on both synthetic and real-world datasets show that USDNL outperforms other methods. Our code is available at https://github.com/kovelxyz/USDNL.

NeurIPS Conference 2022 Conference Paper

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

  • Yuanfeng Ji
  • Haotian Bai
  • Chongjian Ge
  • Jie Yang
  • Ye Zhu
  • Ruimao Zhang
  • Zhen Li
  • Lingyan Zhanng

Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. Information can be found at https: //amos22. grand-challenge. org.

NeurIPS Conference 2022 Conference Paper

HSDF: Hybrid Sign and Distance Field for Modeling Surfaces with Arbitrary Topologies

  • Li Wang
  • Jie Yang
  • Weikai Chen
  • Xiaoxu Meng
  • Bo Yang
  • Jintao Li
  • Lin Gao

Neural implicit function based on signed distance field (SDF) has achieved impressive progress in reconstructing 3D models with high fidelity. However, such approaches can only represent closed shapes. Recent works based on unsigned distance function (UDF) are proposed to handle both watertight and open surfaces. Nonetheless, as UDF is signless, its direct output is limited to point cloud, which imposes an additional challenge on extracting high-quality meshes from discrete points. To address this issue, we present a new learnable implicit representation, coded HSDF, that connects the good ends of SDF and UDF. In particular, HSDF is able to represent arbitrary topologies containing both closed and open surfaces while being compatible with existing iso-surface extraction techniques for easy field-to-mesh conversion. In addition to predicting a UDF, we propose to learn an additional sign field via a simple classifier. Unlike traditional SDF, HSDF is able to locate the surface of interest before level surface extraction by generating surface points following NDF~\cite{chibane2020ndf}. We are then able to obtain open surfaces via an adaptive meshing approach that only instantiates regions containing surface into a polygon mesh. We also propose HSDF-Net, a dedicated learning framework that factorizes the learning of HSDF into two easier problems. Experiments on multiple datasets show that HSDF outperforms state-of-the-art techniques both qualitatively and quantitatively.

NeurIPS Conference 2022 Conference Paper

METS-CoV: A Dataset of Medical Entity and Targeted Sentiment on COVID-19 Related Tweets

  • Peilin Zhou
  • Zeqiang Wang
  • Dading Chong
  • Zhijiang Guo
  • Yining Hua
  • Zichang Su
  • Zhiyang Teng
  • Jiageng Wu

The COVID-19 pandemic continues to bring up various topics discussed or debated on social media. In order to explore the impact of pandemics on people's lives, it is crucial to understand the public's concerns and attitudes towards pandemic-related entities (e. g. , drugs, vaccines) on social media. However, models trained on existing named entity recognition (NER) or targeted sentiment analysis (TSA) datasets have limited ability to understand COVID-19-related social media texts because these datasets are not designed or annotated from a medical perspective. In this paper, we release METS-CoV, a dataset containing medical entities and targeted sentiments from COVID-19 related tweets. METS-CoV contains 10, 000 tweets with 7 types of entities, including 4 medical entity types (Disease, Drug, Symptom, and Vaccine) and 3 general entity types (Person, Location, and Organization). To further investigate tweet users' attitudes toward specific entities, 4 types of entities (Person, Organization, Drug, and Vaccine) are selected and annotated with user sentiments, resulting in a targeted sentiment dataset with 9, 101 entities (in 5, 278 tweets). To the best of our knowledge, METS-CoV is the first dataset to collect medical entities and corresponding sentiments of COVID-19 related tweets. We benchmark the performance of classical machine learning models and state-of-the-art deep learning models on NER and TSA tasks with extensive experiments. Results show that this dataset has vast room for improvement for both NER and TSA tasks. With rich annotations and comprehensive benchmark results, we believe METS-CoV is a fundamental resource for building better medical social media understanding tools and facilitating computational social science research, especially on epidemiological topics. Our data, annotation guidelines, benchmark models, and source code are publicly available (\url{https: //github. com/YLab-Open/METS-CoV}) to ensure reproducibility.

AAAI Conference 2022 Short Paper

Multi-View Adjacency-Constrained Nearest Neighbor Clustering (Student Abstract)

  • Jie Yang
  • Chin-Teng Lin

Most existing multi-view clustering methods have problems with parameter selection and high computational complexity, and there have been very few works based on hierarchical clustering to learn the complementary information of multiple views. In this paper, we propose a Multi-view Adjacencyconstrained Nearest Neighbor Clustering (MANNC) and its parameter-free version (MANNC-PF) to overcome these limitations. Experiments tested on eight real-world datasets validate the superiority of the proposed methods compared with the 13 current state-of-the-art methods.

JBHI Journal 2022 Journal Article

NeuroSEE: A Neuromorphic Energy-Efficient Processing Framework for Visual Prostheses

  • Chuanqing Wang
  • Jie Yang
  • Mohamad Sawan

Visual prostheses with both comprehensive visual signal processing capability and energy efficiency are becoming increasingly demanded in the age of intelligent personal healthcare, particularly with the rise of wearable and implantable devices. To address this trend, we propose NeuroSEE, a neuromorphic energy-efficient processing framework that combines a spike representation encoding technique and a bio-inspired processing method. This framework first utilizes sparse spike trains to represent visual information, and then a bio-inspired spiking neural network (SNN) is adopted to process the spike trains. The SNN model makes use of an IF neuron with multiple spike-firing rates to decrease the energy consumption without compensating for prediction performance. The experimental results indicate that when predicting the response of the primary visual cortex, the framework achieves a state-of-the-art Pearson correlation coefficient performance. Spike-based recording and processing methods simplify the storage and transmission of redundant scene information and complex calculation processes. It could reduce power consumption by 15 times compared with the existing Convolutional neural network (CNN) processing framework. The proposed NeuroSEE framework predicts the response of the primary visual cortex in an energy efficient manner, making it a powerful tool for visual prostheses.

JMLR Journal 2021 Journal Article

Generalization Properties of hyper-RKHS and its Applications

  • Fanghui Liu
  • Lei Shi
  • Xiaolin Huang
  • Jie Yang
  • Johan A.K. Suykens

This paper generalizes regularized regression problems in a hyper-reproducing kernel Hilbert space (hyper-RKHS), illustrates its utility for kernel learning and out-of-sample extensions, and proves asymptotic convergence results for the introduced regression models in an approximation theory view. Algorithmically, we consider two regularized regression models with bivariate forms in this space, including kernel ridge regression (KRR) and support vector regression (SVR) endowed with hyper-RKHS, and further combine divide-and-conquer with Nystr\"{o}m approximation for scalability in large sample cases. This framework is general: the underlying kernel is learned from a broad class, and can be positive definite or not, which adapts to various requirements in kernel learning. Theoretically, we study the convergence behavior of regularized regression algorithms in hyper-RKHS and derive the learning rates, which goes beyond the classical analysis on RKHS due to the non-trivial independence of pairwise samples and the characterisation of hyper-RKHS. Experimentally, results on several benchmarks suggest that the employed framework is able to learn a general kernel function form an arbitrary similarity matrix, and thus achieves a satisfactory performance on classification tasks. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

AAAI Conference 2021 Conference Paper

MARTA: Leveraging Human Rationales for Explainable Text Classification

  • Ines Arous
  • Ljiljana Dolamic
  • Jie Yang
  • Akansha Bhardwaj
  • Giuseppe Cuccu
  • Philippe Cudré-Mauroux

Explainability is a key requirement for text classification in many application domains ranging from sentiment analysis to medical diagnosis or legal reviews. Existing methods often rely on “attention” mechanisms for explaining classification results by estimating the relative importance of input units. However, recent studies have shown that such mechanisms tend to mis-identify irrelevant input units in their explanation. In this work, we propose a hybrid human-AI approach that incorporates human rationales into attention-based text classification models to improve the explainability of classification results. Specifically, we ask workers to provide rationales for their annotation by selecting relevant pieces of text. We introduce MARTA, a Bayesian framework that jointly learns an attention-based model and the reliability of workers while injecting human rationales into model training. We derive a principled optimization algorithm based on variational inference with efficient updating rules for learning MARTA parameters. Extensive validation on real-world datasets shows that our framework significantly improves the state of the art both in terms of classification explainability and accuracy.

NeurIPS Conference 2021 Conference Paper

OctField: Hierarchical Implicit Functions for 3D Modeling

  • Jia-Heng Tang
  • Weikai Chen
  • Jie Yang
  • Bo Wang
  • Songrun Liu
  • Bo Yang
  • Lin Gao

Recent advances in localized implicit functions have enabled neural implicit representation to be scalable to large scenes. However, the regular subdivision of 3D space employed by these approaches fails to take into account the sparsity of the surface occupancy and the varying granularities of geometric details. As a result, its memory footprint grows cubically with the input volume, leading to a prohibitive computational cost even at a moderately dense decomposition. In this work, we present a learnable hierarchical implicit representation for 3D surfaces, coded OctField, that allows high-precision encoding of intricate surfaces with low memory and computational budget. The key to our approach is an adaptive decomposition of 3D scenes that only distributes local implicit functions around the surface of interest. We achieve this goal by introducing a hierarchical octree structure to adaptively subdivide the 3D space according to the surface occupancy and the richness of part geometry. As octree is discrete and non-differentiable, we further propose a novel hierarchical network that models the subdivision of octree cells as a probabilistic process and recursively encodes and decodes both octree structure and surface geometry in a differentiable manner. We demonstrate the value of OctField for a range of shape modeling and reconstruction tasks, showing superiority over alternative approaches.

IJCAI Conference 2021 Conference Paper

UniGNN: a Unified Framework for Graph and Hypergraph Neural Networks

  • Jing Huang
  • Jie Yang

Hypergraph, an expressive structure with flexibility to model the higher-order correlations among entities, has recently attracted increasing attention from various research domains. Despite the success of Graph Neural Networks (GNNs) for graph representation learning, how to adapt the powerful GNN-variants directly into hypergraphs remains a challenging problem. In this paper, we propose UniGNN, a unified framework for interpreting the message passing process in graph and hypergraph neural networks, which can generalize general GNN models into hypergraphs. In this framework, meticulously-designed architectures aiming to deepen GNNs can also be incorporated into hypergraphs with the least effort. Extensive experiments have been conducted to demonstrate the effectiveness of UniGNN on multiple real-world datasets, which outperform the state-of-the-art approaches with a large margin. Especially for the DBLP dataset, we increase the accuracy from 77. 4% to 88. 8% in the semi-supervised hypernode classification task. We further prove that the proposed message-passing based UniGNN models are at most as powerful as the 1-dimensional Generalized Weisfeiler-Leman (1-GWL) algorithm in terms of distinguishing non-isomorphic hypergraphs. Our code is available at https: //github. com/OneForward/UniGNN.

AAAI Conference 2020 Conference Paper

A Generalized Framework for Edge-Preserving and Structure-Preserving Image Smoothing

  • Wei Liu
  • Pingping Zhang
  • Yinjie Lei
  • Xiaolin Huang
  • Jie Yang
  • Ian Reid

Image smoothing is a fundamental procedure in applications of both computer vision and graphics. The required smoothing properties can be different or even contradictive among different tasks. Nevertheless, the inherent smoothing nature of one smoothing operator is usually fixed and thus cannot meet the various requirements of different applications. In this paper, a non-convex non-smooth optimization framework is proposed to achieve diverse smoothing natures where even contradictive smoothing behaviors can be achieved. To this end, we first introduce the truncated Huber penalty function which has seldom been used in image smoothing. A robust framework is then proposed. When combined with the strong flexibility of the truncated Huber penalty function, our framework is capable of a range of applications and can outperform the state-of-the-art approaches in several tasks. In addition, an efficient numerical solution is provided and its convergence is theoretically guaranteed even the optimization framework is non-convex and non-smooth. The effectiveness and superior performance of our approach are validated through comprehensive experimental results in a range of applications.

AAAI Conference 2020 Conference Paper

A Human-AI Loop Approach for Joint Keyword Discovery and Expectation Estimation in Micropost Event Detection

  • Akansha Bhardwaj
  • Jie Yang
  • Philippe Cudré-Mauroux

Microblogging platforms such as Twitter are increasingly being used in event detection. Existing approaches mainly use machine learning models and rely on event-related keywords to collect the data for model training. These approaches make strong assumptions on the distribution of the relevant microposts containing the keyword – referred to as the expectation of the distribution – and use it as a posterior regularization parameter during model training. Such approaches are, however, limited as they fail to reliably estimate the informativeness of a keyword and its expectation for model training. This paper introduces a Human-AI loop approach to jointly discover informative keywords for model training while estimating their expectation. Our approach iteratively leverages the crowd to estimate both keyword-specific expectation and the disagreement between the crowd and the model in order to discover new keywords that are most beneficial for model training. These keywords and their expectation not only improve the resulting performance but also make the model training process more transparent. We empirically demonstrate the merits of our approach, both in terms of accuracy and interpretability, on multiple real-world datasets and show that our approach improves the state of the art by 24. 3%.

JBHI Journal 2020 Journal Article

Characterizing Alzheimer's Disease With Image and Genetic Biomarkers Using Supervised Topic Models

  • Jie Yang
  • Xinyang Feng
  • Andrew F. Laine
  • Elsa D. Angelini

Neuroimaging and genetic biomarkers have been widely studied from discriminative perspectives towards Alzheimer's disease (AD) classification, since neuroanatomical patterns and genetic variants are jointly critical indicators for AD diagnosis. Generative methods, designed to model common occurring patterns, could potentially advance the understanding of this disease, but have not been fully explored for AD characterization. Moreover, the introduction of a supervised component into the generative process can constrain the model for more discriminative characterization. In this study, we propose an original method based on supervised topic modeling to characterize AD from a generative perspective, yet maintaining discriminative power at differentiating disease populations. Our topic modeling jointly exploits discretized image features and categorical genetic features. Diagnostic information - cognitively normal (CN), mild cognitive impairment (MCI) and AD - is introduced as a supervision variable. Experimental results on the ADNI cohort demonstrate that our model, while achieving competitive discriminative performance, can discover topics revealing both well-known and novel neuroanatomical patterns including temporal, parietal and frontal regions; as well as associations between genetic factors and neuroanatomical patterns.

JMLR Journal 2020 Journal Article

Learning Data-adaptive Non-parametric Kernels

  • Fanghui Liu
  • Xiaolin Huang
  • Chen Gong
  • Jie Yang
  • Li Li

In this paper, we propose a data-adaptive non-parametric kernel learning framework in margin based kernel methods. In model formulation, given an initial kernel matrix, a data-adaptive matrix with two constraints is imposed in an entry-wise scheme. Learning this data-adaptive matrix in a formulation-free strategy enlarges the margin between classes and thus improves the model flexibility. The introduced two constraints are imposed either exactly (on small data sets) or approximately (on large data sets) in our model, which provides a controllable trade-off between model flexibility and complexity with theoretical demonstration. In algorithm optimization, the objective function of our learning framework is proven to be gradient-Lipschitz continuous. Thereby, kernel and classifier/regressor learning can be efficiently optimized in a unified framework via Nesterov's acceleration. For the scalability issue, we study a decomposition-based approach to our model in the large sample case. The effectiveness of this approximation is illustrated by both empirical studies and theoretical guarantees. Experimental results on various classification and regression benchmark data sets demonstrate that our non-parametric kernel learning framework achieves good performance when compared with other representative kernel learning based algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

AAAI Conference 2020 Conference Paper

Learning to Incorporate Structure Knowledge for Image Inpainting

  • Jie Yang
  • Zhiquan Qi
  • Yong Shi

This paper develops a multi-task learning framework that attempts to incorporate the image structure knowledge to assist image inpainting, which is not well explored in previous works. The primary idea is to train a shared generator to simultaneously complete the corrupted image and corresponding structures — edge and gradient, thus implicitly encouraging the generator to exploit relevant structure knowledge while inpainting. In the meantime, we also introduce a structure embedding scheme to explicitly embed the learned structure features into the inpainting process, thus to provide possible preconditions for image completion. Specifically, a novel pyramid structure loss is proposed to supervise structure learning and embedding. Moreover, an attention mechanism is developed to further exploit the recurrent structures and patterns in the image to refine the generated structures and contents. Through multi-task learning, structure embedding besides with attention, our framework takes advantage of the structure knowledge and outperforms several state-of-theart methods on benchmark datasets quantitatively and qualitatively.

AAAI Conference 2020 Conference Paper

Random Fourier Features via Fast Surrogate Leverage Weighted Sampling

  • Fanghui Liu
  • Xiaolin Huang
  • Yudong Chen
  • Jie Yang
  • Johan Suykens

In this paper, we propose a fast surrogate leverage weighted sampling strategy to generate refined random Fourier features for kernel approximation. Compared to the current state-ofthe-art method that uses the leverage weighted scheme (Li et al. 2019), our new strategy is simpler and more effective. It uses kernel alignment to guide the sampling process and it can avoid the matrix inversion operator when we compute the leverage function. Given n observations and s random features, our strategy can reduce the time complexity for sampling from O(ns2 +s3 ) to O(ns2 ), while achieving comparable (or even slightly better) prediction performance when applied to kernel ridge regression (KRR). In addition, we provide theoretical guarantees on the generalization performance of our approach, and in particular characterize the number of random features required to achieve statistical guarantees in KRR. Experiments on several benchmark datasets demonstrate that our algorithm achieves comparable prediction performance and takes less time cost when compared to (Li et al. 2019).

IJCAI Conference 2020 Conference Paper

Structured Probabilistic End-to-End Learning from Crowds

  • Zhijun Chen
  • Huimin Wang
  • Hailong Sun
  • Pengpeng Chen
  • Tao Han
  • Xudong Liu
  • Jie Yang

End-to-end learning from crowds has recently been introduced as an EM-free approach to training deep neural networks directly from noisy crowdsourced annotations. It models the relationship between true labels and annotations with a specific type of neural layer, termed as the crowd layer, which can be trained using pure backpropagation. Parameters of the crowd layer, however, can hardly be interpreted as annotator reliability, as compared with the more principled probabilistic approach. The lack of probabilistic interpretation further prevents extensions of the approach to account for important factors of annotation processes, e. g. , instance difficulty. This paper presents SpeeLFC, a structured probabilistic model that incorporates the constraints of probability axioms for parameters of the crowd layer, which allows to explicitly model annotator reliability while benefiting from the end-to-end training of neural networks. Moreover, we propose SpeeLFC-D, which further takes into account instance difficulty. Extensive validation on real-world datasets shows that our methods improve the state-of-the-art.

JBHI Journal 2019 Journal Article

Densely-Connected Multi-Magnification Hashing for Histopathological Image Retrieval

  • Yun Gu
  • Jie Yang

Content-based medical image retrieval is an important computer-aided diagnosis technique providing the clinicians with interpretative references based on visual similarity. In this paper, we focus on the tasks of histopathological image retrieval for breast cancer diagnosis. The densely-connected multi-magnification (DCMMH) framework is proposed to generate the discriminative binary codes by exploiting the histopathological images with multiple magnification factors. The low-magnification images are boosted by the accumulated similarity based on local patches that also regularize the feature learning of high-magnification images. In order to fully utilize the information across different magnification levels, a densely-connected architecture is finally deployed for high-low magnification pairs of datasets. Experiments on BreakHis dataset demonstrate that, DCMMH outperforms the previous hashing methods on histopathological image retrieval.

AAAI Conference 2018 Conference Paper

Mesh-Based Autoencoders for Localized Deformation Component Analysis

  • Qingyang Tan
  • Lin Gao
  • Yu-Kun Lai
  • Jie Yang
  • Shihong Xia

Spatially localized deformation components are very useful for shape analysis and synthesis in 3D geometry processing. Several methods have recently been developed, with an aim to extract intuitive and interpretable deformation components. However, these techniques suffer from fundamental limitations especially for meshes with noise or large-scale deformations, and may not always be able to identify important deformation components. In this paper we propose a novel mesh-based autoencoder architecture that is able to cope with meshes with irregular topology. We introduce sparse regularization in this framework, which along with convolutional operations, helps localize deformations. Our framework is capable of extracting localized deformation components from mesh data sets with large-scale deformations and is robust to noise. It also provides a nonlinear approach to reconstruction of meshes using the extracted basis, which is more effective than the current linear combination approach. Extensive experiments show that our method outperforms state-of-the-art methods in both qualitative and quantitative evaluations.

AAAI Conference 2018 Conference Paper

Nonlinear Pairwise Layer and Its Training for Kernel Learning

  • Fanghui Liu
  • Xiaolin Huang
  • Chen Gong
  • Jie Yang
  • Li Li

Kernel learning is a fundamental technique that has been intensively studied in the past decades. For the complicated practical tasks, the traditional “shallow” kernels (e. g. , Gaussian kernel and sigmoid kernel) are not flexible enough to produce satisfactory performance. To address this shortcoming, this paper introduces a nonlinear layer in kernel learning to enhance the model flexibility. This layer is pairwise, which fully considers the coupling information among examples. So our model contains a fixed single mapping layer (i. e. a Gaussian kernel) as well as a nonlinear pairwise layer, thereby achieving better flexibility than the existing kernel structures. Moreover, the proposed structure can be seamlessly embedded to Support Vector Machines (SVM), of which the training process can be formulated as a joint optimization problem including nonlinear function learning and standard SVM optimization. We theoretically prove that the objective function is gradient-Lipschitz continuous, which further guides us how to accelerate the optimization process in a deep kernel architecture. Experimentally, we find that the proposed structure outperforms other state-ofthe-art kernel-based algorithms on various benchmark datasets, and thus the effectiveness of the incorporated pairwise layer with its training approach is demonstrated.

AAAI Conference 2017 Conference Paper

Exploiting both Vertical and Horizontal Dimensions of Feature Hierarchy for Effective Recommendation

  • Zhu Sun
  • Jie Yang
  • Jie Zhang
  • Alessandro Bozzon

Feature hierarchy (FH) has proven to be effective to improve recommendation accuracy. Prior work mainly focuses on the influence of vertically affiliated features (i. e. child-parent) on user-item interactions. The relationships of horizontally organized features (i. e. siblings and cousins) in the hierarchy, however, has only been little investigated. We show in real-world datasets that feature relationships in horizontal dimension can help explain and further model user-item interactions. To fully exploit FH, we propose a unified recommendation framework that seamlessly incorporates both vertical and horizontal dimensions for effective recommendation. Our model further considers two types of semantically rich feature relationships in horizontal dimension, i. e. complementary and alternative relationships. Extensive validation on four real-world datasets demonstrates the superiority of our approach against the state of the art. An additional benefit of our model is to provide better interpretations of the generated recommendations.

IJCAI Conference 2017 Conference Paper

MRLR: Multi-level Representation Learning for Personalized Ranking in Recommendation

  • Zhu Sun
  • Jie Yang
  • Jie Zhang
  • Alessandro Bozzon
  • Yu Chen
  • Chi Xu

Representation learning (RL) has recently proven to be effective in capturing local item relationships by modeling item co-occurrence in individual user's interaction record. However, the value of RL for recommendation has not reached the full potential due to two major drawbacks: 1) recommendation is modeled as a rating prediction problem but should essentially be a personalized ranking one; 2) multi-level organizations of items are neglected for fine-grained item relationships. We design a unified Bayesian framework MRLR to learn user and item embeddings from a multi-level item organization, thus benefiting from RL as well as achieving the goal of personalized ranking. Extensive validation on real-world datasets shows that MRLR consistently outperforms state-of-the-art algorithms.

AAAI Conference 2016 Conference Paper

Teaching-to-Learn and Learning-to-Teach for Multi-label Propagation

  • Chen Gong
  • Dacheng Tao
  • Jie Yang
  • Wei Liu

Multi-label propagation aims to transmit the multi-label information from labeled examples to unlabeled examples based on a weighted graph. Existing methods ignore the specific propagation difficulty of different unlabeled examples and conduct the propagation in an imperfect sequence, leading to the error-prone classification of some difficult examples with uncertain labels. To address this problem, this paper associates each possible label with a “teacher”, and proposes a “Multi-Label Teaching-to-Learn and Learning-to- Teach” (ML-TLLT) algorithm, so that the entire propagation process is guided by the teachers and manipulated from simple examples to more difficult ones. In the teaching-to-learn step, the teachers select the simplest examples for the current propagation by investigating both the definitiveness of each possible label of the unlabeled examples, and the dependencies between labels revealed by the labeled examples. In the learning-to-teach step, the teachers reversely learn from the learner’s feedback to properly select the simplest examples for the next propagation. Thorough empirical studies show that due to the optimized propagation sequence designed by the teachers, ML-TLLT yields generally better performance than seven state-of-the-art methods on the typical multi-label benchmark datasets.

AAAI Conference 2014 Conference Paper

ReLISH: Reliable Label Inference via Smoothness Hypothesis

  • Chen Gong
  • Dacheng Tao
  • Keren Fu
  • Jie Yang

The smoothness hypothesis is critical for graph-based semi-supervised learning. This paper defines local smoothness, based on which a new algorithm, Reliable Label Inference via Smoothness Hypothesis (ReLISH), is proposed. ReLISH has produced smoother labels than some existing methods for both labeled and unlabeled examples. Theoretical analyses demonstrate good stability and generalizability of ReLISH. Using real-world datasets, our empirical analyses reveal that ReLISH is promising for both transductive and inductive tasks, when compared with representative algorithms, including Harmonic Functions, Local and Global Consistency, Constraint Metric Learning, Linear Neighborhood Propagation, and Manifold Regularization.

AAAI Conference 2014 Conference Paper

Signed Laplacian Embedding for Supervised Dimension Reduction

  • Chen Gong
  • Dacheng Tao
  • Jie Yang
  • Keren Fu

Manifold learning is a powerful tool for solving nonlinear dimension reduction problems. By assuming that the high-dimensional data usually lie on a low-dimensional manifold, many algorithms have been proposed. However, most algorithms simply adopt the traditional graph Laplacian to encode the data locality, so the discriminative ability is limited and the embedding results are not always suitable for the subsequent classification. Instead, this paper deploys the signed graph Laplacian and proposes Signed Laplacian Embedding (SLE) for supervised dimension reduction. By exploring the label information, SLE comprehensively transfers the discrimination carried by the original data to the embedded low-dimensional space. Without perturbing the discrimination structure, SLE also retains the locality. Theoretically, we prove the immersion property by computing the rank of projection, and relate SLE to existing algorithms in the frame of patch alignment. Thorough empirical studies on synthetic and real datasets demonstrate the effectiveness of SLE.