Arrow Research search

Author name cluster

Jie Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

121 papers
2 author rows

Possible papers

121

AAAI Conference 2026 Conference Paper

BiHiTo: Biomolecular Hierarchy-inspired Tokenization

  • Ruochong Zheng
  • Yutian Liu
  • Yian Zhao
  • Zhiwei Nie
  • Xuehan Hou
  • Chang Liu
  • Siwei Ma
  • Youdong Mao

Three-dimensional atomic arrangements of biomolecules are key to demystifying biological functions. The rapid expansion of accessible structural data, driven by advances in AI for science, highlights the critical challenge of efficiently modeling large-scale biomolecular structures, which are high-dimensional systems shaped by biological assembly principles. To address this, we introduce BiHiTo, a multi-level Biomolecular Hierarchy-inspired Tokenizer that intrinsically mimics natural biological assembly hierarchies. Specifically, we design a multi-codebook quantizer that mirrors the natural hierarchy of biomolecular structure, enabling simultaneous capture of representations spanning atomic motifs to global conformational variations. This hierarchical alignment markedly improves the biological interpretability and reconstruction fidelity of biomolecular structure.Extensive experiments demonstrate that BiHiTo delivers state-of-the-art performance and robust generalization across molecular dynamics trajectories and macromolecular complexes, facilitating advances in structure generation and dynamic conformation exploration. In the reconstruction of the CASP14 and OOD test set FastFolding protein multi-conformation data, our method achieves a 17% and 51% reduction in RMSD compared to Bio2Token, respectively.

AAAI Conference 2026 Conference Paper

CoGenSAM: Codebook-Interactive Generative Labeling for Adapting SAM to Crack Segmentation

  • Zhuangzhuang Chen
  • Nuo Chen
  • Dachong Li
  • Zhiliang Lin
  • Xingyu Feng
  • Yifan Zhang
  • Jie Chen
  • Jianqiang Li

The goal of this work is to adapt Segment Anything Models (SAM) into crack segmentation tasks via automatic label generation, thus eliminating manual annotation cost. In this regard, an intuitive approach is to extract edges of crack samples and generate labels via the dilation and erosion processes for fine-tuning SAM. However, this simple solution cannot guarantee the quality of generated labels, as crack regions will be corrupted due to the imperfect edge detection. To this end, this paper proposes CoGenSAM, a novel Codebook-interactive Generative Labeling framework that enables an annotation-free SAM fine-tuning. To achieve this, in the first stage, we pre-train a vector-quantized variational auto-encoder (VQVAE) by reconstructing the synthesized crack-like structures for learning crack-aware priors within the codebook. In the second stage, these priors help another VQVAE serve as the restoration model to restore the randomly corrupted structures into uncorrupted ones. Specifically, we propose the crack-aware contrastive-interaction to maximize the mutual information with the above priors via codebook interaction. Then, high-quality labels can be generated by restoring corrupted labels from edge detection, contributing to an annotation-free SAM fine-tuning. We collect a new dataset, Bridge2025, to address the limited availability of related bridge-oriented benchmarks. Experiments show that our performance is close to fully-supervised methods.

AAAI Conference 2026 Conference Paper

Conditional Distribution Learning for Graph Classification

  • Jie Chen
  • Hua Mao
  • Chuanbin Liu
  • Zhu Wang
  • Xi Peng

Leveraging the diversity and quantity of data provided by various graph-structured data augmentations while preserving intrinsic semantic information is challenging. Additionally, successive layers in graph neural network (GNN) tend to produce more similar node embeddings, while graph contrastive learning aims to increase the dissimilarity between negative pairs of node embeddings. This inevitably results in a conflict between the message-passing mechanism (MPM) of GNNs and the contrastive learning (CL) of negative pairs via intraviews. In this paper, we propose a conditional distribution learning (CDL) method that learns graph representations from graph-structured data for semisupervised graph classification. Specifically, we present an end-to-end graph representation learning model to align the conditional distributions of weakly and strongly augmented features over the original features. This alignment enables the CDL model to effectively preserve intrinsic semantic information when both weak and strong augmentations are applied to graph-structured data. To avoid the conflict between the MPM and the CL of negative pairs, positive pairs of node representations are retained for measuring the similarity between the original features and the corresponding weakly augmented features. Extensive experiments with several benchmark graph datasets demonstrate the effectiveness of the proposed CDL method.

AAAI Conference 2026 Conference Paper

Deep Inverse Shading: Consistent Albedo and Surface Detail Recovery via Generative Refinement

  • Jiacheng Wu
  • Ruiqi Zhang
  • Jie Chen

Reconstructing human avatars using generative priors is essential for achieving versatile and realistic avatar models. Traditional approaches often rely on volumetric representations guided by generative models, but these methods require extensive volumetric rendering queries, leading to slow training. Alternatively, surface-based representations offer faster optimization through differentiable rasterization, yet they are typically limited by vertex count, restricting mesh resolution and scalability when combined with generative priors. Moreover, integrating generative priors into physically based human avatar modeling remains largely unexplored. To address these challenges, we introduce DIS (Deep Inverse Shading), a unified framework for high-fidelity, relightable avatar reconstruction that incorporates generative priors into a coherent surface representation. DIS centers on a mesh-based model that serves as the target for optimizing both surface and material details. The framework fuses multi-view 2D generative surface normal predictions, rich in detail but often inconsistent, into the central mesh using a normal conversion module. This module converts generative normal outputs into per-triangle surface offsets via differentiable rasterization, enabling the capture of fine geometric details beyond sparse vertex limitations. Additionally, DIS integrates a de-shading module, informed by generative priors, to recover accurate material properties such as albedo. This module refines albedo predictions by removing baked-in shading and back-propagates reconstruction errors to further optimize the mesh geometry. Through this joint optimization of geometry and material appearance, DIS achieves physically consistent, high-quality reconstructions suitable for accurate relighting. Our experiments show that DIS delivers SOTA relighting quality, enhanced rendering efficiency, lower memory consumption, and detailed surface reconstruction.

TMLR Journal 2026 Journal Article

LoDAdaC: a unified local training-based decentralized framework with adaptive gradients and compressed communication

  • Wei Liu
  • Anweshit Panda
  • Ujwal Pandey
  • Haven Cook
  • George Slota
  • Naigang Wang
  • Jie Chen
  • Yangyang Xu

In the decentralized distributed learning, achieving fast convergence and low communication cost is essential for scalability and high efficiency. Adaptive gradient methods, such as Adam, have demonstrated strong practical performance in deep learning and centralized distributed settings. However, their convergence properties remain largely unexplored in decentralized settings involving multiple local training steps, such as federated learning. To address this limitation, we propose LoDAdaC, a unified multiple \textbf{Lo}cal Training (MLT) \textbf{D}ecentralized framework with \textbf{Ada}m-type updates and \textbf{C}ompressed communication (CC). LoDAdaC accommodates a broad class of optimizers for its local adaptive updates, including AMSGrad, Adam, and AdaGrad; it is compatible with standard (possibly biased) compressors such as low-bit quantization and sparsification. MLT and CC enable LoDAdaC to achieve multiplied reduction of communication cost, while the technique of adaptive updates enables fast convergence. We rigorously prove the combined advantage through complexity analysis. In addition, experiments on image classification and GPT-style language model training validate our theoretical findings and show that LoDAdaC significantly outperforms existing decentralized algorithms in terms of convergence speed and communication efficiency.

AAAI Conference 2026 Conference Paper

Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning

  • Wenchuan Zhang
  • Jingru Guo
  • Hengzhe Zhang
  • Penghao Zhang
  • Jie Chen
  • Shuwan Zhang
  • Zhang Zhang
  • Yuhao Yi

Although Vision Language Models (VLMs) have shown generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text–image search, enabling retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering.

AAAI Conference 2026 Conference Paper

Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner

  • Wenchuan Zhang
  • Penghao Zhang
  • Jingru Guo
  • Tao Cheng
  • Jie Chen
  • Shuwan Zhang
  • Zhang Zhang
  • Yuhao Yi

Recent advances in vision-language models (VLMs) have enabled broad progress in the general medical field. However, pathology still remains a more challenging sub-domain, with current pathology-specific VLMs exhibiting limitations in both diagnostic accuracy and reasoning plausibility. Such shortcomings are largely attributable to the nature of current pathology datasets, which are primarily composed of image–description pairs that lack the depth and structured diagnostic paradigms employed by real-world pathologists. In this study, we leverage pathology textbooks and real-world pathology experts to construct high-quality, reasoning-oriented datasets. Building on this, we introduce Patho-R1, a multimodal RL-based pathology Reasoner, trained through a three-stage pipeline: (1) continued pretraining on 3.5 million image-text pairs for knowledge infusion; (2) supervised fine-tuning on 500k high-quality Chain-of-Thought samples for reasoning incentivizing; (3) reinforcement learning using Group Relative Policy Optimization and Decoupled Clip and Dynamic sAmpling Policy Optimization strategies for multimodal reasoning quality refinement. To further assess the alignment quality of our dataset, we propose Patho-CLIP, trained on the same figure-caption corpus used for continued pretraining. Comprehensive experimental results demonstrate that both Patho-CLIP and Patho-R1 achieve robust performance across a wide range of pathology-related tasks, including zero-shot classification, cross-modal retrieval, Visual Question Answering, and Multiple Choice Question.

AAAI Conference 2026 Conference Paper

ProAR: Probabilistic Autoregressive Modeling for Molecular Dynamics

  • Kaiwen Cheng
  • Yutian Liu
  • Zhiwei Nie
  • Mujie Lin
  • Yanzhen Hou
  • Yiheng Tao
  • Chang Liu
  • Jie Chen

Understanding the structural dynamics of biomolecules is crucial for uncovering biological functions. As molecular dynamics (MD) simulation data becomes more available, deep generative models have been developed to synthesize realistic MD trajectories. However, existing methods produce fixed-length trajectories by jointly denoising high-dimensional spatiotemporal representations, which conflicts with MD’s frame-by-frame integration process and fails to capture time-dependent conformational diversity. Inspired by MD's sequential nature, we introduce a new probabilistic autoregressive (ProAR) framework for trajectory generation. ProAR uses a dual-network system that models each frame as a multivariate Gaussian distribution and employs an anti-drifting sampling strategy to reduce cumulative errors. This approach captures conformational uncertainty and time-coupled structural changes while allowing flexible generation of trajectories of arbitrary length. Experiments on ATLAS, a large-scale protein MD dataset, demonstrate that for long trajectory generation, our model achieves a 7.5% reduction in reconstruction RMSE and an average 25.8% improvement in conformation change accuracy compared to previous state-of-the-art methods. For conformation sampling task, it performs comparably to specialized time-independent models, providing a flexible and dependable alternative to standard MD simulations.

AAAI Conference 2026 Conference Paper

SAOT: An Enhanced Locality-Aware Spectral Transformer for Solving PDEs

  • Chenhong Zhou
  • Jie Chen
  • Zaifeng Yang

Neural operators have shown great potential in solving a family of Partial Differential Equations (PDEs) by modeling the mappings between input and output functions. Fourier Neural Operator (FNO) implements global convolutions via parameterizing the integral operators in Fourier space. However, it often results in over-smoothing solutions and fails to capture local details and high-frequency components. To address these limitations, we investigate incorporating the spatial-frequency localization property of Wavelet transforms into the Transformer architecture. We propose a novel Wavelet Attention (WA) module with linear computational complexity to efficiently learn locality-aware features. Building upon WA, we further develop the Spectral Attention Operator Transformer (SAOT), a hybrid spectral Transformer framework that integrates WA’s localized focus with the global receptive field of Fourier-based Attention (FA) through a gated fusion block. Experimental results demonstrate that WA significantly mitigates the limitations of FA and outperforms existing Wavelet-based neural operators by a large margin. By integrating the locality-aware and global spectral representations, SAOT achieves state-of-the-art performance on six operator learning benchmarks and exhibits strong discretization-invariant ability.

AAAI Conference 2026 Conference Paper

SOSControl: Enhancing Human Motion Generation Through Saliency-Aware Symbolic Orientation and Timing Control

  • Ho Yin Au
  • Junkun Jiang
  • Jie Chen

Traditional text-to-motion frameworks often lack precise control, and existing approaches based on joint keyframe locations provide only positional guidance, making it challenging and unintuitive to specify body part orientations and motion timing. To address these limitations, we introduce the Salient Orientation Symbolic (SOS) script, a programmable symbolic framework for specifying body part orientations and motion timing at keyframes. We further propose an automatic SOS extraction pipeline that employs temporally-constrained agglomerative clustering for frame saliency detection and a Saliency-based Masking Scheme (SMS) to generate sparse, interpretable SOS scripts directly from motion data. Moreover, we present the SOSControl framework, which treats the available orientation symbols in the sparse SOS script as salient and prioritizes satisfying these constraints during motion generation. By incorporating SMS-based data augmentation and gradient-based iterative optimization, the framework enhances alignment with user-specified constraints. Additionally, it employs a ControlNet-based ACTOR-PAE Decoder to ensure smooth and natural motion outputs. Extensive experiments demonstrate that the SOS extraction pipeline generates human-interpretable scripts with symbolic annotations at salient keyframes, while the SOSControl framework outperforms existing baselines in motion quality, controllability, and generalizability with respect to motion timing and body part orientation control.

AAAI Conference 2026 Conference Paper

UniAPO: Unified Multimodal Automated Prompt Optimization

  • Qipeng zhu
  • Yanzhe Chen
  • Huasong Zhong
  • Jie Chen
  • Yan Li
  • Zhixin Zhang
  • Junping Zhang
  • Zhenheng Yang

Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks—such as video-language generation—introduces two core challenges: (i) visual token inflation, where long visual-token sequences restrict context capacity and result in insufficient feedback signals; (ii) a lack of process-level supervision, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization. We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO. UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement, making the optimization more stable and goal-driven. To further address the aforementioned challenges, we introduce a short-long term memory mechanism: historical feedback mitigates context limitations, while historical prompts provide directional guidance for effective prompt optimization. UniAPO achieves consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.

AAAI Conference 2026 Conference Paper

WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation

  • Zishan Shu
  • Juntong Wu
  • Wei Yan
  • Xudong Liu
  • Hongyu Zhang
  • Chang Liu
  • Youdong Mao
  • Jie Chen

Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency—from low-frequency global layout to high-frequency edges and textures—is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed-form, frequency–time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(NlogN) time—far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6× higher throughput and 30% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based methods, effectively capturing both global coherence and high-frequency details essential for rich visual semantics.

IROS Conference 2025 Conference Paper

Achieving Lift-to-Weight Ratio >3. 5 in Piezoelectric Direct-Driven Insect-Scale Flapping-Wing MAVs

  • Xiang Lu
  • Jie Chen
  • Yang Chen
  • Zixin Deng
  • Yulie Wu
  • Xuezhong Wu
  • Dingbang Xiao

Insect-scale flapping-wing micro aerial vehicles (FWMAVs) employing piezoelectric direct-drive configurations eliminate traditional kinematic chains through direct coupling of the wing and actuator. While this design approach significantly reduces structural complexity and manufacturing costs compared to transmission-dependent systems, it inherently limits wing stroke amplitude and consequent lift generation. This paper presents a novel lift-enhancement strategy for piezoelectric direct-drive FWMAVs, effectively improving payload capacity through optimized aerodynamic performance. The redesigned X-configuration prototype demonstrates outstanding metrics: 68 mm wingspan with 212 mg total mass achieves 7. 47 mN maximum lift (exceeding 3. 5: 1 lift-to-weight ratio) and 1. 25 m/s takeoff speed. Experimental validation confirms 39% payload capacity improvement and 34% lift-to-weight ratio enhancement compared to baseline designs. This enhancement establishes our robot as the current state-of-the-art in piezoelectric direct-drive FWMAVs regarding lift-to-weight ratio.

JBHI Journal 2025 Journal Article

Active-Supervised Model for Intestinal Ulcers Segmentation Using Fuzzy Labeling

  • Jie Chen
  • Yanning Lin
  • Faisal Saeed
  • Ziqian Ding
  • Muhammad Diyan
  • Jianqiang Li
  • Zhaoxia Wang

Inflammatory bowel disease (IBD) is a chronic inflammatory condition of the intestines with a rising global incidence. Colonoscopy remains the gold standard for IBD diagnosis, but traditional image-scoring methods are subjective and complex, impacting diagnostic accuracy and efficiency. To address these limitations, this paper investigates machine learning techniques for intestinal ulcer segmentation, focusing on multi-category ulcer segmentation to enhance IBD diagnosis. We identified two primary challenges in intestinal ulcer segmentation: 1) labeling noise, where inaccuracies in medical image annotation introduce ambiguity, hindering model training, and 2) performance variability across datasets, where models struggle to maintain high accuracy due to medical image diversity. To address these challenges, we propose an active ulcer segmentation algorithm based on fuzzy labeling. A collaborative training segmentation model is designed to utilize pixel-wise confidence extracted from fuzzy labels, distinguishing high- and low-confidence regions, and enhancing robustness to noisy labels through network cooperation. To mitigate performance disparities, we introduce a data adaptation strategy leveraging active learning. By selecting high-information samples based on uncertainty and diversity, the strategy enables incremental model training, improving adaptability. Extensive experiments on public and hospital datasets validate the proposed methods. Our collaborative training model and active learning strategy show significant advantages in handling noisy labels and enhancing model performance across datasets, paving the way for more precise and efficient IBD diagnosis.

AAAI Conference 2025 Conference Paper

Adversarial Learning Under Hybrid Perturbations for Robust Acute Lymphoblastic Leukemia Classification

  • Jie Chen
  • Xinyuan Liu
  • Xintong Liu
  • Jianqiang Li

Acute lymphoblastic leukemia is a childhood cancer prevalent worldwide, which can prove fatal within weeks or months. However, current diagnosis models based on machine learning and deep learning methods fail to consider device noise (pixel-level perturbations) and rotation/translation (spatial-transformed perturbations), which can undermine the model's robustness. Adversarial training is a potential solution to this issue. This paper presents a hybrid perturbation adversarial training (HPAT) strategy that leverages two types of adversarial samples: pixel-level adversarial samples and spatial adversarial samples. This work generates these hybrid adversarial samples through Projected Gradient Descent (PGD) in couple with spatial transformation based on the Bayesian optimization (STBO) algorithm, respectively. This work introduced the Mixed Batch Normalization (MixBN) module to handle both adversarial samples and clean samples, alleviating the problem of clean accuracy degradation due to adversarial training. The proposed hybrid adversarial training strategy is tested on the public acute lymphoblastic leukemia dataset and found that it outperformed existing acute lymphoblastic cell classification models.

AAAI Conference 2025 Conference Paper

Aligning Instance Brownian Bridge with Texts for Open-Vocabulary Video Instance Segmentation

  • Zesen Cheng
  • Kehan Li
  • Li Hao
  • Peng Jin
  • Xiawu Zheng
  • Chang Liu
  • Jie Chen

Temporally locating objects with arbitrary class texts is the primary pursuit of open-vocabulary Video Instance Segmentation (VIS). Because of the insufficient vocabulary of video data, previous methods leverage the image-text pretraining model for recognizing object instances by separately aligning each frame with class texts. As a result, the separation breaks the instance movement context of videos and requires a lot of inference overhead. To tackle these issues, we propose BridgeText Alignment (BTA) to link frame-level instance representations as a Brownian Bridge. On one hand, we can calculate the global descriptor of a Brownian bridge for capturing instance dynamics, which enables extra considering temporal information rather than only static information of each frame for aligning with texts. On the other hand, according to the goal-conditioned property of the Brownian bridge, we can estimate the middle frame features via the start and the end frame features so the global feature calculation of a Brownian bridge only needs to infer a few frames, which largely reduces inference overhead. We term our overall pipeline as BriVIS. Following the training settings of previous works, BriVIS surpasses the SOTA (OV2Seg) by a clear margin. For example, on the challenging large-vocabulary datasets (BURST, LVVIS), BriVIS achieves 5.7 and 20.9 mAP, which exhibits +2.2∼+6.7 mAP improvement compared to OV2Seg. Furthermore, after training via BTA, using only the head and the tail frames for alignment improves the speed by 32% (2.77 → 1.88 s/iter) while just decreasing the performance by 0.2 mAP (21.1 → 20.9 mAP).

AAAI Conference 2025 Conference Paper

Attack-inspired Calibration Loss for Calibrating Crack Recognition

  • Zhuangzhuang Chen
  • Qiangyu Chen
  • Jiahao Zhang
  • Zhiliang Lin
  • Xingyu Feng
  • Jie Chen
  • Jianqiang Li

Deep neural networks (DNNs) have substantially achieved high predictive accuracy in many vision tasks. However, we find that they are poorly calibrated for crack recognition tasks, as these DNNs tend to produce both under-confident and over-confident predictions in such safety-critical applications, thereby limiting their practical use in real-world scenarios. To address this issue, we propose a novel attack-inspired calibration loss (AICL) that explicitly regularizes class probabilities to be better confidence estimation. Specifically, we first propose the attack-inspired correctness estimation method (ACE) that aims to estimate the correctness degree of each sample via adversarial attacks. Then, we propose Correctness-aware Distribution Guidance, which starts from a distribution perspective that enforces the ordinal ranking of the predicted confidence referring to the estimated correctness degree. The proposed method can be conveniently implemented on top of any DNNs-based crack recognition model by serving as a plug-and-play loss function. To address the limited availability of related benchmarks, we collect a fully annotated dataset, namely, Bridge2024, which involves inconsistent cracks and noisy backgrounds in real-world bridges. Our AICL outperforms the state-of-art calibration methods on various benchmark datasets including CRACK2019, SDNET2018, and our BRIDGE2024.

NeurIPS Conference 2025 Conference Paper

Causality Meets the Table: Debiasing LLMs for Faithful TableQA via Front-Door Intervention

  • Zhen Yang
  • Ziwei Du
  • Minghan Zhang
  • Wei Du
  • Jie Chen
  • Fulan Qian
  • Shu Zhao

Table Question Answering (TableQA) combines natural language understanding and structured data reasoning, posing challenges in semantic interpretation and logical inference. Recent advances in Large Language Models (LLMs) have improved TableQA performance through Direct Prompting and Agent paradigms. However, these models often rely on spurious correlations, as they tend to overfit to token co-occurrence patterns in pretraining corpora, rather than perform genuine reasoning. To address this issue, we propose Causal Intervention TableQA (CIT), which is based on a structural causal graph and applies front-door adjustment to eliminate bias caused by token co-occurrence. CIT formalizes TableQA as a causal graph and identifies token co-occurrence patterns as confounders. By applying front-door adjustment, CIT guides question variant generation and reasoning to reduce confounding effects. Experiments on multiple benchmarks show that CIT achieves state-of-the-art performance, demonstrating its effectiveness in mitigating bias. Consistent gains across various LLMs further confirm its generalizability.

AAAI Conference 2025 Conference Paper

CLEP: A Novel Contrastive Learning Method for Evolutionary Reentrancy Vulnerability Detection

  • Jie Chen
  • Liangmin Wang
  • Huijuan Zhu
  • Victor S. Sheng

Reentrancy vulnerabilities in smart contracts have been exploited to steal enormous amounts of money, thus detecting reentrancy vulnerabilities is a hotspot issue in security research. However, a new attack is emerging in which attackers continuously release new reentrancy patterns to exploit fresh vulnerabilities and obfuscate existing ones. Existing detection methods neglect the time-series evolution of vulnerabilities across different smart contract versions, leading to a gradual decline in their effectiveness over time. We investigate the time-series correlations among vulnerabilities in various versions and refer to these as Evolutionary Reentrancy Vulnerabilities (ERVs). We summarize that ERVs detection faces two key challenges: (i) capturing the evolving pattern of ERVs along a complete evolutionary chain and (ii) detecting fresh reentrancy vulnerabilities in new versions. To address these challenges, we propose CLEP, a novel Contrastive Learning with Evolving Pairs detection method. It can effectively capture the evolving patterns by discerning similarities and differences across versions. Specifically, we first modified the sample distribution by incorporating version declarations as time-series evolution information. Then, leveraging the hierarchical similarity, we design an evolving pairs scheme to form negative and positive contract pairs across versions. Finally, we build a complete evolutionary chain by proposing a version-aware contrastive sampler. Our experimental results show that CLEP not only outperforms state-of-the-art baselines in version-specific scenarios but also shows promising performance in cross-version evolution scenarios.

TMLR Journal 2025 Journal Article

Compressed Decentralized Momentum Stochastic Gradient Methods for Nonconvex Optimization

  • Wei Liu
  • Anweshit Panda
  • Ujwal Pandey
  • Christopher Brissette
  • Yikang Shen
  • George Slota
  • Naigang Wang
  • Jie Chen

In this paper, we design two compressed decentralized algorithms for solving nonconvex stochastic optimization under two different scenarios. Both algorithms adopt a momentum technique to achieve fast convergence and a message-compression technique to save communication costs. Though momentum acceleration and compressed communication have been used in literature, it is highly nontrivial to theoretically prove the effectiveness of their composition in a decentralized algorithm that can maintain the benefits of both sides, because of the need to simultaneously control the consensus error, the compression error, and the bias from the momentum gradient. For the scenario where gradients are bounded, our proposal is a compressed decentralized adaptive method. To the best of our knowledge, this is the first decentralized adaptive stochastic gradient method with compressed communication. For the scenario of data heterogeneity without bounded gradients, our proposal is a compressed decentralized heavy-ball method, which applies a gradient tracking technique to address the challenge of data heterogeneity. Notably, both methods achieve an optimal convergence rate, and they can achieve linear speed up and adopt topology-independent algorithmic parameters within a certain regime of the user-specified error tolerance. Superior empirical performance is observed over state-of-the-art methods on training deep neural networks (DNNs) and Transformers.

AAAI Conference 2025 Conference Paper

Cross-View Graph Consistency Learning for Invariant Graph Representations

  • Jie Chen
  • Hua Mao
  • Wai Lok Woo
  • Chuanbin Liu
  • Xi Peng

Graph representation learning is fundamental for analyzing graph-structured data. Exploring invariant graph representations remains a challenge for most existing graph representation learning methods. In this paper, we propose a cross-view graph consistency learning (CGCL) method that learns invariant graph representations for link prediction. First, two complementary augmented views are derived from an incomplete graph structure through a coupled graph structure augmentation scheme. This augmentation scheme mitigates the potential information loss that is commonly associated with various data augmentation techniques involving raw graph data, such as edge perturbation, node removal, and attribute masking. Second, we propose a CGCL model that can learn invariant graph representations. A cross-view training scheme is proposed to train the proposed CGCL model. This scheme attempts to maximize the consistency information between one augmented view and the graph structure reconstructed from the other augmented view. Furthermore, we offer a comprehensive theoretical CGCL analysis. This paper empirically and experimentally demonstrates the effectiveness of the proposed CGCL method, achieving competitive results on graph datasets in comparisons with several state-of-the-art algorithms.

NeurIPS Conference 2025 Conference Paper

Deep Compositional Phase Diffusion for Long Motion Sequence Generation

  • Ho Yin Au
  • Jie Chen
  • Junkun Jiang
  • Jingyu Xiang

Recent research on motion generation has shown significant progress in generating semantically aligned motion with singular semantics. However, when employing these models to create composite sequences containing multiple semantically generated motion clips, they often struggle to preserve the continuity of motion dynamics at the transition boundaries between clips, resulting in awkward transitions and abrupt artifacts. To address these challenges, we present Compositional Phase Diffusion, which leverages the Semantic Phase Diffusion Module (SPDM) and Transitional Phase Diffusion Module (TPDM) to progressively incorporate semantic guidance and phase details from adjacent motion clips into the diffusion process. Specifically, SPDM and TPDM operate within the latent motion frequency domain established by the pre-trained Action-Centric Motion Phase Autoencoder (ACT-PAE). This allows them to learn semantically important and transition-aware phase information from variable-length motion clips during training. Experimental results demonstrate the competitive performance of our proposed framework in generating compositional motion sequences that align semantically with the input conditions, while preserving phase transitional continuity between preceding and succeeding motion clips. Additionally, motion inbetweening task is made possible by keeping the phase parameter of the input motion sequences fixed throughout the diffusion process, showcasing the potential for extending the proposed framework to accommodate various application scenarios. Codes are available at https: //github. com/asdryau/TransPhase.

AAAI Conference 2025 Conference Paper

Defense Against Model Stealing Based on Account-Aware Distribution Discrepancy

  • Jian-Ping Mei
  • Weibin Zhang
  • Jie Chen
  • Xuyun Zhang
  • Tiantian Zhu

Malicious users attempt to replicate commercial models functionally at low cost by training a clone model with query responses. It is challenging to timely prevent such model-stealing attacks to achieve strong protection and maintain utility. In this paper, we propose a novel non-parametric detector called Account-aware Distribution Discrepancy (ADD) to recognize queries from malicious users by leveraging account-wise local dependency. We formulate each class as a Multivariate Normal distribution (MVN) in the feature space and measure the malicious score as the sum of weighted class-wise distribution discrepancy. The ADD detector is combined with random-based prediction poisoning to yield a plug-and-play defense module named D-ADD for image classification models. Results of extensive experimental studies show that D-ADD achieves strong defense against different types of attacks with little interference in serving benign users for both soft and hard-label settings.

AAAI Conference 2025 Conference Paper

DigitalLLaVA: Incorporating Digital Cognition Capability for Physical World Comprehension in Multimodal LLMs

  • Shiyu Li
  • Pengxu Wei
  • Pengchong Qiao
  • Chang Liu
  • Jie Chen

Multimodal Large Language Models (MLLMs) have shown remarkable cognitive capabilities in various cross-modal tasks.However, existing MLLMs struggle with tasks that require physical digital cognition, such as accurately reading an electric meter or pressure gauge. This limitation significantly reduces their effectiveness in practical applications like industrial monitoring and home energy management, where digital sensors are not feasible. For humans, physical digits are artificially defined quantities presented on specific carriers, which require training to recognize. As existing MLLMs are only pre-trained in the manner of object recognition, they fail to comprehend the relationship between digital carriers and their reading. To this end, referring to human behavior, we propose a novel DigitalLLaVA method to explicitly inject digital cognitive abilities into MLLMs in a two-step manner. In the first step, to improve the MLLM's understanding of physical digit carriers, we propose a digit carrier mapping method. This step utilizes object-level text-image pairs to enhance the model's comprehension of objects containing physical digits. For the second step, unlike previous methods that rely on sequential digital prediction or digit regression, we propose a 32 bit floating point simulation approach that treats digit prediction as a whole. Using digit-level text-image pairs, we train three float heads to predict 32-bit floating-point numbers using 0/1 binary classification. This step significantly reduces the search space, making the prediction process more robust and straightforward. Being simple but effective, our method can identify very precise metrics (i.e., accurate to ±0.001) and provide floating-point results, showing its applicability in digital carrier domains.

NeurIPS Conference 2025 Conference Paper

Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection

  • Yu Li
  • Xingyu Qiu
  • Yuqian Fu
  • Jie Chen
  • Tianwen Qian
  • Xu Zheng
  • Danda Pani Paudel
  • Yanwei Fu

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance. The source code and instructions are available at https: //github. com/LiYu0524/Domain-RAG.

IJCAI Conference 2025 Conference Paper

Dual-Balancing for Physics-Informed Neural Networks

  • Chenhong Zhou
  • Jie Chen
  • Zaifeng Yang
  • Ching Eng Png

Physics-informed neural networks (PINNs) have emerged as a new learning paradigm for solving partial differential equations (PDEs) by enforcing the constraints of physical equations, boundary conditions (BCs), and initial conditions (ICs) into the loss function. Despite their successes, vanilla PINNs still suffer from poor accuracy and slow convergence due to the intractable multi-objective optimization issue. In this paper, we propose a novel Dual-Balanced PINN (DB-PINN), which dynamically adjusts loss weights by integrating inter-balancing and intra-balancing to alleviate two imbalance issues in PINNs. Inter-balancing aims to mitigate the gradient imbalance between PDE residual loss and condition-fitting losses by determining an aggregated weight that offsets their gradient distribution discrepancies. Intra-balancing acts on condition-fitting losses to tackle the imbalance in fitting difficulty across diverse conditions. By evaluating the fitting difficulty based on the loss records, intra-balancing can allocate the aggregated weight proportionally to each condition loss according to its fitting difficulty level. We further introduce a robust weight update strategy to prevent abrupt spikes and arithmetic overflow in instantaneous weight values caused by large loss variances, enabling smooth weight updating and stable training. Extensive experiments demonstrate that DB-PINN achieves significantly superior performance than those popular gradient-based weighting methods in terms of convergence speed and prediction accuracy. Our code and supplementary material are available at https: //github. com/chenhong-zhou/DualBalanced-PINNs.

NeurIPS Conference 2025 Conference Paper

DyMoDreamer: World Modeling with Dynamic Modulation

  • Boxuan Zhang
  • Runqing Wang
  • Wei Xiao
  • Weipu Zhang
  • Jian Sun
  • Gao Huang
  • Jie Chen
  • Gang Wang

A critical bottleneck in deep reinforcement learning (DRL) is sample inefficiency, as training high-performance agents often demands extensive environmental interactions. Model-based reinforcement learning (MBRL) mitigates this by building world models that simulate environmental dynamics and generate synthetic experience, improving sample efficiency. However, conventional world models process observations holistically, failing to decouple dynamic objects and temporal features from static backgrounds. This approach is computationally inefficient, especially for visual tasks where dynamic objects significantly influence rewards and decision-making performance. To address this, we introduce DyMoDreamer, a novel MBRL algorithm that incorporates a dynamic modulation mechanism to improve the extraction of dynamic features and enrich the temporal information. DyMoDreamer employs differential observations derived from a novel inter-frame differencing mask, explicitly encoding object-level motion cues and temporal dynamics. Dynamic modulation is modeled as stochastic categorical distributions and integrated into a recurrent state-space model (RSSM), enhancing the model's focus on reward-relevant dynamics. Experiments demonstrate that DyMoDreamer sets a new state-of-the-art on the Atari $100$k benchmark with a $156. 6$\% mean human-normalized score, establishes a new record of $832$ on the DeepMind Visual Control Suite, and gains a $9. 5$\% performance improvement after $1$M steps on the Crafter benchmark.

NeurIPS Conference 2025 Conference Paper

GMV: A Unified and Efficient Graph Multi-View Learning Framework

  • Qipeng zhu
  • Jie Chen
  • Jian Pu
  • Junping Zhang

Graph Neural Networks (GNNs) are pivotal in graph classification but often struggle with generalization and overfitting. We introduce a unified and efficient Graph Multi-View (GMV) learning framework that integrates multi-view learning into GNNs to enhance robustness and efficiency. Leveraging the lottery ticket hypothesis, GMV activates diverse sub-networks within a single GNN through a novel training pipeline, which includes mixed-view generation, and multi-view decomposition and learning. This approach simultaneously broadens "views" from the data, model, and optimization perspectives during training to enhance the generalization capabilities of GNNs. During inference, GMV only incorporates additional prediction heads into standard GNNs, thereby achieving multi-view learning at minimal cost. Our experiments demonstrate that GMV surpasses other augmentation and ensemble techniques for GNNs and Graph Transformers across various graph classification scenarios.

ICML Conference 2025 Conference Paper

GPEN: Global Position Encoding Network for Enhanced Subgraph Representation Learning

  • Nannan Wu
  • Yuming Huang
  • Yiming Zhao
  • Jie Chen
  • Wenjun Wang 0002

Subgraph representation learning has attracted growing interest due to its wide applications in various domains. However, existing methods primarily focus on local neighborhood structures while overlooking the significant impact of global structural information, in particular the influence of multi-hop neighbors beyond immediate neighborhoods. This presents two key challenges: how to effectively capture the structural relationships between distant nodes, and how to prevent excessive aggregation of global structural information from weakening the discriminative ability of subgraph representations. To address these challenges, we propose GPEN (Global Position Encoding Network). GPEN leverages a hierarchical tree structure to encode each node’s global position based on its path distance to the root node, enabling a systematic way to capture relationships between distant nodes. Furthermore, we introduce a boundary-aware convolution module that selectively integrates global structural information while maintaining the unique structural patterns of each subgraph. Extensive experiments on eight public datasets identify that GPEN significantly outperforms state-of-the-art methods in subgraph representation learning.

ICLR Conference 2025 Conference Paper

Graph Neural Preconditioners for Iterative Solutions of Sparse Linear Systems

  • Jie Chen

Preconditioning is at the heart of iterative solutions of large, sparse linear systems of equations in scientific disciplines. Several algebraic approaches, which access no information beyond the matrix itself, are widely studied and used, but ill-conditioned matrices remain very challenging. We take a machine learning approach and propose using graph neural networks as a general-purpose preconditioner. They show attractive performance for many problems and can be used when the mainstream preconditioners perform poorly. Empirical evaluation on over 800 matrices suggests that the construction time of these graph neural preconditioners (GNPs) is more predictable and can be much shorter than that of other widely used ones, such as ILU and AMG, while the execution time is faster than using a Krylov method as the preconditioner, such as in inner-outer GMRES. GNPs have a strong potential for solving large-scale, challenging algebraic problems arising from not only partial differential equations, but also economics, statistics, graph, and optimization, to name a few.

JBHI Journal 2025 Journal Article

Interpretable Dynamic Directed Graph Convolutional Network for Multi-Relational Prediction of Missense Mutation and Drug Response

  • Qian Gao
  • Tao Xu
  • Xiaodi Li
  • Wanling Gao
  • Haoyuan Shi
  • Youhua Zhang
  • Jie Chen
  • Zhenyu Yue

Tumor heterogeneity presents a significant challenge in predicting drug responses, especially as missense mutations within the same gene can lead to varied outcomes such as drug resistance, enhanced sensitivity, or therapeutic ineffectiveness. These complex relationships highlight the need for advanced analytical approaches in oncology. Due to their powerful ability to handle heterogeneous data, graph convolutional networks (GCNs) represent a promising approach for predicting drug responses. However, simple bipartite graphs cannot accurately capture the complex relationships involved in missense mutation and drug response. Furthermore, Deep learning models for drug response are often considered “black boxes”, and their interpretability remains a widely discussed issue. To address these challenges, we propose an Interpretable Dynamic Directed Graph Convolutional Network (IDDGCN) framework, which incorporates four key features: 1) the use of directed graphs to differentiate between sensitivity and resistance relationships, 2) the dynamic updating of node weights based on node-specific interactions, 3) the exploration of associations between different mutations within the same gene and drug response, and 4) the enhancement of interpretability models through the integration of a weighted mechanism that accounts for the biological significance, alongside a ground truth construction method to evaluate prediction transparency. The experimental results demonstrate that IDDGCN outperforms existing state-of-the-art models, exhibiting excellent predictive power. Both qualitative and quantitative evaluations of its interpretability further highlight its ability to explain predictions, offering a fresh perspective for precision oncology and targeted drug development.

NeurIPS Conference 2025 Conference Paper

Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve

  • Yuanzhe Liu
  • Ryan Deng
  • Tim Kaler
  • Xuhao Chen
  • Charles Leiserson
  • Yao Ma
  • Jie Chen

Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other's successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson-based collaboration framework, design the lesson solicitation--banking--selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi-LLM collaboration methods.

IJCAI Conference 2025 Conference Paper

MTPNet: Multi-Grained Target Perception for Unified Activity Cliff Prediction

  • Zishan Shu
  • Yufan Deng
  • Hongyu Zhang
  • Zhiwei Nie
  • Jie Chen

Activity cliff prediction is a critical task in drug discovery and material design. Existing computational methods are limited to handling single binding targets, which restricts the applicability of these prediction models. In this paper, we present the Multi-Grained Target Perception network (MTPNet) to incorporate the prior knowledge of interactions between the molecules and their target proteins. Specifically, MTPNet is a unified framework for activity cliff prediction, which consists of two components: Macro-level Target Semantic (MTS) guidance and Micro-level Pocket Semantic (MPS) guidance. By this way, MTPNet dynamically optimizes molecular representations through multi-grained protein semantic conditions. To our knowledge, it is the first time to employ the receptor proteins as guiding information to effectively capture critical interaction details. Extensive experiments on 30 representative activity cliff datasets demonstrate that MTPNet significantly outperforms previous approaches, achieving an average RMSE improvement of 18. 95% on top of several mainstream GNN architectures. Overall, MTPNet internalizes interaction patterns through conditional deep learning to achieve unified predictions of activity cliffs, helping to accelerate compound optimization and design. Codes are available at: https: //github. com/ZishanShu/MTPNet.

NeurIPS Conference 2025 Conference Paper

SE-GUI: Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

  • Xinbin Yuan
  • Jian Zhang
  • Kaixin Li
  • Zhuoxuan Cai
  • Lujian Yao
  • Jie Chen
  • Enguang Wang
  • Qibin Hou

Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging—especially in complex, high-resolution, professional environments. Traditional supervised fine-tuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL)-based framework that incorporates three core strategies: (1) seed data curation to ensure high-quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self-evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves state-of-the-art results among similarly sized models on three grounding benchmarks. Notably, it attains 47. 3\% accuracy on the ScreenSpot-Pro dataset—outperforming much larger models, such as UI-TARS-72B, by a margin of 24. 2\%. These findings underscore the effectiveness of RL-based approaches in enhancing GUI agent performance, particularly in high-resolution, complex environments.

ICML Conference 2025 Conference Paper

Teaching Language Models to Critique via Reinforcement Learning

  • Zhihui Xie 0002
  • Jie Chen
  • Liyu Chen
  • Weichao Mao
  • Jingjing Xu 0001
  • Lingpeng Kong

Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106. 1% relative improvements across challenging code generation benchmarks.

IROS Conference 2025 Conference Paper

Three-DOF controlled flight in palm-scale micro robotic blimp driven by flapping wings

  • Jie Chen
  • Xiang Lu
  • Yulie Wu
  • Yang Chen
  • Dingbang Xiao
  • Xuezhong Wu

Micro blimps exhibit significant potential for applications in environmental monitoring and disaster rescue. Nonetheless, traditional propulsion methods for micro blimps encounter challenges such as complex mechanical structures, intricate attitude control, and large volumes. This paper present a novel compact and lightweight bio-inspired flapping-wing-driven micro robotic blimp actuated by piezoelectric (PZT), featuring a simplified structure and achieving three-degree-of-freedom (DOF) motion control with only two flapping-wing thruster units. We present a high-voltage drive-sense-control circuit and adaptive control strategy, enabling wireless remote control, onboard attitude sensing, and closed-loop yaw control. The proposed micro robotic blimp, powered by an onboard battery, measures 15 cm in major axis and weighs 1. 53 g, achieves a maneuvering speed of 17 cm/s, and angular velocity reaches 12°/s with a yaw angle control accuracy of 0. 5°. As the smallest and lightest known self-powered micro blimp capable of stable yaw control, the platform demonstrates excellent endurance and environmental stealth characteristics and advances the design of micro aerial vehicles by offering a novel and efficient approach.

NeurIPS Conference 2025 Conference Paper

Unraveling Metameric Dilemma for Spectral Reconstruction: A High-Fidelity Approach via Semi-Supervised Learning

  • Xingxing Yang
  • Jie Chen
  • Zaifeng Yang

Spectral reconstruction from RGB images often suffers from a metameric dilemma, where distinct spectral distributions map to nearly identical RGB values, making them indistinguishable to current models and leading to unreliable reconstructions. In this paper, we present Diff-Spectra that integrates supervised physics-aware spectral estimation and unsupervised high-fidelity spectral regularization for HSI reconstruction. We first introduce an Adaptive illumiChroma Decoupling (AICD) module to decouple illumination and chrominance information, which learns intrinsic and distinctive feature distributions, thereby mitigating the metameric issue. Then, we incorporate the AICD into a learnable spectral response function (SRF) guided hyperspectral initial estimation mechanism to mimic the physical image formation and thus inject physics-aware reasoning into neural networks, turning an ill-posed problem into a constrained, interpretable task. We also introduce a metameric spectra augmentation method to synthesize comprehensive hyperspectral data to pre-train a Spectral Diffusion Module (SDM), which internalizes the statistical properties of real-world HSI data, enforcing unsupervised high-fidelity regularization on the spectral transitions via inner-loop optimization during inference. Extensive experimental evaluations demonstrate that our Diff-Spectra achieves SOTA performance on both Spectral reconstruction and downstream HSI classification.

JBHI Journal 2025 Journal Article

Valence-Arousal Disentangled Representation Learning for Emotion Recognition in SSVEP-Based BCIs

  • Yipeng Du
  • Jie Chen
  • Zhengwu Liu
  • Ngai Wong
  • Chi Zhang
  • Zhiwei Ding
  • Jian Liu
  • Edith C.H. Ngai

Steady state visually evoked potential (SSVEP)-based brain-computer interfaces (BCIs), which are widely used in rehabilitation and disability assistance, can benefit from real-time emotion recognition to enhance human–machine interaction. However, the learned discri-minative latent representations in SSVEP-BCIs may generalize in an unintended direction, which can lead to reduced accuracy in detecting emotional states. In this paper, we introduce a Valence-Arousal Disentangled Representation Learning (VADL) method, drawing inspir-ation from the classical two-dimensional emotional model, to enhance the performance and generalization of emotion recognition within SSVEP-BCIs. VADL distinctly disentangles the latent variables of valence and arousal information to improve accuracy. It utilizes the structured state space duality model to thoroughly extract global emotional features. Additionally, we propose a Multisubject Gradient Blending training strategy that individually tailors the learning pace of reconstruction and discrimination tasks within VADL on-the-fly. To verify the feasibility of our method, we have developed a comprehensive database comprising 23 subjects, in which both the emotional states and SSVEPs were effectively elicited. Experimental results indicate that VADL surpasses existing state-of-the-art benchmark algorithms.

IROS Conference 2024 Conference Paper

A Point-Line Features Fusion Method for Fast and Robust Monocular Visual-Inertial Initialization

  • Guoqiang Xie
  • Jie Chen
  • Tianhang Tang
  • Zeyu Chen
  • Ling Lei
  • Yiguang Liu

Fast and robust initialization is essential for highly accurate monocular visual-inertial odometer (VIO), but at present majority of initialization methods rely only on point features, unstable in low texture and blurring situations. Therefore, we propose a novel point-line features fusion method for monocular visual-inertial initialization, as line features are more stable and provide richer geometric information than point features: 1) a closed-form line features initialization method is presented, and combined with point features to obtain a more integrated and robust linear system; 2) a monocular depth network is adopted to provide learned affine-invariant depth map, requiring only one prior depth map for the first frame, which can improve performance under low-parallax scenarios; 3) we can easily use RANSAC to reject outliers in solving linear system based on our formulation. Moreover, line feature re-projection residual is added to visual-inertial bundle adjustment (VI-BA) to obtain more accurate initial parameters. The proposed method is more accurate and robust than state-of-the-art methods due to the line features, especially under extreme low-parallax scenarios, and extensive experiments on popular datasets have confirmed, 0. 5s initialization window on EuRoC MAV, 0. 3s initialization window on TUM-VI, while the standard method normally waits for a window of 2s.

NeurIPS Conference 2024 Conference Paper

Automated Label Unification for Multi-Dataset Semantic Segmentation with GNNs

  • Rong Ma
  • Jie Chen
  • Xiangyang Xue
  • Jian Pu

Deep supervised models possess significant capability to assimilate extensive training data, thereby presenting an opportunity to enhance model performance through training on multiple datasets. However, conflicts arising from different label spaces among datasets may adversely affect model performance. In this paper, we propose a novel approach to automatically construct a unified label space across multiple datasets using graph neural networks. This enables semantic segmentation models to be trained simultaneously on multiple datasets, resulting in performance improvements. Unlike existing methods, our approach facilitates seamless training without the need for additional manual reannotation or taxonomy reconciliation. This significantly enhances the efficiency and effectiveness of multi-dataset segmentation model training. The results demonstrate that our method significantly outperforms other multi-dataset training methods when trained on seven datasets simultaneously, and achieves state-of-the-art performance on the WildDash 2 benchmark. Our code can be found in https: //github. com/Mrhonor/AutoUniSeg.

JBHI Journal 2024 Journal Article

CALLM: Enhancing Clinical Interview Analysis Through Data Augmentation With Large Language Models

  • Yuqi Wu
  • Kaining Mao
  • Yanbo Zhang
  • Jie Chen

The global prevalence of mental health disorders is increasing, leading to a significant economic burden estimated in trillions of dollars. In automated mental health diagnosis, the scarcity and imbalance of clinical data pose considerable challenges for researchers, limiting the effectiveness of machine learning algorithms. To cope with this issue, this paper aims to introduce a novel clinical transcript data augmentation framework by leveraging large language models (CALLM). The framework follows a “patient-doctor role-playing” intuition to generate realistic synthetic data. In addition, our study introduces a unique “Textbook-Assignment-Application” (T-A-A) partitioning approach to offer a systematic means of crafting synthetic clinical interview datasets. Concurrently, we have also developed a “Response-Reason” prompt engineering paradigm to generate highly authentic and diagnostically valuable transcripts. By leveraging a fine-tuned DistilBERT model on the E-DAIC PTSD dataset, we achieved a balanced accuracy of 0. 77, an F1-score of 0. 70, and an AUC of 0. 78 during test set evaluations, which showcase robust adaptability in both Zero-Shot Learning (ZSL) and Few-Shot Learning (FSL) scenarios. We further compare the CALLM framework with other data augmentation methods and PTSD diagnostic works and demonstrates consistent improvements. Compared to conventional data collection methods, our synthetic dataset not only demonstrates superior performance but also incurs less than 1% of the associated costs.

AAAI Conference 2024 Conference Paper

CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning

  • Qingsong Yan
  • Qiang Wang
  • Kaiyong Zhao
  • Jie Chen
  • Bo Li
  • Xiaowen Chu
  • Fei Deng

Neural Radiance Fields have demonstrated impressive performance in novel view synthesis. However, NeRF and most of its variants still rely on traditional complex pipelines to provide extrinsic and intrinsic camera parameters, such as COLMAP. Recent works, like NeRFmm, BARF, and L2G-NeRF, directly treat camera parameters as learnable and estimate them through differential volume rendering. However, these methods work for forward-looking scenes with slight motions and fail to tackle the rotation scenario in practice. To overcome this limitation, we propose a novel camera parameter free neural radiance field (CF-NeRF), which incrementally reconstructs 3D representations and recovers the camera parameters inspired by incremental structure from motion. Given a sequence of images, CF-NeRF estimates camera parameters of images one by one and reconstructs the scene through initialization, implicit localization, and implicit optimization. To evaluate our method, we use a challenging real-world dataset, NeRFBuster, which provides 12 scenes under complex trajectories. Results demonstrate that CF-NeRF is robust to rotation and achieves state-of-the-art results without providing prior information and constraints.

JBHI Journal 2024 Journal Article

Difference-Deformable Convolution With Pseudo Scale Instance Map for Cell Localization

  • Chengyang Zhang
  • Jie Chen
  • Bo Li
  • Min Feng
  • Yongquan Yang
  • Qikui Zhu
  • Hong Bu Bu

Cell localization still faces two unresolved challenges: 1) the dramatic variations in cell morphology, coupled with the heterogeneous intensity distribution of lightly stained cells; 2) existing cell location maps lack scale information, resulting in insufficient supervision for point maps and inaccurate supervision for density maps. 1) To address the first challenges, we introduce a novel gradient-aware and shape-adaptive Difference-Deformable Convolution (DDConv), which enhances the model's robustness to color by leveraging gradient information while adaptively adjusting the shape of the convolutional kernel to tackle the substantial variability in cell morphology. 2) To overcome the issue of unreasonable location maps, we propose the Pseudo-Scale Instance (PSI) map, which can adaptively provide the corresponding scale information for each cell to realize accurate supervision. We analyze and evaluate DDConv and the PSI map in three challenging cell localization tasks. In comparison to existing methods, our proposed approach significantly enhances localization performance, setting a new benchmark for the cell localization task.

TMLR Journal 2024 Journal Article

GLASU: A Communication-Efficient Algorithm for Federated Learning with Vertically Distributed Graph Data

  • Xinwei Zhang
  • Mingyi Hong
  • Jie Chen

Vertical federated learning (VFL) is a distributed learning paradigm, where computing clients collectively train a model based on the partial features of the same set of samples they possess. Current research on VFL focuses on the case when samples are independent, but it rarely addresses an emerging scenario when samples are interrelated through a graph. In this work, we train a graph neural network (GNN) through VFL, where each client owns a part of the node features and a different edge set. This data scenario incurs a significant communication overhead, not only because of the handling of distributed features but also due to neighborhood aggregation in a GNN. Moreover, the training analysis is faced with a challenge caused by the biased stochastic gradients. We propose a model-splitting method that splits a backbone GNN across the clients and the server and a communication-efficient algorithm, GLASU, to train such a model. GLASU adopts lazy aggregation and stale updates to skip communication in neighborhood aggregation and in model updates, respectively, greatly reducing communication while enjoying convergence guarantees. We conduct extensive numerical experiments on real-world datasets, showing that GLASU effectively trains a GNN that matches the accuracy of centralized training, while using only a fraction of the time due to communication saving.

NeurIPS Conference 2024 Conference Paper

Graph Neural Flows for Unveiling Systemic Interactions Among Irregularly Sampled Time Series

  • Giangiacomo Mercatali
  • Andre Freitas
  • Jie Chen

Interacting systems are prevalent in nature. It is challenging to accurately predict the dynamics of the system if its constituent components are analyzed independently. We develop a graph-based model that unveils the systemic interactions of time series observed at irregular time points, by using a directed acyclic graph to model the conditional dependencies (a form of causal notation) of the system components and learning this graph in tandem with a continuous-time model that parameterizes the solution curves of ordinary differential equations (ODEs). Our technique, a graph neural flow, leads to substantial enhancements over non-graph-based methods, as well as graph-based methods without the modeling of conditional dependencies. We validate our approach on several tasks, including time series classification and forecasting, to demonstrate its efficacy.

NeurIPS Conference 2024 Conference Paper

HiCoM: Hierarchical Coherent Motion for Dynamic Streamable Scenes with 3D Gaussian Splatting

  • Qiankun Gao
  • Jiarui Meng
  • Chengxiang Wen
  • Jie Chen
  • Jian Zhang

The online reconstruction of dynamic scenes from multi-view streaming videos faces significant challenges in training, rendering and storage efficiency. Harnessing superior learning speed and real-time rendering capabilities, 3D Gaussian Splatting (3DGS) has recently demonstrated considerable potential in this field. However, 3DGS can be inefficient in terms of storage and prone to overfitting by excessively growing Gaussians, particularly with limited views. This paper proposes an efficient framework, dubbed HiCoM, with three key components. First, we construct a compact and robust initial 3DGS representation using a perturbation smoothing strategy. Next, we introduce a Hierarchical Coherent Motion mechanism that leverages the inherent non-uniform distribution and local consistency of 3D Gaussians to swiftly and accurately learn motions across frames. Finally, we continually refine the 3DGS with additional Gaussians, which are later merged into the initial 3DGS to maintain consistency with the evolving scene. To preserve a compact representation, an equivalent number of low-opacity Gaussians that minimally impact the representation are removed before processing subsequent frames. Extensive experiments conducted on two widely used datasets show that our framework improves learning efficiency of the state-of-the-art methods by about 20% and reduces the data storage by 85%, achieving competitive free-viewpoint video synthesis quality but with higher robustness and stability. Moreover, by parallel learning multiple frames simultaneously, our HiCoM decreases the average training wall time to <2 seconds per frame with negligible performance degradation, substantially boosting real-world applicability and responsiveness.

JBHI Journal 2024 Journal Article

Hybrid Bayesian Optimization-Based Graphical Discovery for Methylation Sites Prediction

  • Lingyan Gu
  • Tingbo Chen
  • Jianqiang Li
  • Yu-An Huang
  • Zhihua Du
  • Victor C.M. Leung
  • Jie Chen

Protein methylation is one of the most important reversible post-translational modifications (PTMs), playing a vital role in the regulation of gene expression. Protein methylation sites serve as biomarkers in cardiovascular and pulmonary diseases, influencing various aspects of normal cell biology and pathogenesis. Nonetheless, the majority of existing computational methods for predicting protein methylation sites (PMSP) have been constructed based on protein sequences, with few methods leveraging the topological information of proteins. To address this issue, we propose an innovative framework for predicting Methylation Sites using Graphs (GraphMethySite) that employs graph convolution network in conjunction with Bayesian Optimization (BO) to automatically discover the graphical structure surrounding a candidate site and improve the predictive accuracy. In order to extract the most optimal subgraphs associated with methylation sites, we extend GraphMethySite by coupling it with a hybrid Bayesian optimization (together named GraphMethySite $^+$ ) to determine and visualize the topological relevance among amino-acid residues. We evaluated our framework on two extended protein methylation datasets, and empirical results demonstrate that it outperforms existing state-of-the-art methylation prediction methods.

AAAI Conference 2024 Conference Paper

Hyperspectral Image Reconstruction via Combinatorial Embedding of Cross-Channel Spatio-Spectral Clues

  • Xingxing Yang
  • Jie Chen
  • Zaifeng Yang

Existing learning-based hyperspectral reconstruction methods show limitations in fully exploiting the information among the hyperspectral bands. As such, we propose to investigate the chromatic inter-dependencies in their respective hyperspectral embedding space. These embedded features can be fully exploited by querying the inter-channel correlations in a combinatorial manner, with the unique and complementary information efficiently fused into the final prediction. We found such independent modeling and combinatorial excavation mechanisms are extremely beneficial to uncover marginal spectral features, especially in the long wavelength bands. In addition, we have proposed a spatio-spectral attention block and a spectrum-fusion attention module, which greatly facilitates the excavation and fusion of information at both semantically long-range levels and fine-grained pixel levels across all dimensions. Extensive quantitative and qualitative experiments show that our method (dubbed CESST) achieves SOTA performance. Code for this project is at: https://github.com/AlexYangxx/CESST.

AAAI Conference 2024 Conference Paper

Parallel Vertex Diffusion for Unified Visual Grounding

  • Zesen Cheng
  • Kehan Li
  • Peng Jin
  • Siheng Li
  • Xiangyang Ji
  • Li Yuan
  • Chang Liu
  • Jie Chen

Unified visual grounding (UVG) capitalizes on a wealth of task-related knowledge across various grounding tasks via one-shot training, which curtails retraining costs and task-specific architecture design efforts. Vertex generation-based UVG methods achieve this versatility by unified modeling object box and contour prediction and provide a text-powered interface to vast related multi-modal tasks, e.g., visual question answering and captioning. However, these methods typically generate vertexes sequentially through autoregression, which is prone to be trapped in error accumulation and heavy computation, especially for high-dimension sequence generation in complex scenarios. In this paper, we develop Parallel Vertex Diffusion (PVD) based on the parallelizability of diffusion models to accurately and efficiently generate vertexes in a parallel and scalable manner. Since the coordinates fluctuate greatly, it typically encounters slow convergence when training diffusion models without geometry constraints. Therefore, we consummate our PVD by two critical components, i.e., center anchor mechanism and angle summation loss, which serve to normalize coordinates and adopt a differentiable geometry descriptor from the point-in-polygon problem of computational geometry to constrain the overall difference of prediction and label vertexes. These innovative designs empower our PVD to demonstrate its superiority with state-of-the-art performance across various grounding tasks.

NeurIPS Conference 2024 Conference Paper

Parameterized Approximation Schemes for Fair-Range Clustering

  • Zhen Zhang
  • Xiaohong Chen
  • Limei Liu
  • Jie Chen
  • Junyu Huang
  • Qilong Feng

Fair-range clustering extends classical clustering formulations by associating each data point with one or more demographic labels. It imposes lower and upper bound constraints on the number of facilities opened for each label, ensuring fair representation of all demographic groups by the selected facilities. In this paper we focus on the fair-range $k$-median and $k$-means problems in Euclidean spaces. We give $(1+\varepsilon)$-approximation algorithms with fixed-parameter tractable running times for both problems, parameterized by the numbers of opened facilities and demographic labels. For Euclidean metrics, these are the first parameterized approximation schemes for the problems, improving upon the previously known $O(1)$-approximation ratios given by Thejaswi et al. (KDD 2022).

AAAI Conference 2024 Conference Paper

Practical Privacy-Preserving MLaaS: When Compressive Sensing Meets Generative Networks

  • Jia Wang
  • Wuqiang Su
  • Zushu Huang
  • Jie Chen
  • Chengwen Luo
  • Jianqiang Li

The Machine-Learning-as-a-Service (MLaaS) framework allows one to grab low-hanging fruit of machine learning techniques and data science, without either much expertise for this sophisticated sphere or provision of specific infrastructures. However, the requirement of revealing all training data to the service provider raises new concerns in terms of privacy leakage, storage consumption, efficiency, bandwidth, etc. In this paper, we propose a lightweight privacy-preserving MLaaS framework by combining Compressive Sensing (CS) and Generative Networks. It’s constructed on the favorable facts observed in recent works that general inference tasks could be fulfilled with generative networks and classifier trained on compressed measurements, since the generator could model the data distribution and capture discriminative information which are useful for classification. To improve the performance of the MLaaS framework, the supervised generative models of the server are trained and optimized with prior knowledge provided by the client. In order to prevent the service provider from recovering the original data as well as identifying the queried results, a noise-addition mechanism is designed and adopted into the compressed data domain. Empirical results confirmed its performance superiority in accuracy and resource consumption against the state-of-the-art privacy preserving MLaaS frameworks.

TMLR Journal 2024 Journal Article

SA-MLP: Distilling Graph Knowledge from GNNs into Structure-Aware MLP

  • Jie Chen
  • Mingyuan Bai
  • Shouzhen Chen
  • Junbin Gao
  • Junping Zhang
  • Jian Pu

The recursive node fetching and aggregation in message-passing cause inference latency when deploying Graph Neural Networks (GNNs) to large-scale graphs. One promising inference acceleration direction is to distill GNNs into message-passing-free student Multi-Layer Perceptrons (MLPs). However, the MLP student without graph dependency cannot fully learn the structure knowledge from GNNs, which causes inferior performance in heterophilic and online scenarios. To address this problem, we first design a simple yet effective Structure-Aware MLP (SA-MLP) as a student model. It utilizes linear layers as encoders and decoders to capture features and graph structures without message-passing among nodes. Furthermore, we introduce a novel structure-mixing knowledge distillation technique. It generates virtual samples imbued with a hybrid of structure knowledge from teacher GNNs, thereby enhancing the learning ability of MLPs for structure information. Extensive experiments on eight benchmark datasets under both transductive and online settings show that our SA-MLP can consistently achieve similar or even better results than teacher GNNs while maintaining as fast inference speed as MLPs. Our findings reveal that SA-MLP efficiently assimilates graph knowledge through distillation from GNNs in an end-to-end manner, eliminating the need for complex model architectures and preprocessing of features/structures. Our code is available at https://github.com/JC-202/SA-MLP.

AAAI Conference 2024 Conference Paper

Secure Distributed Sparse Gaussian Process Models Using Multi-Key Homomorphic Encryption

  • Adil Nawaz
  • Guopeng Chen
  • Muhammad Umair Raza
  • Zahid Iqbal
  • Jianqiang Li
  • Victor C.M. Leung
  • Jie Chen

Distributed sparse Gaussian process (dGP) models provide an ability to achieve accurate predictive performance using data from multiple devices in a time efficient and scalable manner. The distributed computation of model, however, risks exposure of privately owned data to public manipulation. In this paper we propose a secure solution for dGP regression models using multi-key homomorphic encryption. Experimental results show that with a little sacrifice in terms of time complexity, we achieve a secure dGP model without deteriorating the predictive performance compared to traditional non-secure dGP models. We also present a practical implementation of the proposed model using several Nvidia Jetson Nano Developer Kit modules to simulate a real-world scenario. Thus, secure dGP model plugs the data security issues of dGP and provide a secure and trustworthy solution for multiple devices to use privately owned data for model computation in a distributed environment availing speed, scalability and robustness of dGP.

ICML Conference 2023 Conference Paper

A Gromov-Wasserstein Geometric View of Spectrum-Preserving Graph Coarsening

  • Yifan Chen 0004
  • Rentian Yao
  • Yun Yang
  • Jie Chen

Graph coarsening is a technique for solving large-scale graph problems by working on a smaller version of the original graph, and possibly interpolating the results back to the original graph. It has a long history in scientific computing and has recently gained popularity in machine learning, particularly in methods that preserve the graph spectrum. This work studies graph coarsening from a different perspective, developing a theory for preserving graph distances and proposing a method to achieve this. The geometric approach is useful when working with a collection of graphs, such as in graph classification and regression. In this study, we consider a graph as an element on a metric space equipped with the Gromov–Wasserstein (GW) distance, and bound the difference between the distance of two graphs and their coarsened versions. Minimizing this difference can be done using the popular weighted kernel $K$-means method, which improves existing spectrum-preserving methods with the proper choice of the kernel. The study includes a set of experiments to support the theory and method, including approximating the GW distance, preserving the graph spectrum, classifying graphs using spectral information, and performing regression using graph convolutional networks. Code is available at https: //github. com/ychen-stat-ml/GW-Graph-Coarsening.

JBHI Journal 2023 Journal Article

BMAnet: Boundary Mining With Adversarial Learning for Semi-Supervised 2D Myocardial Infarction Segmentation

  • Chenchu Xu
  • Yifei Wang
  • Dong Zhang
  • Longfei Han
  • Yanping Zhang
  • Jie Chen
  • Shuo Li

Automatic segmentation of myocardial infarction (MI) regions in late gadolinium-enhanced cardiac magnetic resonance images is an essential step in the computed diagnosis of myocardial infarction. Most of the current myocardial infarction region segmentation methods are based on fully supervised deep learning. However, cardiologists' annotation of myocardial infarction regions in cardiac magnetic resonance images during the diagnosis process is time-consuming and expensive. This paper proposes a semi-supervised myocardial infarction segmentation. It consists of two models: 1) a boundary mining model and 2) an adversarial learning model. The boundary mining model can solve the boundary ambiguity problem by enlarging the gap between the foreground and background features, thus segmenting the myocardial infarction region accurately. The adversarial learning model can make the boundary mining model learn from additional unlabeled data by evaluating the segmentation performance and providing pseudo supervision, which significantly increases the robustness of the boundary mining model. We conduct extensive experiments on an in-house myocardial magnetic resonance dataset. The experimental results on six evaluation metrics demonstrate that our method achieves excellent results in myocardial infarction segmentation and outperforms the state-of-the-art semi-supervised methods.

ICML Conference 2023 Conference Paper

Compressed Decentralized Proximal Stochastic Gradient Method for Nonconvex Composite Problems with Heterogeneous Data

  • Yonggui Yan
  • Jie Chen
  • Pin-Yu Chen
  • Xiaodong Cui
  • Songtao Lu
  • Yangyang Xu

We first propose a decentralized proximal stochastic gradient tracking method (DProxSGT) for nonconvex stochastic composite problems, with data heterogeneously distributed on multiple workers in a decentralized connected network. To save communication cost, we then extend DProxSGT to a compressed method by compressing the communicated information. Both methods need only $\mathcal{O}(1)$ samples per worker for each proximal update, which is important to achieve good generalization performance on training deep neural networks. With a smoothness condition on the expected loss function (but not on each sample function), the proposed methods can achieve an optimal sample complexity result to produce a near-stationary point. Numerical experiments on training neural networks demonstrate the significantly better generalization performance of our methods over large-batch training methods and momentum variance-reduction methods and also, the ability of handling heterogeneous data by the gradient tracking scheme.

NeurIPS Conference 2023 Conference Paper

Discover and Align Taxonomic Context Priors for Open-world Semi-Supervised Learning

  • Yu Wang
  • Zhun Zhong
  • Pengchong Qiao
  • Xuxin Cheng
  • Xiawu Zheng
  • Chang Liu
  • Nicu Sebe
  • Rongrong Ji

Open-world Semi-Supervised Learning (OSSL) is a realistic and challenging task, aiming to classify unlabeled samples from both seen and novel classes using partially labeled samples from the seen classes. Previous works typically explore the relationship of samples as priors on the pre-defined single-granularity labels to help novel class recognition. In fact, classes follow a taxonomy and samples can be classified at multiple levels of granularity, which contains more underlying relationships for supervision. We thus argue that learning with single-granularity labels results in sub-optimal representation learning and inaccurate pseudo labels, especially with unknown classes. In this paper, we take the initiative to explore and propose a uniformed framework, called Taxonomic context prIors Discovering and Aligning (TIDA), which exploits the relationship of samples under various granularity. It allows us to discover multi-granularity semantic concepts as taxonomic context priors (i. e. , sub-class, target-class, and super-class), and then collaboratively leverage them to enhance representation learning and improve the quality of pseudo labels. Specifically, TIDA comprises two components: i) A taxonomic context discovery module that constructs a set of hierarchical prototypes in the latent space to discover the underlying taxonomic context priors; ii) A taxonomic context-based prediction alignment module that enforces consistency across hierarchical predictions to build the reliable relationship between classes among various granularity and provide additions supervision. We demonstrate that these two components are mutually beneficial for an effective OSSL framework, which is theoretically explained from the perspective of the EM algorithm. Extensive experiments on seven commonly used datasets show that TIDA can significantly improve the performance and achieve a new state of the art. The source codes are publicly available at https: //github. com/rain305f/TIDA.

IROS Conference 2023 Conference Paper

FISS+: Efficient and Focused Trajectory Generation and Refinement Using Fast Iterative Search and Sampling Strategy

  • Shuo Sun
  • Jie Chen
  • Jiawei Sun 0006
  • Chengran Yuan
  • Yuanchen Li
  • Tangyike Zhang
  • Marcelo H. Ang

Trajectory planning plays a crucial role in autonomous driving systems, as it is tasked to generate feasible trajectories under highly dynamic scenarios within the time constraint. This paper proposes a novel two-stage coarse-to-fine framework for efficient sampling-based trajectory planning. The proposed method is designed to iteratively generate new trajectory samples focused on the low-cost regions in the sampling space. Two trajectory exploration algorithms are well-designed for efficient search in discretized coarse global space and continuous fine local space, respectively. Experimental results on the first-of-its-kind planning benchmark tool CommonRoad show that our method significantly outperforms the baseline methods both in optimality and computational efficiency. Overall, our approach offers a promising solution for efficient and effective trajectory planning in more autonomous vehicle applications.

ICML Conference 2023 Conference Paper

GC-Flow: A Graph-Based Flow Network for Effective Clustering

  • Tianchun Wang
  • Farzaneh Mirzazadeh
  • Xiang Zhang 0001
  • Jie Chen

Graph convolutional networks (GCNs) are discriminative models that directly model the class posterior $p(y|\mathbf{x})$ for semi-supervised classification of graph data. While being effective, as a representation learning approach, the node representations extracted from a GCN often miss useful information for effective clustering, because the objectives are different. In this work, we design normalizing flows that replace GCN layers, leading to a generative model that models both the class conditional likelihood $p(\mathbf{x}|y)$ and the class prior $p(y)$. The resulting neural network, GC-Flow, retains the graph convolution operations while being equipped with a Gaussian mixture representation space. It enjoys two benefits: it not only maintains the predictive power of GCN, but also produces well-separated clusters, due to the structuring of the representation space. We demonstrate these benefits on a variety of benchmark data sets. Moreover, we show that additional parameterization, such as that on the adjacency matrix used for graph convolutions, yields additional improvement in clustering.

AAAI Conference 2023 Conference Paper

Learnable Blur Kernel for Single-Image Defocus Deblurring in the Wild

  • Jucai Zhai
  • Pengcheng Zeng
  • Chihao Ma
  • Jie Chen
  • Yong Zhao

Recent research showed that the dual-pixel sensor has made great progress in defocus map estimation and image defocus deblurring. However, extracting real-time dual-pixel views is troublesome and complex in algorithm deployment. Moreover, the deblurred image generated by the defocus deblurring network lacks high-frequency details, which is unsatisfactory in human perception. To overcome this issue, we propose a novel defocus deblurring method that uses the guidance of the defocus map to implement image deblurring. The proposed method consists of a learnable blur kernel to estimate the defocus map, which is an unsupervised method, and a single-image defocus deblurring generative adversarial network (DefocusGAN) for the first time. The proposed network can learn the deblurring of different regions and recover realistic details. We propose a defocus adversarial loss to guide this training process. Competitive experimental results confirm that with a learnable blur kernel, the generated defocus map can achieve results comparable to supervised methods. In the single-image defocus deblurring task, the proposed method achieves state-of-the-art results, especially significant improvements in perceptual quality, where PSNR reaches 25.56 dB and LPIPS reaches 0.111.

AAAI Conference 2023 Conference Paper

Proximal Stochastic Recursive Momentum Methods for Nonconvex Composite Decentralized Optimization

  • Gabriel Mancino-Ball
  • Shengnan Miao
  • Yangyang Xu
  • Jie Chen

Consider a network of N decentralized computing agents collaboratively solving a nonconvex stochastic composite problem. In this work, we propose a single-loop algorithm, called DEEPSTORM, that achieves optimal sample complexity for this setting. Unlike double-loop algorithms that require a large batch size to compute the (stochastic) gradient once in a while, DEEPSTORM uses a small batch size, creating advantages in occasions such as streaming data and online learning. This is the first method achieving optimal sample complexity for decentralized nonconvex stochastic composite problems, requiring O(1) batch size. We conduct convergence analysis for DEEPSTORM with both constant and diminishing step sizes. Additionally, under proper initialization and a small enough desired solution error, we show that DEEPSTORM with a constant step size achieves a network-independent sample complexity, with an additional linear speed-up with respect to N over centralized methods. All codes are made available at https://github.com/gmancino/DEEPSTORM.

IJCAI Conference 2023 Conference Paper

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

  • Peng Jin
  • Hao Li
  • Zesen Cheng
  • Jinfa Huang
  • Zhennan Wang
  • Li Yuan
  • Chang Liu
  • Jie Chen

Text-video retrieval is a challenging cross-modal task, which aims to align visual entities with natural language descriptions. Current methods either fail to leverage the local details or are computationally expensive. What's worse, they fail to leverage the heterogeneous concepts in data. In this paper, we propose the Disentangled Conceptualization and Set-to-set Alignment (DiCoSA) to simulate the conceptualizing and reasoning process of human beings. For disentangled conceptualization, we divide the coarse feature into multiple latent factors related to semantic concepts. For set-to-set alignment, where a set of visual concepts correspond to a set of textual concepts, we propose an adaptive pooling method to aggregate semantic concepts to address the partial matching. In particular, since we encode concepts independently in only a few dimensions, DiCoSA is superior at efficiency and granularity, ensuring fine-grained interactions using a similar computational complexity as coarse-grained alignment. Extensive experiments on five datasets, including MSR-VTT, LSMDC, MSVD, ActivityNet, and DiDeMo, demonstrate that our method outperforms the existing state-of-the-art methods.

IJCAI Conference 2023 Conference Paper

TG-VQA: Ternary Game of Video Question Answering

  • Hao Li
  • Peng Jin
  • Zesen Cheng
  • Songyang Zhang
  • Kai Chen
  • Zhennan Wang
  • Chang Liu
  • Jie Chen

Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them. However, since relying heavily on human instructions, i. e. , annotations or priors, current contrastive learning-based VideoQA methods remains challenging to perform fine-grained visual-linguistic alignments. In this work, we innovatively resort to game theory, which can simulate complicated relationships among multiple players with specific interaction strategies, e. g. , video, question, and answer as ternary players, to achieve fine-grained alignment for VideoQA task. Specifically, we carefully design a VideoQA-specific interaction strategy to tailor the characteristics of VideoQA, which can mathematically generate the fine-grained visual-linguistic alignment label without label-intensive efforts. Our TG-VQA outperforms existing state-of-the-art by a large margin (more than 5%) on long-term and short-term VideoQA datasets, verifying its effectiveness and generalization ability. Thanks to the guidance of game-theoretic interaction, our model impressively convergences well on limited data (10^4 videos), surpassing most of those pre-trained on large-scale data (10^7 videos).

IJCAI Conference 2023 Conference Paper

WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation

  • Zesen Cheng
  • Peng Jin
  • Hao Li
  • Kehan Li
  • Siheng Li
  • Xiangyang Ji
  • Chang Liu
  • Jie Chen

The top-down and bottom-up methods are two mainstreams of referring segmentation, while both methods have their own intrinsic weaknesses. Top-down methods are chiefly disturbed by Polar Negative (PN) errors owing to the lack of fine-grained cross-modal alignment. Bottom-up methods are mainly perturbed by Inferior Positive (IP) errors due to the lack of prior object information. Nevertheless, we discover that two types of methods are highly complementary for restraining respective weaknesses but the direct average combination leads to harmful interference. In this context, we build Win-win Cooperation (WiCo) to exploit complementary nature of two types of methods on both interaction and integration aspects for achieving a win-win improvement. For the interaction aspect, Complementary Feature Interaction (CFI) introduces prior object information to bottom-up branch and provides fine-grained information to top-down branch for complementary feature enhancement. For the integration aspect, Gaussian Scoring Integration (GSI) models the gaussian performance distributions of two branches and weighted integrates results by sampling confident scores from the distributions. With our WiCo, several prominent bottom-up and top-down combinations achieve remarkable improvements on three common datasets with reasonable extra costs, which justifies effectiveness and generality of our method.

NeurIPS Conference 2022 Conference Paper

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

  • Peng Jin
  • Jinfa Huang
  • Fenglin Liu
  • Xian Wu
  • Shen Ge
  • Guoli Song
  • David Clifton
  • Jie Chen

Most video-and-language representation learning approaches employ contrastive learning, e. g. , CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.

ICLR Conference 2022 Conference Paper

Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series

  • Enyan Dai
  • Jie Chen

Anomaly detection is a widely studied task for a broad variety of data types; among them, multiple time series appear frequently in applications, including for example, power grids and traffic networks. Detecting anomalies for multiple time series, however, is a challenging subject, owing to the intricate interdependencies among the constituent series. We hypothesize that anomalies occur in low density regions of a distribution and explore the use of normalizing flows for unsupervised anomaly detection, because of their superior quality in density estimation. Moreover, we propose a novel flow model by imposing a Bayesian network among constituent series. A Bayesian network is a directed acyclic graph (DAG) that models causal relationships; it factorizes the joint probability of the series into the product of easy-to-evaluate conditional probabilities. We call such a graph-augmented normalizing flow approach GANF and propose joint estimation of the DAG with flow parameters. We conduct extensive experiments on real-world datasets and demonstrate the effectiveness of GANF for density estimation, anomaly detection, and identification of time series distribution drift.

JBHI Journal 2022 Journal Article

MDAN: Mirror Difference Aware Network for Brain Stroke Lesion Segmentation

  • Qiqi Bao
  • Shiyu Mi
  • Bowen Gang
  • Wenming Yang
  • Jie Chen
  • Qingmin Liao

Brain stroke lesion segmentation is of great importance for stroke rehabilitation neuroimaging analysis. Due to the large variance of stroke lesion shapes and similarities of tissue intensity distribution, it remains a challenging task. To help detect abnormalities, the anatomical symmetries of brain magnetic resonance (MR) images have been widely used as visual cues for clinical practices. However, most methods for brain images segmentation do not fully utilize structural symmetry information. This paper presents a novel mirror difference aware network (MDAN) for stroke lesion segmentation. The network uses an encoder-decoder architecture, aiming at holistically exploiting the symmetries of image features. Specifically, a differential feature augmentation (DFA) module is developed in the encoding path to highlight the semantically pathological asymmetries of features in abnormalities. In the DFA module, a Siamese contrastive supervised loss is designed to enhance discriminative features, and a mirror position-based difference augmentation (MDA) module is used to further magnify the discrepancy. Moreover, mirror feature fusion (MFF) modules are applied to efficiently fuse and transfer the information both of the original input and the horizontally flipped features to the decoding path. Extensive experiments on the Anatomical Tracings of Lesions After Stroke (ATLAS) dataset show the proposed MDAN outperforms the state-of-the-art methods.

JBHI Journal 2022 Journal Article

Stroke Risk Prediction With Hybrid Deep Transfer Learning Framework

  • Jie Chen
  • Yingru Chen
  • Jianqiang Li
  • Jia Wang
  • Zijie Lin
  • Asoke K. Nandi

Stroke has become a leading cause of death and long-term disability in the world with no effective treatment. Deep learning-based approaches have the potential to outperform existing stroke risk prediction models, but they rely on large well-labeled data. Due to the strict privacy protection policy in health-care systems, stroke data is usually distributed among different hospitals in small pieces. In addition, the positive and negative instances of such data are extremely imbalanced. Transfer learning can solve small data issue by exploiting the knowledge of a correlated domain, especially when multiple source of data are available. In this work, we propose a novel Hybrid Deep Transfer Learning-based Stroke Risk Prediction (HDTL-SRP) scheme to exploit the knowledge structure from multiple correlated sources (i. e. , external stroke data, chronic diseases data, such as hypertension and diabetes). The proposed framework has been extensively tested in synthetic and real-world scenarios, and it outperforms the state-of-the-art stroke risk prediction models. It also shows the potential of real-world deployment among multiple hospitals aided with 5 G/B5G infrastructures.

NeurIPS Conference 2021 Conference Paper

CentripetalText: An Efficient Text Instance Representation for Scene Text Detection

  • Tao Sheng
  • Jie Chen
  • Zhouhui Lian

Scene text detection remains a grand challenge due to the variation in text curvatures, orientations, and aspect ratios. One of the hardest problems in this task is how to represent text instances of arbitrary shapes. Although many methods have been proposed to model irregular texts in a flexible manner, most of them lose simplicity and robustness. Their complicated post-processings and the regression under Dirac delta distribution undermine the detection performance and the generalization ability. In this paper, we propose an efficient text instance representation named CentripetalText (CT), which decomposes text instances into the combination of text kernels and centripetal shifts. Specifically, we utilize the centripetal shifts to implement pixel aggregation, guiding the external text pixels to the internal text kernels. The relaxation operation is integrated into the dense regression for centripetal shifts, allowing the correct prediction in a range instead of a specific value. The convenient reconstruction of text contours and the tolerance of prediction errors in our method guarantee the high detection accuracy and the fast inference speed, respectively. Besides, we shrink our text detector into a proposal generation module, namely CentripetalText Proposal Network (CPN), replacing Segmentation Proposal Network (SPN) in Mask TextSpotter v3 and producing more accurate proposals. To validate the effectiveness of our method, we conduct experiments on several commonly used scene text benchmarks, including both curved and multi-oriented text datasets. For the task of scene text detection, our approach achieves superior or competitive performance compared to other existing methods, e. g. , F-measure of 86. 3% at 40. 0 FPS on Total-Text, F-measure of 86. 1% at 34. 8 FPS on MSRA-TD500, etc. For the task of end-to-end scene text recognition, our method outperforms Mask TextSpotter v3 by 1. 1% in F-measure on Total-Text.

NeurIPS Conference 2021 Conference Paper

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

  • Ruchir Puri
  • David Kung
  • Geert Janssen
  • Wei Zhang
  • Giacomo Domeniconi
  • Vladimir Zolotov
  • Julian T Dolby
  • Jie Chen

Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled breakthroughs in computer vision, speech recognition, natural language processing and beyond, motivating researchers to leverage AI techniques to improve software development efficiency. Thus, the fast-emerging research area of “AI for Code” has garnered new interest and gathered momentum. In this paper, we present a large-scale dataset \textit{CodeNet}, consisting of over 14 million code samples and about 500 million lines of code in 55 different programming languages, which is aimed at teaching AI to code. In addition to its large scale, CodeNet has a rich set of high-quality annotations to benchmark and help accelerate research in AI techniques for a variety of critical coding tasks, including code similarity and classification, code translation between a large variety of programming languages, and code performance (runtime and memory) improvement techniques. Additionally, CodeNet provides sample input and output test sets for 98. 5\% of the code samples, which can be used as an oracle for determining code correctness and potentially guide reinforcement learning for code quality improvements. As a usability feature, we provide several pre-processing tools in CodeNet to transform source code into representations that can be readily used as inputs into machine learning models. Results of code classification and code similarity experiments using the CodeNet dataset are provided as a reference. We hope that the scale, diversity and rich, high-quality annotations of CodeNet will offer unprecedented research opportunities at the intersection of AI and Software Engineering.

UAI Conference 2021 Conference Paper

Dynamic visualization for L1 fusion convex clustering in near-linear time

  • Bingyuan Zhang
  • Jie Chen
  • Yoshikazu Terada

Convex clustering has drawn recent attention because of its competitive performance and nice property to guarantee global optimality. However, convex clustering is infeasible due to its high computational cost for large-scale data sets. We propose a novel method to solve the L1 fusion convex clustering problem by dynamic programming. We develop the Convex clustering Path Algorithm In Near-linear Time (C-PAINT) algorithm to construct the solution path efficiently. The proposed C-PAINT yields the exact solution while other general solvers for convex problems applied in the convex clustering depend on tuning parameters such as step size and threshold, and it usually takes many iterations to converge. Including a sorting process that almost takes no time in practice, the main part of the algorithm takes only linear time. Thus, C-PAINT has superior scalability comparing to other state-of-art algorithms. Moreover, C-PAINT enables the path visualization of clustering solutions for large data. In particular, experiments show our proposed method can solve the convex clustering with 10^7 data points in two minutes. We demonstrate the proposed method using both synthetic data and real data. Our algorithms are implemented in the dpcc R package.

IJCAI Conference 2021 Conference Paper

Graph Universal Adversarial Attacks: A Few Bad Actors Ruin Graph Learning Models

  • Xiao Zang
  • Yi Xie
  • Jie Chen
  • Bo Yuan

Deep neural networks, while generalize well, are known to be sensitive to small adversarial perturbations. This phenomenon poses severe security threat and calls for in-depth investigation of the robustness of deep learning models. With the emergence of neural networks for graph structured data, similar investigations are urged to understand their robustness. It has been found that adversarially perturbing the graph structure and/or node features may result in a significant degradation of the model performance. In this work, we show from a different angle that such fragility similarly occurs if the graph contains a few bad-actor nodes, which compromise a trained graph neural network through flipping the connections to any targeted victim. Worse, the bad actors found for one graph model severely compromise other models as well. We call the bad actors ``anchor nodes'' and propose an algorithm, named GUA, to identify them. Thorough empirical investigations suggest an interesting finding that the anchor nodes often belong to the same class; and they also corroborate the intuitive trade-off between the number of anchor nodes and the attack success rate. For the dataset Cora which contains 2708 nodes, as few as six anchor nodes will result in an attack success rate higher than 80% for GCN and other three models.

IJCAI Conference 2021 Conference Paper

RR-Net: Injecting Interactive Semantics in Human-Object Interaction Detection

  • Dongming Yang
  • Yuexian Zou
  • Can Zhang
  • Meng Cao
  • Jie Chen

Human-Object Interaction (HOI) detection devotes to learn how humans interact with surrounding objects. Latest end-to-end HOI detectors are short of relation reasoning, which leads to inability to learn HOI-specific interactive semantics for predictions. In this paper, we therefore propose novel relation reasoning for HOI detection. We first present a progressive Relation-aware Frame, which brings a new structure and parameter sharing pattern for interaction inference. Upon the frame, an Interaction Intensifier Module and a Correlation Parsing Module are carefully designed, where: a) interactive semantics from humans can be exploited and passed to objects to intensify interactions, b) interactive correlations among humans, objects and interactions are integrated to promote predictions. Based on modules above, we construct an end-to-end trainable framework named Relation Reasoning Network (abbr. RR-Net). Extensive experiments show that our proposed RR-Net sets a new state-of-the-art on both V-COCO and HICO-DET benchmarks and improves the baseline about 5. 5% and 9. 8% relatively, validating that this first effort in exploring relation reasoning and integrating interactive semantics has brought obvious improvement for end-to-end HOI detection.

AAAI Conference 2021 Conference Paper

Unsupervised Learning of Graph Hierarchical Abstractions with Differentiable Coarsening and Optimal Transport

  • Tengfei Ma
  • Jie Chen

Hierarchical abstractions are a methodology for solving large-scale graph problems in various disciplines. Coarsening is one such approach: it generates a pyramid of graphs whereby the one in the next level is a structural summary of the prior one. With a long history in scientific computing, many coarsening strategies were developed based on mathematically driven heuristics. Recently, resurgent interests exist in deep learning to design hierarchical methods learnable through differentiable parameterization. These approaches are paired with downstream tasks for supervised learning. In practice, however, supervised signals (e. g. , labels) are scarce and are often laborious to obtain. In this work, we propose an unsupervised approach, coined OTCOARSEN- ING, with the use of optimal transport. Both the coarsening matrix and the transport cost matrix are parameterized, so that an optimal coarsening strategy can be learned and tailored for a given set of graphs without use of labels. We demonstrate that the proposed approach produces meaningful coarse graphs and yields competitive performance compared with supervised methods for graph classification and regression.

JBHI Journal 2020 Journal Article

Automatic Medical Code Assignment via Deep Learning Approach for Intelligent Healthcare

  • Fei Teng
  • Zheng Ma
  • Jie Chen
  • Ming Xiao
  • Lufei Huang

With the development of healthcare 4. 0, there has been an explosion in the amount of data such as image, medical text, physiological signals, lab tests, etc. Among them, medical records provide a complete picture of the associated clinical events. However, the processing of medical texts is difficult because they are structurally free, diverse in style, and have subjective factors. Assigning metadata codes from the International Classification of Diseases (ICD) presents a standardized way of indicating diagnoses and procedures, so it becomes a mandatory process for understanding medical records to make better clinical and financial decisions. Such a manual encoding task is time-consuming, error-prone and expensive. In this paper, we proposed a deep learning approach and a medical topic mining method to automatically predict ICD codes from text-free medical records. The result of the F1 score on Medical Information Mart for Intensive Care (MIMIC-III) dataset increases by 5% over the state of art. It also suitable for multiple ICD versions and languages. For the specific disease, atrial fibrillation, the F1 score is up to 96% and 93. 3% using in-house ICD-10 datasets and MIMIC-III datasets, respectively. We developed an Artificial Intelligence based coding system, which can greatly improve the efficiency and accuracy of human coders, and meanwhile accelerate the secondary use for clinical informatics.

AAAI Conference 2020 Conference Paper

CAG: A Real-Time Low-Cost Enhanced-Robustness High-Transferability Content-Aware Adversarial Attack Generator

  • Huy Phan
  • Yi Xie
  • Siyu Liao
  • Jie Chen
  • Bo Yuan

Deep neural networks (DNNs) are vulnerable to adversarial attack despite their tremendous success in many artificial intelligence fields. Adversarial attack is a method that causes the intended misclassfication by adding imperceptible perturbations to legitimate inputs. To date, researchers have developed numerous types of adversarial attack methods. However, from the perspective of practical deployment, these methods suffer from several drawbacks such as long attack generating time, high memory cost, insufficient robustness and low transferability. To address the drawbacks, we propose a Content-aware Adversarial Attack Generator (CAG) to achieve real-time, low-cost, enhanced-robustness and hightransferability adversarial attack. First, as a type of generative model-based attack, CAG shows significant speedup (at least 500 times) in generating adversarial examples compared to the state-of-the-art attacks such as PGD and C&W. Furthermore, CAG only needs a single generative model to perform targeted attack to any targeted class. Because CAG encodes the label information into a trainable embedding layer, it differs from prior generative model-based adversarial attacks that use n different copies of generative models for n different targeted classes. As a result, CAG significantly reduces the required memory cost for generating adversarial examples. Moreover, CAG can generate adversarial perturbations that focus on the critical areas of input by integrating the class activation maps information in the training process, and hence improve the robustness of CAG attack against the state-of-art adversarial defenses. In addition, CAG exhibits high transferability across different DNN classifier models in black-box attack scenario by introducing random dropout in the process of generating perturbations. Extensive experiments on different datasets and DNN models have verified the realtime, low-cost, enhanced-robustness, and high-transferability benefits of CAG.

AAAI Conference 2020 Conference Paper

Embedding Compression with Isotropic Iterative Quantization

  • Siyu Liao
  • Jie Chen
  • Yanzhi Wang
  • Qinru Qiu
  • Bo Yuan

Continuous representation of words is a standard component in deep learning-based NLP models. However, representing a large vocabulary requires significant memory, which can cause problems, particularly on resource-constrained platforms. Therefore, in this paper we propose an isotropic iterative quantization (IIQ) approach for compressing embedding vectors into binary ones, leveraging the iterative quantization technique well established for image retrieval, while satisfying the desired isotropic property of PMI based models. Experiments with pre-trained embeddings (i. e. , GloVe and HDC) demonstrate a more than thirty-fold compression ratio with comparable and sometimes even improved performance over the original real-valued embedding vectors.

AAAI Conference 2020 Conference Paper

EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs

  • Aldo Pareja
  • Giacomo Domeniconi
  • Jie Chen
  • Tengfei Ma
  • Toyotaro Suzumura
  • Hiroki Kanezashi
  • Tim Kaler
  • Tao Schardl

Graph representation learning resurges as a trending research subject owing to the widespread use of deep learning for Euclidean data, which inspire various creative designs of neural networks in the non-Euclidean domain, particularly graphs. With the success of these graph neural networks (GNN) in the static setting, we approach further practical scenarios where the graph dynamically evolves. Existing approaches typically resort to node embeddings and use a recurrent neural network (RNN, broadly speaking) to regulate the embeddings and learn the temporal dynamics. These methods require the knowledge of a node in the full time span (including both training and testing) and are less applicable to the frequent change of the node set. In some extreme scenarios, the node sets at different time steps may completely differ. To resolve this challenge, we propose EvolveGCN, which adapts the graph convolutional network (GCN) model along the temporal dimension without resorting to node embeddings. The proposed approach captures the dynamism of the graph sequence through using an RNN to evolve the GCN parameters. Two architectures are considered for the parameter evolution. We evaluate the proposed approach on tasks including link prediction, edge classification, and node classification. The experimental results indicate a generally higher performance of EvolveGCN compared with related approaches. The code is available at https: //github. com/IBM/EvolveGCN.

NeurIPS Conference 2020 Conference Paper

Online Convex Optimization Over Erdos-Renyi Random Networks

  • Jinlong Lei
  • Peng Yi
  • Yiguang Hong
  • Jie Chen
  • Guodong Shi

The work studies how node-to-node communications over an Erd\H{o}s-R\'enyi random network influence distributed online convex optimization, which is vital in solving large-scale machine learning in antagonistic or changing environments. At per step, each node (computing unit) makes a local decision, experiences a loss evaluated with a convex function, and communicates the decision with other nodes over a network. The node-to-node communications are described by the Erd\H{o}s-R\'enyi rule, where independently each link takes place with a probability $p$ over a prescribed connected graph. The objective is to minimize the system-wide loss accumulated over a finite time horizon. We consider standard distributed gradient descents with full gradients, one-point bandits and two-points bandits for convex and strongly convex losses, respectively. We establish how the regret bounds scale with respect to time horizon $T$, network size $N$, decision dimension $d$, and an algebraic network connectivity. The regret bounds scaling with respect to $T$ match those obtained by state-of-the-art algorithms and fundamental limits in the corresponding centralized online optimization problems, e. g. , $\mathcal{O}(\sqrt{T}) $ and $\mathcal{O}(\ln(T)) $ regrets are established for convex and strongly convex losses with full gradient feedback and two-points information, respectively. For classical Erd\H{o}s-R\'enyi networks over all-to-all possible node communications, the regret scalings with respect to the probability $p$ are analytically established, based on which the tradeoff between the communication overhead and computation accuracy is clearly demonstrated. Numerical studies have validated the theoretical findings.

AAAI Conference 2020 Conference Paper

Online Planner Selection with Graph Neural Networks and Adaptive Scheduling

  • Tengfei Ma
  • Patrick Ferber
  • Siyu Huo
  • Jie Chen
  • Michael Katz

Automated planning is one of the foundational areas of AI. Since no single planner can work well for all tasks and domains, portfolio-based techniques have become increasingly popular in recent years. In particular, deep learning emerges as a promising methodology for online planner selection. Owing to the recent development of structural graph representations of planning tasks, we propose a graph neural network (GNN) approach to selecting candidate planners. GNNs are advantageous over a straightforward alternative, the convolutional neural networks, in that they are invariant to node permutations and that they incorporate node labels for better inference. Additionally, for cost-optimal planning, we propose a twostage adaptive scheduling method to further improve the likelihood that a given task is solved in time. The scheduler may switch at halftime to a different planner, conditioned on the observed performance of the first one. Experimental results validate the effectiveness of the proposed method against strong baselines, both deep learning and non-deep learning based. The code is available at https: //github. com/matenure/GNN planner.

AAAI Conference 2020 Conference Paper

Scalable Variational Bayesian Kernel Selection for Sparse Gaussian Process Regression

  • Tong Teng
  • Jie Chen
  • Yehong Zhang
  • Bryan Kian Hsiang Low

This paper presents a variational Bayesian kernel selection (VBKS) algorithm for sparse Gaussian process regression (SGPR) models. In contrast to existing GP kernel selection algorithms that aim to select only one kernel with the highest model evidence, our VBKS algorithm considers the kernel as a random variable and learns its belief from data such that the uncertainty of the kernel can be interpreted and exploited to avoid overconfident GP predictions. To achieve this, we represent the probabilistic kernel as an additional variational variable in a variational inference (VI) framework for SGPR models where its posterior belief is learned together with that of the other variational variables (i. e. , inducing variables and kernel hyperparameters). In particular, we transform the discrete kernel belief into a continuous parametric distribution via reparameterization in order to apply VI. Though it is computationally challenging to jointly optimize a large number of hyperparameters due to many kernels being evaluated simultaneously by our VBKS algorithm, we show that the variational lower bound of the log-marginal likelihood can be decomposed into an additive form such that each additive term depends only on a disjoint subset of the variational variables and can thus be optimized independently. Stochastic optimization is then used to maximize the variational lower bound by iteratively improving the variational approximation of the exact posterior belief via stochastic gradient ascent, which incurs constant time per iteration and hence scales to big data. We empirically evaluate the performance of our VBKS algorithm on synthetic and massive real-world datasets.

AAAI Conference 2019 Conference Paper

A Sequential Set Generation Method for Predicting Set-Valued Outputs

  • Tian Gao
  • Jie Chen
  • Vijil Chenthamarakshan
  • Michael Witbrock

Consider a general machine learning setting where the output is a set of labels or sequences. This output set is unordered and its size varies with the input. Whereas multi-label classification methods seem a natural first resort, they are not readily applicable to set-valued outputs because of the growth rate of the output space; and because conventional sequence generation doesn’t reflect sets’ order-free nature. In this paper, we propose a unified framework—sequential set generation (SSG)—that can handle output sets of labels and sequences. SSG is a meta-algorithm that leverages any probabilistic learning method for label or sequence prediction, but employs a proper regularization such that a new label or sequence is generated repeatedly until the full set is produced. Though SSG is sequential in nature, it does not penalize the ordering of the appearance of the set elements and can be applied to a variety of set output problems, such as a set of classification labels or sequences. We perform experiments with both benchmark and synthetic data sets and demonstrate SSG’s strong performance over baseline methods.

NeurIPS Conference 2019 Conference Paper

Adaptively Aligned Image Captioning via Adaptive Attention Time

  • Lun Huang
  • Wenmin Wang
  • Yaxian Xia
  • Jie Chen

Recent neural models for image captioning usually employ an encoder-decoder framework with an attention mechanism. However, the attention mechanism in such a framework aligns one single (attended) image feature vector to one caption word, assuming one-to-one mapping from source image regions and target caption words, which is never possible. In this paper, we propose a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning. AAT allows the framework to learn how many attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions. AAT is deterministic and differentiable, and doesn't introduce any noise to the parameter gradients. In this paper, we empirically show that AAT improves over state-of-the-art methods on the task of image captioning. Code is available at https: //github. com/husthuaan/AAT.

AAAI Conference 2018 Conference Paper

A Cascaded Inception of Inception Network With Attention Modulated Feature Fusion for Human Pose Estimation

  • Wentao Liu
  • Jie Chen
  • Cheng Li
  • Chen Qian
  • Xiao Chu
  • Xiaolin Hu

Accurate keypoint localization of human pose needs diversified features: the high level for contextual dependencies and the low level for detailed refinement of joints. However, the importance of the two factors varies from case to case, but how to efficiently use the features is still an open problem. Existing methods have limitations in preserving low level features, adaptively adjusting the importance of different levels of features, and modeling the human perception process. This paper presents three novel techniques step by step to efficiently utilize different levels of features for human pose estimation. Firstly, an inception of inception (IOI) block is designed to emphasize the low level features. Secondly, an attention mechanism is proposed to adjust the importance of individual levels according to the context. Thirdly, a cascaded network is proposed to sequentially localize the joints to enforce message passing from joints of stand-alone parts like head and torso to remote joints like wrist or ankle. Experimental results demonstrate that the proposed method achieves the state-of-the-art performance on both MPII and LSP benchmarks.

NeurIPS Conference 2018 Conference Paper

Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders

  • Tengfei Ma
  • Jie Chen
  • Cao Xiao

Deep generative models have achieved remarkable success in various data domains, including images, time series, and natural languages. There remain, however, substantial challenges for combinatorial structures, including graphs. One of the key challenges lies in the difficulty of ensuring semantic validity in context. For example, in molecular graphs, the number of bonding-electron pairs must not exceed the valence of an atom; whereas in protein interaction networks, two proteins may be connected only when they belong to the same or correlated gene ontology terms. These constraints are not easy to be incorporated into a generative model. In this work, we propose a regularization framework for variational autoencoders as a step toward semantic validity. We focus on the matrix representation of graphs and formulate penalty terms that regularize the output distribution of the decoder to encourage the satisfaction of validity constraints. Experimental results confirm a much higher likelihood of sampling valid graphs in our approach, compared with others reported in the literature.

JMLR Journal 2017 Journal Article

Hierarchically Compositional Kernels for Scalable Nonparametric Learning

  • Jie Chen
  • Haim Avron
  • Vikas Sindhwani

We propose a novel class of kernels to alleviate the high computational cost of large-scale nonparametric learning with kernel methods. The proposed kernel is defined based on a hierarchical partitioning of the underlying data domain, where the Nyström method (a globally low-rank approximation) is married with a locally lossless approximation in a hierarchical fashion. The kernel maintains (strict) positive-definiteness. The corresponding kernel matrix admits a recursively off- diagonal low-rank structure, which allows for fast linear algebra computations. Suppressing the factor of data dimension, the memory and arithmetic complexities for training a regression or a classifier are reduced from $O(n^2)$ and $O(n^3)$ to $O(nr)$ and $O(nr^2)$, respectively, where $n$ is the number of training examples and $r$ is the rank on each level of the hierarchy. Although other randomized approximate kernels entail a similar complexity, empirical results show that the proposed kernel achieves a matching performance with a smaller $r$. We demonstrate comprehensive experiments to show the effective use of the proposed kernel on data sizes up to the order of millions. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

NeurIPS Conference 2017 Conference Paper

Solving Most Systems of Random Quadratic Equations

  • Gang Wang
  • Georgios Giannakis
  • Yousef Saad
  • Jie Chen

This paper deals with finding an $n$-dimensional solution $\bm{x}$ to a system of quadratic equations $y_i=|\langle\bm{a}_i, \bm{x}\rangle|^2$, $1\le i \le m$, which in general is known to be NP-hard. We put forth a novel procedure, that starts with a \emph{weighted maximal correlation initialization} obtainable with a few power iterations, followed by successive refinements based on \emph{iteratively reweighted gradient-type iterations}. The novel techniques distinguish themselves from prior works by the inclusion of a fresh (re)weighting regularization. For certain random measurement models, the proposed procedure returns the true solution $\bm{x}$ with high probability in time proportional to reading the data $\{(\bm{a}_i; y_i)\}_{1\le i \le m}$, provided that the number $m$ of equations is some constant $c>0$ times the number $n$ of unknowns, that is, $m\ge cn$. Empirically, the upshots of this contribution are: i) perfect signal recovery in the high-dimensional regime given only an \emph{information-theoretic limit number} of equations; and, ii) (near-)optimal statistical accuracy in the presence of additive noise. Extensive numerical tests using both synthetic data and real images corroborate its improved signal recovery performance and computational efficiency relative to state-of-the-art approaches.

AAAI Conference 2015 Conference Paper

Parallel Gaussian Process Regression for Big Data: Low-Rank Representation Meets Markov Approximation

  • Kian Hsiang Low
  • Jiangbo Yu
  • Jie Chen
  • Patrick Jaillet

The expressive power of a Gaussian process (GP) model comes at a cost of poor scalability in the data size. To improve its scalability, this paper presents a low-rankcum-Markov approximation (LMA) of the GP model that is novel in leveraging the dual computational advantages stemming from complementing a low-rank approximate representation of the full-rank GP based on a support set of inputs with a Markov approximation of the resulting residual process; the latter approximation is guaranteed to be closest in the Kullback-Leibler distance criterion subject to some constraint and is considerably more refined than that of existing sparse GP models utilizing low-rank representations due to its more relaxed conditional independence assumption (especially with larger data). As a result, our LMA method can trade off between the size of the support set and the order of the Markov property to (a) incur lower computational cost than such sparse GP models while achieving predictive performance comparable to them and (b) accurately represent features/patterns of any scale. Interestingly, varying the Markov order produces a spectrum of LMAs with PIC approximation and full-rank GP at the two extremes. An advantage of our LMA method is that it is amenable to parallelization on multiple machines/cores, thereby gaining greater scalability. Empirical evaluation on three real-world datasets in clusters of up to 32 computing nodes shows that our centralized and parallel LMA methods are significantly more time-efficient and scalable than state-of-the-art sparse and full-rank GP regression methods while achieving comparable predictive performances.

AAAI Conference 2014 Conference Paper

GP-Localize: Persistent Mobile Robot Localization Using Online Sparse Gaussian Process Observation Model

  • Nuo Xu
  • Kian Hsiang Low
  • Jie Chen
  • Keng Kiat Lim
  • Etkin Ozgul

Central to robot exploration and mapping is the task of persistent localization in environmental fields characterized by spatially correlated measurements. This paper presents a Gaussian process localization (GP-Localize) algorithm that, in contrast to existing works, can exploit the spatially correlated field measurements taken during a robot’s exploration (instead of relying on prior training data) for efficiently and scalably learning the GP observation model online through our proposed novel online sparse GP. As a result, GP-Localize is capable of achieving constant time and memory (i. e. , independent of the size of the data) per filtering step, which demonstrates the practical feasibility of using GPs for persistent robot localization and autonomy. Empirical evaluation via simulated experiments with real-world datasets and a real robot experiment shows that GP-Localize outperforms existing GP localization algorithms.

AAMAS Conference 2012 Conference Paper

Decentralized Active Robotic Exploration and Mapping for Probabilistic Field Classification in Environmental Sensing

  • Kian Hsiang Low
  • Jie Chen
  • John Dolan
  • Steve Chien
  • David Thompson

A central problem in environmental sensing and monitoring is to classify/label the hotspots in a large-scale environmental field. This paper presents a novel \emph{decentralized active robotic exploration} (DARE) strategy for probabilistic classification/labeling of hotspots in a \emph{Gaussian process} (GP)-based field. In contrast to existing state-of-the-art exploration strategies for learning environmental field maps, the time needed to solve the DARE strategy is independent of the map resolution and the number of robots, thus making it practical for in situ, real-time active sampling. Its exploration behavior exhibits an interesting formal trade-off between that of boundary tracking until the hotspot region boundary can be accurately predicted and wide-area coverage to find new boundaries in sparsely sampled areas to be tracked. We provide a theoretical guarantee on the active exploration performance of the DARE strategy: under reasonable conditional independence assumption, we prove that it can optimally achieve two formal cost-minimizing exploration objectives based on the misclassification and entropy criteria. Importantly, this result implies that the uncertainty of labeling the hotspots in a GP-based field is greatest at or close to the hotspot region boundaries. Empirical evaluation on real-world plankton density and temperature field data shows that, subject to limited observations, DARE strategy can achieve more superior classification of hotspots and time efficiency than state-of-the-art active exploration strategies.

IJCAI Conference 2011 Conference Paper

Learning Compact Visual Descriptor for Low Bit Rate Mobile Landmark Search

  • Rongrong Ji
  • Ling-Yu Duan
  • Jie Chen
  • Hongxun Yao
  • Tiejun Huang
  • Wen Gao

In this paper, we propose to extract a compact yet discriminative visual descriptor directly on the mobile device, which tackles the wireless query transmission latency in mobile landmark search. This descriptor is offline learnt from the location contexts of geo-tagged Web photos from both Flickr and Panoramio with two phrases: First, we segment the landmark photo collections into discrete geographical regions using a Gaussian Mixture Model [Stauffer et al. , 2000]. Second, a ranking sensitive vocabulary boosting is introduced to learn a compact codebook within each region. To tackle the locally optimal descriptor learning caused by imprecise geographical segmentation, we further iterate above phrases by feedback an "entropy" based descriptor compactness into a prior distribution to constrain the Gaussian mixture modeling. Consequently, when entering a specific geographical region, the codebook in the mobile device is downstream adapted, which ensures efficient extraction of compact descriptor, its low bit rate transmission, as well as promising discrimination ability. We deploy our descriptor within both HTC and iPhone mobile phones, testing landmark search in typical areas included Beijing, New York, and Barcelona containing one million images. Our learning descriptor outperforms alternative compact descriptors [Chen et al. , 2009][Chen et al. , 2010][Chandrasekhar et al. , 2009a][Chandrasekhar et al. , 2009b] with a large margin.

JMLR Journal 2009 Journal Article

Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

  • Jie Chen
  • Haw-ren Fang
  • Yousef Saad

Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact k NN graph takes Θ( dn 2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate k NN graph in Θ( dn t ) time for high dimensional data (large d ). The exponent t ∈ (1,2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building k NN graphs. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )