Author name cluster

Tao Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

68 papers

2 author rows

EAAI Journal 2026 Journal Article

A short-term water demand forecasting method integrating wavelet stepwise decomposition and spatial-temporal features

Chenlei Xie
Jie Wang
Tao Chen
Qiansheng Fang
Shanshou Li
Xuelei Yang

Accurate short-term water demand forecasting is crucial for the management and scheduling of water distribution systems. However, existing decomposition-based prediction models face two major challenges: prevalent data leakage during global decomposition, which distorts model evaluation, and the inherent shift-variance in methods designed to avoid leakage, resulting in poor forecasting accuracy. To address these issues, this paper proposes an innovative forecasting framework integrating Wavelet stepwise decomposition (WSD) with spatial-temporal features. The core contributions of this work are threefold: First, the proposed WSD method employs a fixed-length sliding window for decomposition, fundamentally eliminating data leakage. Second, correlation analysis is introduced to optimize the selection of the mother wavelet, thereby minimizing errors caused by shift-variance. Third, a hybrid prediction model is constructed, where Extreme gradient boosting (XGBoost) fits the stable trends of low-frequency subseries, and an inverted Transformer (iTransformer) captures the dynamic dependencies within multi-dimensional spatial-temporal features of high-frequency subseries, significantly enhancing their prediction accuracy. Experimental results on a real-world water distribution networks (WDN) demonstrate that the proposed method outperforms benchmark models, including Long short-term memory (LSTM) and graph-based models.

AAAI Conference 2026 Conference Paper

Beyond Quadratic: Linear-Time Change Detection with RWKV

Zhenyu Yang
Gensheng Pei
Tao Chen
Xia Yuan
Haofeng Zhang
Xiangbo Shu
Yazhou Yao

Existing paradigms for remote sensing change detection are caught in a trade-off: CNNs excel at efficiency but lack global context, while Transformers capture long-range dependencies at a prohibitive computational cost. This paper introduces ChangeRWKV, a new architecture that reconciles this conflict. By building upon the Receptance Weighted Key Value (RWKV) framework, our ChangeRWKV uniquely combines the parallelizable training of Transformers with the linear-time inference of RNNs. Our approach core features two key innovations: a hierarchical RWKV encoder that builds multi-resolution feature representation, and a novel Spatial-Temporal Fusion Module (STFM) engineered to resolve spatial misalignments across scales while distilling fine-grained temporal discrepancies. ChangeRWKV not only achieves state-of-the-art performance on the LEVIR-CD benchmark, with an 85.46% IoU and 92.16% F1 score, but does so while drastically reducing parameters and FLOPs compared to previous leading methods. This work demonstrates a new, efficient, and powerful paradigm for operational-scale change detection.

PDF Details DOI

JBHI Journal 2026 Journal Article

FourierMask: Explain EEG-Based End-to-End Deep Learning Models in the Frequency Domain

Hanqi Wang
Jingyu Zhang
Kun Yang
Jichuan Xiong
Xuefeng Liu
Tao Chen
Liang Song

The rise of EEG-based end-to-end deep learning models has underscored the need to elucidate how these models process time-series raw EEG signals to generate predictions. The frequency domain provides a more suitable perspective for this task due to two key advantages: the strong correlation with cognitive states and the inherent capacity to model long-range temporal dependencies. However, this perspective remains underexplored in existing research. To bridge this gap, we propose FourierMask, the first mask perturbation framework specifically designed for frequency-domain explanation of EEG-based end-to-end models. Our method introduces three key innovations. First, the Fourier-based domain transformation enables direct manipulation of spectral components. Second, A learnable mask mechanism jointly models the spectral-spatial couplings relationship for EEG explanation. Third, a perturbation generator constrained by a target alignment loss ensures natural perturbations by minimizing distribution shift via cluster-aware regularization. We validate our method through experiments on an EEG benchmark dataset across EEGNet, TSCeption, and DeepConvNet models. Our method reaches a 36. 0% average accuracy drop gap (vs. 8. 6% for LIME and 6. 6% for easyPEASI) at the group-level. And, it reaches a 17. 8% average accuracy drop gap (vs. 8. 9% for LIME and 9. 9% for easyPEASI) at the instance-level. Our model-agnostic framework provides a plug-and-play solution for enhancing transparency of EEG-based end-to-end deep learning models. It links model decisions to frequency biomarkers, with potential applications in neuromedicine and brain-computer interfaces.

AAAI Conference 2026 Conference Paper

Learning from Human Gaze: Human-like Robot Social Navigation in Dense Crowds

Zhecheng Yu
Yan Lyu
Chen Yang
Tao Chen
Yishuang Zhang
Bo Ling
Peng Wang
Guanyu Gao

Robot navigation in dense crowds requires understanding social cues that humans naturally use, yet existing methods struggle with real-world complexity. We investigate two questions: (1) Where do pedestrians look when navigating crowds? and (2) Can eye tracking improve robot navigation? To answer, we introduce GazeNav, an egocentric dataset collected via wearable eye trackers, featuring synchronized video, gaze, and trajectories in crowded environments. Analysis reveals that the gaze of pedestrians is closely related to the semantic presence and movement of other individuals, exhibiting distinct attention patterns across navigation behaviors. Building on this, we propose Gaze2Nav, a modular framework that first predicts human gaze to infer socially salient pedestrians, then incorporates the semantic attention into motion planning alongside visual inputs. Our method achieves 87.6% salient pedestrian prediction accuracy and reduces trajectory error by 15.4% over state-of-the-art baselines. By aligning with human gaze, our framework improves both performance and interpretability, advancing toward human-like, socially intelligent robot navigation.

PDF Details DOI

TIST Journal 2026 Journal Article

LKAFormer: A Lightweight Kolmogorov-Arnold Transformer Model for Image Semantic Segmentation

Shoulin Yin
Liguo Wang
Tao Chen
Huafei Huang
Jing Gao
Jianing Zhang
Meng Liu
Peng Li

Transformer-based semantic segmentation methods have demonstrated outstanding performance by leveraging global self-attention to effectively capture long-range dependence. However, there still exist two issues in existing works: (1) Most of them utilize the full-rank weight matrix to support the self-attention mechanism and feed-forward network in modelling long-range dependence between patches/pixels, resulting in a high computational cost during both training and inference. (2) Most of them ignore information interactions between high-level semantics and low-level structures during the image resolution recovery, which leads to the performance degradation in segmenting objects with complex boundaries. To tackle these challenges, a lightweight Kolmogorov-Arnold Transformer model (LKAFormer) is proposed for the image semantic segmentation, containing a two-stream lightweight Transformer encoder and a graph feature pyramid aggregation KAN-decoder. The former constructs a hierarchical feature cross-scale fusion pipeline to obtain sufficient semantics containing comprehensive multi-scale information via setting coarse-grained and fine-grained streams with different-size patches of images. In that pipeline, feature lightweight focusing modules model complex and long-range dependence across patches/pixels to refine image semantics with less computational costs by lightweight multi-head self-attention and lightweight feed-forward network designs. The latter leverages the learnable nonlinear transformation mechanism of the Kolmogorov-Arnold Transformer architecture to adaptively capture spatial structure dependence of distinct sub-regions of images. And then, it jointly performs the intra-scale graph fusion and cross-scale graph fusion during the image resolution recovery to enhance information interactions between high-level semantics and low-level structures, which achieves the robust boundary localization and texture refinement of segmentation objects. Finally, plentiful experiments are conducted on three challenging datasets, and the results show LKAFormer sets a new baseline in the image segmentation task in comparison with 11 methods.

AAAI Conference 2026 Conference Paper

Mitigating Low-Quality Reasoning in MLLMs: Self-Driven Refined Multimodal CoT with Selective Thinking and Step-wise Visual Enhancement

Chongjun Tu
Peng Ye
Dongzhan Zhou
Tao Chen
Wanli Ouyang

Current Multimodal Chain-of-Thought (MCoT) methods suffer from low-quality multimodal reasoning, characterized by overthinking on simple queries and inefficient utilization of visual information, resulting in vast inefficient and ineffective computations. In this paper, we discover that Multimodal Large Language Models (MLLMs) possess inherent capabilities to distinguish between simple and difficult queries and enhance task-related visual information, which remain underutilized by existing approaches. Based on this insight, we propose Self-Driven Refined Multimodal CoT (SDR-MCoT), a training-free framework that mitigates these issues through two self-driven modules. First, our selective thinking module employs entropy-based confidence estimation to determine whether queries require detailed reasoning, preventing overthinking on simple questions. Second, our step-wise visual enhancement module strengthens attention to relevant visual regions at each reasoning step without inserting additional tokens, achieving fine-grained visual grounding and enhancement with minimal overhead. Moreover, SDR-MCoT can be seamlessly integrated into various MLLMs, offering a practical solution for improving multimodal reasoning. Comprehensive experiments across eight benchmarks from diverse domains (multimodal reasoning, visual understanding, hallucination, and mathematical reasoning) demonstrate that SDR-MCoT consistently outperforms existing MCoT methods on four different base models with reduced overhead. For instance, on Qwen2-VL-7B, our method improves average accuracy by over 6% while reducing token consumption by approximately 60% compared to zero-shot CoT.

PDF Details DOI

EAAI Journal 2026 Journal Article

Multiscale wavelet-based spatial–spectral compression network for hyperspectral image

Hang Yu
Mingyang Wan
Tao Chen
Aibin Peng
Xiangfei Shen
Rulong He
Lihui Chen
Haijun Liu

Hyperspectral images (HSIs) possess high-dimensional tensor structures that present significant reconstruction challenges under ultra-low compression ratios (CR) in artificial intelligence-driven remote sensing. Conventional compression methods are unable to effectively capture inherent spatial–spectral coherence and often neglect multiscale spectral absorption-reflection dependencies, which are critical for maintaining spectral fidelity. To overcome these shortcomings, we propose a Multiscale Wavelet-based Spatial-Spectral Compression Network (MWC-Net) for HSI reconstruction. Methodologically, MWC-Net integrates a three-dimensional (3D) spatial–spectral attention encoder, which via tri-branch attention to extract complete spatial–spectral coherence. Additionally, we develop a multiscale wavelet spatial–spectral decoder that restores scale-sensitive spectral features through multiscale super-resolution and enhances spatial–spectral resolution using wavelet decomposition. Compared to the state-of-the-art method “Hyperspectral Image Compression Sensing Network With Convolutional Neural Networks (CNN)–Transformer Mixture Architectures”, MWC-Net achieves an average decrease from 1. 770 to 1. 549 in the spectral angle mapper (SAM) metric. Additionally, the average peak signal-to-noise ratio (PSNR) increases from 39. 81 to 40. 79, while the average root mean square error (RMSE) decreases from 55. 15 to 49. 97, under approximately 1% CR. This enhancement highlights the superior ability of MWC-Net to balance compression efficiency and spectral fidelity in HSI reconstruction. The code can be available on https: //github. com/YuHang-max/MWCNet.

YNIMG Journal 2026 Journal Article

Neural correlates of autobiographical memory deficits across psychiatric disorders: A systematic review and meta-analysis

Xu-chen Yu
Wan-ting Ran
Gui-fang Chen
Tao Chen
Ji-fang Cui
Ya Wang
Raymond C.K. Chan

Dysfunction of autobiographical memory (AM) is one of the core markers of psychiatric disorders such as major depressive disorder and schizophrenia. However, it remains unclear whether there is a common neural basis underlying AM impairment across psychiatric patients. In this study, a systematic review and meta-analysis using both Seed-based d Mapping with Permutation of Subject Images (SDM-PSI) and Activation Likelihood Estimation (ALE) approaches were conducted to examine brain activation differences between psychiatric populations and healthy controls during AM. A computerized search was performed using the databases Web of Science, PubMed, APA PsycInfo and EBSCO to identify relevant studies published from inception to 31 October 2025. Twenty-four studies (1385 participants) were identified for the qualitative synthesis and 12 studies (547 participants) for the meta-analysis. The qualitative analysis revealed widespread abnormalities in psychiatric patients in both activation and functional connectivity (FC) across default mode network, salience and attentional network, control network and visual network. Meta-analysis results indicated that patients with psychiatric disorders exhibited hyperactivations in the cingulate cortex, and subsequent meta-analytic connectivity modeling (MACM) analysis demonstrated its widespread co-activation with large-scale functional networks. These findings suggest the network-level dysfunction across psychiatric disorders during AM process and provide insights for future clinical research.

AAAI Conference 2026 Conference Paper

Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

pengtao chen
Xianfang Zeng
Maosen Zhao
Mingzhu Shen
Wei Cheng
Gang Yu
Tao Chen

While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09×, 2.38×, and 1.67× theoretical FLOP reduction, and actual inference speedups of 1.76×, 1.85×, and 1.58×, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.

PDF Details DOI

AAAI Conference 2025 Conference Paper

All-in-One: Transferring Vision Foundation Models into Stereo Matching

Jingyi Zhou
Haoyu Zhang
Jiakang Yuan
Peng Ye
Tao Chen
Hao Jiang
Meiya Chen
Yangyang Zhang

As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks 1st on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Boost Embodied AI Models with Robust Compression Boundary

Chong Yu
Tao Chen
Zhongxue Gan

The rapid improvement of deep learning models with the integration of the physical world has dramatically improved embodied AI capabilities. Meanwhile, the powerful embodied AI models and their scales place an increasing burden on deployment efficiency. The efficiency issue is more apparent on embodied AI platforms than on data centers because they have more limited computational resources and memory bandwidth. Meanwhile, most embodied AI scenarios, like autonomous driving and robotics, are more sensitive to fast responses. Theoretically, the traditional model compression techniques can help embodied AI models with more efficient computation, lower memory and energy consumption, and reduced latency. Because the embodied AI models are expected to interact with the physical world, the corresponding compressed models are also expected to resist natural corruption caused by real-world events such as noise, blur, weather conditions, and even adversarial corruption. This paper explores the novel paradigm to boost the efficiency of the embodied AI models and the robust compression boundary. The efficacy of our method has been proven to find the optimal balance between accuracy, efficiency, and robustness in real-world conditions.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression

Xiaohui Wang
Peng Ye
Chenyu Huang
Shenghe Zheng
Bo Zhang
Lei Bai
Wanli Ouyang
Tao Chen

With the rise of the fine-tuned–pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 50$\times$ compression, (b) general NLP models (RoBERTa-base, T5-base) with up to 224$\times$ compression, (c) vision models (ViT-B/32, ViT-L/14) with up to 132$\times$ compression, and (d) multi-modal models (BEiT-3) with 18$\times$ compression, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression. Code is available at https: //github. com/xiaohuiwang000/UltraDelta.

IROS Conference 2025 Conference Paper

Distributed Cooperative Target Tracking and Active Sensing of Dual-AUV Based on Flank Array Sonar Detection

Qi Qi
Tao Chen
Yiming Jiang 0022
Yanjie Pan

When tracking underwater target, autonomous underwater vehicles (AUVs) need to estimate the target state based on the information detected by sensors and plan their own tracking paths accordingly to achieve active sensing of the target. When the sensor equipped on the AUV is a flank array sonar, the problem becomes significantly more complex due to the limited field of view (FOV) of the sonar and the fact that bearing-only information is available for observation. To address this issue, this paper proposes a distributed solution for cooperative tracking and active sensing using dual-AUV systems equipped with flank array sonar for detection. Based on the analysis of underwater acoustic communication modes in dual-AUV systems, this study decomposes the problem into two aspects: cooperative estimation and planning control for active sensing. Corresponding algorithms are proposed and their effectiveness is verified.

IROS Conference 2025 Conference Paper

ETA-IK: Execution-Time-Aware Inverse Kinematics for Dual-Arm Systems

Yucheng Tang
Xi Huang
Yongzhou Zhang
Tao Chen
Ilshat Mamaev
Björn Hein

This paper presents ETA-IK, a novel Execution-Time-Aware Inverse Kinematics method tailored for dual-arm robotic systems. The primary goal is to optimize motion execution time by leveraging the redundancy of the entire system, specifically in tasks where only the relative pose of the robots is constrained, such as dual-arm scanning of unknown objects. Unlike traditional IK methods using surrogate metrics, our approach directly optimizes execution time while implicitly considering collisions. A neural network based execution time approximator is employed to predict time-efficient joint configurations while accounting for potential collisions. Through experimental evaluation on a system composed of a UR5 and a KUKA iiwa robot, we demonstrate significant reductions in execution time. The proposed method outperforms conventional approaches, showing improved motion efficiency without sacrificing positioning accuracy.

NeurIPS Conference 2025 Conference Paper

FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Chongjun Tu
Lin Zhang
pengtao chen
Peng Ye
Xianfang Zeng
Wei Cheng
Gang Yu
Tao Chen

Multimodal Large Language Models (MLLMs) have shown impressive video content understanding capabilities but struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, which comprises 1, 776 videos from both ego-centric and third-person perspectives and enables assessment through both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8, 184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we employ the GPT-assisted evaluation and develop a novel cost-efficient LLM-free assessment method, where the latter can enhance benchmarking interpretability and accessibility. Comprehensive experiments with21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset of 17, 152 videos with fine-grained motion annotations. Finetuning Qwen2. 5-VL on FAVOR-Train yields consistent improvements on motion-related tasks across TVBench, MotionBenchand our FAVOR-Bench. Our assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools for the community to develop more powerful video understanding models.

ICML Conference 2025 Conference Paper

ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset

Yilin Wang
Peixuan Lei
Jie Song
Yuzhe Hao
Tao Chen
Yuxuan Zhang
Lei Jia
Yuanxiang Li

Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge. To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language. Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving a strong improvement in QA accuracy over strong baselines with fewer than 1% additional trainable parameters. By combining computational efficiency with robust cross-modal modeling, our work establishes a adaptable paradigm for integrating temporal data with natural language, paving the way for new research and applications in multi-modal AI. More details about the project, including datasets and code, are available at: https: //pandalin98. github. io/itformer_site/.

EWRL Workshop 2025 Workshop Paper

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Zechu Li
Rickmer Krohn
Tao Chen
Anurag Ajay
Pulkit Agrawal
Georgia Chalvatzaki

Deep reinforcement learning (RL) algorithms typically parameterize the policy as a deep network that outputs either a deterministic action or a stochastic one modeled as a Gaussian distribution, hence restricting learning to a single behavioral mode. Meanwhile, diffusion models emerged as a powerful framework for multimodal learning. However, the use of diffusion policies in online RL is hindered by the intractability of policy likelihood approximation, as well as the greedy objective of RL methods that can easily skew the policy to a single mode. This paper presents Deep Diffusion Policy Gradient (DDiffPG), a novel actor-critic algorithm that learns \textit{from scratch} multimodal policies parameterized as diffusion models while discovering and maintaining versatile behaviors. DDiffPG explores and discovers multiple modes through off-the-shelf unsupervised clustering combined with novelty-based intrinsic motivation. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective, ensuring the improvement of the diffusion policy across all modes. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes. Empirical studies validate DDiffPG's capability to master multimodal behaviors in complex, high-dimensional continuous control tasks with sparse rewards, also showcasing proof-of-concept dynamic online replanning when navigating mazes with unseen obstacles.

NeurIPS Conference 2025 Conference Paper

PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding

Kangcong Li
Peng Ye
Chongjun Tu
Lin Zhang
Chunfeng Song
Jiamin Wu
Tao Yang
Qihao Zheng

While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain’s working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons’ persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, addressing contextual decay; and (2) Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules, establishing cross-token dependencies and mitigating fragmentation. Extensive evaluations show that PaceLLM achieves 6% improvement on LongBench’s Multi-document QA and 12. 5–17. 5% performance gains on $\infty$-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. This work pioneers brain-inspired LLM optimization and is complementary to other works. Besides, it can be generalized to any model and enhance their long-context performance and interpretability without structural overhauls.

NeurIPS Conference 2025 Conference Paper

PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs

Xinzhe Zheng
Hao Du
Fanding Xu
Jinzhe Li
Zhiyuan Liu
Wenkang Wang
Tao Chen
Wanli Ouyang

Deep learning-based computational methods have achieved promising results in predicting protein-protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model's capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates PRotein-protein INteraction prediction from a Graph-level perspective. PRING curates a high-quality, multi-species PPI network dataset comprising 21, 484 proteins and 186, 818 interactions, with well-designed strategies to address both data redundancy and leakage. Building on this golden-standard dataset, we establish two complementary evaluation paradigms: (1) topology-oriented tasks, which assess intra and cross-species PPI network construction, and (2) function-oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model's capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity-based, naive sequence-based, protein language model-based, and structure-based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real-world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at https: //github. com/SophieSarceau/PRING.

EAAI Journal 2025 Journal Article

Spatio-Temporal Channel Attention and Membrane Potential Modulation for Efficient Spiking neural network

Xingming Tang
Tao Chen
Qian Cheng
Hangchi Shen
Shukai Duan
Lidan Wang

Spiking Neural Networks (SNNs) are an energy-efficient alternative to Artificial Neural Networks (ANNs) due to their event-driven nature. However, common coding methods such as direct coding struggle to capture critical spatio-temporal dynamics. To address this, we propose a Spatio-Temporal Channel Attention (STCA) module to improve feature extraction during spike encoding. In addition, we introduce a Membrane Potential Modulator (MPM) to reduce information loss due to binary quantization. Together, STCA and MPM form the Gated Attention Coding Mechanism (GACM), which improves SNN training on both static and neuromorphic datasets. Experiments at the Canadian Institute for Advanced Research 10/100 (CIFAR10/100) and CIFAR10-Dynamic Vision Sensor (DVS) show that GACM has higher accuracy and significant efficiency over direct coding. In particular, we improved accuracy by 1. 73% on the CIFAR100 and 0. 81% on the CIFAR10 in fewer steps.

NeurIPS Conference 2024 Conference Paper

$\textit{Bifr\"ost}$: 3D-Aware Image Compositing with Language Instructions

Lingxiao Li
Kaixiong Gong
Weihong Li
Xili Dai
Tao Chen
Xiaojun Yuan
Xiangyu Yue

This paper introduces $\textit{Bifröst}$, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships ($\textit{e. g. }$, occlusion). $\textit{Bifröst}$ addresses these issues by training MLLM as a 2. 5D location predictor and integrating depth maps as an extra condition during the generation process to bridge the gap between 2D and 3D, which enhances spatial comprehension and supports sophisticated spatial interactions. Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2. 5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that $\textit{Bifröst}$ significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding. This work not only pushes the boundaries of generative image compositing but also reduces reliance on expensive annotated datasets by effectively utilizing existing resources in innovative ways.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

3DET-Mamba: Causal Sequence Modelling for End-to-End 3D Object Detection

Mingsheng Li
Jiakang Yuan
Sijin Chen
Lin Zhang
Anyu Zhu
Xin Chen
Tao Chen

Transformer-based architectures have been proven successful in detecting 3D objects from point clouds. However, the quadratic complexity of the attention mechanism struggles to encode rich information as point cloud resolution increases. Recently, state space models (SSM) such as Mamba have gained great attention due to their linear complexity and long sequence modeling ability for language understanding. To exploit the potential of Mamba on 3D scene-level perception, for the first time, we propose 3DET-Mamba, which is a novel SSM-based model designed for indoor 3d object detection. Specifically, we divide the point cloud into different patches and use a lightweight yet effective Inner Mamba to capture local geometric information. To observe the scene from a global perspective, we introduce a novel Dual Mamba module that models the point cloud in terms of spatial distribution and continuity. Additionally, we design a Query-aware Mamba module that decodes context features into object sets under the guidance of learnable queries. Extensive experiments demonstrate that 3DET-Mamba surpasses previous 3DETR on indoor 3D detection benchmarks such as ScanNet, improving AP25/AP50 from 65. 0\%/47. 0\% to 70. 4\%/54. 4\%, respectively.

PDF Details DOI

EAAI Journal 2024 Journal Article

A new nonlinear ensemble framework based on dynamic-matched weights for tool remaining useful life prediction

Tingting Feng
Liang Guo
Tao Chen
Hongli Gao

In the field of remaining useful life (RUL) prediction, the most prominent task is constructing an accurate prediction model. However, it is difficult for single prediction models to satisfy multiple application situations. Therefore, a new nonlinear ensemble RUL prediction framework based on dynamic-matched weights is proposed in this paper. In the proposed framework, the neural network-based method and the stochastic process-based method are first aggregated through a nonlinear weighting formulation to mitigate data limitations and lack of a priori knowledge. Then, a novel ensemble weight dynamic matching algorithm is designed to achieve time-varying weight matching and improve the prediction accuracy. Finally, the ensemble RUL prediction result is characterized by the probability density function (PDF) of the remaining life. Through two milling cutter experiments, the proposed nonlinear ensemble RUL prediction framework is verified with better comprehensive performance. The cumulative relative accuracy (CRA) of the prediction results is greater than 0. 6, which outperforms the commonly used tool RUL prediction method.

AAAI Conference 2024 Conference Paper

Adaptive Integration of Partial Label Learning and Negative Learning for Enhanced Noisy Label Learning

Mengmeng Sheng
Zeren Sun
Zhenhuang Cai
Tao Chen
Yichao Zhou
Yazhou Yao

There has been significant attention devoted to the effectiveness of various domains, such as semi-supervised learning, contrastive learning, and meta-learning, in enhancing the performance of methods for noisy label learning (NLL) tasks. However, most existing methods still depend on prior assumptions regarding clean samples amidst different sources of noise (e.g., a pre-defined drop rate or a small subset of clean samples). In this paper, we propose a simple yet powerful idea called NPN, which revolutionizes Noisy label learning by integrating Partial label learning (PLL) and Negative learning (NL). Toward this goal, we initially decompose the given label space adaptively into the candidate and complementary labels, thereby establishing the conditions for PLL and NL. We propose two adaptive data-driven paradigms of label disambiguation for PLL: hard disambiguation and soft disambiguation. Furthermore, we generate reliable complementary labels using all non-candidate labels for NL to enhance model robustness through indirect supervision. To maintain label reliability during the later stage of model training, we introduce a consistency regularization term that encourages agreement between the outputs of multiple augmentations. Experiments conducted on both synthetically corrupted and real-world noisy datasets demonstrate the superiority of NPN compared to other state-of-the-art (SOTA) methods. The source code has been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/NPN.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Boosting Residual Networks with Group Knowledge

Shengji Tang
Peng Ye
Baopu Li
Weihao Lin
Tao Chen
Tong He
Chong Yu
Wanli Ouyang

Recent research understands the residual networks from a new perspective of the implicit ensemble model. From this view, previous methods such as stochastic depth and stimulative training have further improved the performance of the residual network by sampling and training of its subnets. However, they both use the same supervision for all subnets of different capacities and neglect the valuable knowledge generated by subnets during training. In this manuscript, we mitigate the significant knowledge distillation gap caused by using the same kind of supervision and advocate leveraging the subnets to provide diverse knowledge. Based on this motivation, we propose a group knowledge based training framework for boosting the performance of residual networks. Specifically, we implicitly divide all subnets into hierarchical groups by subnet-in-subnet sampling, aggregate the knowledge of different subnets in each group during training, and exploit upper-level group knowledge to supervise lower-level subnet group. Meanwhile, we also develop a subnet sampling strategy that naturally samples larger subnets, which are found to be more helpful than smaller subnets in boosting performance for hierarchical groups. Compared with typical subnet training and other methods, our method achieves the best efficiency and performance trade-offs on multiple datasets and network structures. The code is at https://github.com/tsj-001/AAAI24-GKT.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

EMR-Merging: Tuning-Free High-Performance Model Merging

Chenyu Huang
Peng Ye
Tao Chen
Tong He
Xiangyu Yue
Wanli Ouyang

The success of pretrain-finetune paradigm brings about the release of numerous model weights. In this case, merging models finetuned on different tasks to enable a single model with multi-task capabilities is gaining increasing attention for its practicability. Existing model merging methods usually suffer from (1) significant performance degradation or (2) requiring tuning by additional data or training. In this paper, we rethink and analyze the existing model merging paradigm. We discover that using a single model's weights can hardly simulate all the models' performance. To tackle this issue, we propose Elect, Mask & Rescale-Merging (EMR-Merging). We first (a) elect a unified model from all the model weights and then (b) generate extremely lightweight task-specific modulators, including masks and rescalers, to align the direction and magnitude between the unified model and each specific model, respectively. EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance. We find that EMR-Merging shows outstanding performance compared to existing merging methods under different classical and newly-established settings, including merging different numbers of vision models (up to 30), NLP models, PEFT models, and multi-modal models.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on

Chenhui Wang
Tao Chen
Zhihao Chen
Zhizhong Huang
Taoran Jiang
Qi Wang
Hongming Shan

Despite their impressive generative performance, latent diffusion model-based virtual try-on (VTON) methods lack faithfulness to crucial details of the clothes, such as style, pattern, and text. To alleviate these issues caused by the diffusion stochastic nature and latent supervision, we propose a novel Faithful Latent Diffusion Model for VTON, termed FLDM-VTON. FLDM-VTON improves the conventional latent diffusion process in three major aspects. First, we propose incorporating warped clothes as both the starting point and local condition, supplying the model with faithful clothes priors. Second, we introduce a novel clothes flattening network to constrain generated try-on images, providing clothes-consistent faithful supervision. Third, we devise a clothes-posterior sampling for faithful inference, further enhancing the model performance over conventional clothes-agnostic Gaussian sampling. Extensive experimental results on the benchmark VITON-HD and Dress Code datasets demonstrate that our FLDM-VTON outperforms state-of-the-art baselines and is able to generate photo-realistic try-on images with faithful clothing details.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

FNP: Fourier Neural Processes for Arbitrary-Resolution Data Assimilation

Kun Chen
Peng Ye
Hao Chen
Kang Chen
Tao Han
Wanli Ouyang
Tao Chen
Lei Bai

Data assimilation is a vital component in modern global medium-range weather forecasting systems to obtain the best estimation of the atmospheric state by combining the short-term forecast and observations. Recently, AI-based data assimilation approaches have attracted increasing attention for their significant advantages over traditional techniques in terms of computational consumption. However, existing AI-based data assimilation methods can only handle observations with a specific resolution, lacking the compatibility and generalization ability to assimilate observations with other resolutions. Considering that complex real-world observations often have different resolutions, we propose the Fourier Neural Processes (FNP) for arbitrary-resolution data assimilation in this paper. Leveraging the efficiency of the designed modules and flexible structure of neural processes, FNP achieves state-of-the-art results in assimilating observations with varying resolutions, and also exhibits increasing advantages over the counterparts as the resolution and the amount of observations increase. Moreover, our FNP trained on a fixed resolution can directly handle the assimilation of observations with out-of-distribution resolutions and the observational information reconstruction task without additional fine-tuning, demonstrating its excellent generalization ability across data resolutions as well as across tasks. Code is available at https: //github. com/OpenEarthLab/FNP.

PDF Details DOI

JBHI Journal 2024 Journal Article

HOPE: Hybrid-Granularity Ordinal Prototype Learning for Progression Prediction of Mild Cognitive Impairment

Chenhui Wang
Yiming Lei
Tao Chen
Junping Zhang
Yuxin Li
Hongming Shan

Mild cognitive impairment (MCI) is often at high risk of progression to Alzheimer's disease (AD). Existing works to identify the progressive MCI (pMCI) typically require MCI subtype labels, pMCI vs. stable MCI (sMCI), determined by whether or not an MCI patient will progress to AD after a long follow-up. However, prospectively acquiring MCI subtype data is time-consuming and resource-intensive; the resultant small datasets could lead to severe overfitting and difficulty in extracting discriminative information. Inspired by that various longitudinal biomarkers and cognitive measurements present an ordinal pathway on AD progression, we propose a novel Hybrid-granularity Ordinal PrototypE learning (HOPE) method to characterize AD ordinal progression for MCI progression prediction. First, HOPE learns an ordinal metric space that enables progression prediction by prototype comparison. Second, HOPE leverages a novel hybrid-granularity ordinal loss to learn the ordinal nature of AD via effectively integrating instance-to-instance ordinality, instance-to-class compactness, and class-to-class separation. Third, to make the prototype learning more stable, HOPE employs an exponential moving average strategy to learn the global prototypes of NC and AD dynamically. Experimental results on the internal ADNI and the external NACC datasets demonstrate the superiority of the proposed HOPE over existing state-of-the-art methods as well as its interpretability.

NeurIPS Conference 2024 Conference Paper

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Zechu Li
Rickmer Krohn
Tao Chen
Anurag Ajay
Pulkit Agrawal
Georgia Chalvatzaki

Deep reinforcement learning (RL) algorithms typically parameterize the policy as a deep network that outputs either a deterministic action or a stochastic one modeled as a Gaussian distribution, hence restricting learning to a single behavioral mode. Meanwhile, diffusion models emerged as a powerful framework for multimodal learning. However, the use of diffusion policies in online RL is hindered by the intractability of policy likelihood approximation, as well as the greedy objective of RL methods that can easily skew the policy to a single mode. This paper presents Deep Diffusion Policy Gradient (DDiffPG), a novel actor-critic algorithm that learns from scratch multimodal policies parameterized as diffusion models while discovering and maintaining versatile behaviors. DDiffPG explores and discovers multiple modes through off-the-shelf unsupervised clustering combined with novelty-based intrinsic motivation. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective, ensuring the improvement of the diffusion policy across all modes. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes. Empirical studies validate DDiffPG's capability to master multimodal behaviors in complex, high-dimensional continuous control tasks with sparse rewards, also showcasing proof-of-concept dynamic online replanning when navigating mazes with unseen obstacles. Our project page is available at https: //supersglzc. github. io/projects/ddiffpg/.

PDF Details DOI

ECAI Conference 2024 Conference Paper

Leveraging Foundation Models for Zero-Shot IoT Sensing

Dinghao Xue
Xiaoran Fan
Tao Chen
Guohao Lan
Qun Song 0001

Deep learning models are increasingly deployed on edge Internet of Things (IoT) devices. However, these models typically operate under supervised conditions and fail to recognize unseen classes different from training. To address this, zero-shot learning (ZSL) aims to classify data of unseen classes with the help of semantic information. Foundation models (FMs) trained on web-scale data have shown impressive ZSL capability in natural language processing and visual understanding. However, leveraging FMs’ generalized knowledge for zero-shot IoT sensing using signals such as mmWave, IMU, and Wi-Fi has not been fully investigated. In this work, we align the IoT data embeddings with the semantic embeddings generated by an FM’s text encoder for zero-shot IoT sensing. To utilize the physics principles governing the generation of IoT sensor signals to derive more effective prompts for semantic embedding extraction, we propose to use cross-attention to combine a learnable soft prompt that is optimized automatically on training data and an auxiliary hard prompt that encodes domain knowledge of the IoT sensing task. To address the problem of IoT embeddings biasing to seen classes due to the lack of unseen class data during training, we propose using data augmentation to synthesize unseen class IoT data for fine-tuning the IoT feature extractor and embedding projector. We evaluate our approach on multiple IoT sensing tasks. Results show that our approach achieves superior open-set detection and generalized zero-shot learning performance compared with various baselines. Our code is available at https: //github. com/schrodingho/FM_ZSL_IoT.

NeurIPS Conference 2024 Conference Paper

MeshXL: Neural Coordinate Field for Generative 3D Foundation Models

Sijin Chen
Xin Chen
Anqi Pang
Xianfang Zeng
Wei Cheng
Yijun Fu
Fukun Yin
Zhibin Wang

The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed, and storage efficiency, which is widely preferred in various applications. However, given its unstructured graph representation, the direct generation of high-fidelity 3D meshes is challenging. Fortunately, with a pre-defined ordering strategy, 3D meshes can be represented as sequences, and the generation process can be seamlessly treated as an auto-regressive problem. In this paper, we validate Neural Coordinate Field (NeurCF), an explicit coordinate representation with implicit neural embeddings, is a simple-yet-effective representation for large-scale sequential mesh modeling. After that, we present MeshXL, a family of generative pre-trained auto-regressive models that addresses 3D mesh generation with modern large language model approaches. Extensive experiments show that MeshXL is able to generate high-quality 3D meshes, and can also serve as foundation models for various down-stream applications.

PDF Details DOI

AAAI Conference 2024 Conference Paper

PM-INR: Prior-Rich Multi-Modal Implicit Large-Scale Scene Neural Representation

Yiying Yang
Fukun Yin
Wen Liu
Jiayuan Fan
Xin Chen
Gang Yu
Tao Chen

Recent advancements in implicit neural representations have contributed to high-fidelity surface reconstruction and photorealistic novel view synthesis. However, with the expansion of the scene scale, such as block or city level, existing methods will encounter challenges because traditional sampling cannot cope with the cubically growing sampling space. To alleviate the dependence on filling the sampling space, we explore using multi-modal priors to assist individual points to obtain more global semantic information and propose a priorrich multi-modal implicit neural representation network, Pm-INR, for the outdoor unbounded large-scale scene. The core of our method is multi-modal prior extraction and crossmodal prior fusion modules. The former encodes codebooks from different modality inputs and extracts valuable priors, while the latter fuses priors to maintain view consistency and preserve unique features among multi-modal priors. Finally, feature-rich cross-modal priors are injected into the sampling regions to allow each region to perceive global information without filling the sampling space. Extensive experiments have demonstrated the effectiveness and robustness of our method for outdoor unbounded large-scale scene novel view synthesis, which outperforms state-of-the-art methods in terms of PSNR, SSIM, and LPIPS.

PDF Details DOI

EAAI Journal 2024 Journal Article

Principal space approximation ensemble discriminative marginalized least-squares regression for hyperspectral image classification

Haoyu Long
Tao Chen
Huayue Chen
Xiangbing Zhou
Wu Deng

Discriminative marginalized least-squares regression (DMLSR) is unable to extract the spectral-spatial joint features, the proportion of learned interfering pixels is high. To solve this problem, a novel principal space approximation ensemble discriminant edge least-squares regression, namely PSAE-DMLSR is proposed for hyperspectral image classification. In the PSAE-DMLSR, a marginal principal component method (MP) is employed to search the optimal spectral subspace, and a principal space local marginal principal component (PSLMP) method is proposed to search the optimal representation space (ORS). In the PSLMP, a principal space representation (PSR) is designed to integrate the global spectral-spatial joint features information of the ORS, and the PSR is used to impose approximate averaging constraints and stochastic cascade fusion on the ORS, which can further improve the representation ability of the ORS. The ORS can effectively reduce the proportion of interfering pixels in DMLSR learning. It conducted comparative experiments with some more advanced classification methods on the three commonly used hyperspectral datasets. The experiment results show that the PSAE-DMLSR classification model can still obtain high classification accuracy under low hardware conditions, and the execution efficiency also has advantages.

NeurIPS Conference 2024 Conference Paper

S2HPruner: Soft-to-Hard Distillation Bridges the Discretization Gap in Pruning

Weihao Lin
Shengji Tang
Chong Yu
Peng Ye
Tao Chen

Recently, differentiable mask pruning methods optimize the continuous relaxation architecture (soft network) as the proxy of the pruned discrete network (hard network) for superior sub-architecture search. However, due to the agnostic impact of the discretization process, the hard network struggles with the equivalent representational capacity as the soft network, namely discretization gap, which severely spoils the pruning performance. In this paper, we first investigate the discretization gap and propose a novel structural differentiable mask pruning framework named S2HPruner to bridge the discretization gap in a one-stage manner. In the training procedure, SH2Pruner forwards both the soft network and its corresponding hard network, then distills the hard network under the supervision of the soft network. To optimize the mask and prevent performance degradation, we propose a decoupled bidirectional knowledge distillation. It blocks the weight updating from the hard to the soft network while maintaining the gradient corresponding to the mask. Compared with existing pruning arts, S2HPruner achieves surpassing pruning performance without fine-tuning on comprehensive benchmarks, including CIFAR-100, Tiny ImageNet, and ImageNet with a variety of network architectures. Besides, investigation and analysis experiments explain the effectiveness of S2HPruner. Codes will be released soon.

PDF Details DOI

EAAI Journal 2024 Journal Article

Short-term wind power prediction framework using numerical weather predictions and residual convolutional long short-term memory attention network

Chenlei Xie
Xuelei Yang
Tao Chen
Qiansheng Fang
Jie Wang
Yan Shen

As a prominent global source of renewable energy, wind power generation had been experiencing rapid growth. The more precise prediction of short-term wind power was essential to ensure the stable and cost-effective operation of power systems. In response, a wind power prediction framework using numerical weather predictions (NWPs) and Residual Convolutional Long Short-Term Memory Attention (Res-ConvLSTM-Attention) network was proposed in this study. Addressing the issue of significant errors in individual NPW, Weighted Naive Bayes (WNB) model and Multivariate Quadratic Nonlinear Regression (NR) model were employed to fuse the four NWPs wind speed and direction characteristics respectively, aiming to obtain more accurate weather forecast data. Given the difficulty in accurately predicting due to the randomness of wind power, a Res-ConvLSTM-Attention network was proposed for short-term wind power prediction. The Res-ConvLSTM unit extracted deep spatiotemporal features while effectively alleviating network degradation and gradient vanishing issues caused by network deepening. The Attention unit allocated higher weights to key features, and their combination enhanced the accuracy of wind power prediction. Finally, using the data provided by Challenge Data for experimental analysis, the results showed that the mean absolute error (MAE), root mean square error (RMSE), mean arctangent absolute percentage error (MAAPE) and coefficient of determination (R2) value were 0. 0758, 0. 1163, 0. 4364 and 0. 946, affirming the effectiveness of the wind power prediction framework.

IJCAI Conference 2024 Conference Paper

Spear: Evaluate the Adversarial Robustness of Compressed Neural Models

Chong Yu
Tao Chen
Zhongxue Gan
Jiayuan Fan

As Artificial Intelligence evolves, the neural models vulnerable to adversarial attacks may produce fatal results in critical applications. This paper mainly discusses the robustness of the compressed neural models facing adversarial attacks. A few studies discuss the interaction between model compression and adversarial attack. However, they focus on the robustness against the traditional attacks designed for the dense models, not the attacks intended explicitly for the compressed models, using sparsity and quantization techniques. Compressed models often have fewer parameters and smaller sizes that are more friendly to resource-limited devices than dense models, so they are widely deployed in various edge and mobile devices. However, introducing the sparsity and quantization into neural models further imposes higher attack risks. A specific adversarial attack method (Spear) is proposed to generate the particular adversarial attack samples for evaluating the robustness of the compressed models. The Spear attack finds minimal perturbations to create the attack samples to maximize the different behaviors between the compressed and dense reference models. We demonstrate the proposed Spear attack technique can generally be applied to various networks and tasks through quantitative and ablation experiments.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

Hancheng Ye
Jiakang Yuan
Renqiu Xia
Xiangchao Yan
Tao Chen
Junchi Yan
Botian Shi
Bo Zhang

Diffusion models have recently achieved great success in the synthesis of high-quality images and videos. However, the existing denoising techniques in diffusion models are commonly based on step-by-step noise predictions, which suffers from high computation cost, resulting in a prohibitive latency for interactive applications. In this paper, we propose AdaptiveDiffusion to relieve this bottleneck by adaptively reducing the noise prediction steps during the denoising process. Our method considers the potential of skipping as many noise prediction steps as possible while keeping the final denoised results identical to the original full-step ones. Specifically, the skipping strategy is guided by the third-order latent difference that indicates the stability between timesteps during the denoising process, which benefits the reusing of previous noise prediction results. Extensive experiments on image and video diffusion models demonstrate that our method can significantly speed up the denoising process while generating identical results to the original process, achieving up to an average 2-5x speedup without quality degradation. The code is available at https: //github. com/UniModal4Reasoning/AdaptiveDiffusion

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset

Jiakang Yuan
Bo Zhang
Xiangchao Yan
Botian Shi
Tao Chen
Yikang Li
Yu Qiao

It is a long-term vision for Autonomous Driving (AD) community that the perception models can learn from a large-scale point cloud dataset, to obtain unified representations that can achieve promising results on different tasks or benchmarks. Previous works mainly focus on the self-supervised pre-training pipeline, meaning that they perform the pre-training and fine-tuning on the same benchmark, which is difficult to attain the performance scalability and cross-dataset application for the pre-training checkpoint. In this paper, for the first time, we are committed to building a large-scale pre-training point-cloud dataset with diverse data distribution, and meanwhile learning generalizable representations from such a diverse pre-training dataset. We formulate the point-cloud pre-training task as a semi-supervised problem, which leverages the few-shot labeled and massive unlabeled point-cloud data to generate the unified backbone representations that can be directly applied to many baseline models and benchmarks, decoupling the AD-related pre-training process and downstream fine-tuning task. During the period of backbone pre-training, by enhancing the scene- and instance-level distribution diversity and exploiting the backbone's ability to learn from unknown instances, we achieve significant performance gains on a series of downstream perception benchmarks including Waymo, nuScenes, and KITTI, under different baseline models like PV-RCNN++, SECOND, CenterPoint.

IJCAI Conference 2023 Conference Paper

Adversarial Amendment is the Only Force Capable of Transforming an Enemy into a Friend

Chong Yu
Tao Chen
Zhongxue Gan

Adversarial attack is commonly regarded as a huge threat to neural networks because of misleading behavior. This paper presents an opposite perspective: adversarial attacks can be harnessed to improve neural models if amended correctly. Unlike traditional adversarial defense or adversarial training schemes that aim to improve the adversarial robustness, the proposed adversarial amendment (AdvAmd) method aims to improve the original accuracy level of neural models on benign samples. We thoroughly analyze the distribution mismatch between the benign and adversarial samples. This distribution mismatch and the mutual learning mechanism with the same learning ratio applied in prior art defense strategies is the main cause leading the accuracy degradation for benign samples. The proposed AdvAmd is demonstrated to steadily heal the accuracy degradation and even leads to a certain accuracy boost of common neural models on benign classification, object detection, and segmentation tasks. The efficacy of the AdvAmd is contributed by three key components: mediate samples (to reduce the influence of distribution mismatch with a fine-grained amendment), auxiliary batch norm (to solve the mutual learning mechanism and the smoother judgment surface), and AdvAmd loss (to adjust the learning ratios according to different attack vulnerabilities) through quantitative and ablation experiments.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Breadcrumbs to the Goal: Goal-Conditioned Exploration from Human-in-the-Loop Feedback

Marcel Torne Villasevil
Max Balsells I Pamies
Zihan Wang
Samedh Desai
Tao Chen
Pulkit Agrawal
Abhishek Gupta

Exploration and reward specification are fundamental and intertwined challenges for reinforcement learning. Solving sequential decision making tasks with a non-trivial element of exploration requires either specifying carefully designed reward functions or relying on indiscriminate, novelty seeking exploration bonuses. Human supervisors can provide effective guidance in the loop to direct the exploration process, but prior methods to leverage this guidance require constant synchronous high-quality human feedback, which is expensive and impractical to obtain. In this work, we propose a technique - Human Guided Exploration (HUGE), that is able to leverage low-quality feedback from non-expert users, which is infrequent, asynchronous and noisy, to guide exploration for reinforcement learning, without requiring careful reward specification. The key idea is to separate the challenges of directed exploration and policy learning - human feedback is used to direct exploration, while self-supervised policy learning is used to independently learn unbiased behaviors from the collected data. We show that this procedure can leverage noisy, asynchronous human feedback to learn tasks with no hand-crafted reward design or exploration bonuses. We show that HUGE is able to learn a variety of challenging multi-stage robotic navigation and manipulation tasks in simulation using crowdsourced feedback from non-expert users. Moreover, this paradigm can be scaled to learning directly on real-world robots.

JBHI Journal 2023 Journal Article

Generalizable Pancreas Segmentation Modeling in CT Imaging via Meta-Learning and Latent-Space Feature Flow Generation

Jun Li
Tao Chen
Xiaohua Qian

Accurate pancreas segmentation is highly crucial for diagnosing and treating pancreatic diseases. Although CNN has demonstrated promising outcomes, the performance on unseen data can be significantly compromised by the wide appearance-style variations induced by different imaging factors. Thus, we propose a generalizable pancreas segmentation model based on a meta-learning strategy and latent-space feature flow generation method. Our approach enhances the generalizability by systematically reducing the interference from the cluttered background and appearance-style discrepancies through a coarse-to-fine workflow. Specifically, the integrity-preserving coarse segmentation module is designed to adaptively balance the pancreas coverage and segmentation accuracy with the meta-learning strategy for filtering out background clutter. It also enhances the generalization of the coarse model to reasonably-accurate ROIs thereby promoting the stability of fine segmentation. Subsequently, the appearance-style feature flow generation method is developed to generate a series of progressively-varying style-related intermediate representations between two latent spaces. This feature flow effectively models the distribution variations caused by appearance-style discrepancies, and thus enhances the adaptability of the fine model. Our method achieves superior performance on three pancreas datasets and outperforms state-of-the-art generalization methods. Besides, it can be easily integrated into other workflows, leading to a potential paradigm for enhancing generalization performance.

JBHI Journal 2023 Journal Article

Generalizable Pancreas Segmentation via a Dual Self-Supervised Learning Framework

Jun Li
Hongzhang Zhu
Tao Chen
Xiaohua Qian

Recently, numerous pancreas segmentation methods have achieved promising performance on local single-source datasets. However, these methods don't adequately account for generalizability issues, and hence typically show limited performance and low stability on test data from other sources. Considering the limited availability of distinct data sources, we seek to improve the generalization performance of a pancreas segmentation model trained with a single-source dataset, i. e. , the single-source generalization task. In particular, we propose a dual self-supervised learning model that incorporates both global and local anatomical contexts. Our model aims to fully exploit the anatomical features of the intra-pancreatic and extra-pancreatic regions, and hence enhance the characterization of the high-uncertainty regions for more robust generalization. Specifically, we first construct a global-feature contrastive self-supervised learning module that is guided by the pancreatic spatial structure. This module obtains complete and consistent pancreatic features through promoting intra-class cohesion, and also extracts more discriminative features for differentiating between pancreatic and non-pancreatic tissues through maximizing inter-class separation. It mitigates the influence of surrounding tissue on the segmentation outcomes in high-uncertainty regions. Subsequently, a local-image-restoration self-supervised learning module is introduced to further enhance the characterization of the high-uncertainty regions. In this module, informative anatomical contexts are actually learned to recover randomly-corrupted appearance patterns in those regions. The effectiveness of our method is demonstrated with state-of-the-art performance and comprehensive ablation analysis on three pancreas datasets (467 cases). The results demonstrate a great potential in providing a stable support for the diagnosis and treatment of pancreatic diseases.

NeurIPS Conference 2023 Conference Paper

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Zibo Zhao
Wen Liu
Xin Chen
Xianfang Zeng
Rui Wang
Pei Cheng
Bin Fu
Tao Chen

We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to producing inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Our extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.

NeurIPS Conference 2023 Conference Paper

MotionGPT: Human Motion as a Foreign Language

Biao Jiang
Xin Chen
Wen Liu
Jingyi Yu
Gang Yu
Tao Chen

Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multimodal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

NeurIPS Conference 2023 Conference Paper

PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation

Yuhan Ding
Fukun Yin
Jiayuan Fan
Hui Li
Xin Chen
Wen Liu
Chongshan Lu
Gang Yu

Recent advances in implicit neural representations have achieved impressive results by sampling and fusing individual points along sampling rays in the sampling space. However, due to the explosively growing sampling space, finely representing and synthesizing detailed textures remains a challenge for unbounded large-scale outdoor scenes. To alleviate the dilemma of using individual points to perceive the entire colossal space, we explore learning the surface distribution of the scene to provide structural priors and reduce the samplable space and propose a Point Diffusion implicit Function, PDF, for large-scale scene neural representation. The core of our method is a large-scale point cloud super-resolution diffusion module that enhances the sparse point cloud reconstructed from several training images into a dense point cloud as an explicit prior. Then in the rendering stage, only sampling points with prior points within the sampling radius are retained. That is, the sampling space is reduced from the unbounded space to the scene surface. Meanwhile, to fill in the background of the scene that cannot be provided by point clouds, the region sampling based on Mip-NeRF 360 is employed to model the background representation. Expensive experiments have demonstrated the effectiveness of our method for large-scale scene novel view synthesis, which outperforms relevant state-of-the-art baselines.

EAAI Journal 2023 Journal Article

PVF-DectNet: Multi-modal 3D detection network based on Perspective-Voxel fusion

Ke Wang
Tianqiang Zhou
Zhichuang Zhang
Tao Chen
Junlan Chen

The detection of small objects such as pedestrians still poses challenges to the LiDAR-based 3D object detection due to the sparseness and disorder of point clouds. Conversely, images from cameras can provide rich semantic information, which makes these small-sized objects easy to be detected. To take use of the advantages of both devices to achieve better 3D object detection, research on the fusion of LiDAR and camera information is now being conducted. The existing fusion methods between point clouds and image are normally weighed more on the point clouds. Hence the semantic information of images is not fully utilized. We propose a new fusion method named PVFusion to try to fuse more image features. We first divide each point into a separate perspective voxel and project the voxel onto the image feature maps. Then the semantic feature of the perspective voxel is fused with the geometric feature of the point. A 3D object detection model (PVF-DectNet) is designed using PVFusion. During training we employ the ground truth paste (GT-Paste) data augmentation and solve the occlusion problem caused by newly added object. The KITTI validation set is used to validate the PVF-DectNet, which shows 3. 6% AP improvement over the other feature fusion methods in pedestrian detection. On the KITTI test set, the PVF-DectNet outperforms the other multi-modal SOTA methods by 2. 2% AP in pedestrian detection. And PVFusion shows better detection performance for sparse point clouds than PointFusion in both car and pedestrian categories. As for 32 beams LiDAR scene, there are 4. 2% AP increment in moderate difficulty car category and 5. 2% mAP improvement in pedestrian category.

NeurIPS Conference 2022 Conference Paper

Coordinates Are NOT Lonely - Codebook Prior Helps Implicit Neural 3D representations

Fukun Yin
Wen Liu
Zilong Huang
Pei Cheng
Tao Chen
Gang Yu

Implicit neural 3D representation has achieved impressive results in surface or scene reconstruction and novel view synthesis, which typically uses the coordinate-based multi-layer perceptrons (MLPs) to learn a continuous scene representation. However, existing approaches, such as Neural Radiance Field (NeRF) and its variants, usually require dense input views (i. e. 50-150) to obtain decent results. To relive the over-dependence on massive calibrated images and enrich the coordinate-based feature representation, we explore injecting the prior information into the coordinate-based network and introduce a novel coordinate-based model, CoCo-INR, for implicit neural 3D representation. The cores of our method are two attention modules: codebook attention and coordinate attention. The former extracts the useful prototypes containing rich geometry and appearance information from the prior codebook, and the latter propagates such prior information into each coordinate and enriches its feature representation for a scene or object surface. With the help of the prior information, our method can render 3D views with more photo-realistic appearance and geometries than the current methods using fewer calibrated images available. Experiments on various scene reconstruction datasets, including DTU and BlendedMVS, and the full 3D head reconstruction dataset, H3DS, demonstrate the robustness under fewer input views and fine detail-preserving capability of our proposed method.

AAAI Conference 2022 Conference Paper

Fast and Constrained Absent Keyphrase Generation by Prompt-Based Learning

Huanqin Wu
Baijiaxin Ma
Wei Liu
Tao Chen
Dan Nie

Generating absent keyphrases, which do not appear in the input document, is challenging in the keyphrase prediction task. Most previous works treat the problem as an autoregressive sequence-to-sequence generation task, which demonstrates promising results for generating grammatically correct and fluent absent keyphrases. However, such an end-toend process with a complete data-driven manner is unconstrained, which is prone to generate keyphrases inconsistent with the input document. In addition, the existing autoregressive decoding method makes the generation of keyphrases must be done from left to right, leading to slow speed during inference. In this paper, we propose a constrained absent keyphrase generation method in a prompt-based learning fashion. Specifically, the prompt will be created firstly based on the keywords, which are defined as the overlapping words between absent keyphrase and document. Then, a maskpredict decoder is used to complete the absent keyphrase on the constraint of prompt. Experiments on keyphrase generation benchmarks have demonstrated the effectiveness of our approach. In addition, we evaluate the performance of constrained absent keyphrases generation from an information retrieval perspective. The result shows that our approach can generate more consistent keyphrases, which can improve document retrieval performance. What’s more, with a nonautoregressive decoding manner, our model can speed up the absent keyphrase generation by 8. 67× compared with the autoregressive method.

EAAI Journal 2022 Journal Article

Fire detection in video surveillances using convolutional neural networks and wavelet transform

Lida Huang
Gang Liu
Yan Wang
Hongyong Yuan
Tao Chen

Fire is one of the most frequent and common emergencies threatening public safety and social development. Recently, intelligent fire detection technologies represented by convolutional neural networks (CNNs) have been widely concerned by academia and industry, substantially improving detection accuracy. However, CNN-based fire detection systems are still subject to the interference of false alarms and the limitation of computing power. In this paper, taking advantage of traditional spectral analysis in fire image detection technology, a novel Wavelet-CNN method is proposed, which applies the 2D Haar transform to extract spectral features of the image and input them into CNNs at different layer stages. Two classic backbone networks, ResNet50 and MobileNet v2 (MV2) are used to test our method, and experimental results on a benchmark fire dataset and a video dataset show that the method improves fire detection accuracy and reduces false alarms, especially for the light-weight MV2. Despite the low computational needs, the Wavelet-MV2 achieves accuracy that is comparable to state-of-the-art methods.

JBHI Journal 2022 Journal Article

Marginal Structural Models Using Calibrated Weights With SuperLearner: Application to Type II Diabetes Cohort

Sumeet Kalia
Olli Saarela
Tao Chen
Braden O'Neill
Christopher Meaney
Jessica Gronsbell
Ervin Sejdic
Michael Escobar

As different scientific disciplines begin to converge on machine learning for causal inference, we demonstrate the application of machine learning algorithms in the context of longitudinal causal estimation using electronic health records. Our aim is to formulate a marginal structural model for estimating diabetes care provisions in which we envisioned hypothetical (i. e. counterfactual) dynamic treatment regimes using a combination of drug therapies to manage diabetes: metformin, sulfonylurea and SGLT-2i. The binary outcome of diabetes care provisions was defined using a composite measure of chronic disease prevention and screening elements [27] including (i) primary care visit, (ii) blood pressure, (iii) weight, (iv) hemoglobin A1c, (v) lipid, (vi) ACR, (vii) eGFR and (viii) statin medication. We used several statistical learning algorithms to describe causal relationships between the prescription of three common classes of diabetes medications and quality of diabetes care using the electronic health records contained in National Diabetes Repository. In particular, we generated an ensemble of statistical learning algorithms using the SuperLearner framework based on the following base learners: (i) least absolute shrinkage and selection operator, (ii) ridge regression, (iii) elastic net, (iv) random forest, (v) gradient boosting machines, and (vi) neural network. Each statistical learning algorithm was fitted using the pseudo-population generated from the marginalization of the time-dependent confounding process. Covariate balance was assessed using the longitudinal (i. e. cumulative-time product) stabilized weights with calibrated restrictions. Our results indicated that the treatment drop-in cohorts (with respect to metformin, sulfonylurea and SGLT-2i) may have improved diabetes care provisions in relation to treatment naïve (i. e. no treatment) cohort. As a clinical utility, we hope that this article will facilitate discussions around the prevention of adverse chronic outcomes associated with type II diabetes through the improvement of diabetes care provisions in primary care.

NeurIPS Conference 2022 Conference Paper

Pre-Trained Language Models for Interactive Decision-Making

Shuang Li
Xavier Puig
Chris Paxton
Yilun Du
Clinton Wang
Linxi Fan
Tao Chen
De-An Huang

Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and fine-tuning them via behavior cloning improves task completion rates by 43. 6% in the VirtualHome environment. Next, we integrate an active data gathering procedure in which agents iteratively interact with the environment, relabel past "failed" experiences with new goals, and update their policies in a self-supervised loop. Active data gathering further improves combinatorial generalization, outperforming the best baseline by 25. 1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We find that sequential input representations (vs. fixed-dimensional feature vectors) and LM-based weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e. g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.

NeurIPS Conference 2022 Conference Paper

Stimulative Training of Residual Networks: A Social Psychology Perspective of Loafing

Peng Ye
Shengji Tang
Baopu Li
Tao Chen
Wanli Ouyang

Residual networks have shown great success and become indispensable in today’s deep models. In this work, we aim to re-investigate the training process of residual networks from a novel social psychology perspective of loafing, and further propose a new training strategy to strengthen the performance of residual networks. As residual networks can be viewed as ensembles of relatively shallow networks (i. e. , unraveled view) in prior works, we also start from such view and consider that the final performance of a residual network is co-determined by a group of sub-networks. Inspired by the social loafing problem of social psychology, we find that residual networks invariably suffer from similar problem, where sub-networks in a residual network are prone to exert less effort when working as part of the group compared to working alone. We define this previously overlooked problem as network loafing. As social loafing will ultimately cause the low individual productivity and the reduced overall performance, network loafing will also hinder the performance of a given residual network and its sub-networks. Referring to the solutions of social psychology, we propose stimulative training, which randomly samples a residual sub-network and calculates the KL-divergence loss between the sampled sub-network and the given residual network, to act as extra supervision for sub-networks and make the overall goal consistent. Comprehensive empirical results and theoretical analyses verify that stimulative training can well handle the loafing problem, and improve the performance of a residual network by improving the performance of its sub-networks. The code is available at https: //github. com/Sunshine-Ye/NIPS22-ST.

AIJ Journal 2022 Journal Article

VoCSK: Verb-oriented commonsense knowledge mining with taxonomy-guided induction

Jingping Liu
Tao Chen
Chao Wang
Jiaqing Liang
Lihan Chen
Yanghua Xiao
Yunwen Chen
Ke Jin

Commonsense knowledge acquisition is one of the fundamental issues in realizing human-level AI. However, commonsense knowledge is difficult to obtain because it is a human consensus and rarely explicitly appears in texts or other data. In this paper, we focus on the automatic acquisition of a typical kind of implicit verb-oriented commonsense knowledge (e. g. , “person eats food”), which is the concept-level knowledge of verb phrases. For this purpose, we propose a taxonomy-guided induction method to mine verb-oriented commonsense knowledge from verb phrases with the help of a probabilistic taxonomy. First, we design an entropy-based triplet filter to cope with noisy verb phrases. Then, we propose a joint model based on the minimum description length principle and a neural language model to generate verb-oriented commonsense knowledge. Besides, we introduce two strategies to accelerate the computation, including the simulated annealing-based approximate solution and the verb phrase clustering method. Finally, we conduct extensive experiments to prove that our solution is more effective than competitors in mining verb-oriented commonsense knowledge. We construct a commonsense knowledge base called VoCSK, containing 259 verbs and 18, 406 verb-oriented commonsense knowledge. To verify the usefulness of VoCSK, we utilize the knowledge in this KB to improve the model performance on two downstream applications.

AAAI Conference 2021 Conference Paper

Empower Distantly Supervised Relation Extraction with Collaborative Adversarial Training

Tao Chen
Haochen Shi
Liyuan Liu
Siliang Tang
Jian Shao
Zhigang Chen
Yueting Zhuang

With recent advances in distantly supervised (DS) relation extraction (RE), considerable attention is attracted to leverage multi-instance learning (MIL) to distill high-quality supervision from the noisy DS. Here, we go beyond label noise and identify the key bottleneck of DS-MIL to be its low data utilization: as high-quality supervision being refined by MIL, MIL abandons a large amount of training instances, which leads to a low data utilization and hinders model training from having abundant supervision. In this paper, we propose collaborative adversarial training to improve the data utilization, which coordinates virtual adversarial training (VAT) and adversarial training (AT) at different levels. Specifically, since VAT is label-free, we employ the instance-level VAT to recycle instances abandoned by MIL. Besides, we deploy AT at the bag-level to unleash the full potential of the high-quality supervision got by MIL. Our proposed method brings consistent improvements (∼ 5 absolute AUC score) to the previous state of the art, which verifies the importance of the data utilization issue and the effectiveness of our method.

JBHI Journal 2020 Journal Article

M$^3$Lung-Sys: A Deep Learning System for Multi-Class Lung Pneumonia Screening From CT Imaging

Xuelin Qian
Huazhu Fu
Weiya Shi
Tao Chen
Yanwei Fu
Fei Shan
Xiangyang Xue

To counter the outbreak of COVID-19, the accurate diagnosis of suspected cases plays a crucial role in timely quarantine, medical treatment, and preventing the spread of the pandemic. Considering the limited training cases and resources ( e. g, time and budget), we propose a Multi-task Multi-slice Deep Learning System (M $^3$ Lung-Sys) for multi-class lung pneumonia screening from CT imaging, which only consists of two 2D CNN networks, i. e. , slice- and patient-level classification networks. The former aims to seek the feature representations from abundant CT slices instead of limited CT volumes, and for the overall pneumonia screening, the latter one could recover the temporal information by feature refinement and aggregation between different slices. In addition to distinguish COVID-19 from Healthy, H1N1, and CAP cases, our M $^3$ Lung-Sys also be able to locate the areas of relevant lesions, without any pixel-level annotation. To further demonstrate the effectiveness of our model, we conduct extensive experiments on a chest CT imaging dataset with a total of 734 patients (251 healthy people, 245 COVID-19 patients, 105 H1N1 patients, and 133 CAP patients). The quantitative results with plenty of metrics indicate the superiority of our proposed model on both slice- and patient-level classification tasks. More importantly, the generated lesion location maps make our system interpretable and more valuable to clinicians.

JBHI Journal 2019 Journal Article

Pattern Classification for Gastrointestinal Stromal Tumors by Integration of Radiomics and Deep Convolutional Features

Zhenyuan Ning
Jiaxiu Luo
Yong Li
Shuai Han
Qianjin Feng
Yikai Xu
Wufan Chen
Tao Chen

Predicting malignant potential is one of the most critical components of a computer-aided diagnosis system for gastrointestinal stromal tumors (GISTs). These tumors have been studied only on the basis of subjective computed tomography findings. Among various methodologies, radiomics, and deep learning algorithms, specifically convolutional neural networks (CNNs), have recently been confirmed to achieve significant success by outperforming the state-of-the-art performance in medical image pattern classification and have rapidly become leading methodologies in this field. However, the existing methods generally use radiomics or deep convolutional features independently for pattern classification, which tend to take into account only global or local features, respectively. In this paper, we introduce and evaluate a hybrid structure that includes different features selected with radiomics model and CNNs and integrates these features to deal with GISTs classification. The Radiomics model and CNNs are constructed for global radiomics and local convolutional feature selection, respectively. Subsequently, we utilize distinct radiomics and deep convolutional features to perform pattern classification for GISTs. Specifically, we propose a new pooling strategy to assemble the deep convolutional features of 54 three-dimensional patches from the same case and integrate these features with the radiomics features for independent case, followed by random forest classifier. Our method can be extensively evaluated using multiple clinical datasets. The classification performance (area under the curve (AUC): 0. 882; 95% confidence interval (CI): 0. 816-0. 947) consistently outperforms those of independent radiomics (AUC: 0. 807; 95% CI: 0. 724-0. 892) and CNNs (AUC: 0. 826; 95% CI: 0. 795-0. 856) approaches.

IROS Conference 2019 Conference Paper

Sim-to-(Multi)-Real: Transfer of Low-Level Robust Control Policies to Multiple Quadrotors

Artem Molchanov
Tao Chen
Wolfgang Hönig
James A. Preiss
Nora Ayanian
Gaurav S. Sukhatme

Quadrotor stabilizing controllers often require careful, model-specific tuning for safe operation. We use reinforcement learning to train policies in simulation that transfer remarkably well to multiple different physical quadrotors. Our policies are low-level, i. e. , we map the rotorcrafts’ state directly to the motor outputs. The trained control policies are very robust to external disturbances and can withstand harsh initial conditions such as throws. We show how different training methodologies (change of the cost function, modeling of noise, use of domain randomization) might affect flight performance. To the best of our knowledge, this is the first work that demonstrates that a simple neural network can learn a robust stabilizing low-level quadrotor controller (without the use of a stabilizing PD controller) that is shown to generalize to multiple quadrotors. The video of our experiments can be found at https://sites.google.com/view/sim-to-multi-quad.

NeurIPS Conference 2018 Conference Paper

Hardware Conditioned Policies for Multi-Robot Transfer Learning

Tao Chen
Adithyavairavan Murali
Abhinav Gupta

Deep reinforcement learning could be used to learn dexterous robotic policies but it is challenging to transfer them to new robots with vastly different hardware properties. It is also prohibitively expensive to learn a new policy from scratch for each robot hardware due to the high sample complexity of modern state-of-the-art algorithms. We propose a novel approach called Hardware Conditioned Policies where we train a universal policy conditioned on a vector representation of robot hardware. We considered robots in simulation with varied dynamics, kinematic structure, kinematic lengths and degrees-of-freedom. First, we use the kinematic structure directly as the hardware encoding and show great zero-shot transfer to completely novel robots not seen during training. For robots with lower zero-shot success rate, we also demonstrate that fine-tuning the policy network is significantly more sample-efficient than training a model from scratch. In tasks where knowing the agent dynamics is important for success, we learn an embedding for robot hardware and show that policies conditioned on the encoding of hardware tend to generalize and transfer well. Videos of experiments are available at: https: //sites. google. com/view/robot-transfer-hcp.

EAAI Journal 2017 Journal Article

The artificial tree (AT) algorithm

Q.Q. Li
Kai Song
Z.C. He
Eric Li
A.G. Cheng
Tao Chen

Bionic intelligence algorithms have many advantages compared with traditional optimization algorithms. In this paper, inspired by the growth law of trees, a new bionic algorithm, named artificial tree (AT) algorithm is developed. In the proposed AT, the branch position is considered as the design variable. In addition, the branch is the solution, and the branch thickness is the indicator of the solution. The computing process of AT is achieved by simulating the transport of organic matters and the update of tree branches. The comparative analysis using thirty typical benchmark problems between AT algorithm and some well-known bionic intelligent methods is also performed. Based on numerical results, AT is found to be very effective in dealing with various problems.

AAAI Conference 2015 Conference Paper

VELDA: Relating an Image Tweet’s Text and Images

Tao Chen
Hany SalahEldeen
Xiangnan He
Min-Yen Kan
Dongyuan Lu

Image tweets are becoming a prevalent form of social media, but little is known about their content – textual and visual – and the relationship between the two mediums. Our analysis of image tweets shows that while visual elements certainly play a large role in image-text relationships, other factors such as emotional elements, also factor into the relationship. We develop Visual- Emotional LDA (VELDA), a novel topic model to capture the image-text correlation from multiple perspectives (namely, visual and emotional). Experiments on real-world image tweets in both English and Chinese and other user generated content, show that VELDA significantly outperforms existing methods on cross-modality image retrieval. Even in other domains where emotion does not factor in image choice directly, our VELDA model demonstrates good generalization ability, achieving higher fidelity modeling of such multimedia documents.

TCS Journal 2013 Journal Article

Optimal fault-tolerant routing algorithm and fault-tolerant diameter in directed double-loop networks

Yebin Chen
Ying Li
Tao Chen

This paper addresses the reliability of directed double-loop networks G ( N; r, s ), and studies the problems about optimal fault-tolerant routing, fault-tolerant diameter, etc. , in G ( N; r, s ). Firstly, we study the shapes of the L-shaped tiles which are the minimum distance diagram of directed double-loop networks; we divide them into four types. There are different parameters for each type. According to the solutions to the congruence equation, then we study the distributions of the optimal equivalent nodes for different destination nodes, and present an optimal fault-tolerant routing algorithm and a formula computing the fault-tolerant diameter. Finally, we prove that there is a lower bound for fault-tolerant diameter, and show that there may be lots of double optimal directed double-loop networks in some infinite clusters of G ( N; r, s ). According to the proposed fault-tolerant routing algorithm, the reliability and transmission performance will be optimal when some faults occur in G ( N; r, s ).

AIJ Journal 2012 Journal Article

Model-based multidimensional clustering of categorical data

Tao Chen
Nevin L. Zhang
Tengfei Liu
Kin Man Poon
Yi Wang

Existing models for cluster analysis typically consist of a number of attributes that describe the objects to be partitioned and one single latent variable that represents the clusters to be identified. When one analyzes data using such a model, one is looking for one way to cluster data that is jointly defined by all the attributes. In other words, one performs unidimensional clustering. This is not always appropriate. For complex data with many attributes, it is more reasonable to consider multidimensional clustering, i. e. , to partition data along multiple dimensions. In this paper, we present a method for performing multidimensional clustering on categorical data and show its superiority over unidimensional clustering.

IS Journal 2010 Journal Article

A Comparative Study of Mobile-Based Landmark Recognition Techniques

Kim-Hui Yap
Tao Chen
Zhen Li
Kui Wu

Mobile-based landmark recognition is becoming increasingly appealing due to the proliferation of mobile devices coupled with improving processing techniques, imaging capability, and networking infrastructure. This article provides a general overview of existing mobile-based and nonmobile-based landmark recognition systems and their differences. We discuss content and context analysis and compare landmark classification methods. We also present the experimental results of our own mobile landmark recognition evaluations based on content analysis, context analysis, and integrated content-context analysis.

ICML Conference 2010 Conference Paper

Variable Selection in Model-Based Clustering: To Do or To Facilitate

Leonard K. M. Poon
Nevin L. Zhang
Tao Chen
Yi Wang 0006

AIIM Journal 2008 Journal Article

Latent tree models and diagnosis in traditional Chinese medicine

Nevin L. Zhang
Shihong Yuan
Tao Chen
Yi Wang

Objective TCM (traditional Chinese medicine) is an important avenue for disease prevention and treatment for the Chinese people and is gaining popularity among others. However, many remain skeptical and even critical of TCM because of a number of its shortcomings. One key shortcoming is the lack of objective diagnosis standards. We endeavor to alleviate this shortcoming using machine learning techniques. Method TCM diagnosis consists of two steps, patient information gathering and syndrome differentiation. We focus on the latter. When viewed as a black box, syndrome differentiation is simply a classifier that classifies patients into different classes based on their symptoms. A fundamental question is: do those classes exist in reality? To seek an answer to the question from the machine learning perspective, one would naturally use cluster analysis. Previous clustering methods are unable to cope with the complexity of TCM. We have therefore developed a new clustering method in the form of latent tree models. We have conducted a case study where we first collected a data set about a TCM domain called kidney deficiency and then used latent tree models to analyze the data set. Results Our analysis has found natural clusters in the data set that correspond well to TCM syndrome types. This is an important discovery because (1) it provides statistical validation to TCM syndrome types and (2) it suggests the possibility of establishing objective and quantitative diagnosis standards for syndrome differentiation. In this paper, we provide a summary of research work on latent tree models and report the aforementioned case study.

IROS Conference 2006 Conference Paper

A Criterion for Evaluating Competitive Teleoperation System

Xingbo Huang
Jingtai Liu
Lei Sun 0001
Weiwei Sun
Tao Chen

This paper proposes a kind of criterion, which is called degree of satisfaction (DoS). It is utilized to evaluate the competitive teleoperation. We focus on the feather of competitive teleoperation and utilize the criterion to analyze the system. To demonstrate the degree of satisfaction is an effective criterion, a set of competitive teleoperation experiment is designed on TTRP (teleoperation/tele-game robot platform). Experimental results are presented to support our approach

ICRA Conference 2005 Conference Paper

Competitive Multi-robot Teleoperation

Jingtai Liu
Lei Sun 0001
Tao Chen
Xingbo Huang
Chunying Zhao

This paper proposes a novel kind of multi-operator multi-robot(MOMR) teleoperation systems - the competitive teleoperation system. Compared with the conventional collaborated MOMR teleoperation system, features and properties of the competitive teleoperation system are presented. Futhermore, major concerns of research and development for this kind of systems are discussed subsequently. Finally, telegame, a kind of Internet-based competitive teleoperation systems, is built as the prototype to support the future research on this aspect and some experimental results are presented to support the discussion.