Arrow Research search

Author name cluster

Tao Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

102 papers
2 author rows

Possible papers

102

AAAI Conference 2026 Conference Paper

CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution

  • Baoliang Tian
  • Yuxuan Si
  • Jilong Wang
  • LingYao Li
  • Zhongyuan Bao
  • Zineng Zhou
  • Tao Wang
  • Sixu Li

Multimodal Large Language Models are primarily trained and evaluated on aligned image-text pairs, which leaves their ability to detect and resolve real-world inconsistencies largely unexplored. In open-domain applications visual and textual cues often conflict, requiring models to perform structured reasoning beyond surface-level alignment. We introduce CrossCheck-Bench, a diagnostic benchmark for evaluating contradiction detection in multimodal inputs. The benchmark adopts a hierarchical task framework covering three levels of reasoning complexity and defines seven atomic capabilities essential for resolving cross-modal inconsistencies. CrossCheck-Bench includes 15k question-answer pairs sourced from real-world artifacts with synthetically injected contradictions. The dataset is constructed through a multi-stage annotation pipeline involving more than 450 expert hours to ensure semantic validity and calibrated difficulty across perception, integration, and reasoning. We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection. Most models perform well on isolated entity recognition but fail when multiple clues must be synthesized for conflict reasoning. Capability-level analysis further reveals uneven skill acquisition, especially in tasks requiring multi-step inference or rule-based validation. Additional probing shows that conventional prompting strategies such as Chain-of-Thought and Set-of-Mark yield only marginal gains. By contrast, methods that interleave symbolic reasoning with grounded visual processing achieve more stable improvements. These results highlight a persistent bottleneck in multimodal reasoning and suggest new directions for building models capable of robust cross-modal verification.

AAAI Conference 2026 Conference Paper

DiMA: Distinguishing Resident and Tourist Preferences via Multi-Modal LLM Alignment for Out-of-Town Cross-Domain Recommendation

  • Fan Zhang
  • Jinpeng Chen
  • Tao Wang
  • Huan Li
  • Senzhang Wang
  • Feifei Kou
  • Ye Ji
  • Kaimin Wei

Out-of-Town (OOT) recommendation aims to provide personalized suggestions for users in unfamiliar cities. However, OOT recommendation faces two fundamental challenges: the difficulty of reasoning across modalities, as preference signals in disparate formats such as images and text are hard to compare; and the preference deviation problem, since a user's resident and tourist preferences often diverge, rendering simple preference transfer ineffective. To address these challenges, we propose Distinguishing Resident and Tourist Preferences via Multi-Modal LLM Alignment for Out-of-Town Cross-Domain Recommendation (DiMA), a framework for re-ranking Points of Interest (POIs). To tackle the multimodal challenge, DiMA first leverages Multimodal Large Language Models and Large Language Models (LLMs) to transform heterogeneous POI data into unified semantic tags, enabling both cross-modal reasoning and efficient downstream processing. To address preference deviation, a ``teacher'' LLM executes a custom Chain-of-Thought (CoT) process to disentangle resident and tourist preferences from multi-city histories for re-ranking. Finally, a lightweight student model learns this CoT reasoning via Supervised Fine-Tuning and is then refined with Direct Preference Optimization to align with true user choices, with the potential to surpass the teacher. Extensive experiments on a real-world dataset demonstrate that DiMA significantly enhances the performance of baseline models in the OOT recommendation re-ranking task.

AAAI Conference 2026 Conference Paper

Hybrid-DMKG: A Hybrid Reasoning Framework over Dynamic Multimodal Knowledge Graphs for Multimodal Multihop QA with Knowledge Editing

  • Li Yuan
  • Qingfei Huang
  • Bingshan Zhu
  • Yi Cai
  • Qingbao Huang
  • Changmeng Zheng
  • Zikun Deng
  • Tao Wang

Multimodal Knowledge Editing (MKE) extends traditional knowledge editing to settings involving both textual and visual modalities. However, existing MKE benchmarks primarily assess final answer correctness, neglecting the quality of intermediate reasoning and robustness to visually rephrased inputs. To address this limitation, we introduce MMQAKE, the first benchmark for multimodal multihop question answering with knowledge editing. MMQAKE evaluates: (1) a model’s ability to reason over 2–5-hop factual chains that span both text and images, including performance at each intermediate step; (2) robustness to visually rephrased inputs in multihop questions. Our evaluation shows that current MKE methods often struggle to consistently update and reason over multimodal reasoning chains following knowledge edits. To overcome these challenges, we propose Hybrid-DMKG, a hybrid reasoning framework built on a dynamic multimodal knowledge graph (DMKG) to enable accurate multihop reasoning over updated multimodal knowledge. Hybrid-DMKG first uses a large language model to decompose multimodal multihop questions into sequential sub-questions, then applies a multimodal retrieval model to locate updated facts by jointly encoding each sub-question with candidate entities and their associated images. For answer inference, a hybrid reasoning module operates over the DMKG via two parallel paths: (1) relation-linking prediction; (2) RAG Reasoning with large vision-language models. A background-reflective decision module then aggregates evidence from both paths to select the most credible answer. Experimental results on MMQAKE show that Hybrid-DMKG significantly outperforms existing MKE approaches, achieving higher accuracy and improved robustness to knowledge updates.

JBHI Journal 2026 Journal Article

USRMamba: Adaptive Routing-Guided State Space Model for Ultrasound Super-Resolution

  • Tao Wang
  • Zihan Zhou
  • Chufeng Jin
  • Tianyi Liu
  • Baike Shi
  • Guangquan Zhou
  • Rongjun Ge
  • Jean-Louis Coatrieux

In ultrasound (US) imaging, resolution degradation caused by the acoustic diffraction limit and transducer array density can significantly reduce image quality, which have negative impacts on clinical diagnosis. Super-resolution (SR) reconstruction is a more flexible and cost-effective measure compared to system upgrades. However, the complexity and diversity of tissue acoustic properties make it difficult to establish a unified model for US image SR reconstruction. In this context, this paper pioneers a revolutionary Mamba-based single US image SR method, referred to as USRMamba. Firstly, a simple and efficient Enhanced Transform Combine Module (ETCM) is designed for shallow feature extraction, which achieves multi-scale decoupling through Laplacian sharpening and wavelet transform to solve the interference of high-frequency information loss and speckle noise in US images; More importantly, an Adaptive Top-k Prompt Module (ATPM) is proposed, whose core is to generate semantic prompts through an adaptive routing-guided strategy to suppress the interference of fuzzy region labels caused by attenuation on detail reconstruction. In addition, a Frequency Channel Attention Module (FCAM) is developed, forming a modeling strategy of “frequency-spatial domain reconstruction” in parallel with ATPM, further optimizing the fidelity for US images SR reconstruction. Qualitative and quantitative experiments demonstrate that USRMamba exhibits superior performance on several US datasets. Especially with scale factor ×2, the proposed method has an average PSNR 1. 31dB higher than state-of-the-art (SOTA) methods.

IROS Conference 2025 Conference Paper

A Crab-Inspired Soft Gripper with Single-Finger Dexterous Grasping Capabilities

  • Yunce Zhang
  • Haobin Lv
  • Yixiang Liu
  • Zhe Min
  • Shizhao Zhou
  • Tao Wang
  • Shiqiang Zhu
  • Rui Song 0002

Soft grippers conform to the shape and surface properties of the objects to be grasped, effectively avoiding damage to soft and fragile items. Despite the variety of existing soft gripper designs, their structures lack sufficient flexibility for effectively grasping slender objects or operating in narrow spaces. To address these challenges, we propose a soft gripper with single-finger grasping capabilities, inspired by the structure of crab claws. The structural design and the fabrication method of the gripper are introduced, and the analytical bending model is derived. Experiments are conducted under typical operating conditions to validate the model, and the results indicate that the measured data are in good accordance with the predicted responses. Furthermore, a series of grasping experiments are carried out to test the single-finger grasping capabilities of the proposed soft gripper. The results indicate that the proposed soft gripper can efficiently and stably grasp slender or irregular objects with a single finger. In particular, it demonstrates suitability for operations in narrow spaces and shows potential for handling complex tasks. This innovative design effectively reduces the complexity of the system, while exhibiting promising capabilities in grasping slender or irregular objects and operating within restricted spaces.

TIST Journal 2025 Journal Article

A GPT-assisted Multi-Granularity Contrastive Learning approach for Knowledge Graph Entity Typing

  • Hongbin Zhang
  • Tao Wang
  • Zhuowei Wang
  • Nankai Lin
  • Chong Chen
  • Lianglun Cheng

Knowledge graph entity typing (KGET) is an efficient way to infer possible missing types for entities, which has become a key instrument to enhance the construction of knowledge graphs (KGs). Existing models to KGET have mainly focused on a single granularity information such as distinct entity information, but other granularity information including entity-to-type-clusters, the same cluster and interaction information have not been fully explored, resulting in inferring incorrect types in KGs. To address this, we propose a GPT-assisted Multi-Granularity Contrastive Learning (GMGCL) approach to acquire entity-to-type-clusters, entity, type-cluster and relation information by GPT-assisted entity-to-type-clusters clustering, entity-based, cluster-based and relation-based contrastive learning, respectively. Our approach is evaluated on FB15kET and YAGO43kET datasets, outperforming other baselines and obtaining a 1.35% average improvement at least on MRR.

NeurIPS Conference 2025 Conference Paper

ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

  • Lingfeng Wang
  • Hualing Lin
  • Senda Chen
  • Tao Wang
  • Changxu Cheng
  • Yangyang Zhong
  • Dong Zheng
  • Wuyue Zhao

While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models will be released.

NeurIPS Conference 2025 Conference Paper

Association-Focused Path Aggregation for Graph Fraud Detection

  • Tian Qiu
  • Wenda Li
  • Zunlei Feng
  • Jie Lei
  • Tao Wang
  • Yi Gao
  • Mingli Song
  • Yang Gao

Fraudulent activities have caused substantial negative social impacts and are exhibiting emerging characteristics such as intelligence and industrialization, posing challenges of high-order interactions, intricate dependencies, and the sparse yet concealed nature of fraudulent entities. Existing graph fraud detectors are limited by their narrow "receptive fields", as they focus only on the relations between an entity and its neighbors while neglecting longer-range structural associations hidden between entities. To address this issue, we propose a novel fraud detector based on Graph Path Aggregation (GPA). It operates through variable-length path sampling, semantic-associated path encoding, path interaction and aggregation, and aggregation-enhanced fraud detection. To further facilitate interpretable association analysis, we synthesize G-Internet, the first benchmark dataset in the field of internet fraud detection. Extensive experiments across datasets in multiple fraud scenarios demonstrate that the proposed GPA outperforms mainstream fraud detectors by up to +15% in Average Precision (AP). Additionally, GPA exhibits enhanced robustness to noisy labels and provides excellent interpretability by uncovering implicit fraudulent patterns across broader contexts. Code is available at https: //github. com/horrible-dong/GPA.

IJCAI Conference 2025 Conference Paper

Collaborative Multi-LoRA Experts with Achievement-based Multi-Tasks Loss for Unified Multimodal Information Extraction

  • Li Yuan
  • Yi Cai
  • Xudong Shen
  • Qing Li
  • Qingbao Huang
  • Zikun Deng
  • Tao Wang

Multimodal Information Extraction (MIE) has gained attention for extracting structured information from multimedia sources. Traditional methods tackle MIE tasks separately, missing opportunities to share knowledge across tasks. Recent approaches unify these tasks into a generation problem using instruction-based T5 models with visual adaptors, optimized through full-parameter fine-tuning. However, this method is computationally intensive, and multi-task fine-tuning often faces gradient conflicts, limiting performance. To address these challenges, we propose collaborative multi-LoRA experts with achievement-based multi-task loss (C-LoRAE) for MIE tasks. C-LoRAE extends the low-rank adaptation (LoRA) method by incorporating a universal expert to learn shared multimodal knowledge from cross-MIE tasks and task-specific experts to learn specialized instructional task features. This configuration enhances the model’s generalization ability across multiple tasks while maintaining the independence of various instruction tasks and mitigating gradient conflicts. Additionally, we propose an achievement-based multi-task loss to balance training progress across tasks, addressing the imbalance caused by varying numbers of training samples in MIE tasks. Experimental results on seven benchmark datasets across three key MIE tasks demonstrate that C-LoRAE achieves superior overall performance compared to traditional fine-tuning methods and LoRA methods while utilizing a comparable number of training parameters to LoRA.

JBHI Journal 2025 Journal Article

CSAI: Conditional Self-Attention Imputation for Healthcare Time-series

  • Linglong Qian
  • Joseph Arul Raj
  • Hugh Logan-Ellis
  • Ao Zhang
  • Yuezhou Zhang
  • Tao Wang
  • Richard JB Dobson
  • Zina Ibrahim

We introduce the Conditional Self-Attention Imputation (CSAI) model, a novel recurrent neural network architecture designed to address imputation challenges in multivariate time series derived from hospital electronic health records (EHRs). CSAI introduces key novelties specific to EHR data: a) attention-based hidden state initialisation to capture both long- and short-range temporal dependencies, b) domain-informed temporal decay to mimic clinical recording patterns, and c) a non-uniform masking strategy that models non-random missingness. Comprehensive evaluation across four EHR benchmark datasets demonstrates CSAI's effectiveness compared to state-of-the-art architectures in data restoration and downstream tasks. CSAI is integrated into PyPOTS, an open-source Python toolbox for partially observed time series. This work significantly advances the state of neural network imputation applied to EHRs by more closely aligning algorithmic imputation with clinical realities.

JBHI Journal 2025 Journal Article

Design of a Multi-Parameter Fusion Sensor and System for Respiratory Monitoring of Mechanically Ventilated Patients in the ICU

  • Shuai Ren
  • Xiaohan Wang
  • Maolin Cai
  • Yan Shi
  • Tao Wang
  • Zujin Luo

In order to achieve precise respiratory therapy for mechanically ventilated patients, real-time monitoring of the state parameters of inhaled and exhaled gases is required. These parameters are primarily measured by ventilators, with limitations such as insufficient monitoring parameters, circuit leaks, and constraints imposed by distance and obstacles. This paper designs a low-power wireless sensor for multi-parameter monitoring near the patient, which can be used continuously for approximately 60 days. Based on this sensor, an intelligent respiratory monitoring system with a distributed architecture is proposed to achieve intelligent patient-ventilator asynchrony (PVA) perception. Experimental results show that the system can stably and accurately collect and transmit data, with measurement errors for pressure, flow, temperature, humidity, and CO $_{2}$ concentration being $\pm$ 1. 3%, $\pm$ 2. 1%, $\pm$ 0. 6 $^\circ$, $\pm$ 1% RH, $\pm$ 0. 3 mmHg respectively. The proposed sensor and system have the potential to enhance the efficiency and intelligence of medical care significantly.

IROS Conference 2025 Conference Paper

Enhancing the Flexibility of a Quadruped Robot with a 2-DOF Active Spine Using Nonlinear Model Predictive Control

  • Zeyi Yang
  • Zhiyong Xu
  • Haoming Rong
  • Shaolin Mo
  • Yuying Chen
  • Zujian Chen
  • Tao Wang
  • Hui Cheng

For quadrupeds, a flexible spine allows them to traverse space and make quick turns. From the perspective of mechanical design in quadruped robots, an active spine with 2 degrees of freedom (2-DOF) can achieve dynamic posture adjustment similar to biological organisms which allows for pitch and yaw control. In this work, we present a novel approach to enhance the flexibility of a quadruped robot, Yatsen Lion II, by incorporating a 2-DOF active spine, which is mechanically designed as a linkage-driven parallelogram mechanism. To optimize its motion, we utilize nonlinear model predictive control (NMPC), which combines centroidal dynamics with full kinematics. By incorporating the two extra DOFs of the spinal joint into the generalized coordinates and velocities, we represent the robot as a hybrid dynamic system, capturing the intricate interplay between the legs and spine. Centroidal dynamics act as a crucial bridge between joint movements and the robot’s overall momentum, enabling the controller to synchronize the quadruped’s movements with dynamic spinal adjustments and adaptive gait patterns. We validate our approach through both simulation and real-world experiments. We compare spinal quadruped robot to their rigid-spine counterparts across key locomotion metrics, including in-place turning, straight-line speed, and turning radius. The results indicate that the spined quadrupedal robot outperforms its rigid counterpart by up to 26%, highlighting its flexibility.

NeurIPS Conference 2025 Conference Paper

Foundations of Top-$k$ Decoding for Language Models

  • Georgy Noarov
  • Soham Mallick
  • Tao Wang
  • Sunay Joshi
  • Yan Sun
  • Yangxinyu Xie
  • Mengxin Yu
  • Edgar Dobriban

Top-$k$ decoding is a widely used method for sampling from LLMs: at each token, only the largest $k$ next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top-$k$ and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-$k$ decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-$k$ decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We introduce *Bregman decoders* obtained by minimizing a separable Bregman divergence (for both the *primal* and *dual* cases) with a sparsity-inducing $\ell_0$-regularization; in particular, these decoders are *adaptive* in the sense that the sparsity parameter $k$ is chosen depending on the underlying token distribution. Despite the combinatorial nature of the sparse Bregman objective, we show how to optimize it efficiently for a large class of divergences. We prove that (i) the optimal decoding strategies are greedy, and further that (ii) the objective is discretely convex in $k$, such that the optimal $k$ can be identified in logarithmic time. We note that standard top-$k$ decoding arises as a special case for the KL divergence, and construct new decoding strategies with substantially different behaviors (e. g. , non-linearly up-weighting larger probabilities after re-normalization).

NeurIPS Conference 2025 Conference Paper

Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS

  • Tao Wang
  • Mengyu Li
  • Geduo Zeng
  • Cheng Meng
  • Qiong Zhang

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for radiance field rendering, but it typically requires millions of redundant Gaussian primitives, overwhelming memory and rendering budgets. Existing compaction approaches address this by pruning Gaussians based on heuristic importance scores, without global fidelity guarantee. To bridge this gap, we propose a novel optimal transport perspective that casts 3DGS compaction as global Gaussian mixture reduction. Specifically, we first minimize the composite transport divergence over a KD-tree partition to produce a compact geometric representation, and then decouple appearance from geometry by fine-tuning color and opacity attributes with far fewer Gaussian primitives. Experiments on benchmark datasets show that our method (i) yields negligible loss in rendering quality (PSNR, SSIM, LPIPS) compared to vanilla 3DGS with only 10\% Gaussians; and (ii) consistently outperforms state-of-the-art 3DGS compaction techniques. Notably, our method is applicable to any stage of vanilla or accelerated 3DGS pipelines, providing an efficient and agnostic pathway to lightweight neural rendering.

JBHI Journal 2025 Journal Article

High-Frequency SSVEP-BCI With Row-Column Dual-Frequency Encoding and Decoding Strategy for Reduced Training Data

  • Yufeng Ke
  • Xiaohe Chen
  • Wei Xu
  • Tao Wang
  • Shuaishuai Shen
  • Dong Ming

Steady-state visual evoked potentials (SSVEP)-based brain-computer interfaces (BCIs) have the potential to be utilized in various fields due to their high accuracies and information transfer rates (ITR). High-frequency (HF) visual stimuli have shown promise in reducing visual fatigue and enhancing user comfort. However, these HF-SSVEP-BCIs often face limitations in the number of commands and typically require extensive individual training data to achieve high performance. In this study, we proposed a row-column dual-frequency encoding and decoding method using HF stimulation to develop a comfortable BCI system that supports multiple commands and reduces training costs. We arranged 20 targets in a matrix of five rows and four columns, with each target modulated by left-and-right field stimulation using two frequency-phase combinations. Targets in each row or column share a unique frequency-phase combination, allowing EEG data from the same row or column to be used collectively to train a row/column index decoding model for target identification. To evaluate the performance of our method, we constructed a 20-target asynchronous robotic arm control system with the adaptive window method. With only four training trials per target, the online system achieved an ITR of 105. 14 ± 14. 15 bits/min, a true positive rate of 98. 18 ± 2. 87%, a false positive rate of 7. 39 ± 6. 73%, and a classification accuracy of 91. 88 ± 5. 75%, with an average data length of 925. 70 ± 45. 44 ms. These results indicate that the proposed protocol can deliver accurate and rapid command outputs for a comfortable SSVEP-based BCI with minimal training data and fewer frequencies.

JBHI Journal 2025 Journal Article

How Deep is Your Guess? A Fresh Perspective on Deep Learning for Medical Time-Series Imputation

  • Linglong Qian
  • Hugh Logan Ellis
  • Tao Wang
  • Jun Wang
  • Robin Mitra
  • Richard Dobson
  • Zina Ibrahim

We present a comprehensive analysis of deep learning approaches for Electronic Health Record (EHR) time-series imputation, examining how the interplay between architectural and framework design decisions gives rise to higher-level properties of a given deep imputer model and distinct biases towards complex data characteristics. Our investigation reveals the varying capabilities of deep imputers in capturing complex spatio-temporal dependencies within EHRs, and that the effectiveness of the model depends on how its combined biases align with the characteristics of the medical time series. Our experimental evaluation challenges common assumptions about model complexity, demonstrating that larger models do not necessarily improve performance. Rather, carefully designed architectures can better capture the complex patterns inherent in clinical data. The study highlights the need for imputation approaches that prioritise clinically meaningful data reconstruction over statistical accuracy. Our experiments further reveal up to 20% in variations of imputation performance based on preprocessing and implementation choices, emphasising the need for standardised benchmarking methodologies. Finally, we identify critical gaps between current deep imputation methods and medical requirements, highlighting the importance of integrating clinical insights to achieve more reliable imputation approaches for healthcare applications.

IROS Conference 2025 Conference Paper

IMM-MOT: A Novel 3D Multi-object Tracking Framework with Interacting Multiple Model Filter

  • Xiaohong Liu
  • Xulong Zhao
  • Gang Liu
  • Zili Wu
  • Tao Wang
  • Lei Meng
  • Yuhan Wang

3D Multi-Object Tracking (MOT) provides the trajectories of surrounding objects, assisting robots or vehicles in smarter path planning and obstacle avoidance. Existing 3D MOT methods based on the Tracking-by-Detection framework typically use a single motion model to track an object throughout its entire tracking process. However, objects may change their motion patterns due to variations in the surrounding environment. In this paper, we introduce the Interacting Multiple Model filter in IMM-MOT, which accurately fits the complex motion patterns of individual objects, overcoming the limitation of single-model tracking in existing approaches. In addition, we incorporate a Damping Window mechanism into the trajectory lifecycle management, leveraging the continuous association status of trajectories to control their creation and termination, reducing the occurrence of overlooked low-confidence true targets. Furthermore, we propose the Distance-Based Score Enhancement module, which enhances the differentiation between false positives and true positives by adjusting detection scores, thereby improving the effectiveness of the Score Filter. On the NuScenes Val dataset, IMM-MOT outperforms most other single-modal models using 3D point clouds, achieving an AMOTA of 73. 8%. Our project is available at https://github.com/Ap01lo/IMM-MOT.

JBHI Journal 2025 Journal Article

Improving Patient-Ventilator Synchrony During Pressure Support Ventilation Based on Reinforcement Learning Algorithm

  • Liming Hao
  • Xiaohan Wang
  • Shuai Ren
  • Yan Shi
  • Maolin Cai
  • Tao Wang
  • Zujin Luo

Mechanical ventilation is an effective treatment for critically ill patients and those with pulmonary diseases. However, patient-ventilator asynchrony (PVA) remains a significant challenge, potentially leading to high mortality. Improving patient-ventilator synchrony poses a complex decision-making problem in clinical practice. Traditional methods rely heavily on clinicians' experience, often resulting in inefficiencies, delayed ventilator adjustments, and resource shortages. This paper proposes a novel approach using a deep reinforcement learning (RL) algorithm based on deep Q-learning (DQN) to enhance patient-ventilator synchrony during pressure support ventilation. The action space and reward function are established from clinical experience, and a pneumatic model of the mechanical ventilation system is constructed to simulate various patient conditions and types of PVAs. Clinical data are used to evaluate the RL algorithm qualitatively and quantitatively. The RL-optimized ventilation strategy reduces the proportion of breaths containing PVAs from 37. 52% to 7. 08%, demonstrating its effectiveness in assisting clinical decision-making, improving synchrony, and enabling intelligent ventilator control, bedside monitoring, and automatic weaning.

ICML Conference 2025 Conference Paper

Improving Value Estimation Critically Enhances Vanilla Policy Gradient

  • Tao Wang
  • Ruipeng Zhang
  • Sicun Gao

Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

JBHI Journal 2025 Journal Article

Integrative Graph-Based Framework for Predicting circRNA Drug Resistance Using Disease Contextualization and Deep Learning

  • Yongtian Wang
  • Wenkai Shen
  • Yewei Shen
  • Shang Feng
  • Tao Wang
  • Xuequn Shang
  • Jiajie Peng

Circular RNAs (circRNAs) play a crucial role in gene regulation and have been implicated in the development of drug resistance in cancer, representing a significant challenge in oncological therapeutics. Despite advancements in computational models predicting RNA-drug interactions, existing frameworks often overlook the complex interplay between circRNAs, drug mechanisms, and disease contexts. This study aims to bridge this gap by introducing a novel computational model, circRDRP, that enhances prediction accuracy by integrating disease-specific contexts into the analysis of circRNA-drug interactions. It employs a hybrid graph neural network that combines features from Graph Attention Networks (GAT) and Graph Convolutional Networks (GCN) in a two-layer structure, with further enhancement through convolutional neural networks. This approach allows for sophisticated feature extraction from integrated networks of circRNAs, drugs, and diseases. Our results demonstrate that the circRDRP model outperforms existing models in predicting drug resistance, showing significant improvements in accuracy, precision, and recall. Specifically, the model shows robust predictive capability in case studies involving major anticancer drugs such as Cisplatin and Methotrexate, indicating its potential utility in precision medicine. In conclusion, circRDRP offers a powerful tool for understanding and predicting drug resistance mediated by circRNAs, with implications for designing more effective cancer therapies.

ICML Conference 2025 Conference Paper

Open-Det: An Efficient Learning Framework for Open-Ended Detection

  • Guiping Cao
  • Tao Wang
  • Wenjian Huang 0001
  • Xiangyuan Lan
  • Jianguo Zhang 0001
  • Dongmei Jiang

Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1. 5% of the training data (0. 077M vs. 5. 077M), 20. 8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1. 0% in APr). The source codes are available at: https: //github. com/Med-Process/Open-Det.

JBHI Journal 2025 Journal Article

Resting-State Electroencephalographic Signatures Predict Treatment Efficacy of tACS for Refractory Auditory Hallucinations in Schizophrenic Patients

  • Xiaojuan Wang
  • Ruxin Hu
  • Tao Wang
  • Yuan Chang
  • Xiaoya Liu
  • Meijuan Li
  • Ying Gao
  • Shuang Liu

Transcranial alternating current stimulation (tACS) has been reported to treat refractory auditory hallucinations in schizophrenia. Despite diligent efforts, it is imperative to underscore that tACS does not uniformly demonstrate efficacy across all patients as with all treatments currently employed in clinical practice. The study aims to find biomarkers predicting individual responses to tACS, guiding treatment decisions, and preventing healthcare resource wastage. We divided 17 schizophrenic patients with refractory auditory hallucinations into responsive(RE) and non-responsive(NR) groups based on their auditory hallucination symptom reduction rates after one month of tACS treatment. The pre-treatment resting-state electroencephalogram(rsEEG) was recorded and then computed absolute power spectral density (PSD), Hjorth parameters (HPs, Hjorth activity (HA), Hjorth mobility (HM), and Hjorth complexity (HC) included) from different frequency bands to portray the brain oscillations. The results demonstrated that statistically significant differences localized within the high gamma frequency bands of the right brain hemisphere. Immediately, we input the significant dissociable features into popular machine learning algorithms, the Cascade Forward Neural Network achieved the best recognition accuracy of 93. 87%. These findings preliminarily imply that high gamma oscillations in the right brain hemisphere may be the main influencing factor leading to different responses to tACS treatment, and incorporating rsEEG signatures could improve personalized decisions for integrating tACS in clinical treatment.

NeurIPS Conference 2025 Conference Paper

SALS: Sparse Attention in Latent Space for KV Cache Compression

  • Junlin Mu
  • Hantao Huang
  • Jihang Zhang
  • Minghui Yu
  • Tao Wang
  • Yidong Li

Large Language Models (LLMs) capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value (KV) cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding (RoPE) mechanism in modern LLMs, naive low‑-rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space (SALS) framework. SALS projects the KV cache into a compact latent space via low-rank projection, and performs sparse token selection using RoPE-free query--key interactions in this space. By reconstructing only a small subset of important tokens, it avoids the overhead of full KV cache reconstruction. We comprehensively evaluate SALS on various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLaMA3. 1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy. Under different settings, SALS achieves 6. 4-fold KV cache compression and 5. 7-fold speed-up in the attention operator compared to FlashAttention2 on the 4K sequence. For the end-to-end throughput performance, we achieves 1. 4-fold and 4. 5-fold improvement compared to GPT-fast on 4k and 32K sequences, respectively. The source code will be publicly available in the future.

IROS Conference 2025 Conference Paper

Seamless Transition Control in Spring-Legged Quadrotors: A Hybrid Dynamics Perspective with Guaranteed Feasibility

  • Hongli Li
  • Botao Zhang
  • Rui Mao
  • Tao Wang
  • Hui Cheng

Legged aerial-terrestrial robots have garnered significant research attention in recent years due to their enhanced environmental adaptability through combined aerial and terrestrial locomotion. However, existing passive spring-legged aerial robots exhibit limited motion versatility, demonstrating single stance gait during ground impacts, which constrains their task adaptability and creates substantial challenges in hybrid trajectory optimization and switching control. To address these difficulties, this work presents a systematic solution to achieve diverse hybrid locomotion. We innovatively establish the differential flatness property for spring-legged quadrotors in both aerial and terrestrial domains, and propose a unified hybrid trajectory optimization framework that generates smooth, agile, and dynamically feasible multi-modal trajectories incorporating diverse stance gait patterns. Furthermore, a hybrid nonlinear model predictive controller with a trajectory extension strategy is developed to enhance hybrid tracking precision and mode transition execution. Compared to existing methods, we achieve a 27% reduction in tracking error during hybrid locomotion while maintaining high-precision foot placement. The source code will be released to benefit the community 1

JBHI Journal 2024 Journal Article

An Accurate Non-Contact Photoplethysmography via Active Cancellation of Reflective Interference

  • Yonggang Tong
  • Zhipei Huang
  • Feng Qiu
  • Tao Wang
  • Yiquan Wang
  • Fei Qin
  • Ming Yin

Imaging Photoplethysmography (IPPG) is an emerging and efficient optical method for non-contact measurement of pulse waves using an image sensor. While the contactless way brings convenience, the inevitable distance between the sensor and the subject results in massive specular reflection interference on the skin surface, which leads to a low Signal to Interference plus Noise Ratio (SINR) of IPPG. To ease this challenge, this work proposes a novel modulation illumination approach to measure the accurate arterial pulse wave via surface reflection interference isolation from IPPG. Based on the proposed skin reflection model, a specific modulation illumination is designed to separate the surface reflections and obtain the subcutaneous diffuse reflections containing the pulse wave information. Compared with the results under ambient illumination and constant supplemental illumination, the SINR of the proposed method is improved by 4. 56 and 3. 74 dB, respectively.

IROS Conference 2024 Conference Paper

An Online Automatic Calibration Method for Infrastructure-Based LiDAR-Camera via Cross-modal Object Matching

  • Tao Wang
  • Yuesheng He
  • Hanyang Zhuang
  • Ming Yang 0002

In indoor environments where the Global Navigation Satellite System (GNSS) isn’t available, the infrastructure-based LiDAR-camera joint array can provide high-precision localization for mobile robots, such as Autonomous Valet Parking (AVP). The primary challenge in employing the infrastructure-based LiDAR-camera joint array is the extrinsic calibration between the LiDAR and the camera. Moreover, to handle interference deviation caused by vibrations or inadequate mounting stiffness during operation, the calibration’s extrinsic parameters must be automatically updated online, presenting higher demands for infrastructure-based LiDAR-camera extrinsic calibration. This paper proposes an infrastructure LiDAR-camera online automatic calibration method based on prior knowledge of cross-modal target registration. This method requires no manual targets and initial pose guesses and can achieve extrinsic calibration. The object-prior model based on a lightweight object detection algorithm can rapidly detect scenes favorable for extrinsic calibration in sub-images of camera images. This creates favorable conditions for the registration of cross-modal networks and poses optimization of the LiDAR camera. Additionally, because a lightweight algorithm is used, the process does not compromise efficiency or consume excessive computational resources. Experimental results demonstrate that the proposed calibration method is suitable for calibrating infrastructure-based LiDAR-camera, with comparable accuracy and the ability to perform online calibration. Comparative experiments also show that the object-prior model can indeed select better scenes for LiDAR-camera extrinsic calibration, thus improving the accuracy and stability of extrinsic calibration to some extent.

ICML Conference 2024 Conference Paper

Controlled Decoding from Language Models

  • Sidharth Mudgal
  • Jong Lee
  • Harish Ganapathy
  • YaGuang Li
  • Tao Wang
  • Yanping Huang
  • Zhifeng Chen
  • Heng-Tze Cheng

KL-regularized reinforcement learning (RL) is a popular alignment framework to control the language model responses towards high reward outcomes. We pose a tokenwise RL objective and propose a modular solver for it, called controlled decoding (CD). CD exerts control through a separate prefix scorer module, which is trained to learn a value function for the reward. The prefix scorer is used at inference time to control the generation from a frozen base model, provably sampling from a solution to the RL objective. We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that prefix scorers for multiple rewards may be combined at inference time, effectively solving a multi-objective RL problem with no additional training. We show that the benefits of applying CD transfer to an unseen base model with no further tuning as well. Finally, we show that CD can be applied in a blockwise decoding fashion at inference-time, essentially bridging the gap between the popular best-of-$K$ strategy and tokenwise control through reinforcement learning. This makes CD a promising approach for alignment of language models.

ECAI Conference 2024 Conference Paper

CorrAdaptor: Adaptive Local Context Learning for Correspondence Pruning

  • Wei Zhu
  • Yicheng Liu
  • Yuping He
  • Tangfei Liao
  • Kang Zheng
  • Xiaoqiu Xu
  • Tao Wang
  • Tong Lu

In the fields of computer vision and robotics, accurate pixel-level correspondences are essential for enabling advanced tasks such as structure-from-motion and simultaneous localization and mapping. Recent correspondence pruning methods usually focus on learning local consistency through k-nearest neighbors, which makes it difficult to capture robust context for each correspondence. We propose CorrAdaptor, a novel architecture that introduces a dual-branch structure capable of adaptively adjusting local contexts through both explicit and implicit local graph learning. Specifically, the explicit branch uses KNN-based graphs tailored for initial neighborhood identification, while the implicit branch leverages a learnable matrix to softly assign neighbors and adaptively expand the local context scope, significantly enhancing the model’s robustness and adaptability to complex image variations. Moreover, we design a motion injection module to integrate motion consistency into the network to suppress the impact of outliers and refine local context learning, resulting in substantial performance improvements. The experimental results on extensive correspondence-based tasks indicate that our CorrAdaptor achieves state-of-the-art performance both qualitatively and quantitatively.

NeurIPS Conference 2024 Conference Paper

DFA-GNN: Forward Learning of Graph Neural Networks by Direct Feedback Alignment

  • Gongpei Zhao
  • Tao Wang
  • Congyan Lang
  • Yi Jin
  • Yidong Li
  • Haibin Ling

Graph neural networks (GNNs) are recognized for their strong performance across various applications, with the backpropagation (BP) algorithm playing a central role in the development of most GNN models. However, despite its effectiveness, BP has limitations that challenge its biological plausibility and affect the efficiency, scalability and parallelism of training neural networks for graph-based tasks. While several non-backpropagation (non-BP) training algorithms, such as the direct feedback alignment (DFA), have been successfully applied to fully-connected and convolutional network components for handling Euclidean data, directly adapting these non-BP frameworks to manage non-Euclidean graph data in GNN models presents significant challenges. These challenges primarily arise from the violation of the independent and identically distributed (i. i. d. ) assumption in graph data and the difficulty in accessing prediction errors for all samples (nodes) within the graph. To overcome these obstacles, in this paper we propose DFA-GNN, a novel forward learning framework tailored for GNNs with a case study of semi-supervised learning. The proposed method breaks the limitations of BP by using a dedicated forward training mechanism. Specifically, DFA-GNN extends the principles of DFA to adapt to graph data and unique architecture of GNNs, which incorporates the information of graph topology into the feedback links to accommodate the non-Euclidean characteristics of graph data. Additionally, for semi-supervised graph learning tasks, we developed a pseudo error generator that spreads residual errors from training data to create a pseudo error for each unlabeled node. These pseudo errors are then utilized to train GNNs using DFA. Extensive experiments on 10 public benchmarks reveal that our learning framework outperforms not only previous non-BP methods but also the standard BP methods, and it exhibits excellent robustness against various types of noise and attacks.

NeurIPS Conference 2024 Conference Paper

Generated and Pseudo Content guided Prototype Refinement for Few-shot Point Cloud Segmentation

  • Lili Wei
  • Congyan Lang
  • Ziyi Chen
  • Tao Wang
  • Yidong Li
  • Jun Liu

Few-shot 3D point cloud semantic segmentation aims to segment query point clouds with only a few annotated support point clouds. Existing prototype-based methods learn prototypes from the 3D support set to guide the segmentation of query point clouds. However, they encounter the challenge of low prototype quality due to constrained semantic information in the 3D support set and class information bias between support and query sets. To address these issues, in this paper, we propose a novel framework called Generated and Pseudo Content guided Prototype Refinement (GPCPR), which explicitly leverages LLM-generated content and reliable query context to enhance prototype quality. GPCPR achieves prototype refinement through two core components: LLM-driven Generated Content-guided Prototype Refinement (GCPR) and Pseudo Query Context-guided Prototype Refinement (PCPR). Specifically, GCPR integrates diverse and differentiated class descriptions generated by large language models to enrich prototypes with comprehensive semantic knowledge. PCPR further aggregates reliable class-specific pseudo-query context to mitigate class information bias and generate more suitable query-specific prototypes. Furthermore, we introduce a dual-distillation regularization term, enabling knowledge transfer between early-stage entities (prototypes or pseudo predictions) and their deeper counterparts to enhance refinement. Extensive experiments demonstrate the superiority of our method, surpassing the state-of-the-art methods by up to 12. 10% and 13. 75% mIoU on S3DIS and ScanNet, respectively.

ICML Conference 2024 Conference Paper

Mollification Effects of Policy Gradient Methods

  • Tao Wang
  • Sylvia L. Herbert
  • Sicun Gao

Policy gradient methods have enabled deep reinforcement learning (RL) to approach challenging continuous control problems, even when the underlying systems involve highly nonlinear dynamics that generate complex non-smooth optimization landscapes. We develop a rigorous framework for understanding how policy gradient methods mollify non-smooth optimization landscapes to enable effective policy search, as well as the downside of it: while making the objective function smoother and easier to optimize, the stochastic objective deviates further from the original problem. We demonstrate the equivalence between policy gradient methods and solving backward heat equations. Following the ill-posedness of backward heat equations from PDE theory, we present a fundamental challenge to the use of policy gradient under stochasticity. Moreover, we make the connection between this limitation and the uncertainty principle in harmonic analysis to understand the effects of exploration with stochastic policies in RL. We also provide experimental results to illustrate both the positive and negative aspects of mollification effects in practice.

NeurIPS Conference 2024 Conference Paper

PANORAMIA: Privacy Auditing of Machine Learning Models without Retraining

  • Mishaal Kazmi
  • Hadrien Lautraite
  • Alireza Akbari
  • Qiaoyue Tang
  • Mauricio Soroco
  • Tao Wang
  • Sébastien Gambs
  • Mathias Lécuyer

We present PANORAMIA, a privacy leakage measurement framework for machine learning models that relies on membership inference attacks using generated data as non-members. By relying on generated non-member data, PANORAMIA eliminates the common dependency of privacy measurement tools on in-distribution non-member data. As a result, PANORAMIA does not modify the model, training data, or training process, and only requires access to a subset of the training data. We evaluate PANORAMIA on ML models for image and tabular data classification, as well as on large-scale language models.

AAAI Conference 2024 Conference Paper

Trend-Aware Supervision: On Learning Invariance for Semi-supervised Facial Action Unit Intensity Estimation

  • Yingjie Chen
  • Jiarui Zhang
  • Tao Wang
  • Yun Liang

With the increasing need for facial behavior analysis, semi-supervised AU intensity estimation using only keyframe annotations has emerged as a practical and effective solution to relieve the burden of annotation. However, the lack of annotations makes the spurious correlation problem caused by AU co-occurrences and subject variation much more prominent, leading to non-robust intensity estimation that is entangled among AUs and biased among subjects. We observe that trend information inherent in keyframe annotations could act as extra supervision and raising the awareness of AU-specific facial appearance changing trends during training is the key to learning invariant AU-specific features. To this end, we propose Trend-AwareSupervision (TAS), which pursues three kinds of trend awareness, including intra-trend ranking awareness, intra-trend speed awareness, and inter-trend subject awareness. TAS alleviates the spurious correlation problem by raising trend awareness during training to learn AU-specific features that represent the corresponding facial appearance changes, to achieve intensity estimation invariance. Experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of each kind of awareness. And under trend-aware supervision, the performance can be improved without extra computational or storage costs during inference.

JBHI Journal 2024 Journal Article

Using Semi-Supervised Domain Adaptation to Enhance EEG-Based Cross-Task Mental Workload Classification Performance

  • Tao Wang
  • Yufeng Ke
  • Yichao Huang
  • Feng He
  • Wenxiao Zhong
  • Shuang Liu
  • Dong Ming

Mental workload (MWL) assessment is critical for accident prevention and operator safety. However, achieving cross-task generalization of MWL classification models is a significant challenge for real-world applications. Classifiers trained on labeled samples from one task often experience a notable performance drop when directly applied to samples from other tasks, limiting its use cases. To address this issue, we propose a semi-supervised cross-task domain adaptation (SCDA) method using power spectral density (PSD) features for MWL recognition across tasks (MATB-II and n-back). Our results demonstrated that the SCDA method achieved the best cross-task classification performance on our data and COG-BCI public dataset, with accuracies of 90. 98% ± 9. 36% and 96. 61% ± 4. 35%, respectively. Furthermore, in the cross-task classification of cross-subject scenarios, SCDA showed the highest average accuracy (75. 39% ± 9. 56% on our data, 90. 98% ± 9. 36% on the COG-BCI public dataset). The findings indicate that the semi-supervised transfer learning approach using PSD features is feasible and effective for cross-task MWL assessment.

AAAI Conference 2024 Conference Paper

VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning

  • Tangfei Liao
  • Xiaoqin Zhang
  • Li Zhao
  • Tao Wang
  • Guobao Xiao

Correspondence pruning aims to find correct matches (inliers) from an initial set of putative correspondences, which is a fundamental task for many applications. The process of finding is challenging, given the varying inlier ratios between scenes/image pairs due to significant visual differences. However, the performance of the existing methods is usually limited by the problem of lacking visual cues (e.g., texture, illumination, structure) of scenes. In this paper, we propose a Visual-Spatial Fusion Transformer (VSFormer) to identify inliers and recover camera poses accurately. Firstly, we obtain highly abstract visual cues of a scene with the cross attention between local features of two-view images. Then, we model these visual cues and correspondences by a joint visual-spatial fusion module, simultaneously embedding visual cues into correspondences for pruning. Additionally, to mine the consistency of correspondences, we also design a novel module that combines the KNN-based graph and the transformer, effectively capturing both local and global contexts. Extensive experiments have demonstrated that the proposed VSFormer outperforms state-of-the-art methods on outdoor and indoor benchmarks. Our code is provided at the following repository: https://github.com/sugar-fly/VSFormer.

AAAI Conference 2024 Conference Paper

Zero-Shot Aerial Object Detection with Visual Description Regularization

  • Zhengqing Zang
  • Chenyu Lin
  • Chenwei Tang
  • Tao Wang
  • Jiancheng Lv

Existing object detection models are mainly trained on large-scale labeled datasets. However, annotating data for novel aerial object classes is expensive since it is time-consuming and may require expert knowledge. Thus, it is desirable to study label-efficient object detection methods on aerial images. In this work, we propose a zero-shot method for aerial object detection named visual Description Regularization, or DescReg. Concretely, we identify the weak semantic-visual correlation of the aerial objects and aim to address the challenge with prior descriptions of their visual appearance. Instead of directly encoding the descriptions into class embedding space which suffers from the representation gap problem, we propose to infuse the prior inter-class visual similarity conveyed in the descriptions into the embedding learning. The infusion process is accomplished with a newly designed similarity-aware triplet loss which incorporates structured regularization on the representation space. We conduct extensive experiments with three challenging aerial object detection datasets, including DIOR, xView, and DOTA. The results demonstrate that DescReg significantly outperforms the state-of-the-art ZSD methods with complex projection designs and generative frameworks, e.g., DescReg outperforms best reported ZSD method on DIOR by 4.5 mAP on unseen classes and 8.1 in HM. We further show the generalizability of DescReg by integrating it into generative ZSD methods as well as varying the detection architecture. Codes will be released at https://github.com/zq-zang/DescReg.

NeurIPS Conference 2023 Conference Paper

Fractal Landscapes in Policy Optimization

  • Tao Wang
  • Sylvia Herbert
  • Sicun Gao

Policy gradient lies at the core of deep reinforcement learning (RL) in continuous domains. Despite much success, it is often observed in practice that RL training with policy gradient can fail for many reasons, even on standard control problems with known solutions. We propose a framework for understanding one inherent limitation of the policy gradient approach: the optimization landscape in the policy space can be extremely non-smooth or fractal for certain classes of MDPs, such that there does not exist gradient to be estimated in the first place. We draw on techniques from chaos theory and non-smooth analysis, and analyze the maximal Lyapunov exponents and H\"older exponents of the policy optimization objectives. Moreover, we develop a practical method that can estimate the local smoothness of objective function from samples to identify when the training process has encountered fractal landscapes. We show experiments to illustrate how some failure cases of policy optimization can be explained by such fractal landscapes.

IJCAI Conference 2023 Conference Paper

Graph Propagation Transformer for Graph Representation Learning

  • Zhe Chen
  • Hao Tan
  • Tao Wang
  • Tianrun Shen
  • Tong Lu
  • Qiuying Peng
  • Cheng Cheng
  • Yue Qi

This paper presents a novel transformer architecture for graph representation learning. The core insight of our method is to fully consider the information propagation among nodes and edges in a graph when building the attention module in the transformer blocks. Specifically, we propose a new attention mechanism called Graph Propagation Attention (GPA). It explicitly passes the information among nodes and edges in three ways, i. e. node-to-node, node-to-edge, and edge-to-node, which is essential for learning graph-structured data. On this basis, we design an effective transformer architecture named Graph Propagation Transformer (GPTrans) to further help learn graph data. We verify the performance of GPTrans in a wide range of graph learning experiments on several benchmark datasets. These results show that our method outperforms many state-of-the-art transformer-based graph models with better performance. The code will be released at https: //github. com/czczup/GPTrans.

IJCAI Conference 2023 Conference Paper

Orion: Online Backdoor Sample Detection via Evolution Deviance

  • Huayang Huang
  • Qian Wang
  • Xueluan Gong
  • Tao Wang

Widely-used DNN models are vulnerable to backdoor attacks, where the backdoored model is only triggered by specific inputs but can maintain a high prediction accuracy on benign samples. Existing backdoor input detection strategies rely on the assumption that benign and poisoned samples are separable in the feature representation of the model. However, such an assumption can be broken by advanced feature-hidden backdoor attacks. In this paper, we propose a novel detection framework, dubbed Orion (online backdoor sample detection via evolution deviance). Specifically, we analyze how predictions evolve during a forward pass and find deviations between the shallow and deep outputs of the backdoor inputs. By introducing side nets to track such evolution divergence, Orion eliminates the need for the assumption of latent separability. Additionally, we put forward a scheme to restore the original label of backdoor samples, enabling more robust predictions. Extensive experiments on six attacks, three datasets, and two architectures verify the effectiveness of Orion. It is shown that Orion outperforms state-of-the-art defenses and can identify feature-hidden attacks with an F1-score of 90%, compared to 40% for other detection schemes. Orion can also achieve 80% label recovery accuracy on basic backdoor attacks.

NeurIPS Conference 2023 Conference Paper

Punctuation-level Attack: Single-shot and Single Punctuation Can Fool Text Models

  • Wenqiang Wang
  • Chongyang Du
  • Tao Wang
  • Kaihao Zhang
  • Wenhan Luo
  • Lin Ma
  • Wei Liu
  • Xiaochun Cao

The adversarial attacks have attracted increasing attention in various fields including natural language processing. The current textual attacking models primarily focus on fooling models by adding character-/word-/sentence-level perturbations, ignoring their influence on human perception. In this paper, for the first time in the community, we propose a novel mode of textual attack, punctuation-level attack. With various types of perturbations, including insertion, displacement, deletion, and replacement, the punctuation-level attack achieves promising fooling rates against SOTA models on typical textual tasks and maintains minimal influence on human perception and understanding of the text by mere perturbation of single-shot single punctuation. Furthermore, we propose a search method named Text Position Punctuation Embedding and Paraphrase (TPPEP) to accelerate the pursuit of optimal position to deploy the attack, without exhaustive search, and we present a mathematical interpretation of TPPEP. Thanks to the integrated Text Position Punctuation Embedding (TPPE), the punctuation attack can be applied at a constant cost of time. Experimental results on public datasets and SOTA models demonstrate the effectiveness of the punctuation attack and the proposed TPPE. We additionally apply the single punctuation attack to summarization, semantic-similarity-scoring, and text-to-image tasks, and achieve encouraging results.

JBHI Journal 2023 Journal Article

SemiMAR: Semi-Supervised Learning for CT Metal Artifact Reduction

  • Tao Wang
  • Hui Yu
  • Zhiwen Wang
  • Hu Chen
  • Yan Liu
  • Jingfeng Lu
  • Yi Zhang

Metal artifacts lead to CT imaging quality degradation. With the success of deep learning (DL) in medical imaging, a number of DL-based supervised methods have been developed for metal artifact reduction (MAR). Nonetheless, fully-supervised MAR methods based on simulated data do not perform well on clinical data due to the domain gap. Although this problem can be avoided in an unsupervised way to a certain degree, severe artifacts cannot be well suppressed in clinical practice. Recently, semi-supervised metal artifact reduction (MAR) methods have gained wide attention due to their ability in narrowing the domain gap and improving MAR performance in clinical data. However, these methods typically require large model sizes, posing challenges for optimization. To address this issue, we propose a novel semi-supervised MAR framework. In our framework, only the artifact-free parts are learned, and the artifacts are inferred by subtracting these clean parts from the metal-corrupted CT images. Our approach leverages a single generator to execute all complex transformations, thereby reducing the model's scale and preventing overlap between clean part and artifacts. To recover more tissue details, we distill the knowledge from the advanced dual-domain MAR network into our model in both image domain and latent feature space. The latent space constraint is achieved via contrastive learning. We also evaluate the impact of different generator architectures by investigating several mainstream deep learning-based MAR backbones. Our experiments demonstrate that the proposed method competes favorably with several state-of-the-art semi-supervised MAR techniques in both qualitative and quantitative aspects.

JBHI Journal 2023 Journal Article

Trustworthy Data and AI Environments for Clinical Prediction: Application to Crisis-Risk in People With Depression

  • Yamiko Joseph Msosa
  • Arturas Grauslys
  • Yifan Zhou
  • Tao Wang
  • Iain Buchan
  • Paul Langan
  • Steven Foster
  • Michael Walker

Depression is a common mental health condition that often occurs in association with other chronic illnesses, and varies considerably in severity. Electronic Health Records (EHRs) contain rich information about a patient's medical history and can be used to train, test and maintain predictive models to support and improve patient care. This work evaluated the feasibility of implementing an environment for predicting mental health crisis among people living with depression based on both structured and unstructured EHRs. A large EHR from a mental health provider, Mersey Care, was pseudonymised and ingested into the Natural Language Processing (NLP) platform CogStack, allowing text content in binary clinical notes to be extracted. All unstructured clinical notes and summaries were semantically annotated by MedCAT and BioYODIE NLP services. Cases of crisis in patients with depression were then identified. Random forest models, gradient boosting trees, and Long Short-Term Memory (LSTM) networks, with varying feature arrangement, were trained to predict the occurrence of crisis. The results showed that all the prediction models can use a combination of structured and unstructured EHR information to predict crisis in patients with depression with good and useful accuracy. The LSTM network that was trained on a modified dataset with only 1000 most-important features from the random forest model with temporality showed the best performance with a mean AUC of 0. 901 and a standard deviation of 0. 006 using a training dataset and a mean AUC of 0. 810 and 0. 01 using a hold-out test dataset. Comparing the results from the technical evaluation with the views of psychiatrists shows that there are now opportunities to refine and integrate such prediction models into pragmatic point-of-care clinical decision support tools for supporting mental healthcare delivery.

AAAI Conference 2023 Conference Paper

Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method

  • Tao Wang
  • Kaihao Zhang
  • Tianrun Shen
  • Wenhan Luo
  • Bjorn Stenger
  • Tong Lu

As the quality of optical sensors improves, there is a need for processing large-scale images. In particular, the ability of devices to capture ultra-high definition (UHD) images and video places new demands on the image processing pipeline. In this paper, we consider the task of low-light image enhancement (LLIE) and introduce a large-scale database consisting of images at 4K and 8K resolution. We conduct systematic benchmarking studies and provide a comparison of current LLIE algorithms. As a second contribution, we introduce LLFormer, a transformer-based low-light enhancement method. The core components of LLFormer are the axis-based multi-head self-attention and cross-layer attention fusion block, which significantly reduces the linear complexity. Extensive experiments on the new dataset and existing public datasets show that LLFormer outperforms state-of-the-art methods. We also show that employing existing LLIE methods trained on our benchmark as a pre-processing step significantly improves the performance of downstream tasks, e.g., face detection in low-light conditions. The source code and pre-trained models are available at https://github.com/TaoWangzj/LLFormer.

AAAI Conference 2022 Conference Paper

Causal Intervention for Subject-Deconfounded Facial Action Unit Recognition

  • Yingjie Chen
  • Diqi Chen
  • Tao Wang
  • Yizhou Wang
  • Yun Liang

Subject-invariant facial action unit (AU) recognition remains challenging for the reason that the data distribution varies among subjects. In this paper, we propose a causal inference framework for subject-invariant facial action unit recognition. To illustrate the causal effect existing in AU recognition task, we formulate the causalities among facial images, subjects, latent AU semantic relations, and estimated AU occurrence probabilities via a structural causal model. By constructing such a causal diagram, we clarify the causal effect among variables and propose a plug-in causal intervention module, CIS, to deconfound the confounder Subject in the causal diagram. Extensive experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of our CIS, and the model with CIS inserted, CISNet, has achieved state-of-the-art performance.

AAAI Conference 2022 Conference Paper

Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-Supervised Action Recognition

  • Tianyu Guo
  • Hong Liu
  • Zhan Chen
  • Mengyuan Liu
  • Tao Wang
  • Runwei Ding

In recent years, self-supervised representation learning for skeleton-based action recognition has been developed with the advance of contrastive learning methods. The existing contrastive learning methods use normal augmentations to construct similar positive samples, which limits the ability to explore novel movement patterns. In this paper, to make better use of the movement patterns introduced by extreme augmentations, a Contrastive Learning framework utilizing Abundant Information Mining for self-supervised action Representation (AimCLR) is proposed. First, the extreme augmentations and the Energy-based Attention-guided Drop Module (EADM) are proposed to obtain diverse positive samples, which bring novel movement patterns to improve the universality of the learned representations. Second, since directly using extreme augmentations may not be able to boost the performance due to the drastic changes in original identity, the Dual Distributional Divergence Minimization Loss (D3 M Loss) is proposed to minimize the distribution divergence in a more gentle way. Third, the Nearest Neighbors Mining (NNM) is proposed to further expand positive samples to make the abundant information mining process more reasonable. Exhaustive experiments on NTU RGB+D 60, PKU-MMD, NTU RGB+D 120 datasets have verified that our AimCLR can significantly perform favorably against state-of-the-art methods under a variety of evaluation protocols with observed higher quality action representations. Our code is available at https: //github. com/Levigty/AimCLR.

IJCAI Conference 2022 Conference Paper

Discrete Listwise Personalized Ranking for Fast Top-N Recommendation with Implicit Feedback

  • Fangyuan Luo
  • Jun Wu
  • Tao Wang

We address the efficiency problem of personalized ranking from implicit feedback by hashing users and items with binary codes, so that top-N recommendation can be fast executed in a Hamming space by bit operations. However, current hashing methods for top-N recommendation fail to align their learning objectives (such as pointwise or pairwise loss) with the benchmark metrics for ranking quality (e. g. Average Precision, AP), resulting in sub-optimal accuracy. To this end, we propose a Discrete Listwise Personalized Ranking (DLPR) model that optimizes AP under discrete constraints for fast and accurate top-N recommendation. To resolve the challenging DLPR problem, we devise an efficient algorithm that can directly learn binary codes in a relaxed continuous solution space. Specifically, theoretical analysis shows that the optimal solution to the relaxed continuous optimization problem is exactly the same as that of the original discrete DLPR problem. Through extensive experiments on two real-world datasets, we show that DLPR consistently surpasses state-of-the-art hashing methods for top-N recommendation.

AAAI Conference 2022 Conference Paper

FedInv: Byzantine-Robust Federated Learning by Inversing Local Model Updates

  • Bo Zhao
  • Peng Sun
  • Tao Wang
  • Keyu Jiang

Federated learning (FL) is a privacy-preserving distributed machine learning paradigm that enables multiple clients to collaboratively train statistical models without disclosing raw training data. However, the inaccessible local training data and uninspectable local training process make FL susceptible to various Byzantine attacks (e. g. , data poisoning and model poisoning attacks), aiming to manipulate the FL model training process and degrade the model performance. Most of the existing Byzantine-robust FL schemes cannot effectively defend against stealthy poisoning attacks that craft poisoned models statistically similar to benign models. Things worsen when many clients are compromised or data among clients are highly non-independent and identically distributed (non-IID). In this work, to address these issues, we propose FedInv, a novel Byzantine-robust FL framework by inversing local model updates. Specifically, in each round of local model aggregation in FedInv, the parameter server first inverses the local model updates submitted by each client to generate a corresponding dummy dataset. Then, the server identifies those dummy datasets with exceptional Wasserstein distances from others and excludes the related local model updates from model aggregation. We conduct an exhaustive experimental evaluation of FedInv. The results demonstrate that FedInv significantly outperforms the existing robust FL schemes in defending against stealthy poisoning attacks under highly non-IID data partitions.

ICML Conference 2022 Conference Paper

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  • Nan Du 0002
  • Yanping Huang
  • Andrew M. Dai
  • Simon Tong
  • Dmitry Lepikhin
  • Yuanzhong Xu
  • Maxim Krikun
  • Yanqi Zhou

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest \glam has 1. 2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall fewshot performance across 29 NLP tasks.

AAAI Conference 2022 Conference Paper

Pose-Guided Feature Disentangling for Occluded Person Re-identification Based on Transformer

  • Tao Wang
  • Hong Liu
  • Pinhao Song
  • Tianyu Guo
  • Wei Shi

Occluded person re-identification is a challenging task as human body parts could be occluded by some obstacles (e. g. trees, cars, and pedestrians) in certain scenes. Some existing pose-guided methods solve this problem by aligning body parts according to graph matching, but these graph-based methods are not intuitive and complicated. Therefore, we propose a transformer-based Pose-guided Feature Disentangling (PFD) method by utilizing pose information to clearly disentangle semantic components (e. g. human body or joint parts) and selectively match non-occluded parts correspondingly. First, Vision Transformer (ViT) is used to extract the patch features with its strong capability. Second, to preliminarily disentangle the pose information from patch information, the matching and distributing mechanism is leveraged in Pose-guided Feature Aggregation (PFA) module. Third, a set of learnable semantic views are introduced in transformer decoder to implicitly enhance the disentangled body part features. However, those semantic views are not guaranteed to be related to the body without additional supervision. Therefore, Pose-View Matching (PVM) module is proposed to explicitly match visible body parts and automatically separate occlusion features. Fourth, to better prevent the interference of occlusions, we design a Pose-guided Push Loss to emphasize the features of visible body parts. Extensive experiments over five challenging datasets for two tasks (occluded and holistic Re-ID) demonstrate that our proposed PFD is superior promising, which performs favorably against state-of-the-art methods. Code is available at https: //github. com/WangTaoAs/PFD Net

AAAI Conference 2022 Conference Paper

Powerful Graph Convolutional Networks with Adaptive Propagation Mechanism for Homophily and Heterophily

  • Tao Wang
  • Di Jin
  • Rui Wang
  • Dongxiao He
  • Yuxiao Huang

Graph Convolutional Networks (GCNs) have been widely applied in various fields due to their significant power on processing graph-structured data. Typical GCN and its variants work under a homophily assumption (i. e. , nodes with same class are prone to connect to each other), while ignoring the heterophily which exists in many real-world networks (i. e. , nodes with different classes tend to form edges). Existing methods deal with heterophily by mainly aggregating higher-order neighborhoods or combing the immediate representations, which leads to noise and irrelevant information in the result. But these methods did not change the propagation mechanism which works under homophily assumption (that is a fundamental part of GCNs). This makes it difficult to distinguish the representation of nodes from different classes. To address this problem, in this paper we design a novel propagation mechanism, which can automatically change the propagation and aggregation process according to homophily or heterophily between node pairs. To adaptively learn the propagation process, we introduce two measurements of homophily degree between node pairs, which is learned based on topological and attribute information, respectively. Then we incorporate the learnable homophily degree into the graph convolution framework, which is trained in an end-to-end schema, enabling it to go beyond the assumption of homophily. More importantly, we theoretically prove that our model can constrain the similarity of representations between nodes according to their homophily degree. Experiments on seven real-world datasets demonstrate that this new approach outperforms the state-of-the-art methods under heterophily or low homophily, and gains competitive performance under homophily.

NeurIPS Conference 2022 Conference Paper

Rethinking Image Restoration for Object Detection

  • Shangquan Sun
  • Wenqi Ren
  • Tao Wang
  • Xiaochun Cao

Although image restoration has achieved significant progress, its potential to assist object detectors in adverse imaging conditions lacks enough attention. It is reported that the existing image restoration methods cannot improve the object detector performance and sometimes even reduce the detection performance. To address the issue, we propose a targeted adversarial attack in the restoration procedure to boost object detection performance after restoration. Specifically, we present an ADAM-like adversarial attack to generate pseudo ground truth for restoration training. Resultant restored images are close to original sharp images, and at the same time, lead to better results of object detection. We conduct extensive experiments in image dehazing and low light enhancement and show the superiority of our method over conventional training and other domain adaptation and multi-task methods. The proposed pipeline can be applied to all restoration methods and detectors in both one- and two-stage.

IJCAI Conference 2022 Conference Paper

Uncertainty-Guided Pixel Contrastive Learning for Semi-Supervised Medical Image Segmentation

  • Tao Wang
  • Jianglin Lu
  • Zhihui Lai
  • Jiajun Wen
  • Heng Kong

Recently, contrastive learning has shown great potential in medical image segmentation. Due to the lack of expert annotations, however, it is challenging to apply contrastive learning in semi-supervised scenes. To solve this problem, we propose a novel uncertainty-guided pixel contrastive learning method for semi-supervised medical image segmentation. Specifically, we construct an uncertainty map for each unlabeled image and then remove the uncertainty region in the uncertainty map to reduce the possibility of noise sampling. The uncertainty map is determined by a well-designed consistency learning mechanism, which generates comprehensive predictions for unlabeled data by encouraging consistent network outputs from two different decoders. In addition, we suggest that the effective global representations learned by an image encoder should be equivariant to different geometric transformations. To this end, we construct an equivariant contrastive loss to strengthen global representation learning ability of the encoder. Extensive experiments conducted on popular medical image benchmarks demonstrate that the proposed method achieves better segmentation performance than the state-of-the-art methods.

TIST Journal 2022 Journal Article

Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

  • Lili Wei
  • Congyan Lang
  • Liqian Liang
  • Songhe Feng
  • Tao Wang
  • Shidi Chen

Recently, concerning the challenge of collecting large-scale explicitly annotated videos, weakly supervised video object segmentation (WSVOS) using video tags has attracted much attention. Existing WSVOS approaches follow a general pipeline including two phases, i.e., a pseudo masks generation phase and a refinement phase. To explore the intrinsic property and correlation buried in the video frames, most of them focus on the later phase by introducing optical flow as temporal information to provide more supervision. However, these optical flow-based studies are greatly affected by illumination and distortion and lack consideration of the discriminative capacity of multi-level deep features. In this article, with the goal of capturing more effective temporal information and investigating a temporal information fusion strategy accordingly, we propose a unified WSVOS model by adopting a two-branch architecture with a multi-level cross-branch fusion strategy, named as dual-attention cross-branch fusion network (DACF-Net). Concretely, the two branches of DACF-Net, i.e., a temporal prediction subnetwork (TPN) and a spatial segmentation subnetwork (SSN), are used for extracting temporal information and generating predicted segmentation masks, respectively. To perform the cross-branch fusion between TPN and SSN, we propose a dual-attention fusion module that can be plugged into the SSN flexibly. We also pose a cross-frame coherence loss (CFCL) to achieve smooth segmentation results by exploiting the coherence of masks produced by TPN and SSN. Extensive experiments demonstrate the effectiveness of proposed approach compared with the state-of-the-arts on two challenging datasets, i.e., Davis-2016 and YouTube-Objects.

AAAI Conference 2021 Short Paper

An Entity-Aware Adversarial Domain Adaptation Network for Cross-Domain Named Entity Recognition (Student Abstract)

  • Qi Peng
  • Changmeng Zheng
  • Yi Cai
  • Tao Wang
  • Haoran Xie
  • Qing Li

Existing methods for named entity recognition are critically relied on labeled data. To handle the situation that the data is fully-unlabeled, we propose an entity-aware adversarial domain adaptation network, which utilizes the labeled source data and then adapts to unlabeled target domain. We first apply adversarial training to reduce the distribution gap between different domains. Furthermore, we introduce an entity-aware attention to guide adversarial process to achieve the alignment of entity features. The experiment shows that our model outperforms the state-of-the-art approaches.

IJCAI Conference 2021 Conference Paper

Deep Reinforcement Learning for Multi-contact Motion Planning of Hexapod Robots

  • Huiqiao Fu
  • Kaiqiang Tang
  • Peng Li
  • Wenqi Zhang
  • Xinpeng Wang
  • Guizhou Deng
  • Tao Wang
  • Chunlin Chen

Legged locomotion in a complex environment requires careful planning of the footholds of legged robots. In this paper, a novel Deep Reinforcement Learning (DRL) method is proposed to implement multi-contact motion planning for hexapod robots moving on uneven plum-blossom piles. First, the motion of hexapod robots is formulated as a Markov Decision Process (MDP) with a specified reward function. Second, a transition feasibility model is proposed for hexapod robots, which describes the feasibility of the state transition under the condition of satisfying kinematics and dynamics, and in turn determines the rewards. Third, the footholds and Center-of-Mass (CoM) sequences are sampled from a diagonal Gaussian distribution and the sequences are optimized through learning the optimal policies using the designed DRL algorithm. Both of the simulation and experimental results on physical systems demonstrate the feasibility and efficiency of the proposed method. Videos are shown at https: //videoviewpage. wixsite. com/mcrl.

NeurIPS Conference 2021 Conference Paper

Direct Multi-view Multi-person 3D Pose Estimation

  • Tao Wang
  • Jianfeng Zhang
  • Yujun Cai
  • Shuicheng Yan
  • Jiashi Feng

We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images. Instead of estimating 3D joint locations from costly volumetric representation or reconstructing the per-person 3D pose from multiple detected 2D poses as in previous methods, MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks. Specifically, MvP represents skeleton joints as learnable query embeddings and let them progressively attend to and reason over the multi-view information from the input images to directly regress the actual 3D joint locations. To improve the accuracy of such a simple pipeline, MvP presents a hierarchical scheme to concisely represent query embeddings of multi-person skeleton joints and introduces an input-dependent query adaptation approach. Further, MvP designs a novel geometrically guided attention mechanism, called projective attention, to more precisely fuse the cross-view information for each joint. MvP also introduces a RayConv operation to integrate the view-dependent camera geometry into the feature representations for augmenting the projective attention. We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient. Notably, it achieves 92. 3% AP25 on the challenging Panoptic dataset, improving upon the previous best approach [35] by 9. 8%. MvP is general and also extendable to recovering human mesh represented by the SMPL model, thus useful for modeling multi-person body shapes. Code and models are available at https: //github. com/sail-sg/mvp.

TIST Journal 2021 Journal Article

Fine-Grained Semantic Image Synthesis with Object-Attention Generative Adversarial Network

  • Min Wang
  • Congyan Lang
  • Liqian Liang
  • Songhe Feng
  • Tao Wang
  • Yutong Gao

Semantic image synthesis is a new rising and challenging vision problem accompanied by the recent promising advances in generative adversarial networks. The existing semantic image synthesis methods only consider the global information provided by the semantic segmentation mask, such as class label, global layout, and location, so the generative models cannot capture the rich local fine-grained information of the images (e.g., object structure, contour, and texture). To address this issue, we adopt a multi-scale feature fusion algorithm to refine the generated images by learning the fine-grained information of the local objects. We propose OA-GAN, a novel object-attention generative adversarial network that allows attention-driven, multi-fusion refinement for fine-grained semantic image synthesis. Specifically, the proposed model first generates multi-scale global image features and local object features, respectively, then the local object features are fused into the global image features to improve the correlation between the local and the global. In the process of feature fusion, the global image features and the local object features are fused through the channel-spatial-wise fusion block to learn ‘what’ and ‘where’ to attend in the channel and spatial axes, respectively. The fused features are used to construct correlation filters to obtain feature response maps to determine the locations, contours, and textures of the objects. Extensive quantitative and qualitative experiments on COCO-Stuff, ADE20K and Cityscapes datasets demonstrate that our OA-GAN significantly outperforms the state-of-the-art methods.

TIST Journal 2020 Journal Article

End-to-End Text-to-Image Synthesis with Spatial Constrains

  • Min Wang
  • Congyan Lang
  • Liqian Liang
  • Songhe Feng
  • Tao Wang
  • Yutong Gao

Although the performance of automatically generating high-resolution realistic images from text descriptions has been significantly boosted, many challenging issues in image synthesis have not been fully investigated, due to shapes variations, viewpoint changes, pose changes, and the relations of multiple objects. In this article, we propose a novel end-to-end approach for text-to-image synthesis with spatial constraints by mining object spatial location and shape information. Instead of learning a hierarchical mapping from text to image, our algorithm directly generates multi-object fine-grained images through the guidance of the generated semantic layouts. By fusing text semantic and spatial information into a synthesis module and jointly fine-tuning them with multi-scale semantic layouts generated, the proposed networks show impressive performance in text-to-image synthesis for complex scenes. We evaluate our method both on single-object CUB dataset and multi-object MS-COCO dataset. Comprehensive experimental results demonstrate that our method significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

AAAI Conference 2020 Conference Paper

Finding Action Tubes with a Sparse-to-Dense Framework

  • Yuxi Li
  • Weiyao Lin
  • Tao Wang
  • John See
  • Rui Qian
  • Ning Xu
  • Limin Wang
  • Shugong Xu

The task of spatial-temporal action detection has attracted increasing attention among researchers. Existing dominant methods solve this problem by relying on short-term information and dense serial-wise detection on each individual frames or clips. Despite their effectiveness, these methods showed inadequate use of long-term information and are prone to inefficiency. In this paper, we propose for the first time, an efficient framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner. There are two key characteristics in this framework: (1) Both long-term and short-term sampled information are explicitly utilized in our spatiotemporal network, (2) A new dynamic feature sampling module (DTS) is designed to effectively approximate the tube output while keeping the system tractable. We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets, achieving promising results that are competitive to state-of-the-art methods. The proposed sparse-to-dense strategy rendered our framework about 7. 6 times more efficient than the nearest competitor.

TIST Journal 2019 Journal Article

Co-saliency Detection with Graph Matching

  • Zun Li
  • Congyan Lang
  • Jiashi Feng
  • Yidong Li
  • Tao Wang
  • Songhe Feng

Recently, co-saliency detection, which aims to automatically discover common and salient objects appeared in several relevant images, has attracted increased interest in the computer vision community. In this article, we present a novel graph-matching based model for co-saliency detection in image pairs. A solution of graph matching is proposed to integrate the visual appearance, saliency coherence, and spatial structural continuity for detecting co-saliency collaboratively. Since the saliency and the visual similarity have been seamlessly integrated, such a joint inference schema is able to produce more accurate and reliable results. More concretely, the proposed model first computes the intra-saliency for each image by aggregating multiple saliency cues. The common and salient regions across multiple images are thus discovered via a graph matching procedure. Then, a graph reconstruction scheme is proposed to refine the intra-saliency iteratively. Compared to existing co-saliency detection methods that only utilize visual appearance cues, our proposed model can effectively exploit both visual appearance and structure information to better guide co-saliency detection. Extensive experiments on several challenging image pair databases demonstrate that our model outperforms state-of-the-art baselines significantly.

ICRA Conference 2019 Conference Paper

Eagle Shoal: A new designed modular tactile sensing dexterous hand for domestic service robots

  • Tao Wang
  • Zhanxiao Geng
  • Bo Kang
  • Xiaochuan Luo

This paper introduces a new designed modular tactile sensing dexterous hand for domestic service robots. This fully-actuated hand consists of 1 palm and 3 fingers, with embedded tactile sensors, motors and control boards. The palm and each finger have 2 degrees of freedom (DOFs). The modular design makes it easy to attach and detach the hand, even by inexperienced users. The tactile sensor unit with new structure can help to decrease sensor number and keep a good sensing ability. A series of experiments to test the sensor unit and evaluated the hand performance with an object set was performed in this paper. The results show that the sensor unit can provide precise sensing result and perceive continuous vibration data, and the hand has excellent grasp ability. In addition to its good performance, the hand features a cost of $500 USD with a scale of one hundred sets. This hand is affordable for researchers and for domestic service robots in the consumer market. In future research, this hand will be used to promote the robotic manipulation research based on visual and tactile data.

AAAI Conference 2019 Conference Paper

Partial Multi-Label Learning by Low-Rank and Sparse Decomposition

  • Lijuan Sun
  • Songhe Feng
  • Tao Wang
  • Congyan Lang
  • Yi Jin

Multi-Label Learning (MLL) aims to learn from the training data where each example is represented by a single instance while associated with a set of candidate labels. Most existing MLL methods are typically designed to handle the problem of missing labels. However, in many real-world scenarios, the labeling information for multi-label data is always redundant, which can not be solved by classical MLL methods, thus a novel Partial Multi-label Learning (PML) framework is proposed to cope with such problem, i. e. removing the the noisy labels from the multi-label sets. In this paper, in order to further improve the denoising capability of PML framework, we utilize the low-rank and sparse decomposition scheme and propose a novel Partial Multi-label Learning by Low-Rank and Sparse decomposition (PML-LRS) approach. Specifically, we first reformulate the observed label set into a label matrix, and then decompose it into a groundtruth label matrix and an irrelevant label matrix, where the former is constrained to be low rank and the latter is assumed to be sparse. Next, we utilize the feature mapping matrix to explore the label correlations and meanwhile constrain the feature mapping matrix to be low rank to prevent the proposed method from being overfitting. Finally, we obtain the ground-truth labels via minimizing the label loss, where the Augmented Lagrange Multiplier (ALM) algorithm is incorporated to solve the optimization problem. Enormous experimental results demonstrate that PML-LRS can achieve superior or competitive performance against other state-of-the-art methods.

ICRA Conference 2018 Conference Paper

A Fluid-Filled Tubular Dielectric Elastomer Variable Stiffness Structure Inspired by the Hydrostatic Skeleton Principle *Research supported by the National Natural Science Foundation of China (No. 51675413)

  • Tao Wang
  • Yue Li
  • Yuanjie Li
  • Jinhua Zhang
  • Jun Hong 0002
  • Michael Yu Wang

This work presents a novel variable stiffness structure consisting of a fiber-constrained dielectric elastomer tube filled with insulating oil. The tensile stiffness of the structure can be adjusted by voltages and its initial value can be customized according to the initial pre-stretch of the material. The structure has a dimension of ∼30 mm diameter × 50 mm length. A mathematical analysis is established to predict the initial tensile stiffness of the structure. The changes of the tensile stiffness of the structure under voltages are verified experimentally. The results show a decrease of the tensile stiffness of the device by 25% at 4 kV and the decrement is also related to the elongation of the structure. With different pre-stretches and dimensions of the dielectric elastomer, one can obtain devices with different variation ranges of tensile stiffness.

ICRA Conference 2017 Conference Paper

Design and control of an inchworm-inspired soft robot with omega-arching locomotion

  • Huaxia Guo
  • Jinhua Zhang
  • Tao Wang
  • Yuanjie Li
  • Jun Hong 0002
  • Yue Li

This paper presents an inchworm inspired soft robot composed of the soft body, the front foot as well as the back foot. Compared to the traditional inchworm-type robot consisting of rigid components, the driven mode for the soft robot is more simple. The soft robot inspired by the inchworm has higher locomotion efficiency than the other bionic soft robot. The main idea of this paper is to imitate the “Ω” motion shape of biology inchworm based on a silicone square tube with strain-limiting layers. Besides, each foot of the robot made through 3D printing technology together with metal sheet can produce different friction coefficients to achieve the anchor-motion movement. Then, the robot realizes an inchworm-like locomotion under certain actuation patterns. Experimental results show that the proposed robot has excellent performance.

IJCAI Conference 2017 Conference Paper

Interactive Image Segmentation via Pairwise Likelihood Learning

  • Tao Wang
  • Quansen Sun
  • Qi Ge
  • Zexuan Ji
  • Qiang Chen
  • Guiyu Xia

This paper presents an interactive image segmentation approach where the segmentation problem is formulated as a probabilistic estimation manner. Instead of measuring the distances between unseeded pixels and seeded pixels, we measure the similarities between pixel pairs and seed pairs to improve the robustness to the seeds. The unary prior probability of each pixel belonging to the foreground F and background B can be effectively estimated based on the similarities with label pairs (F, F), (F, B), (B, F) and (B, B). Then a likelihood learning framework is proposed to fuse the region and boundary information of the image by imposing the smoothing constraint on the unary potentials. Experiments on challenging data sets demonstrate that the proposed method can obtain better performance than state-of-the-art methods.

ICRA Conference 2016 Conference Paper

A continuous jumping robot on water mimicking water striders

  • Jihong Yan
  • Kai Yang 0008
  • Tao Wang
  • Xinbin Zhang
  • Jie Zhao 0003

Aiming at mimicking the jumping locomotion of water striders, a new continuous jumping robot on water is proposed. Compared with the horizontal rowing motion, the jumping capability of water striders is challengeable to imitate, since the impact force on water is easy to cause the sinking of the robot. In this paper, a jumping mechanism based on springs is designed to produce a large thrust for the robot to jump. The shape of supporting legs and center of gravity of the robot are carefully designed so that the robot can jump on the surface continuously and smoothly. Influences of several critical factors, including the area of supporting legs, spring stiffness and jumping angle, on jump performance are analyzed by means of dynamic simulation and experiments. The fabricated robot weighs about 10. 2 g and can continuously jump on water with the maximum leap height and length of 120 mm and 410 mm, respectively.

AAAI Conference 2016 Conference Paper

Convolutional Neural Networks over Tree Structures for Programming Language Processing

  • Lili Mou
  • Ge Li
  • Lu Zhang
  • Tao Wang
  • Zhi Jin

Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the artificial intelligence community. However, different from a natural language sentence, a program contains rich, explicit, and complicated structural information. Hence, traditional NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in which a convolution kernel is designed over programs’ abstract syntax trees to capture structural information. TBCNN is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.

AAAI Conference 2016 Conference Paper

Path Following with Adaptive Path Estimation for Graph Matching

  • Tao Wang
  • Haibin Ling

Graph matching plays an important role in many fields in computer vision. It is a well-known general NP-hard problem and has been investigated for decades. Among the large amount of algorithms for graph matching, the algorithms utilizing the path following strategy exhibited state-of-art performances. However, the main drawback of this category of algorithms lies in their high computational burden. In this paper, we propose a novel path following strategy for graph matching aiming to improve its computation efficiency. We first propose a path estimation method to reduce the computational cost at each iteration, and subsequently a method of adaptive step length to accelerate the convergence. The proposed approach is able to be integrated into all the algorithms that utilize the path following strategy. To validate our approach, we compare our approach with several recently proposed graph matching algorithms on three benchmark image datasets. Experimental results show that, our approach improves significantly the computation efficiency of the original algorithms, and offers similar or better matching results.

IS Journal 2014 Journal Article

Characterizing the Evolution of Social Computing Research

  • Tao Wang
  • Zhong Liu
  • Baoxin Xiu
  • Hong Mo
  • Qingpeng Zhang

With Web 2. 0 advances, social computing has become an emerging research field in the past decade. This article analyzes the characteristics of social computing research from both static and dynamic perspectives. First, the authors present the overlapping relationships of content represented by keywords as of 2011. Next, they show the dynamics of social computing research through analyzing keyword trends and topological evolution of co-word networks. The article characterizes the key features and the evolution of social computing from a quantitative perspective.

IS Journal 2014 Journal Article

Collaboration Pattern and Topic Analysis on Intelligence and Security Informatics Research

  • Wenli Liu
  • Xiaolong Zheng
  • Tao Wang
  • Hui Wang

In this article, researcher collaboration patterns and research topics on Intelligence and Security Informatics (ISI) are investigated using social network analysis approaches. The collaboration networks exhibit scale-free property and small-world effect. From these networks, the authors obtain the key researchers, institutions, and three important topics.

ICRA Conference 2014 Conference Paper

On-board inertial-assisted visual odometer on an embedded system

  • Guyue Zhou
  • Jiaxin Ye
  • Wei Ren
  • Tao Wang
  • Zexiang Li 0001

In this paper, we propose a novel inertial-assisted visual odometry system intended for low-cost micro aerial vehicles (MAVs). The system sensor assembly consists of two downward-facing cameras and an inertial measurement unit (IMU) with three-axis accelerometers/gyroscopes. Real-time implementation of the system is enabled by a low-cost embedded system via two important features: firstly, simple pixel-level algorithms are integrated in a low-end FPGA and accelerated via pipeline and combinational logic techniques; secondly, a fast yaw-and-translation estimation algorithm works well with a novel outlier rejection scheme based on probabilistic predetermined operations rather than hypothesis testing iterations. We illustrate the performance of our system by hovering a MAV in a GPS-denied environment. Its feasibility and robustness is also illustrated in complex outdoor environments.

ICML Conference 2013 Conference Paper

Deep learning with COTS HPC systems

  • Adam Coates 0002
  • Brody Huval
  • Tao Wang
  • David J. Wu 0001
  • Bryan Catanzaro
  • Andrew Y. Ng

Scaling up deep learning algorithms has been shown to lead to increased performance in benchmark tasks and to enable discovery of complex high-level features. Recent efforts to train extremely large networks (with over 1 billion parameters) have relied on cloud-like computing infrastructure and thousands of CPU cores. In this paper, we present technical details and results from our own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Our system is able to train 1 billion parameter networks on just 3 machines in a couple of days, and we show that it can scale to networks with over 11 billion parameters using just 16 machines. As this infrastructure is much more easily marshaled by others, the approach enables much wider-spread research with extremely large neural networks.

TIST Journal 2011 Journal Article

Automatic player labeling, tracking and field registration and trajectory mapping in broadcast soccer video

  • Xiaofeng Tong
  • Jia Liu
  • Tao Wang
  • Yimin Zhang

In this article, we present a method to perform automatic player trajectories mapping based on player detection, unsupervised labeling, efficient multi-object tracking, and playfield registration in broadcast soccer videos. Player detector determines the players' positions and scales by combining the ability of dominant color based background subtraction and a boosting detector with Haar features. We first learn the dominant color with accumulate color histogram at the beginning of processing, then use the player detector to collect hundreds of player samples, and learn player appearance codebook by unsupervised clustering. In a soccer game, a player can be labeled as one of four categories: two teams, referee or outlier. The learning capability enables the method to be generalized well to different videos without any manual initialization. With the dominant color and player appearance model, we can locate and label each player. After that, we perform multi-object tracking by using Markov Chain Monte Carlo (MCMC) data association to generate player trajectories. Some data driven dynamics are proposed to improve the Markov chain's efficiency, such as label consistency, motion consistency, and track length, etc. Finally, we extract key-points and find the mapping from an image plane to the standard field model, and then map players' position and trajectories to the field. A large quantity of experimental results on FIFA World Cup 2006 videos demonstrate that this method can reach high detection and labeling precision, reliably tracking in scenes of player occlusion, moderate camera motion and pose variation, and yield promising field registration results.

IJCAI Conference 2007 Conference Paper

  • Jianguo Li
  • Changshui Zhang
  • Tao Wang
  • Yimin Zhang

Bayesian network classifiers (BNC) have received considerable attention in machine learning field. Some special structure BNCs have been proposed and demonstrate promise performance. However, recent works show that structure learning in BNs may lead to a non-negligible posterior problem, i. e, there might be many structures have similar posterior scores. In this paper, we propose a generalized additive Bayesian network classifiers, which transfers the structure learning problem to a generalized additive models (GAM) learning problem. We first generate a series of very simple BNs, and put them in the framework of GAM, then adopt a gradient-based algorithm to learn the combining parameters, and thus construct a more powerful classifier. On a large suite of benchmark data sets, the proposed approach outperforms many traditional BNCs, such as naive Bayes, TAN, etc, and achieves comparable or better performance in comparison to boosted Bayesian network classifiers.

IJCAI Conference 2007 Conference Paper

  • Daniel Lizotte
  • Tao Wang
  • Michael Bowling
  • Dale Schuurmans

Gait optimization is a basic yet challenging problem for both quadrupedal and bipedal robots. Although techniques for automating the process exist, most involve local function optimization procedures that suffer from three key drawbacks. Local optimization techniques are naturally plagued by local optima, make no use of the expensive gait evaluations once a local step is taken, and do not explicitly model noise in gait evaluation. These drawbacks increase the need for a large number of gait evaluations, making optimization slow, data inefficient, and manually intensive. We present a Bayesian approach based on Gaussian process regression that addresses all three drawbacks. It uses a global search strategy based on a posterior model inferred from all of the individual noisy evaluations. We demonstrate the technique on a quadruped robot, using it to optimize two different criteria: speed and smoothness. We show in both cases our technique requires dramatically fewer gait evaluations than state-of-the-art local gradient approaches.

NeurIPS Conference 2007 Conference Paper

Stable Dual Dynamic Programming

  • Tao Wang
  • Michael Bowling
  • Dale Schuurmans
  • Daniel Lizotte

Recently, we have introduced a novel approach to dynamic programming and re- inforcement learning that is based on maintaining explicit representations of sta- tionary distributions instead of value functions. In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation.

AAAI Conference 2006 Short Paper

Action Selection in Bayesian Reinforcement Learning

  • Tao Wang

My research attempts to address on-line action selection in reinforcement learning from a Bayesian perspective. The idea is to develop more effective action selection techniques by exploiting information in a Bayesian posterior, while also selecting actions by growing an adaptive, sparse lookahead tree. I further augment the approach by considering a new value function approximation strategy for the belief-state Markov decision processes induced by Bayesian learning.

AAAI Conference 2006 Conference Paper

Compact, Convex Upper Bound Iteration for Approximate POMDP Planning

  • Tao Wang
  • Michael Bowling

Partially observable Markov decision processes (POMDPs) are an intuitive and general way to model sequential decision making problems under uncertainty. Unfortunately, even approximate planning in POMDPs is known to be hard, and developing heuristic planners that can deliver reasonable results in practice has proved to be a significant challenge. In this paper, we present a new approach to approximate value-iteration for POMDP planning that is based on quadratic rather than piecewise linear function approximators. Specifically, we approximate the optimal value function by a convex upper bound composed of a fixed number of quadratics, and optimize it at each stage by semidefinite programming. We demonstrate that our approach can achieve competitive approximation quality to current techniques while still maintaining a bounded size representation of the function approximator. Moreover, an upper bound on the optimal value function can be preserved if required. Overall, the technique requires computation time and space that is only linear in the number of iterations (horizon time).