Arrow Research search

Author name cluster

Li Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

96 papers
2 author rows

Possible papers

96

AAAI Conference 2026 Conference Paper

Beyond Sharpness: The Role of Nonuniformity in Generalization

  • Yingcong Zhou
  • Pingfan Wu
  • Li Wang
  • Zhiguo Fu
  • Fengqin Yang

Sharpness-aware minimization (SAM) is widely recognized for enhancing the generalization performance of deep neural networks. However, recent works have challenged the statement that flatness implies generalization, demonstrating that it is insufficient as the indicator of generalization. In this paper, we reveal an insightful phenomenon: among minima of similar sharpness, stochastic optimization algorithms tend to prefer those with lower nonuniformity. We define nonuniformity by both the magnitude and structure of the gradient noise, and show that it fundamentally differs from sharpness and plays a critical role in generalization. Specifically, we first theoretically prove that the expected generalization gap of models trained via stochastic optimization algorithm is positively correlated with nonuniformity (the magnitude of the gradient noise). Empirically, we show that nonuniformity exhibits a stronger correlation with generalization than sharpness, especially in Transformer models. Furthermore, we demonstrate that the nonuniformity (the structure of the gradient noise) more effectively guides the algorithm towards sparser solutions and exhibits better generalization performance than sharpness-based methods in the high-dimensional sparse regression problem. Finally, extensive experiments on various datasets and models confirm the advantages of nonuniformity for generalization: (1) optimization guided by nonuniformity achieves better generalization compared to those achieved through flatness (including standard training, transfer learning, hyperparameter sensitivity and robustness to label noise); (2) model architecture (such as depth and width) is closely related to nonuniformity.

AAAI Conference 2026 Conference Paper

Decoupling Understanding from Reasoning via Problem Space Mapping for Small-Scale Model Reasoning

  • Li Wang
  • Changhao Zhang
  • Zengqi Xiu
  • Kai Lu
  • Xin Yu
  • Kui Zhang
  • Wenjun Wu

Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., up to 1.5B parameters) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.

JBHI Journal 2026 Journal Article

Direct PET-to-CT Generation for Attenuation Correction: A Slice-to-Slice Continual Transformer Segmentation-Aware Network

  • Rongjun Ge
  • Hanyuan Zheng
  • Yuxin Liu
  • Liutao Yang
  • Li Wang
  • Xu Ji
  • Jingtao Shen
  • Nan Li

Direct synthetic computed tomography (CT) generation from positron emission tomography (PET) plays a crucial role in PET attenuation correction, yet providing detailed structural information to compensate for functional imaging. Compared to the widely used PET/CT and indirect PET/MR-CT, the direct PET-to-CT translation method (denoted as PET-to-CT) offers several advantages: 1) The CT required for PET-to-CT is directly obtained from PET, thereby avoiding the intermediate errors generated in the inter-step processes of multimodal scanning in PET/CT and PET/MR-CT. 2) Furthermore, direct PET-to-CT eliminates the requirement for supplementary imaging equipment, thereby reducing complexity and scan duration in contrast to PET/CT and PET/MR-CT imaging. Thus, direct PET-to-CT is highly promising for clinical applications. However, it faces challenges, including spatial resolution mismatches between PET and CT, as well as voxel-wise semantic differences arising from functional and structural imaging. To address these challenges, this paper proposes a 2D hierarchical method called S2SCT (Slice-to-Slice Continual Transformer)-SA (Segmentation-aware) Network. It uses a slice-continual network to acquire semantic transformation knowledge from each PET slice to a CT slice, facilitating the conversion between functional and structural imaging domains. Subsequently, the segmentation-aware network is designed to futher capture spatial correlations both between slices and within slice, resulting in improved CT spatial resolution. The experiment results demonstrate that our proposed method outperforms mainstream methods in both CT generation and attenuation correction, as evidenced by both visual results and metric values.

JBHI Journal 2026 Journal Article

GPFD-Net: A Geometry-Pose Frequency Decoupling Network for Privacy-Preserving Human Action Recognition in Healthcare

  • Xing Li
  • Jingfan Liang
  • Ge Gao
  • Li Wang
  • Haifeng Wang
  • Shihao Han

Human Action Recognition (HAR) holds significant application value in healthcare informatics, facilitating tasks such as clinical diagnosis and rehabilitation monitoring. Point cloud sequences have emerged as a pivotal modality for balancing privacy preservation with high-fidelity geometric structural representation, ensuring anonymity while retaining critical 3D behavioral information. However, existing point cloud sequence encoding methods struggle to precisely encode micro-geometric details and macro-pose contours within the spatial dimension, as well as the dynamic heterogeneity of actions within the temporal dimension. These limitations impede the realization of high-precision clinical motion analysis. To address these challenges, we propose a Geometry-Pose Frequency Decoupling Network (GPFD-Net) for human action recognition. First, we design a Geometry-Pose Parallel-Collaborative Spatial Encoder (GPCSE). This module designs a parallel dual-stream architecture to explicitly capture and fuse complementary micro-geometric details and macro-pose contours, generating an informative geometry-enhanced pose feature sequence. Second, we introduce a Frequency-Decoupled Temporal Capturer (FDTC). This module adaptively decomposes the geometry-enhanced pose feature sequence into a smooth trend sequence and a transient detail sequence, which are subsequently processed by two parallel expert encoders via differentiated encoding to achieve robust human action recognition. Extensive experiments on four public benchmark datasets demonstrate that GPFD-Net achieves superior performance. The proposed method provides a novel paradigm for high-precision and privacy-preserving motion analysis in healthcare applications.

AAAI Conference 2026 Conference Paper

MoMoREC: A Multi-agent Motivation Generation Framework for Residual Semantic ID-Aware Recommendation

  • Yige Wang
  • Mingming Li
  • Li Wang
  • Kaichen Zhao
  • Wangming Li
  • Weipeng Jiang
  • Xueying Li

Recent advances in the field of sequential recommendation have highlighted the potential of Large Language Models (LLMs) in enhancing item embeddings and improving user understanding. However, existing approaches face three major limitations: 1) insufficient understanding of the reasons behind users' purchase decisions, 2) the high-dimensional embeddings directly produced by LLMs are not well compatible with traditional low-dimensional ID embeddings and 3) reliance on additional fine-tuning and high inference overhead to adapt LLMs to the recommendation task. In this paper, we propose MoMoREC, a simple yet effective user-understanding-based recommendation strategy. This method leverages the intrinsic comprehension capabilities of LLMs combined with residual semantic IDs to better understand users. Specifically, starting from common user purchasing behaviors and incorporating item characteristics, we employ a multi-agent framework to utilize LLMs in analyzing user shopping motivations and extracting high-dimensional dense embeddings. These embeddings are then transformed into low-dimensional IDs using a residual semantic ID approach via clustering and residual dimensionality reduction, which can be fed into the recommendation model. MoMoREC effectively integrates the understanding power of LLMs with the strengths of recommendation systems, preserving rich semantic language embeddings while reducing or eliminating the need for auxiliary trainable modules. As a result, it seamlessly adapts to any sequential recommendation framework. Experiments on three benchmark datasets show that MoMoRec significantly improves traditional recommendation models, demonstrating its effectiveness and flexibility.

JBHI Journal 2026 Journal Article

TriFuse-Net: A Tri-Branch PET/CT Fusion Pyramid Network Enhanced by Lesion-Guided Structural-Metabolic Attention for Lung Cancer Diagnosis and Prognosis

  • Yuyu Liu
  • Jieqin Lv
  • Fangfang Yang
  • Huiqin Wu
  • Xiang Pan
  • Li Wang
  • Han Bai
  • Shunfang Wang

Diagnosis and prognosis of lung cancer via PET/CT imaging have long been major clinical concerns. However, existing multimodal approaches often focus on feature aggregation rather than cross-modal interactive collaboration, failing to capture the structural-metabolic correlations and multi-scale synergy essential for characterizing complex lesions. Therefore, this study proposes TriFuse-Net, a tri-branch PET/CT fusion pyramid network (FPN) enhanced by lesion-guided structural-metabolic attention (LSMA) to improve both diagnosis and prognosis prediction tasks. The model is composed of two identical unimodal branches (PET/CT) and one pyramid branch with an interacting channel and spatial attention. The pyramid structure enables bidirectional multiscale feature extraction and fusion, capturing both local details and global semantic information of lesions. Comprehensive experiments validated the model's superiority across three clinical tasks. TriFuse-Net achieved a C-index of 0. 747 for progression-free survival (PFS) prediction, showing improvements of 14. 7% and 11. 0% over ResNet-CT and ResNet-PET, respectively. Additionally, the clinical-integrated model (TriFuse-Net-Cli) achieved AUCs of 0. 947 for differentiating lung cancer from tuberculosis and 0. 937 for identifying lymph-node metastasis. Ablation studies further confirmed the essential contributions of both FPN and LSMA. In summary, the proposed framework demonstrates that integrating multi-scale structural-metabolic relationships significantly enhances diagnosis and prognosis in lung cancer.

JBHI Journal 2026 Journal Article

Whisperization and Masked CycleGAN-Based Framework for Electrolaryngeal Speech Enhancement

  • Jie Zhou
  • Li Wang
  • Fengji Li
  • Shaochuan Zhang
  • Fan Fan
  • Tao Liu
  • Xiaohong Chen
  • Haijun Niu

Electrolarynx (EL) provides an effective approach to voice rehabilitation for patients with phonation disorder. However, due to its reliance on an external mechanical source, EL speech suffers from limited acoustic cues, leading to degraded quality and restricting the potential of subsequent modeling and enhancement. This paper proposes a novel EL speech enhancement framework that combines whisperization with Masked CycleGAN model. The whisperization step removes redundant constant excitation and mechanical noise, generating an intermediate speech form—whisper-like EL (W-EL) speech, whose acoustic and perceptual properties are closer to natural whisper. Subsequently, the Masked CycleGAN employs a frame-level masking strategy to guide the generator in reconstructing missing prosodic and linguistic features. Thus, we achieved a dual-stage enhancement of “redundancy removal” and “deficiency compensation. ” Acoustic feature analysis demonstrates that the converted W-EL speech is more similar to normal speech in terms of spectrogram, fundamental frequency (F0) values, and F0 contours, while also compensating for the missing low frequency energy below 500 Hz. Objective evaluations show significant improvements across multiple metrics. Subjective evaluations confirm that W-EL speech exhibits higher naturalness and intelligibility compared to original EL speech. Moreover, the combined “whisperization + voice conversion” framework further enhances perceptual quality. This study not only offer a novel pathway for EL speech enhancement, but also may provide valuable insights for improving other types of pathological speech.

ICML Conference 2025 Conference Paper

Cape: Context-Aware Prompt Perturbation Mechanism with Differential Privacy

  • Haoqi Wu
  • Wei Dai
  • Li Wang
  • Qiang Yan

Large Language Models (LLMs) have gained significant popularity due to their remarkable capabilities in text understanding and generation. However, despite their widespread deployment in inference services such as ChatGPT, concerns about the potential leakage of sensitive user data have arisen. Existing solutions primarily rely on privacy-enhancing technologies to mitigate such risks, facing the trade-off among efficiency, privacy, and utility. To narrow this gap, we propose Cape, a context-aware prompt perturbation mechanism based on differential privacy, to enable efficient inference with an improved privacy-utility trade-off. Concretely, we introduce a hybrid utility function that better captures the token similarity. Additionally, we propose a bucketized sampling mechanism to handle large sampling space, which might lead to long-tail phenomenons. Extensive experiments across multiple datasets, along with ablation studies, demonstrate that Cape achieves a better privacy-utility trade-off compared to prior state-of-the-art works.

NeurIPS Conference 2025 Conference Paper

Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

  • Simin Li
  • Zihao Mao
  • Hanxiao Li
  • Zonglei Jing
  • Zhuohang bian
  • Jun Guo
  • Li Wang
  • Zhuoran Han

In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of \emph{robustness}, which ensures stability under uncertainties, and \emph{resilience}, the ability to recover from disruptions—a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82, 620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones.

JBHI Journal 2025 Journal Article

PKAN: Leveraging Kolmogorov–Arnold Networks and Multi-Modal Learning for Peptide Prediction With Advanced Language Models

  • Li Wang
  • Xiangzheng Fu
  • Xiucai Ye
  • Tetsuya Sakurai
  • Xiangxiang Zeng
  • Yiping Liu

Peptides can offer highly specific biological activities, serving as essential mediators of intercellular signaling, which are critical for advancing precision medicine and drug development. Their primary structure can be depicted either as an amino acid sequence or as a chemical molecules consisting of atoms and chemical bonds. Large language models (LLMs) hold the potential to thoroughly elucidate the intricate intrinsic properties of peptides. Here we present the Peptide Kolmogorov-Arnold Network (PKAN), a framework leveraging multi-modal representations inspired by advanced language models for peptide activity and functionality prediction. Comparative experiments across tasks show that PKAN outperforms state-of-the-art models while maintaining a streamlined design with superior predictive capabilities. The multi-modal feature importance scoring, anchored in global structures and the significant marginal impacts of derived features on the model, coupled with intricate symbolic regression of specific activation functions, further demonstrates the robustness and precision of the PKAN framework in identifying and elucidating key determinants of peptide functionality. This work provides scientific evidence for investigating the complex mechanisms of peptide materials and supports the progression of peptide language paradigms in biology.

JBHI Journal 2025 Journal Article

Synergistic Drug Combination Prediction via Dual-Level Feature Aggregation and Knowledge Graph-Based Deep Neural Network

  • Ying Zuo
  • Yan Zhang
  • Li Wang
  • Jianping Yu
  • Jiawei Luo
  • Qiu Xiao

Identifying synergistic drug combinations is a critical but difficult challenge in cancer treatment, owing to the sheer complexity and enormous number of possible drug combinations. However, most existing computational methods rely on a single data perspective and often overlooking the complexity of interactions between different biological entities. Furthermore, they fail to fully integrate the intrinsic properties of drugs and cell lines with the broader biological relationships that play a crucial role in drug synergy. To address these challenges, we propose a novel framework called LGSyn that integrates two types of information: local features, including molecular fingerprints, descriptors, and gene expression profiles, as well as global features that encompass broader biological interactions, including drug-protein, protein-cell line, protein-protein, and cell line-tissue interactions. By combining these two types of features, LGSyn leverages the full spectrum of biological knowledge to predict drug synergy. In LGSyn, we developed three fusion strategies to effectively integrate local and global information and identify the most suitable strategy. The resulting fused feature vectors are then fed into a deep neural network for training and synergy prediction. Experimental results demonstrate that the proposed method outperforms current state-of-the-art models, achieving superior accuracy and stability in drug synergy prediction.

NeurIPS Conference 2025 Conference Paper

V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception

  • Lei Yang
  • Xinyu Zhang
  • Jun Li
  • Chen Wang
  • Jiaqi Ma
  • Zhiying Song
  • Tong Zhao
  • Ziying Song

Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby enhancing the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged; however, these datasets primarily focus on cameras and LiDAR, neglecting 4D Radar—a sensor used in single-vehicle autonomous driving to provide robust perception in adverse weather conditions. In this paper, to bridge the gap created by the absence of 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large-scale, real-world multi-modal dataset featuring 4D Radar. V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data encompasses sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as various typical challenging scenarios. The dataset consists of 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, including 350K annotated boxes across five categories. To support various research domains, we have established V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. Furthermore, we provide comprehensive benchmarks across these three sub-datasets.

IJCAI Conference 2024 Conference Paper

LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory

  • Zicheng Liu
  • Li Wang
  • Siyuan Li
  • Zedong Wang
  • Haitao Lin
  • Stan Z. Li

Transformer models have been successful in various sequence processing tasks, but the self-attention mechanism's computational cost limits its practicality for long sequences. Although there are existing attention variants that improve computational efficiency, they have a limited ability to abstract global information effectively based on their hand-crafted mixing strategies. On the other hand, state-space models (SSMs) are tailored for long sequences but cannot capture complicated local information. Therefore, the combination of them as a unified token mixer is a trend in recent long-sequence models. However, the linearized attention degrades performance significantly even when equipped with SSMs. To address the issue, we propose a new method called LongVQ. LongVQ uses the vector quantization (VQ) technique to compress the global abstraction as a length-fixed codebook, enabling the linear-time computation of the attention matrix. This technique effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues. Our experiments on the Long Range Arena benchmark, autoregressive language modeling, and image and speech classification demonstrate the effectiveness of LongVQ. Our model achieves significant improvements over other sequence models, including variants of Transformers, Convolutions, and recent State Space Models.

JMLR Journal 2024 Journal Article

Nonparametric Regression for 3D Point Cloud Learning

  • Xinyi Li
  • Shan Yu
  • Yueying Wang
  • Guannan Wang
  • Li Wang
  • Ming-Jun Lai

In recent years, there has been an exponentially increased amount of point clouds collected with irregular shapes in various areas. Motivated by the importance of solid modeling for point clouds, we develop a novel and efficient smoothing tool based on multivariate splines over the triangulation to extract the underlying signal and build up a 3D solid model from the point cloud. The proposed method can denoise or deblur the point cloud effectively, provide a multi-resolution reconstruction of the actual signal, and handle sparse and irregularly distributed point clouds to recover the underlying trajectory. In addition, our method provides a natural way of numerosity data reduction. We establish the theoretical guarantees of the proposed method, including the convergence rate and asymptotic normality of the estimator, and show that the convergence rate achieves optimal nonparametric convergence. We also introduce a bootstrap method to quantify the uncertainty of the estimators. Through extensive simulation studies and a real data example, we demonstrate the superiority of the proposed method over traditional smoothing methods in terms of estimation accuracy and efficiency of data reduction. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

JBHI Journal 2024 Journal Article

Progressive Dual Priori Network for Generalized Breast Tumor Segmentation

  • Li Wang
  • Lihui Wang
  • Zixiang Kuai
  • Lei Tang
  • Yingfeng Ou
  • Min Wu
  • Tianliang Shi
  • Chen Ye

To promote the generalization ability of breast tumor segmentation models, as well as to improve the segmentation performance for breast tumors with smaller size, low-contrast and irregular shape, we propose a progressive dual priori network (PDPNet) to segment breast tumors from dynamic enhanced magnetic resonance images (DCE-MRI) acquired at different centers. The PDPNet first cropped tumor regions with a coarse-segmentation based localization module, then the breast tumor mask was progressively refined by using the weak semantic priori and cross-scale correlation prior knowledge. To validate the effectiveness of PDPNet, we compared it with several state-of-the-art methods on multi-center datasets. The results showed that, comparing against the suboptimal method, the DSC and HD95 of PDPNet were improved at least by 5. 13% and 7. 58% respectively on multi-center test sets. In addition, through ablations, we demonstrated that the proposed localization module can decrease the influence of normal tissues and therefore improve the generalization ability of the model. The weak semantic priors allow focusing on tumor regions to avoid missing small tumors and low-contrast tumors. The cross-scale correlation priors are beneficial for promoting the shape-aware ability for irregular tumors. Thus integrating them in a unified framework improved the multi-center breast tumor segmentation performance.

JBHI Journal 2024 Journal Article

RClaNet: An Explainable Alzheimer's Disease Diagnosis Framework by Joint Registration and Classification

  • Liang Wu
  • Shunbo Hu
  • Duanwei Wang
  • Changchun Liu
  • Li Wang

Alzheimer's disease (AD) is an irreversible neurodegenerative disease that affects people's ability of daily life. Unfortunately, there is currently no known cure for AD. Thus, the early detection of AD plays a key role in preventing and controlling its progression. As one of representative methods for measuring brain atrophy, image registration technique has been widely adopted for AD diagnosis. In this study, an AD assistant diagnosis framework based on joint registration and classification is proposed. Specifically, to capture more local deformation information, a novel patch-based joint brain image registration and classification network (RClaNet) to estimate the local dense deformation fields (DDF) and disease risk probability maps (DRM) that explain high-risk areas for AD patients. RClaNet consists of a registration network and a classification network, in which the deformation field from registration network is fed into the classification network to enhance the prediction accuracy of the disease. Then, the exponential distance weighting method is used to obtain the global DDF and the global DRM without grid-like artifacts. Finally, the global classification network uses the global DRM for the early detection of AD. We evaluate the proposed method on the OASIS-3, AIBL, ADNI and COVID-19 datasets, and experimental results show that the proposed RClaNet achieves superior registration performances than several state-of-the-art methods. Early diagnosis of AD using the global DRM also yielded competitive results. These experiments prove that the deformation information in the registration process can be used to characterize subtle changes of degenerative diseases and further assist clinicians in diagnosis.

IJCAI Conference 2024 Conference Paper

RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

  • Ziying Song
  • Guoxing Zhang
  • Lin Liu
  • Lei Yang
  • Shaoqing Xu
  • Caiyan Jia
  • Feiyang Jia
  • Li Wang

Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD). Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for improving the robustness and generalization of multi-modal 3D object detection in AD. Therefore, we propose RoboFusion, a robust framework that leverages VFMs like SAM to tackle out-of-distribution (OOD) noise scenarios. We first adapt the original SAM for AD scenarios named SAM-AD. To align SAM or SAM-AD with multi-modal methods, we then introduce AD-FPN for upsampling the image features extracted by SAM. We employ wavelet decomposition to denoise the depth-guided images for further noise reduction and weather interference. At last, we employ self-attention mechanisms to adaptively reweight the fused features, enhancing informative features while suppressing excess noise. In summary, RoboFusion significantly reduces noise by leveraging the generalization and robustness of VFMs, thereby enhancing the resilience of multi-modal 3D object detection. Consequently, RoboFusion achieves SOTA performance in noisy scenarios, as demonstrated by the KITTI-C and nuScenes-C benchmarks. Code is available at https: //github. com/adept-thu/RoboFusion.

ICML Conference 2024 Conference Paper

Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences

  • Zicheng Liu 0006
  • Siyuan Li 0002
  • Li Wang
  • Zedong Wang
  • Yunfan Liu 0002
  • Stan Z. Li

To mitigate the computational complexity in the self-attention mechanism on long sequences, linear attention utilizes computation tricks to achieve linear complexity, while state space models (SSMs) popularize a favourable practice of using non-data-dependent memory pattern, i. e. , emphasize the near and neglect the distant, to processing sequences. Recent studies have shown the priorities by combining them as one. However, the efficiency of linear attention remains only at the theoretical level in a causal setting, and SSMs require various designed constraints to operate effectively on specific data. Therefore, in order to unveil the true power of the hybrid design, the following two issues need to be addressed: (1) hardware-efficient implementation for linear attention and (2) stabilization of SSMs. To achieve this, we leverage the thought of tiling and hierarchy to propose CHELA (short-long Convolutions with Hardware-Efficient Linear Attention), which replaces SSMs with short-long convolutions and implements linear attention in a divide-and-conquer manner. This approach enjoys global abstraction and data-dependent selection from stable SSM and linear attention while maintaining real linear complexity. Our comprehensive experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.

NeurIPS Conference 2024 Conference Paper

Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs

  • Qinpeng Cui
  • Yixuan Liu
  • Xinyi Zhang
  • Qiqi Bao
  • Qingmin Liao
  • Li Wang
  • Tian Lu
  • Zicheng Liu

Diffusion-based image super-resolution (SR) models have attracted substantial interest due to their powerful image restoration capabilities. However, prevailing diffusion models often struggle to strike an optimal balance between efficiency and performance. Typically, they either neglect to exploit the potential of existing extensive pretrained models, limiting their generative capacity, or they necessitate a dozens of forward passes starting from random noises, compromising inference efficiency. In this paper, we present DoSSR, a $\textbf{Do}$main $\textbf{S}$hift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models while significantly enhancing efficiency by initiating the diffusion process with low-resolution (LR) images. At the core of our approach is a domain shift equation that integrates seamlessly with existing diffusion models. This integration not only improves the use of diffusion prior but also boosts inference efficiency. Moreover, we advance our method by transitioning the discrete shift process to a continuous formulation, termed as DoS-SDEs. This advancement leads to the fast and customized solvers that further enhance sampling efficiency. Empirical results demonstrate that our proposed method achieves state-of-the-art performance on synthetic and real-world datasets, while notably requiring $\textbf{\emph{only 5 sampling steps}}$. Compared to previous diffusion prior based methods, our approach achieves a remarkable speedup of 5-7 times, demonstrating its superior efficiency.

AAAI Conference 2023 Conference Paper

BERT-ERC: Fine-Tuning BERT Is Enough for Emotion Recognition in Conversation

  • Xiangyu Qin
  • Zhiyu Wu
  • Tingting Zhang
  • Yanran Li
  • Jian Luan
  • Bin Wang
  • Li Wang
  • Jinshi Cui

Previous works on emotion recognition in conversation (ERC) follow a two-step paradigm, which can be summarized as first producing context-independent features via fine-tuning pretrained language models (PLMs) and then analyzing contextual information and dialogue structure information among the extracted features. However, we discover that this paradigm has several limitations. Accordingly, we propose a novel paradigm, i.e., exploring contextual information and dialogue structure information in the fine-tuning step, and adapting the PLM to the ERC task in terms of input text, classification structure, and training strategy. Furthermore, we develop our model BERT-ERC according to the proposed paradigm, which improves ERC performance in three aspects, namely suggestive text, fine-grained classification module, and two-stage training. Compared to existing methods, BERT-ERC achieves substantial improvement on four datasets, indicating its effectiveness and generalization capability. Besides, we also set up the limited resources scenario and the online prediction scenario to approximate real-world scenarios. Extensive experiments demonstrate that the proposed paradigm significantly outperforms the previous one and can be adapted to various scenes.

AAAI Conference 2023 Conference Paper

The Implicit Regularization of Momentum Gradient Descent in Overparametrized Models

  • Li Wang
  • Zhiguo Fu
  • Yingcong Zhou
  • Zili Yan

The study of the implicit regularization induced by gradient-based optimization in deep learning is a long-standing pursuit. In the present paper, we characterize the implicit regularization of momentum gradient descent (MGD) in the continuous-time view, so-called momentum gradient flow (MGF). We show that the components of weight vector are learned for a deep linear neural networks at different evolution rates, and this evolution gap increases with the depth. Firstly, we show that if the depth equals one, the evolution gap between the weight vector components is linear, which is consistent with the performance of ridge. In particular, we establish a tight coupling between MGF and ridge for the least squares regression. In detail, we show that when the regularization parameter of ridge is inversely proportional to the square of the time parameter of MGF, the risk of MGF is no more than 1.54 times that of ridge, and their relative Bayesian risks are almost indistinguishable. Secondly, if the model becomes deeper, i.e. the depth is greater than or equal to 2, the evolution gap becomes more significant, which implies an implicit bias towards sparse solutions. The numerical experiments strongly support our theoretical results.

AAAI Conference 2023 Conference Paper

Transfer Learning Enhanced DeepONet for Long-Time Prediction of Evolution Equations

  • Wuzhe Xu
  • Yulong Lu
  • Li Wang

Deep operator network (DeepONet) has demonstrated great success in various learning tasks, including learning solution operators of partial differential equations. In particular, it provides an efficient approach to predicting the evolution equations in a finite time horizon. Nevertheless, the vanilla DeepONet suffers from the issue of stability degradation in the long- time prediction. This paper proposes a transfer-learning aided DeepONet to enhance the stability. Our idea is to use transfer learning to sequentially update the DeepONets as the surro- gates for propagators learned in different time frames. The evolving DeepONets can better track the varying complexities of the evolution equations, while only need to be updated by efficient training of a tiny fraction of the operator networks. Through systematic experiments, we show that the proposed method not only improves the long-time accuracy of Deep- ONet while maintaining similar computational cost but also substantially reduces the sample size of the training set.

AAAI Conference 2023 Conference Paper

Video-Audio Domain Generalization via Confounder Disentanglement

  • Shengyu Zhang
  • Xusheng Feng
  • Wenyan Fan
  • Wenjing Fang
  • Fuli Feng
  • Wei Ji
  • Shuo Li
  • Li Wang

Existing video-audio understanding models are trained and evaluated in an intra-domain setting, facing performance degeneration in real-world applications where multiple domains and distribution shifts naturally exist. The key to video-audio domain generalization (VADG) lies in alleviating spurious correlations over multi-modal features. To achieve this goal, we resort to causal theory and attribute such correlation to confounders affecting both video-audio features and labels. We propose a DeVADG framework that conducts uni-modal and cross-modal deconfounding through back-door adjustment. DeVADG performs cross-modal disentanglement and obtains fine-grained confounders at both class-level and domain-level using half-sibling regression and unpaired domain transformation, which essentially identifies domain-variant factors and class-shared factors that cause spurious correlations between features and false labels. To promote VADG research, we collect a VADG-Action dataset for video-audio action recognition with over 5,000 video clips across four domains (e.g., cartoon and game) and ten action classes (e.g., cooking and riding). We conduct extensive experiments, i.e., multi-source DG, single-source DG, and qualitative analysis, validating the rationality of our causal analysis and the effectiveness of the DeVADG framework.

AAAI Conference 2022 Conference Paper

Cross-Dataset Collaborative Learning for Semantic Segmentation in Autonomous Driving

  • Li Wang
  • Dong Li
  • Han Liu
  • Jinzhang Peng
  • Lu Tian
  • Yi Shan

Semantic segmentation is an important task for scene understanding in self-driving cars and robotics, which aims to assign dense labels for all pixels in the image. Existing work typically improves semantic segmentation performance by exploring different network architectures on a target dataset. Little attention has been paid to build a unified system by simultaneously learning from multiple datasets due to the inherent distribution shift across different datasets. In this paper, we propose a simple, flexible, and general method for semantic segmentation, termed Cross-Dataset Collaborative Learning (CDCL). Our goal is to train a unified model for improving the performance in each dataset by leveraging information from all the datasets. Specifically, we first introduce a family of Dataset-Aware Blocks (DAB) as the fundamental computing units of the network, which help capture homogeneous convolutional representations and heterogeneous statistics across different datasets. Second, we present a Dataset Alternation Training (DAT) mechanism to facilitate the collaborative optimization procedure. We conduct extensive evaluations on diverse semantic segmentation datasets for autonomous driving. Experiments demonstrate that our method consistently achieves notable improvements over prior single-dataset and cross-dataset training methods without introducing extra FLOPs. Particularly, with the same architecture of PSPNet (ResNet-18), our method outperforms the single-dataset baseline by 5. 65%, 6. 57%, and 5. 79% mIoU on the validation sets of Cityscapes, BDD100K, CamVid, respectively. We also apply CDCL for point cloud 3D semantic segmentation and achieve improved performance, which further validates the superiority and generality of our method. Code and models will be released.

NeurIPS Conference 2022 Conference Paper

DARE: Disentanglement-Augmented Rationale Extraction

  • Linan Yue
  • Qi Liu
  • Yichao Du
  • Yanqing An
  • Li Wang
  • Enhong Chen

Rationale extraction can be considered as a straightforward method of improving the model explainability, where rationales are a subsequence of the original inputs, and can be extracted to support the prediction results. Existing methods are mainly cascaded with the selector which extracts the rationale tokens, and the predictor which makes the prediction based on selected tokens. Since previous works fail to fully exploit the original input, where the information of non-selected tokens is ignored, in this paper, we propose a Disentanglement-Augmented Rationale Extraction (DARE) method, which encapsulates more information from the input to extract rationales. Specifically, it first disentangles the input into the rationale representations and the non-rationale ones, and then learns more comprehensive rationale representations for extracting by minimizing the mutual information (MI) between the two disentangled representations. Besides, to improve the performance of MI minimization, we develop a new MI estimator by exploring existing MI estimation methods. Extensive experimental results on three real-world datasets and simulation studies clearly validate the effectiveness of our proposed method. Code is released at https: //github. com/yuelinan/DARE.

NeurIPS Conference 2022 Conference Paper

HSDF: Hybrid Sign and Distance Field for Modeling Surfaces with Arbitrary Topologies

  • Li Wang
  • Jie Yang
  • Weikai Chen
  • Xiaoxu Meng
  • Bo Yang
  • Jintao Li
  • Lin Gao

Neural implicit function based on signed distance field (SDF) has achieved impressive progress in reconstructing 3D models with high fidelity. However, such approaches can only represent closed shapes. Recent works based on unsigned distance function (UDF) are proposed to handle both watertight and open surfaces. Nonetheless, as UDF is signless, its direct output is limited to point cloud, which imposes an additional challenge on extracting high-quality meshes from discrete points. To address this issue, we present a new learnable implicit representation, coded HSDF, that connects the good ends of SDF and UDF. In particular, HSDF is able to represent arbitrary topologies containing both closed and open surfaces while being compatible with existing iso-surface extraction techniques for easy field-to-mesh conversion. In addition to predicting a UDF, we propose to learn an additional sign field via a simple classifier. Unlike traditional SDF, HSDF is able to locate the surface of interest before level surface extraction by generating surface points following NDF~\cite{chibane2020ndf}. We are then able to obtain open surfaces via an adaptive meshing approach that only instantiates regions containing surface into a polygon mesh. We also propose HSDF-Net, a dedicated learning framework that factorizes the learning of HSDF into two easier problems. Experiments on multiple datasets show that HSDF outperforms state-of-the-art techniques both qualitatively and quantitatively.

ICLR Conference 2022 Conference Paper

HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation

  • Boyan Li
  • Hongyao Tang
  • Yan Zheng 0002
  • Jianye Hao
  • Pengyi Li 0001
  • Zhen Wang 0004
  • Zhaopeng Meng
  • Li Wang

Discrete-continuous hybrid action space is a natural setting in many practical problems, such as robot control and game AI. However, most previous Reinforcement Learning (RL) works only demonstrate the success in controlling with either discrete or continuous action space, while seldom take into account the hybrid action space. One naive way to address hybrid action RL is to convert the hybrid action space into a unified homogeneous action space by discretization or continualization, so that conventional RL algorithms can be applied. However, this ignores the underlying structure of hybrid action space and also induces the scalability issue and additional approximation difficulties, thus leading to degenerated results. In this paper, we propose Hybrid Action Representation (HyAR) to learn a compact and decodable latent representation space for the original hybrid action space. HyAR constructs the latent space and embeds the dependence between discrete action and continuous parameter via an embedding table and conditional Variantional Auto-Encoder (VAE). To further improve the effectiveness, the action representation is trained to be semantically smooth through unsupervised environmental dynamics prediction. Finally, the agent then learns its policy with conventional DRL algorithms in the learned representation space and interacts with the environment by decoding the hybrid action embeddings to the original action space. We evaluate HyAR in a variety of environments with discrete-continuous action space. The results demonstrate the superiority of HyAR when compared with previous baselines, especially for high-dimensional action spaces.

ICML Conference 2022 Conference Paper

Individual Reward Assisted Multi-Agent Reinforcement Learning

  • Li Wang
  • Yupeng Zhang
  • Yujing Hu
  • Weixun Wang
  • Chongjie Zhang
  • Yang Gao 0001
  • Jianye Hao
  • Tangjie Lv

In many real-world multi-agent systems, the sparsity of team rewards often makes it difficult for an algorithm to successfully learn a cooperative team policy. At present, the common way for solving this problem is to design some dense individual rewards for the agents to guide the cooperation. However, most existing works utilize individual rewards in ways that do not always promote teamwork and sometimes are even counterproductive. In this paper, we propose Individual Reward Assisted Team Policy Learning (IRAT), which learns two policies for each agent from the dense individual reward and the sparse team reward with discrepancy constraints for updating the two policies mutually. Experimental results in different scenarios, such as the Multi-Agent Particle Environment and the Google Research Football Environment, show that IRAT significantly outperforms the baseline methods and can greatly promote team policy learning without deviating from the original team objective, even when the individual rewards are misleading or conflict with the team rewards.

AAAI Conference 2022 Conference Paper

Privacy-Preserving Face Recognition in the Frequency Domain

  • Yinggui Wang
  • Jian Liu
  • Man Luo
  • Le Yang
  • Li Wang

Some applications require performing face recognition (FR) on third-party servers, which could be accessed by attackers with malicious intents to compromise the privacy of users’ face information. This paper advocates a practical privacypreserving frequency-domain FR scheme without key management. The new scheme first collects the components with the same frequency from different blocks of a face image to form component channels. Only part of the channels are retained and fed into the analysis network that performs an interpretable privacy-accuracy trade-off analysis to identify channels important for face image visualization but not crucial for maintaining high FR accuracy. For this purpose, the loss function of the analysis network consists of the empirical FR error loss and a face visualization penalty term, and the network is trained in an end-to-end manner. We find that with the developed analysis network, more than 94% of the image energy can be dropped while the face recognition accuracy stays almost undegraded. In order to further protect the remaining frequency components, we propose a fast masking method. Effectiveness of the new scheme in removing the visual information of face images while maintaining their distinguishability is validated over several large face datasets. Results show that the proposed scheme achieves a recognition performance and inference time comparable to ArcFace operating on original face images directly.

TIST Journal 2022 Journal Article

Toward Scalable and Privacy-preserving Deep Neural Network via Algorithmic-Cryptographic Co-design

  • Jun Zhou
  • Longfei Zheng
  • Chaochao Chen
  • Yan Wang
  • Xiaolin Zheng
  • Bingzhe Wu
  • Cen Chen
  • Li Wang

Deep Neural Networks (DNNs) have achieved remarkable progress in various real-world applications, especially when abundant training data are provided. However, data isolation has become a serious problem currently. Existing works build privacy-preserving DNN models from either algorithmic perspective or cryptographic perspective. The former mainly splits the DNN computation graph between data holders or between data holders and server, which demonstrates good scalability but suffers from accuracy loss and potential privacy risks. In contrast, the latter leverages time-consuming cryptographic techniques, which has strong privacy guarantee but poor scalability. In this article, we propose SPNN—a Scalable and Privacy-preserving deep Neural Network learning framework, from an algorithmic-cryptographic co-perspective. From algorithmic perspective, we split the computation graph of DNN models into two parts, i.e., the private-data-related computations that are performed by data holders and the rest heavy computations that are delegated to a semi-honest server with high computation ability. From cryptographic perspective, we propose using two types of cryptographic techniques, i.e., secret sharing and homomorphic encryption, for the isolated data holders to conduct private-data-related computations privately and cooperatively. Furthermore, we implement SPNN in a decentralized setting and introduce user-friendly APIs. Experimental results conducted on real-world datasets demonstrate the superiority of our proposed SPNN.

AAAI Conference 2022 Conference Paper

Two-Stage Octave Residual Network for End-to-End Image Compression

  • Fangdong Chen
  • Yumeng Xu
  • Li Wang

Octave Convolution (OctConv) is a generic convolutional unit that has already achieved good performances in many computer vision tasks. Recent studies also have shown the potential of applying the OctConv in end-to-end image compression. However, considering the characteristic of image compression task, current works of OctConv may limit the performance of the image compression network due to the loss of spatial information caused by the sampling operations of inter-frequency communication. Besides, the correlation between multi-frequency latents produced by OctConv is not utilized in current architectures. In this paper, to address these problems, we propose a novel Two-stage Octave Residual (ToRes) block which strips the sampling operation from OctConv to strengthen the capability of preserving useful information. Moreover, to capture the redundancy between the multi-frequency latents, a context transfer module is designed. The results show that both ToRes block and the incorporation of context transfer module help to improve the Rate-Distortion performance, and the combination of these two strategies makes our model achieve the state-of-the-art performance and outperform the latest compression standard Versatile Video Coding (VVC) in terms of both PSNR and MS-SSIM.

IJCAI Conference 2022 Conference Paper

Vertically Federated Graph Neural Network for Privacy-Preserving Node Classification

  • Chaochao Chen
  • Jun Zhou
  • Longfei Zheng
  • Huiwen Wu
  • Lingjuan Lyu
  • Jia Wu
  • Bingzhe Wu
  • Ziqi Liu

Recently, Graph Neural Network (GNN) has achieved remarkable progresses in various real-world tasks on graph data, consisting of node features and the adjacent information between different nodes. High-performance GNN models always depend on both rich features and complete edge information in graph. However, such information could possibly be isolated by different data holders in practice, which is the so-called data isolation problem. To solve this problem, in this paper, we propose VFGNN, a federated GNN learning paradigm for privacy-preserving node classification task under data vertically partitioned setting, which can be generalized to existing GNN models. Specifically, we split the computation graph into two parts. We leave the private data (i. e. , features, edges, and labels) related computations on data holders, and delegate the rest of computations to a semi-honest server. We also propose to apply differential privacy to prevent potential information leakage from the server. We conduct experiments on three benchmarks and the results demonstrate the effectiveness of VFGNN.

NeurIPS Conference 2022 Conference Paper

Weighted Mutual Learning with Diversity-Driven Model Compression

  • Miao Zhang
  • Li Wang
  • David Campos
  • Wei Huang
  • Chenjuan Guo
  • Bin Yang

Online distillation attracts attention from the community as it simplifies the traditional two-stage knowledge distillation process into a single stage. Online distillation collaboratively trains a group of peer models, which are treated as students, and all students gain extra knowledge from each other. However, memory consumption and diversity among peers are two key challenges to the scalability and quality of online distillation. To address the two challenges, this paper presents a framework called Weighted Mutual Learning with Diversity-Driven Model Compression (WML) for online distillation. First, at the base of a hierarchical structure where peers share different parts, we leverage the structured network pruning to generate diversified peer models and reduce the memory requirements. Second, rather than taking the average of peers, this paper, for the first time, leverages a bi-level formulation to estimate the relative importance of peers with a close-form, to further boost the effectiveness of the distillation from each other. Extensive experiments show the generalization of the proposed framework, which outperforms existing online distillation methods on a variety of deep neural networks. More interesting, as a byproduct, \WML produces a series of pruned models under different model sizes in a single run, which also achieves competitive results compared with existing channel pruning methods.

AAAI Conference 2022 Conference Paper

What about Inputting Policy in Value Function: Policy Representation and Policy-Extended Value Function Approximator

  • Hongyao Tang
  • Zhaopeng Meng
  • Jianye Hao
  • Chen Chen
  • Daniel Graves
  • Dong Li
  • Changmin Yu
  • Hangyu Mao

We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL), which extends conventional value function approximator (VFA) to take as input not only the state (and action) but also an explicit policy representation. Such an extension enables PeVFA to preserve values of multiple policies at the same time and brings an appealing characteristic, i. e. , value generalization among policies. We formally analyze the value generalization under Generalized Policy Iteration (GPI). From theoretical and empirical lens, we show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies, which is expected to improve consecutive value approximation during GPI. Based on above clues, we introduce a new form of GPI with PeVFA which leverages the value generalization along policy improvement path. Moreover, we propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or stateaction pairs. In our experiments, we evaluate the efficacy of value generalization offered by PeVFA and policy representation learning in several OpenAI Gym continuous control tasks. For a representative instance of algorithm implementation, Proximal Policy Optimization (PPO) re-implemented under the paradigm of GPI with PeVFA achieves about 40% performance improvement on its vanilla counterpart in most environments.

IJCAI Conference 2021 Conference Paper

Preference-Adaptive Meta-Learning for Cold-Start Recommendation

  • Li Wang
  • Binbin Jin
  • Zhenya Huang
  • Hongke Zhao
  • Defu Lian
  • Qi Liu
  • Enhong Chen

In recommender systems, the cold-start problem is a critical issue. To alleviate this problem, an emerging direction adopts meta-learning frameworks and achieves success. Most existing works aim to learn globally shared prior knowledge across all users so that it can be quickly adapted to a new user with sparse interactions. However, globally shared prior knowledge may be inadequate to discern users’ complicated behaviors and causes poor generalization. Therefore, we argue that prior knowledge should be locally shared by users with similar preferences who can be recognized by social relations. To this end, in this paper, we propose a Preference-Adaptive Meta-Learning approach (PAML) to improve existing meta-learning frameworks with better generalization capacity. Specifically, to address two challenges imposed by social relations, we first identify reliable implicit friends to strengthen a user’s social relations based on our defined palindrome paths. Then, a coarse-fine preference modeling method is proposed to leverage social relations and capture the preference. Afterwards, a novel preference-specific adapter is designed to adapt the globally shared prior knowledge to the preference-specific knowledge so that users who have similar tastes share similar knowledge. We conduct extensive experiments on two publicly available datasets. Experimental results validate the power of social relations and the effectiveness of PAML.

NeurIPS Conference 2021 Conference Paper

Progressive Coordinate Transforms for Monocular 3D Object Detection

  • Li Wang
  • Li Zhang
  • Yi Zhu
  • Zhi Zhang
  • Tong He
  • Mu Li
  • Xiangyang Xue

Recognizing and localizing objects in the 3D space is a crucial ability for an AI agent to perceive its surrounding environment. While significant progress has been achieved with expensive LiDAR point clouds, it poses a great challenge for 3D object detection given only a monocular image. While there exist different alternatives for tackling this problem, it is found that they are either equipped with heavy networks to fuse RGB and depth information or empirically ineffective to process millions of pseudo-LiDAR points. With in-depth examination, we realize that these limitations are rooted in inaccurate object localization. In this paper, we propose a novel and lightweight approach, dubbed {\em Progressive Coordinate Transforms} (PCT) to facilitate learning coordinate representations. Specifically, a localization boosting mechanism with confidence-aware loss is introduced to progressively refine the localization prediction. In addition, semantic image representation is also exploited to compensate for the usage of patch proposals. Despite being lightweight and simple, our strategy allows us to establish a new state-of-the-art among the monocular 3D detectors on the competitive KITTI benchmark. At the same time, our proposed PCT shows great generalization to most coordinate-based 3D detection frameworks.

JBHI Journal 2020 Journal Article

Adaptive-Guided-Coupling-Probability Level Set for Retinal Layer Segmentation

  • Yue Sun
  • Sijie Niu
  • Xizhan Gao
  • Jie Su
  • Jiwen Dong
  • Yuehui Chen
  • Li Wang

Quantitative assessment of retinal layer thickness in spectral domain-optical coherence tomography (SD-OCT) images is vital for clinicians to determine the degree of ophthalmic lesions. However, due to the complex retinal tissues, high-level speckle noises and low intensity constraint, how to accurately recognize the retinal layer structure still remains a challenge. To overcome this problem, this paper proposes an adaptive-guided-coupling-probability level set method for retinal layer segmentation in SD-OCT images. Specifically, based on Bayes's theorem, each voxel probability representation is composed of two probability terms in our method. The first term is constructed as neighborhood Gaussian fitting distribution to characterize intensity information for each intra-retinal layer. The second one is boundary probability map generated by combining anatomical priors and adaptive thickness information to ensure surfaces evolve within a proper range. Then, the voxel probability representation is introduced into the proposed segmentation framework based on coupling probability level set to detect layer boundaries. A total of 1792 retinal B-scan images from 4 SD-OCT cubes in healthy eyes, 5 cubes in abnormal eyes with central serous chorioretinaopathy and 5 SD-OCT cubes in abnormal eyes with age-related macular disease are used to evaluate the proposed method. The experiment demonstrates that the segmentation results obtained by the proposed method have a good consistency with ground truth, and the proposed method outperforms six methods in the layer segmentation of uneven retinal SD-OCT images.

AAAI Conference 2020 Conference Paper

Characterizing Membership Privacy in Stochastic Gradient Langevin Dynamics

  • Bingzhe Wu
  • Chaochao Chen
  • Shiwan Zhao
  • Cen Chen
  • Yuan Yao
  • Guangyu Sun
  • Li Wang
  • Xiaolu Zhang

Bayesian deep learning is recently regarded as an intrinsic way to characterize the weight uncertainty of deep neural networks (DNNs). Stochastic Gradient Langevin Dynamics (SGLD) is an effective method to enable Bayesian deep learning on large-scale datasets. Previous theoretical studies have shown various appealing properties of SGLD, ranging from the convergence properties to the generalization bounds. In this paper, we study the properties of SGLD from a novel perspective of membership privacy protection (i. e. , preventing the membership attack). The membership attack, which aims to determine whether a specific sample is used for training a given DNN model, has emerged as a common threat against deep learning algorithms. To this end, we build a theoretical framework to analyze the information leakage (w. r. t. the training dataset) of a model trained using SGLD. Based on this framework, we demonstrate that SGLD can prevent the information leakage of the training dataset to a certain extent. Moreover, our theoretical analysis can be naturally extended to other types of Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) methods. Empirical results on different datasets and models verify our theoretical findings and suggest that the SGLD algorithm can not only reduce the information leakage but also improve the generalization ability of the DNN models in real-world applications.

AAAI Conference 2020 Conference Paper

Feature Variance Regularization: A Simple Way to Improve the Generalizability of Neural Networks

  • Ranran Huang
  • Hanbo Sun
  • Ji Liu
  • Lu Tian
  • Li Wang
  • Yi Shan
  • Yu Wang

To improve the generalization ability of neural networks, we propose a novel regularization method that regularizes the empirical risk using a penalty on the empirical variance of the features. Intuitively, our approach introduces confusion into feature extraction and prevents the models from learning features that may relate to specific training samples. According to our theoretical analysis, our method encourages models to generate closer feature distributions for the training set and unobservable true data and minimize the expected risk as well, which allows the model to adapt to new samples better. We provide a thorough empirical justification of our approach, and achieves a greater improvement than other regularization methods. The experimental results show the effectiveness of our method on multiple visual tasks, including classification (CIFAR100, ImageNet, fine-grained datasets) and semantic segmentation (Cityscapes).

TIST Journal 2020 Journal Article

Practical Privacy Preserving POI Recommendation

  • Chaochao Chen
  • Jun Zhou
  • Bingzhe Wu
  • Wenjing Fang
  • Li Wang
  • Yuan Qi
  • Xiaolin Zheng

Point-of-Interest (POI) recommendation has been extensively studied and successfully applied in industry recently. However, most existing approaches build centralized models on the basis of collecting users’ data. Both private data and models are held by the recommender, which causes serious privacy concerns. In this article, we propose a novel Privacy preserving POI Recommendation (PriRec) framework. First, to protect data privacy, users’ private data (features and actions) are kept on their own side, e.g., Cellphone or Pad. Meanwhile, the public data that need to be accessed by all the users are kept by the recommender to reduce the storage costs of users’ devices. Those public data include: (1) static data only related to the status of POI, such as POI categories, and (2) dynamic data dependent on user-POI actions such as visited counts. The dynamic data could be sensitive, and we develop local differential privacy techniques to release such data to the public with privacy guarantees. Second, PriRec follows the representations of Factorization Machine (FM) that consists of a linear model and the feature interaction model. To protect the model privacy, the linear models are saved on the users’ side, and we propose a secure decentralized gradient descent protocol for users to learn it collaboratively. The feature interaction model is kept by the recommender since there is no privacy risk, and we adopt a secure aggregation strategy in a federated learning paradigm to learn it. To this end, PriRec keeps users’ private raw data and models in users’ own hands, and protects user privacy to a large extent. We apply PriRec in real-world datasets, and comprehensive experiments demonstrate that, compared with FM, PriRec achieves comparable or even better recommendation accuracy.

AAAI Conference 2020 Conference Paper

Show, Recall, and Tell: Image Captioning with Recall Mechanism

  • Li Wang
  • Zechen Bai
  • Yonghua Zhang
  • Hongtao Lu

Generating natural and accurate descriptions in image captioning has always been a challenge. In this paper, we propose a novel recall mechanism to imitate the way human conduct captioning. There are three parts in our recall mechanism: recall unit, semantic guide (SG) and recalled-word slot (RWS). Recall unit is a text-retrieval module designed to retrieve recalled words for images. SG and RWS are designed for the best use of recalled words. SG branch can generate a recalled context, which can guide the process of generating caption. RWS branch is responsible for copying recalled words to the caption. Inspired by pointing mechanism in text summarization, we adopt a soft switch to balance the generated-word probabilities between SG and RWS. In the CIDEr optimization step, we also introduce an individual recalled-word reward (WR) to boost training. Our proposed methods (SG+RWS+WR) achieve BLEU-4 / CIDEr / SPICE scores of 36. 6 / 116. 9 / 21. 3 with cross-entropy loss and 38. 7 / 129. 1 / 22. 4 with CIDEr optimization on MSCOCO Karpathy test split, which surpass the results of other state-of-the-art methods.

AAAI Conference 2020 Conference Paper

Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement

  • Jianing Deng
  • Li Wang
  • Shiliang Pu
  • Cheng Zhuo

Recent years have witnessed remarkable success of deep learning methods in quality enhancement for compressed video. To better explore temporal information, existing methods usually estimate optical flow for temporal motion compensation. However, since compressed video could be seriously distorted by various compression artifacts, the estimated optical flow tends to be inaccurate and unreliable, thereby resulting in ineffective quality enhancement. In addition, optical flow estimation for consecutive frames is generally conducted in a pairwise manner, which is computational expensive and inefficient. In this paper, we propose a fast yet effective method for compressed video quality enhancement by incorporating a novel Spatio-Temporal Deformable Fusion (STDF) scheme to aggregate temporal information. Specifically, the proposed STDF takes a target frame along with its neighboring reference frames as input to jointly predict an offset field to deform the spatio-temporal sampling positions of convolution. As a result, complementary information from both target and reference frames can be fused within a single Spatio-Temporal Deformable Convolution (STDC) operation. Extensive experiments show that our method achieves the state-of-the-art performance of compressed video quality enhancement in terms of both accuracy and efficiency.

AAAI Conference 2019 Short Paper

An Optimal Rewiring Strategy for Cooperative Multiagent Social Learning

  • Hongyao Tang
  • Jianye Hao
  • Li Wang
  • Tim Baarslag
  • Zan Wang

Multiagent coordination in cooperative multiagent systems (MASs) has been widely studied in both fixed-agent repeated interaction setting and static social learning framework. However, two aspects of dynamics in real-world MASs are currently missing. First, the network topologies can dynamically change during the course of interaction. Second, the interaction utilities between each pair of agents may not be identical and not known as a prior. Both issues mentioned above increase the difficulty of coordination. In this paper, we consider the multiagent social learning in a dynamic environment in which agents can alter their connections and interact with randomly chosen neighbors with unknown utilities beforehand. We propose an optimal rewiring strategy to select most beneficial peers to maximize the accumulated payoffs in long-run interactions. We empirically demonstrate the effects of our approach in large-scale MASs.

AAMAS Conference 2019 Conference Paper

An Optimal Rewiring Strategy for Cooperative Multiagent Social Learning

  • Hongyao Tang
  • Jianye Hao
  • Li Wang
  • Zan Wang
  • Tim Baarslag

Multiagent coordination is a key problem in cooperative multiagent systems (MASs). It has been widely studied in both fixed-agent repeated interaction setting and static social learning framework. However, two aspects of dynamics in real-world MASs are currently neglected. First, the network topologies can change during the course of interaction dynamically. Second, the interaction utilities can be different among each pair of agents and usually unknown before interaction. Both issues mentioned above increase the difficulty of coordination. In this paper, we consider the multiagent social learning in a dynamic environment in which agents can alter their connections and interact with randomly chosen neighbors with unknown utilities beforehand. We propose an optimal rewiring strategy to select most beneficial peers to maximize the accumulated payoffs in long-run interactions. We empirically demonstrate the effects of our approach in a variety of large-scale MASs.

AAAI Conference 2019 Conference Paper

Difficulty-Aware Attention Network with Confidence Learning for Medical Image Segmentation

  • Dong Nie
  • Li Wang
  • Lei Xiang
  • Sihang Zhou
  • Ehsan Adeli
  • Dinggang Shen

Medical image segmentation is a key step for various applications, such as image-guided radiation therapy and diagnosis. Recently, deep neural networks provided promising solutions for automatic image segmentation; however, they often perform good on regular samples (i. e. , easy-to-segment samples), since the datasets are dominated by easy and regular samples. For medical images, due to huge inter-subject variations or disease-specific effects on subjects, there exist several difficult-to-segment cases that are often overlooked by the previous works. To address this challenge, we propose a difficulty-aware deep segmentation network with confidence learning for end-to-end segmentation. The proposed framework has two main contributions: 1) Besides the segmentation network, we also propose a fully convolutional adversarial network for confidence learning to provide voxel-wise and region-wise confidence information for the segmentation network. We relax the adversarial learning to confidence learning by decreasing the priority of adversarial learning, so that we can avoid the training imbalance between generator and discriminator. 2) We propose a difficulty-aware attention mechanism to properly handle hard samples or hard regions considering structural information, which may go beyond the shortcomings of focal loss. We further propose a fusion module to selectively fuse the concatenated feature maps in encoder-decoder architectures. Experimental results on clinical and challenge datasets show that our proposed network can achieve state-of-the-art segmentation accuracy. Further analysis also indicates that each individual component of our proposed network contributes to the overall performance improvement.

NeurIPS Conference 2019 Conference Paper

Generalization in Generative Adversarial Networks: A Novel Perspective from Privacy Protection

  • Bingzhe Wu
  • Shiwan Zhao
  • Chaochao Chen
  • Haoyang Xu
  • Li Wang
  • Xiaolu Zhang
  • Guangyu Sun
  • Jun Zhou

In this paper, we aim to understand the generalization properties of generative adversarial networks (GANs) from a new perspective of privacy protection. Theoretically, we prove that a differentially private learning algorithm used for training the GAN does not overfit to a certain degree, i. e. , the generalization gap can be bounded. Moreover, some recent works, such as the Bayesian GAN, can be re-interpreted based on our theoretical insight from privacy protection. Quantitatively, to evaluate the information leakage of well-trained GAN models, we perform various membership attacks on these models. The results show that previous Lipschitz regularization techniques are effective in not only reducing the generalization gap but also alleviating the information leakage of the training dataset.

IJCAI Conference 2018 Conference Paper

A Reinforced Topic-Aware Convolutional Sequence-to-Sequence Model for Abstractive Text Summarization

  • Li Wang
  • Junlin Yao
  • Yunzhe Tao
  • Li Zhong
  • Wei Liu
  • Qiang Du

In this paper, we propose a deep learning approach to tackle the automatic summarization tasks by incorporating topic information into the convolutional sequence-to-sequence (ConvS2S) model and using self-critical sequence training (SCST) for optimization. Through jointly attending to topics and word-level alignment, our approach can improve coherence, diversity, and informativeness of generated summaries via a biased probability generation mechanism. On the other hand, reinforcement training, like SCST, directly optimizes the proposed model with respect to the non-differentiable metric ROUGE, which also avoids the exposure bias during inference. We carry out the experimental evaluation with state-of-the-art methods over the Gigaword, DUC-2004, and LCSTS datasets. The empirical results demonstrate the superiority of our proposed method in the abstractive summarization.

AAAI Conference 2018 Conference Paper

Efficient Test-Time Predictor Learning With Group-Based Budget

  • Li Wang
  • Dajiang Zhu
  • Yujie Chi

Learning a test-time efficient predictor is becoming important for many real-world applications for which accessing the necessary features of a test data is costly. In this paper, we propose a novel approach to learn a linear predictor by introducing binary indicator variables for selecting feature groups and imposing an explicit budget constraint to up-bound the total cost of selected groups. We solve the convex relaxation of the resulting problem, with the optimal solution proved to be integers for most of the elements at the optima and independent of the specific forms of loss functions used. We propose a general and efficient algorithm to solve the relaxation problem by leveraging the existing SVM solvers with various loss functions. For certain loss functions, the proposed algorithm can further take the advantage of SVM solver in the primal to tackle large-scale and high-dimensional data. Experiments on various datasets demonstrate the effectiveness and efficiency of the proposed method by comparing with various baselines.

AAAI Conference 2017 Conference Paper

Latent Smooth Skeleton Embedding

  • Li Wang
  • Qi Mao
  • Ivor Tsang

Learning a smooth skeleton in a low-dimensional space from noisy data becomes important in computer vision and computational biology. Existing methods assume that the manifold constructed from the data is smooth, but they lack the ability to model skeleton structures from noisy data. To overcome this issue, we propose a novel probabilistic structured learning model to learn the density of latent embedding given high-dimensional data and its neighborhood graph. The embedded points that form a smooth skeleton structure are obtained by maximum a posteriori (MAP) estimation. Our analysis shows that the resulting similarity matrix is sparse and unique, and its associated kernel has eigenvalues that follow a power law distribution, which leads to the embeddings of a smooth skeleton. The model is extended to learn a sparse similarity matrix when the graph structure is unknown. Extensive experiments demonstrate the effectiveness of the proposed methods on various datasets by comparing them with existing methods.

AAAI Conference 2016 Conference Paper

Learning Sparse Confidence-Weighted Classifier on Very High Dimensional Data

  • Mingkui Tan
  • Yan Yan
  • Li Wang
  • Anton van den Hengel
  • Ivor W. Tsang
  • Qinfeng (Javen) Shi

Confidence-weighted (CW) learning is a successful online learning paradigm which maintains a Gaussian distribution over classifier weights and adopts a covariance matrix to represent the uncertainties of the weight vectors. However, there are two deficiencies in existing full CW learning paradigms, these being the sensitivity to irrelevant features, and the poor scalability to high dimensional data due to the maintenance of the covariance structure. In this paper, we begin by presenting an online-batch CW learning scheme, and then present a novel paradigm to learn sparse CW classifiers. The proposed paradigm essentially identifies feature groups and naturally builds a block diagonal covariance structure, making it very suitable for CW learning over very high-dimensional data. Extensive experimental results demonstrate the superior performance of the proposed methods over state-of-the-art counterparts on classification and feature selection tasks.

JMLR Journal 2014 Journal Article

Towards Ultrahigh Dimensional Feature Selection for Big Data

  • Mingkui Tan
  • Ivor W. Tsang
  • Li Wang

In this paper, we present a new adaptive feature scaling scheme for ultrahigh-dimensional feature selection on Big Data, and then reformulate it as a convex semi-infinite programming (SIP) problem. To address the SIP, we propose an efficient feature generating paradigm. Different from traditional gradient-based approaches that conduct optimization on all input features, the proposed paradigm iteratively activates a group of features, and solves a sequence of multiple kernel learning (MKL) subproblems. To further speed up the training, we propose to solve the MKL subproblems in their primal forms through a modified accelerated proximal gradient approach. Due to such optimization scheme, some efficient cache techniques are also developed. The feature generating paradigm is guaranteed to converge globally under mild conditions, and can achieve lower feature selection bias. Moreover, the proposed method can tackle two challenging tasks in feature selection: 1) group-based feature selection with complex structures, and 2) nonlinear feature selection with explicit feature mappings. Comprehensive experiments on a wide range of synthetic and real-world data sets of tens of million data points with $O(10^{14})$ features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training efficiency. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

AAAI Conference 2012 Conference Paper

Convex Matching Pursuit for Large-Scale Sparse Coding and Subset Selection

  • Mingkui Tan
  • Ivor Tsang
  • Li Wang
  • Xinming Zhang

In this paper, a new convex matching pursuit scheme is proposed for tackling large-scale sparse coding and subset selection problems. In contrast with current matching pursuit algorithms such as subspace pursuit (SP), the proposed algorithm has a convex formulation and guarantees that the objective value can be monotonically decreased. Moreover, theoretical analysis and experimental results show that the proposed method achieves better scalability while maintaining similar or better decoding ability compared with state-of-the-art methods on large-scale problems.