Arrow Research search

Author name cluster

Lin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

55 papers
2 author rows

Possible papers

55

AAAI Conference 2026 Conference Paper

Beyond Single-Point Perturbation: A Hierarchical, Manifold-Aware Approach to Diffusion Attacks

  • Zhijie Wang
  • Lin Wang
  • Zhenyu Wen
  • Cong Wang

Latent Diffusion Models have become a powerful tool for generating high-fidelity unrestricted adversarial examples. However, the existing methods typically perturb only the initial latent or rely on prompt engineering, which is ill-suited to the iterative nature of the diffusion process, plus optimization instability due to external text prompts and cumulative drift that push the adversarial images off the data manifold. In this paper, we propose a hierarchical attack framework that operates in alignment with the model's generative manifold and leverages intermediate denoising states to maximize attack transferability and visual fidelity. Extensive experiments show that the proposed attack improves adversarial transferability by 10-20% against a diverse set of normally-trained models and achieves over 10.5% higher success rate against adversarially-defended models, while simultaneously enhancing visual quality by 1.0-1.2 FID reduction and 16.7% LPIPS improvements.

TMLR Journal 2026 Journal Article

BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis

  • Lutao Jiang
  • Xu Zheng
  • Yuanhuiyi Lyu
  • Jiazhou Zhou
  • Lin Wang

Text-to-3D synthesis has recently seen intriguing advances by combining the text-to-image priors with 3D representation methods, e.g., 3D Gaussian Splatting (3D GS), via Score Distillation Sampling (SDS). However, a hurdle of existing methods is the low efficiency, per-prompt optimization for a single 3D object. Therefore, it is imperative for a paradigm shift from per-prompt optimization to feed-forward generation for any unseen text prompts, which yet remains challenging. An obstacle is how to directly generate a set of millions of 3D Gaussians to represent a 3D object. This paper presents BrightDreamer, an end-to-end feed-forward approach that can achieve generalizable and fast (77 ms) text-to-3D generation. Our key idea is to formulate the generation process as estimating the 3D deformation from an anchor shape with predefined positions. For this, we first propose a Text-guided Shape Deformation (TSD) network to predict the deformed shape and its new positions, used as the centers (one attribute) of 3D Gaussians. To estimate the other four attributes (i.e., scaling, rotation, opacity, and SH), we then design a novel Text-guided Triplane Generator (TTG) to generate a triplane representation for a 3D object. The center of each Gaussian enables us to transform the spatial feature into the four attributes. The generated 3D Gaussians can be finally rendered at 705 frames per second. Extensive experiments demonstrate the superiority of our method over existing methods. Also, BrightDreamer possesses a strong semantic understanding capability even for complex text prompts. The project code is available in supplementary materials.

AAAI Conference 2026 Conference Paper

Ev-iCRF: Self-supervised Event-guided iCRF Estimation for HDR Image Reconstruction

  • Xucheng Guo
  • Bing Li
  • Lin Wang
  • Yiran Shen

In this paper, we present Ev-iCRF, a novel self-supervised pipeline for high dynamic range (HDR) image reconstruction from a single-exposure low dynamic range (LDR) image, guided by asynchronous event streams generated by a bio-inspired event camera. The highlight of Ev-iCRF lies in its formulation of the inverse camera response function (iCRF) based on Event-LDR Correspondence. By leveraging the HDR properties of event data, the method enables direct iCRF estimation, offering a new perspective for event-guided HDR imaging. The pipeline is trained in a self-supervised manner using formulation-driven iCRF estimation loss and refinement loss, without the need for synchronized HDR supervision. Ev-iCRF adopts a two-stage coarse-to-fine reconstruction pipeline, allowing effective fusion of features from both LDR image and event data. The event information is used to optimize the iCRF, enabling accurate HDR reconstruction from LDR inputs. We evaluate Ev-iCRF on real-world datasets, and results show that it outperforms state-of-the-art methods in HDR reconstruction accuracy. Moreover, the reconstructed images demonstrate improved texture fidelity and structural detail.

AAAI Conference 2026 Conference Paper

EvDiff3D: Event-Aware Diffusion Repair for High-Fidelity Event-Based 3D Reconstruction

  • Kanghao Chen
  • Zixin Zhang
  • Hangyu Li
  • Lin Wang
  • Zeyu Wang

Event cameras are bio-inspired sensors that capture visual information through asynchronous brightness changes, offering distinct advantages including high temporal resolution and wide dynamic range. While prior research has investigated event-based 3D reconstruction for extreme scenarios, existing methods face inherent limitations and fail to fully exploit the unique characteristics of event data. In this paper, we present EvDiff3D, a novel two-stage 3D reconstruction framework that integrates event-based geometric constraints with an event-aware diffusion prior for appearance refinement. Our key insight lies in bridging the gap between physically grounded event-based reconstruction and data-driven appearance repair through a unified cyclical pipeline. In the first stage, we reconstruct a coarse 3D scene under supervision from event loss and event-based monocular depth constraints to preserve structural fidelity. The second stage fine-tunes an event-aware diffusion model based on a pretrained video diffusion model as a repair prior to enhance the appearance in under-constrained regions. Based on the diffusion model, our pipeline operates within a reconstruction-generation cycle that progressively refines both geometry and appearance using only event data. Extensive experiments on synthetic and real-world datasets demonstrate that EvDiff3D significantly outperforms existing methods in perceptual quality and structural consistency.

AAAI Conference 2026 Conference Paper

PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems

  • Qi Guo
  • Xiaojun Jia
  • Shanmin Pang
  • Simeng Qin
  • Lin Wang
  • Ju Jia
  • Yang Liu
  • Qing Guo

Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. However, MLLMs are vulnerable to adversarial attacks—particularly adversarial patch attacks—which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models. Due to the more complex architectures and strong reasoning capabilities of MLLMs, these approaches perform poorly when transferred to MLLM-based systems. To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms state-of-the-art (SOTA) methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.

AAAI Conference 2026 Conference Paper

Temporal and Spatial Representation Learning for Multimodal Low-Beam 3D Object Detection

  • Lin Wang
  • Shiliang Sun
  • Jing Zhao

To facilitate the large-scale deployment of autonomous driving in real-world scenarios, developing low-cost and high-performance 3D object detection systems has become a critical technical challenge. Although high-beam LiDARs provide denser point cloud data, their prohibitive hardware cost and high power consumption limit their practicality. In contrast, low-beam LiDARs offer advantages in terms of affordability and energy efficiency, but often suffer from inadequate perception accuracy due to their sparser point cloud data. This paper focuses on the task of multimodal 3D object detection with low-beam LiDARs, and proposes a novel approach that integrates temporal and spatial representation learning to enhance detection accuracy under sparser sensor conditions. Specifically, our approach comprises: (1) a Temporal Feature Prediction Learning (TFPL) module, which predicts the current BEV representation based on a sequence of historical BEV features; (2) a Spatial Feature Observation Learning (SFOL) module, which aligns BEV (Bird's-Eye-View) features from high-beam and low-beam LiDAR to enforce the low-beam features to approximate high-beam representations; (3) an Uncertainty-Aware Fusion (UAF) strategy, which performs feature-wise weighting between the predicted and observed BEV features by leveraging channel-wise variances, effectively mitigating perturbations in the learned BEV representations. Extensive experiments on the KITTI and nuScenes 3D object detection datasets demonstrate that the proposed approach significantly improves detection performance under low-beam LiDAR configurations.

ICRA Conference 2025 Conference Paper

DAP-LED: Learning Degradation-Aware Priors with Clip for Joint Low-Light Enhancement and Deblurring

  • Ling Wang
  • Chen Wu
  • Lin Wang

Autonomous vehicles and robots often struggle with reliable visual perception at night due to the low illumination and motion blur caused by the long exposure time of RGB cameras. Existing methods address this challenge by sequentially connecting the off-the-shelf pretrained lowlight enhancement and deblurring models. Unfortunately, these methods often lead to noticeable artifacts (e. g. , color distortions) in the over-exposed regions or make it hardly possible to learn the motion cues of the dark regions. In this paper, we interestingly find vision-language models, e. g. , Contrastive LanguageImage Pretraining (CLIP), can comprehensively perceive diverse degradation levels at night. In light of this, we propose a novel transformer-based joint learning framework, named DAP-LED, which can jointly achieve low-light enhancement and deblurring, benefiting downstream tasks, such as depth estimation, segmentation, and detection in the dark. The key insight is to leverage CLIP to adaptively learn the degradation levels from images at night. This subtly enables learning rich semantic information and visual representation for optimization of the joint tasks. To achieve this, we first introduce a CLIPguided cross-fusion module to obtain multi-scale patch-wise degradation heatmaps from the image embeddings. Then, the heatmaps are fused via the designed CLIP-enhanced transformer blocks to retain useful degradation information for effective model optimization. Experimental results show that, compared to existing methods, our DAP-LED achieves state-of-the-art performance in the dark. Meanwhile, the enhanced results are demonstrated to be effective for three downstream tasks. For demo and more results, please check the project page: https://vlislab22.github.io/dap-led/.

ICRA Conference 2025 Conference Paper

Foresee and Act Ahead: Task Prediction and Pre-Scheduling Enabled Efficient Robotic Warehousing

  • Bo Cao
  • Zhe Liu 0022
  • Xingyao Han
  • Shunbo Zhou
  • Heng Zhang
  • Lijun Han
  • Lin Wang
  • Hesheng Wang 0001

In warehousing systems, to enhance efficiency amid surging demand volumes, much attention has been placed on how to reasonably allocate tasks of delivery to robots. However, the labor of robots is still inevitably wasted to some extent. In this paper, we propose a pre-scheduling enhanced warehousing framework aiming to foresee and act in advance, which consists of task flow prediction and hybrid task allocation. For task prediction, we design the spatio-temporal representations of the task flow and introduce a periodicity-decoupled mechanism tailored for the generation patterns of aggregated orders, and then further extract spatial features of task distribution with a novel combination of graph structures. In hybrid tasks allocation, we consider the known tasks and predicted future tasks simultaneously to optimize the task allocation. In addition, we consider factors such as predicted task uncertainty and sector-level efficiency to realize more balanced and rational allocations. We validate our task prediction model across datasets derived from factories, achieving SOTA performance. Furthermore, we implement our system in a real-world robotic warehouse, demonstrating more than 30% improvements in efficiency.

JBHI Journal 2025 Journal Article

Frozen Large-Scale Pretrained Vision-Language Models are the Effective Foundational Backbone for Multimodal Breast Cancer Prediction

  • Hung Q. Vo
  • Lin Wang
  • Kelvin K. Wong
  • Chika F. Ezeana
  • Xiaohui Yu
  • Wei Yang
  • Jenny Chang
  • Hien V. Nguyen

Breast cancer is a pervasive global health concern among women. Leveraging multimodal data from enterprise patient databases—including Picture Archiving and Communication Systems (PACS) and Electronic Health Records (EHRs)—holds promise for improving prediction. This study introduces a multimodal deep-learning model leveraging mammogram datasets to evaluate breast cancer prediction. Our approach integrates frozen large-scale pretrained vision-language models, showcasing superior performance and stability compared to traditional image-tabular models across two public breast cancer datasets. The model consistently outperforms conventional full fine-tuning methods by using frozen pretrained vision-language models alongside a lightweight trainable classifier. The observed improvements are significant. In the CBIS-DDSM dataset, the Area Under the Curve (AUC) increases from 0. 867 to 0. 902 during validation and from 0. 803 to 0. 830 for the official test set. Within the EMBED dataset, AUC improves from 0. 780 to 0. 805 during validation. In scenarios with limited data, using Breast Imaging-Reporting and Data System category three (BI-RADS 3) cases, AUC improves from 0. 91 to 0. 96 on the official CBIS-DDSM test set and from 0. 79 to 0. 83 on a challenging validation set. This study underscores the benefits of vision-language models in jointly training diverse image-clinical datasets from multiple healthcare institutions, effectively addressing challenges related to non-aligned tabular features. Combining training data enhances breast cancer prediction on the EMBED dataset, outperforming all other experiments. In summary, our research emphasizes the efficacy of frozen large-scale pretrained vision-language models in multimodal breast cancer prediction, offering superior performance and stability over conventional methods, reinforcing their potential for breast cancer prediction.

NeurIPS Conference 2025 Conference Paper

Leader360V: A Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment

  • Weiming Zhang
  • Dingwen Xiao
  • Aobotao DAI
  • Yexin Liu
  • Tianbo Pan
  • Shiqi Wen
  • Lei Chen
  • Lin Wang

360 video captures the complete surrounding scenes with the ultra-large field of view of 360x180. This makes 360 scene understanding tasks, e. g. , segmentation and tracking, crucial for appications, such as autonomous driving, robotics. With the recent emergence of foundation models, the community is, however, impeded by the lack of large-scale, labelled real-world datasets. This is caused by the inherent spherical properties, e. g. , severe distortion in polar regions, and content discontinuities, rendering the annotation costly yet complex. This paper introduces Leader360V, the first large-scale (10K+), labeled real-world 360 video datasets for instance segmentation and tracking. Our datasets enjoy high scene diversity, ranging from indoor and urban settings to natural and dynamic outdoor scenes. To automate annotation, we design an automatic labeling pipeline, which subtly coordinates pre-trained 2D segmentors and large language models (LLMs) to facilitate the labeling. The pipeline operates in three novel stages. Specifically, in the Initial Annotation Phase, we introduce a Semantic- and Distortion-aware Refinement ( SDR ) module, which combines object mask proposals from multiple 2D segmentors with LLM-verified semantic labels. These are then converted into mask prompts to guide SAM2 in generating distortion-aware masks for subsequent frames. In the Auto-Refine Annotation Phase, missing or incomplete regions are corrected either by applying the SDR again or resolving the discontinuities near the horizontal borders. The Manual Revision Phase finally incorporates LLMs and human annotators to further refine and validate the annotations. Extensive user studies and evaluations demonstrate the effectiveness of our labeling pipeline. Meanwhile, experiments confirm that Leader360V significantly enhances model performance for 360 video segmentation and tracking, paving the way for more scalable 360 scene understanding. We release our dataset and code at {https: //leader360v. github. io/Leader360V_HomePage/} for better understanding.

JBHI Journal 2025 Journal Article

MOSAIC: A Multi-Granularity Cross-Modal Framework for Predicting Synergistic Drug Combinations in Personalized Healthcare

  • Licai Zhang
  • Xiao Kang
  • Xinxing Yang
  • Lin Wang
  • Genke Yang
  • Jian Chu

The personalization of cancer treatment through drug combinations is critical for improving healthcare outcomes, increasing effectiveness, and reducing side effects. Computational methods have become increasingly important to prioritize synergistic drug pairs because of the vast search space of possible chemicals. However, existing approaches typically rely solely on global molecular structures, neglecting information exchange between different modality representations and interactions between molecular and fine-grained fragments, leading to limited understanding of drug synergy mechanisms for personalized treatment. To address these limitations, we propose MOSAIC ( M ulti-granularity cr OS s-mod A l method for synerg I stic drug combinations predi C tion), an AI-driven multi-granularity cross-modal method for personalized synergistic drug combination prediction that considers both molecular and fragment-level features. MOSAIC employs a dual-layer representation system, decomposing molecules into chemically meaningful fragments using the BRICS algorithm, facilitating information exchange between graph and SMILES representations through a bidirectional cross-attention mechanism, and ensuring semantic consistency of different modal representations of the same molecular fragment through a contrastive learning framework. Additionally, we designed a bilinear attention network to capture interactions between fragments of different drugs and dynamically integrate multi-granularity feature relationships through a multi-head attention mechanism. Through extensive experiments on multiple real-world datasets, MOSAIC demonstrates superior performance over state-of-the-art methods. Literature validation confirms its predicted novel drug combinations align with existing clinical evidence, while visualization analyses elucidate its capability to pinpoint key molecular fragments critical for drug synergy, providing valuable insights for personalized treatment planning and remote patient monitoring.

NeurIPS Conference 2025 Conference Paper

PASS: Path-selective State Space Model for Event-based Recognition

  • Jiazhou Zhou
  • Kanghao Chen
  • Lei Zhang
  • Lin Wang

Event cameras are bio-inspired sensors that capture intensity changes asynchronously with distinct advantages, such as high temporal resolution. Existing methods for event-based object/action recognition predominantly sample and convert event representation at every fixed temporal interval (or frequency). However, they are constrained to processing a limited number of event lengths and show poor frequency generalization, thus not fully leveraging the event's high temporal resolution. In this paper, we present our PASS framework, exhibiting superior capacity for spatiotemporal event modeling towards a larger number of event lengths and generalization across varying inference temporal frequencies. Our key insight is to learn adaptively encoded event features via the state space models (SSMs), whose linear complexity and generalization on input frequency make them ideal for processing high temporal resolution events. Specifically, we propose a Path-selective Event Aggregation and Scan (PEAS) module to encode events into features with fixed dimensions by adaptively scanning and selecting aggregated event presentation. On top of it, we introduce a novel Multi-faceted Selection Guiding (MSG) loss to minimize the randomness and redundancy of the encoded features during the PEAS selection process. Our method outperforms prior methods on five public datasets and shows strong generalization across varying inference frequencies with less accuracy drop (ours -8. 62% v. s. -20. 69% for the baseline). Moreover, our model exhibits strong long spatiotemporal modeling for a broader distribution of event length (1-10^9), precise temporal perception, and effective generalization for real-world scenarios. Code and checkpoints will be released upon acceptance.

NeurIPS Conference 2025 Conference Paper

PolypSense3D: A Multi-Source Benchmark Dataset for Depth-Aware Polyp Size Measurement in Endoscopy

  • Ruyu Liu
  • Lin Wang
  • Zhou Mingming
  • Jianhua Zhang
  • ZHANG HAOYU
  • Xiufeng Liu
  • Xu Cheng
  • Sixian Chan

Accurate polyp sizing during endoscopy is crucial for cancer risk assessment but is hindered by subjective methods and inadequate datasets lacking integrated 2D appearance, 3D structure, and real-world size information. We introduce PolypSense3D, the first multi-source benchmark dataset specifically targeting depth-aware polyp size measurement. It uniquely integrates over 43, 000 frames from virtual simulations, physical phantoms, and clinical sequences, providing synchronized RGB, dense/sparse depth, segmentation masks, camera parameters, and millimeter-scale size labels derived via a novel forceps-assisted in-vivo annotation technique. To establish its value, we benchmark state-of-the-art segmentation and depth estimation models. Results quantify significant domain gaps between simulated/phantom and clinical data and reveal substantial error propagation from perception stages to final size estimation, with the best fully automated pipelines achieving an average Mean Absolute Error (MAE) of 0. 95 mm on the clinical data subset. Publicly released under CC BY-SA 4. 0 with code and evaluation protocols, PolypSense3D offers a standardized platform to accelerate research in robust, clinically relevant quantitative endoscopic vision. The benchmark dataset and code are available at: https: //github. com/HNUicda/PolypSense3D and https: //doi. org/10. 7910/DVN/K13H89.

NeurIPS Conference 2025 Conference Paper

ProteinConformers: Benchmark Dataset for Simulating Protein Conformational Landscape Diversity and Plausibility

  • Yihang Zhou
  • Chen Wei
  • Minghao Sun
  • Jin Song
  • Yang Li
  • Lin Wang
  • Yang Zhang

Understanding the conformational landscape of proteins is essential for elucidating protein function and facilitating drug design. However, existing protein conformation benchmarks fail to capture the full energy landscape, limiting their ability to evaluate the diversity and physical plausibility of AI-generated structures. We introduce ProteinConformers, a large-scale benchmark dataset comprising over 381, 000 physically realistic conformations for 87 CASP targets. These were derived from more than 40, 000 structural decoys via extensive all-atom molecular dynamics simulations totaling over 6 million CPU hours. Using this dataset, we propose novel metrics to evaluate conformational diversity and plausibility, and systematically benchmark six protein conformation generative models. Our results highlight that leveraging large-scale protein sequence data can enhance a model’s ability to explore conformational space, potentially reducing reliance on MD-derived data. Additionally, we find that PDB and MD datasets influence model performance differently, current models perform well on inter-atomic distance prediction but struggle with inter-residue orientation generation. Overall, our dataset, evaluation metrics, and benchmarking results provide the first comprehensive foundation for assessing generative models in protein conformational modeling. Dataset and instructions are available at https: //huggingface. co/ datasets/Jim990908/ProteinConformers/tree/main. Codes are stored at https: //github. com/auroua/ProteinConformers. An interactive website locates at https: //zhanggroup. org/ProteinConformers.

ICRA Conference 2025 Conference Paper

Robo-GS: A Physics Consistent Spatial-Temporal Model for Robotic Arm with Hybrid Representation

  • Haozhe Lou
  • Yurong Liu
  • Yike Pan
  • Yiran Geng
  • Jianteng Chen
  • Wenlong Ma 0006
  • Chenglong Li
  • Lin Wang

The Real2Sim2Real (R2S2R) paradigm is critical for advancing robotic learning. Existing methods lack a comprehensive solution to accurately reconstruct real-world objects with both spatial representations and their associated physics attributes in the Real2Sim stage. We propose a Real2Sim pipeline to generate digital assets enabling high-fidelity simulation. We design a hybrid repre-sentation model that integrates mesh geometry, 3D Gaussian kernels, and physics attributes to enhance the representation of robotic arms in digital assets. This hybrid representation is implemented through a Gaussian-Mesh-Pixel binding technique, which establishes an isomorphic mapping between mesh vertices and the Gaussian model. This enables a fully differentiable rendering pipeline that can be optimized through numerical solvers, achieves high-fidelity rendering via Gaussian Splatting, and facilitates physically plausible simulation of the robotic arm's interaction with its environment through mesh geometry. With the digital assets, we propose a fully manipulable Real2Sim pipeline that standardizes coordinate systems and scales, ensuring the seamless integration of multiple components. To demonstrate its effectiveness, we include datasets covering various robotic manipulation tasks with their mesh reconstructions. Our model achieves state-of-the-art results in realistic rendering and mesh reconstruction quality for robotic applications. Our code and datasets will be made publicly available at robostudioapp. com.

NeurIPS Conference 2024 Conference Paper

LaSe-E2V: Towards Language-guided Semantic-aware Event-to-Video Reconstruction

  • Kanghao Chen
  • Hangyu Li
  • Jiazhou Zhou
  • Zeyu Wang
  • Lin Wang

Event cameras harness advantages such as low latency, high temporal resolution, and high dynamic range (HDR), compared to standard cameras. Due to the distinct imaging paradigm shift, a dominant line of research focuses on event-to-video (E2V) reconstruction to bridge event-based and standard computer vision. However, this task remains challenging due to its inherently ill-posed nature: event cameras only detect the edge and motion information locally. Consequently, the reconstructed videos are often plagued by artifacts and regional blur, primarily caused by the ambiguous semantics of event data. In this paper, we find language naturally conveys abundant semantic information, rendering it stunningly superior in ensuring semantic consistency for E2V reconstruction. Accordingly, we propose a novel framework, called LaSe-E2V, that can achieve semantic-aware high-quality E2V reconstruction from a language-guided perspective, buttressed by the text-conditional diffusion models. However, due to diffusion models' inherent diversity and randomness, it is hardly possible to directly apply them to achieve spatial and temporal consistency for E2V reconstruction. Thus, we first propose an Event-guided Spatiotemporal Attention (ESA) module to condition the event data to the denoising pipeline effectively. We then introduce an event-aware mask loss to ensure temporal coherence and a noise initialization strategy to enhance spatial consistency. Given the absence of event-text-video paired data, we aggregate existing E2V datasets and generate textual descriptions using the tagging models for training and evaluation. Extensive experiments on three datasets covering diverse challenging scenarios (e. g. , fast motion, low light) demonstrate the superiority of our method. Demo videos for the results are attached to the project page.

NeurIPS Conference 2024 Conference Paper

LinNet: Linear Network for Efficient Point Cloud Representation Learning

  • Hao Deng
  • Kunlei Jing
  • Shengmei Cheng
  • Cheng Liu
  • Jiawei Ru
  • Jiang Bo
  • Lin Wang

Point-based methods have made significant progress, but improving their scalability in large-scale 3D scenes is still a challenging problem. In this paper, we delve into the point-based method and develop a simpler, faster, stronger variant model, dubbed as LinNet. In particular, we first propose the disassembled set abstraction (DSA) module, which is more effective than the previous version of set abstraction. It achieves more efficient local aggregation by leveraging spatial anisotropy and channel anisotropy separately. Additionally, by mapping 3D point clouds onto 1D space-filling curves, we enable parallelization of downsampling and neighborhood queries on GPUs with linear complexity. LinNet, as a purely point-based method, outperforms most previous methods in both indoor and outdoor scenes without any extra attention, and sparse convolution but merely relying on a simple MLP. It achieves the mIoU of 73. 7\%, 81. 4\%, and 69. 1\% on the S3DIS Area5, NuScenes, and SemanticKITTI validation benchmarks, respectively, while speeding up almost 10x times over PointNeXt. Our work further reveals both the efficacy and efficiency potential of the vanilla point-based models in large-scale representation learning. Our code will be available upon publication.

ICLR Conference 2024 Conference Paper

Rethinking CNN's Generalization to Backdoor Attack from Frequency Domain

  • Quanrui Rao
  • Lin Wang
  • Wuying Liu

Convolutional neural network (CNN) is easily affected by backdoor injections, whose models perform normally on clean samples but produce specific outputs on poisoned ones. Most of the existing studies have focused on the effect of trigger feature changes of poisoned samples on model generalization in spatial domain. We focus on the mechanism of CNN memorize poisoned samples in frequency domain, and find that CNN generate generalization to poisoned samples by memorizing the frequency domain distribution of trigger changes. We also explore the influence of trigger perturbations in different frequency domain components on the generalization of poisoned models from visible and invisible backdoor attacks, and prove that high-frequency components are more susceptible to perturbations than low-frequency components. Based on the above fundings, we propose a universal invisible strategy for visible triggers, which can achieve trigger invisibility while maintaining raw attack performance. We also design a novel frequency domain backdoor attack method based on low-frequency semantic information, which can achieve 100\% attack accuracy on multiple models and multiple datasets, and can bypass multiple defenses.

NeurIPS Conference 2023 Conference Paper

DELTA: Diverse Client Sampling for Fasting Federated Learning

  • Lin Wang
  • Yongxin Guo
  • Tao Lin
  • Xiaoying Tang

Partial client participation has been widely adopted in Federated Learning (FL) to reduce the communication burden efficiently. However, an inadequate client sampling scheme can lead to the selection of unrepresentative subsets, resulting in significant variance in model updates and slowed convergence. Existing sampling methods are either biased or can be further optimized for faster convergence. In this paper, we present DELTA, an unbiased sampling scheme designed to alleviate these issues. DELTA characterizes the effects of client diversity and local variance, and samples representative clients with valuable information for global model updates. In addition, DELTA is a proven optimal unbiased sampling scheme that minimizes variance caused by partial client participation and outperforms other unbiased sampling schemes in terms of convergence. Furthermore, to address full-client gradient dependence, we provide a practical version of DELTA depending on the available clients' information, and also analyze its convergence. Our results are validated through experiments on both synthetic and real-world datasets.

ICRA Conference 2023 Conference Paper

Improved Event-Based Dense Depth Estimation via Optical Flow Compensation

  • Dianxi Shi
  • Luoxi Jing
  • Ruihao Li 0001
  • Zhe Liu 0029
  • Lin Wang
  • Huachi Xu
  • Yi Zhang

Event cameras have the potential to overcome the limitations of classical computer vision in real-world applications. Depth estimation is a crucial step for high-level robotics tasks and has attracted much attention from the community. In this paper, we propose an event-based dense depth estimation architecture, Mixed-EF2DNet, which firstly predicts inter-grid optical flow to compensate for lost temporal information, and then estimates multiple contextual depth maps that are fused to generate a robust depth estimation map. To supervise the network training, we further design a smoothing loss function used to smooth local depth estimates and facilitate estimating reasonable depth for pixels without events. In addition, we introduce SE-resblocks in the depth network to enhance the network representation by selecting feature channels. Experimental evaluations on both real-world and synthetic datasets show that our method performs better in terms of accuracy when compared to state-of-the-art algorithms, especially in scene detail estimation. Besides, our method demonstrates excellent generalization in cross-dataset tasks.

JBHI Journal 2023 Journal Article

Interpretable Inference and Classification of Tissue Types in Histological Colorectal Cancer Slides Based on Ensembles Adaptive Boosting Prototype Tree

  • Meiyan Liang
  • Ru Wang
  • Jianan Liang
  • Lin Wang
  • Bo Li
  • Xiaojun Jia
  • Yu Zhang
  • Qinghui Chen

Digital pathology images are treated as the “gold standard” for the diagnosis of colorectal lesions, especially colon cancer. Real-time, objective and accurate inspection results will assist clinicians to choose symptomatic treatment in a timely manner, which is of great significance in clinical medicine. However, Manual methods suffers from long inspection cycle and serious reliance on subjective interpretation. It is also a challenging task for existing computer-aided diagnosis methods to obtain models that are both accurate and interpretable. Models that exhibit high accuracy are always more complex and opaque, while interpretable models may lack the necessary accuracy. Therefore, the framework of ensemble adaptive boosting prototype tree is proposed to predict the colorectal pathology images and provide interpretable inference by visualizing the decision-making process in each base learner. The results showed that the proposed method could effectively address the “accuracy-interpretability trade-off” issue by ensemble of m adaptive boosting neural prototype trees. The superior performance of the framework provides a novel paradigm for interpretable inference and high-precision prediction of pathology image patches in computational pathology.

NeurIPS Conference 2023 Conference Paper

NurViD: A Large Expert-Level Video Database for Nursing Procedure Activity Understanding

  • Ming Hu
  • Lin Wang
  • Siyuan Yan
  • Don Ma
  • Qingli Ren
  • Peng Xia
  • Wei Feng
  • Peibo Duan

The application of deep learning to nursing procedure activity understanding has the potential to greatly enhance the quality and safety of nurse-patient interactions. By utilizing the technique, we can facilitate training and education, improve quality control, and enable operational compliance monitoring. However, the development of automatic recognition systems in this field is currently hindered by the scarcity of appropriately labeled datasets. The existing video datasets pose several limitations: 1) these datasets are small-scale in size to support comprehensive investigations of nursing activity; 2) they primarily focus on single procedures, lacking expert-level annotations for various nursing procedures and action steps; and 3) they lack temporally localized annotations, which prevents the effective localization of targeted actions within longer video sequences. To mitigate these limitations, we propose NurViD, a large video dataset with expert-level annotation for nursing procedure activity understanding. NurViD consists of over 1. 5k videos totaling 144 hours, making it approximately four times longer than the existing largest nursing activity datasets. Notably, it encompasses 51 distinct nursing procedures and 177 action steps, providing a much more comprehensive coverage compared to existing datasets that primarily focus on limited procedures. To evaluate the efficacy of current deep learning methods on nursing activity understanding, we establish three benchmarks on NurViD: procedure recognition on untrimmed videos, procedure and action recognition on trimmed videos, and action detection. Our benchmark and code will be available at https: //github. com/minghu0830/NurViD-benchmark.

AAAI Conference 2023 Conference Paper

OPT-GAN: A Broad-Spectrum Global Optimizer for Black-Box Problems by Learning Distribution

  • Minfang Lu
  • Shuai Ning
  • Shuangrong Liu
  • Fengyang Sun
  • Bo Zhang
  • Bo Yang
  • Lin Wang

Black-box optimization (BBO) algorithms are concerned with finding the best solutions for problems with missing analytical details. Most classical methods for such problems are based on strong and fixed a priori assumptions, such as Gaussianity. However, the complex real-world problems, especially when the global optimum is desired, could be very far from the a priori assumptions because of their diversities, causing unexpected obstacles. In this study, we propose a generative adversarial net-based broad-spectrum global optimizer (OPT-GAN) which estimates the distribution of optimum gradually, with strategies to balance exploration-exploitation trade-off. It has potential to better adapt to the regularity and structure of diversified landscapes than other methods with fixed prior, e.g., Gaussian assumption or separability. Experiments on diverse BBO benchmarks and high dimensional real world applications exhibit that OPT-GAN outperforms other traditional and neural net-based BBO algorithms. The code and Appendix are available at https://github.com/NBICLAB/OPT-GAN

AAAI Conference 2023 Conference Paper

Pixel Is All You Need: Adversarial Trajectory-Ensemble Active Learning for Salient Object Detection

  • Zhenyu Wu
  • Lin Wang
  • Wei Wang
  • Qing Xia
  • Chenglizhao Chen
  • Aimin Hao
  • Shuo Li

Although weakly-supervised techniques can reduce the labeling effort, it is unclear whether a saliency model trained with weakly-supervised data (e.g., point annotation) can achieve the equivalent performance of its fully-supervised version. This paper attempts to answer this unexplored question by proving a hypothesis: there is a point-labeled dataset where saliency models trained on it can achieve equivalent performance when trained on the densely annotated dataset. To prove this conjecture, we proposed a novel yet effective adversarial trajectory-ensemble active learning (ATAL). Our contributions are three-fold: 1) Our proposed adversarial attack triggering uncertainty can conquer the overconfidence of existing active learning methods and accurately locate these uncertain pixels. 2) Our proposed trajectory-ensemble uncertainty estimation method maintains the advantages of the ensemble networks while significantly reducing the computational cost. 3) Our proposed relationship-aware diversity sampling algorithm can conquer oversampling while boosting performance. Experimental results show that our ATAL can find such a point-labeled dataset, where a saliency model trained on it obtained 97%-99% performance of its fully-supervised version with only 10 annotated points per image.

AAAI Conference 2023 Conference Paper

SEPT: Towards Scalable and Efficient Visual Pre-training

  • Yiqi Lin
  • Huabin Zheng
  • Huaping Zhong
  • Jinjing Zhu
  • Weijia Li
  • Conghui He
  • Lin Wang

Recently, the self-supervised pre-training paradigm has shown great potential in leveraging large-scale unlabeled data to improve downstream task performance. However, increasing the scale of unlabeled pre-training data in real-world scenarios requires prohibitive computational costs and faces the challenge of uncurated samples. To address these issues, we build a task-specific self-supervised pre-training framework from a data selection perspective based on a simple hypothesis that pre-training on the unlabeled samples with similar distribution to the target task can bring substantial performance gains. Buttressed by the hypothesis, we propose the first yet novel framework for Scalable and Efficient visual Pre-Training (SEPT) by introducing a retrieval pipeline for data selection. SEPT first leverage a self-supervised pre-trained model to extract the features of the entire unlabeled dataset for retrieval pipeline initialization. Then, for a specific target task, SEPT retrievals the most similar samples from the unlabeled dataset based on feature similarity for each target instance for pre-training. Finally, SEPT pre-trains the target model with the selected unlabeled samples in a self-supervised manner for target data finetuning. By decoupling the scale of pre-training and available upstream data for a target task, SEPT achieves high scalability of the upstream dataset and high efficiency of pre-training, resulting in high model architecture flexibility. Results on various downstream tasks demonstrate that SEPT can achieve competitive or even better performance compared with ImageNet pre-training while reducing the size of training samples by one magnitude without resorting to any extra annotations.

JMLR Journal 2023 Journal Article

SQLFlow: An Extensible Toolkit Integrating DB and AI

  • Jun Zhou
  • Ke Zhang
  • Lin Wang
  • Hua Wu
  • Yi Wang
  • Chaochao Chen

Integrating AI algorithms into databases is an ongoing effort in both academia and industry. We introduce SQLFlow, a toolkit seamlessly combining data manipulations and AI operations that can be run locally or remotely. SQLFlow extends SQL syntax to support typical AI tasks including model training, inference, interpretation, and mathematical optimization. It is compatible with a variety of database management systems (DBMS) and AI engines, including MySQL, TiDB, MaxCompute, and Hive, as well as TensorFlow, scikit-learn, and XGBoost. Documentations and case studies are available at https://sqlflow.org. The source code and additional details can be found at https://github.com/sql-machine-learning/sqlflow. &copy JMLR 2023. ( edit, beta )

IJCAI Conference 2023 Conference Paper

STS-GAN: Can We Synthesize Solid Texture with High Fidelity from Arbitrary 2D Exemplar?

  • Xin Zhao
  • Jifeng Guo
  • Lin Wang
  • Fanqi Li
  • Jiahao Li
  • Junteng Zheng
  • Bo Yang

Solid texture synthesis (STS), an effective way to extend a 2D exemplar to a 3D solid volume, exhibits advantages in computational photography. However, existing methods generally fail to accurately learn arbitrary textures, which may result in the failure to synthesize solid textures with high fidelity. In this paper, we propose a novel generative adversarial nets-based framework (STS-GAN) to extend the given 2D exemplar to arbitrary 3D solid textures. In STS-GAN, multi-scale 2D texture discriminators evaluate the similarity between the given 2D exemplar and slices from the generated 3D texture, promoting the 3D texture generator synthesizing realistic solid textures. Finally, experiments demonstrate that the proposed method can generate high-fidelity solid textures with similar visual characteristics to the 2D exemplar.

AAAI Conference 2023 Conference Paper

Unsupervised Domain Adaptation for Medical Image Segmentation by Selective Entropy Constraints and Adaptive Semantic Alignment

  • Wei Feng
  • Lie Ju
  • Lin Wang
  • Kaimin Song
  • Xin Zhao
  • Zongyuan Ge

Generalizing a deep learning model to new domains is crucial for computer-aided medical diagnosis systems. Most existing unsupervised domain adaptation methods have made significant progress in reducing the domain distribution gap through adversarial training. However, these methods may still produce overconfident but erroneous results on unseen target images. This paper proposes a new unsupervised domain adaptation framework for cross-modality medical image segmentation. Specifically, We first introduce two data augmentation approaches to generate two sets of semantics-preserving augmented images. Based on the model's predictive consistency on these two sets of augmented images, we identify reliable and unreliable pixels. We then perform a selective entropy constraint: we minimize the entropy of reliable pixels to increase their confidence while maximizing the entropy of unreliable pixels to reduce their confidence. Based on the identified reliable and unreliable pixels, we further propose an adaptive semantic alignment module which performs class-level distribution adaptation by minimizing the distance between same class prototypes between domains, where unreliable pixels are removed to derive more accurate prototypes. We have conducted extensive experiments on the cross-modality cardiac structure segmentation task. The experimental results show that the proposed method significantly outperforms the state-of-the-art comparison algorithms. Our code and data are available at https://github.com/fengweie/SE_ASA.

JBHI Journal 2022 Journal Article

A Time-Series Feature-Based Recursive Classification Model to Optimize Treatment Strategies for Improving Outcomes and Resource Allocations of COVID-19 Patients

  • Lin Wang
  • Zheng Yin
  • Mamta Puppala
  • Chika Ezeana
  • Kelvin Wong
  • Tiancheng He
  • Deepa Gotur
  • Stephen Wong

This paper presents a novel Lasso Logistic Regression model based on feature-based time series data to determine disease severity and when to administer drugs or escalate intervention procedures in patients with coronavirus disease 2019 (COVID-19). Advanced features were extracted from highly enriched and time series vital sign data of hospitalized COVID-19 patients, including oxygen saturation readings, and with a combination of patient demographic and comorbidity information, as inputs into the dynamic feature-based classification model. Such dynamic combinations brought deep insights to guide clinical decision-making of complex COVID-19 cases, including prognosis prediction, timing of drug administration, admission to intensive care units, and application of intervention procedures like ventilation and intubation. The COVID-19 patient classification model was developed utilizing 900 hospitalized COVID-19 patients in a leading multi-hospital system in Texas, United States. By providing mortality prediction based on time-series physiologic data, demographics, and clinical records of individual COVID-19 patients, the dynamic feature-based classification model can be used to improve efficacy of the COVID-19 patient treatment, prioritize medical resources, and reduce casualties. The uniqueness of our model is that it is based on just the first 24 hours of vital sign data such that clinical interventions can be decided early and applied effectively. Such a strategy could be extended to prioritize resource allocations and drug treatment for futurepandemic events.

AAAI Conference 2022 Conference Paper

Deconvolutional Density Network: Modeling Free-Form Conditional Distributions

  • Bing Chen
  • Mazharul Islam
  • Jisuo Gao
  • Lin Wang

Conditional density estimation (CDE) is the task of estimating the probability of an event conditioned on some inputs. A neural network (NN) can also be used to compute the output distribution for continuous-domain, which can be viewed as an extension of regression task. Nevertheless, it is difficult to explicitly approximate a distribution without knowing the information of its general form a priori. In order to fit an arbitrary conditional distribution, discretizing the continuous domain into bins is an effective strategy, as long as we have sufficiently narrow bins and very large data. However, collecting enough data is often hard to reach and falls far short of that ideal in many circumstances, especially in multivariate CDE for the curse of dimensionality. In this paper, we demonstrate the benefits of modeling free-form conditional distributions using a deconvolution-based neural net framework, coping with data deficiency problems in discretization. It has the advantage of being flexible but also takes advantage of the hierarchical smoothness offered by the deconvolution layers. We compare our method to a number of other density-estimation approaches and show that our Deconvolutional Density Network (DDN) outperforms the competing methods on many univariate and multivariate tasks. The code of DDN is available at https: //github. com/NBICLAB/DDN

JBHI Journal 2022 Journal Article

Guest Editorial Sensing Psychological Parameters and AI-Enabled Emotion Care for Human Wellness

  • Min Chen
  • Hamid Gharavi
  • Lin Wang
  • Victor C. M. Leung
  • Zhongchun Liu
  • Iztok Humar

The papers in this special section focus on the use of artificial intelligence (AI)-enabled technologies to address human wellness. As the COVID-19 pandemic took hold over the last several years, there was an urgent demand to pay more attention to psychological health for human wellness by providing methods and means of sensing psychological parameters, emotional care and mental disorder patient monitoring, especially during these difficult times. With the aid of wearable computing technology and artificial intelligence, emotion and mental disorder detections are available through sensing and analyzing psychological parameters. Discusses the use of AI-based patient monitoring and the ability to monitor human wellness via remote sensing technologies. The papers in this issue provide a snapshot of some of the latest research advances on the research and application of Small Things and Big Data, knowledge discovery and knowledge representation for the combination towards biomedical and health informatics.

AAMAS Conference 2021 Conference Paper

Modeling Replicator Dynamics in Stochastic Games Using Markov Chain Method

  • Chuang Deng
  • Zhihai Rong
  • Lin Wang
  • Xiaofan Wang

In stochastic games, individuals need to make decisions in multiple states and transitions between states influence the dynamics of strategies significantly. In this work, by describing the dynamic process in stochastic game as a Markov chain and utilizing the transition matrix, we introduce a new method, named state-transition replicator dynamics, to obtain the replicator dynamics of a stochastic game. Based on our proposed model, we can gain qualitative and detailed insights into the influence of transition probabilities on the dynamics of strategies. We illustrate that a set of unbalanced transition probabilities can help players to overcome the social dilemmas and lead to mutual cooperation in a cooperation back state, even if the stochastic game has the same social dilemmas in each state. Moreover, we also present that a set of specifically designed transition probabilities can fix the expected payoffs of one player and make him lose the motivation to update his strategies in the stochastic game.

ICML Conference 2018 Conference Paper

Adversarial Attack on Graph Structured Data

  • Hanjun Dai
  • Hui Li
  • Tian Tian 0001
  • Xin Huang
  • Lin Wang
  • Jun Zhu 0001
  • Le Song

Deep learning on graph structures has shown exciting results in various applications. However, few attentions have been paid to the robustness of such models, in contrast to numerous research work for image or text adversarial attack and defense. In this paper, we focus on the adversarial attacks that fool deep learning models by modifying the combinatorial structure of data. We first propose a reinforcement learning based attack method that learns the generalizable attack policy, while only requiring prediction labels from the target classifier. We further propose attack methods based on genetic algorithms and gradient descent in the scenario where additional prediction confidence or gradients are available. We use both synthetic and real-world data to show that, a family of Graph Neural Network models are vulnerable to these attacks, in both graph-level and node-level classification tasks. We also show such attacks can be used to diagnose the learned classifiers.

AAAI Conference 2016 Conference Paper

Toward a Better Understanding of Deep Neural Network Based Acoustic Modelling: An Empirical Investigation

  • Xingfu Wang
  • Lin Wang
  • Jing Chen
  • Litao Wu

Recently, deep neural networks (DNNs) have outperformed traditional acoustic models on a variety of speech recognition benchmarks. However, due to system differences across research groups, although a tremendous breadth and depth of related work has been established, it is still not easy to assess the performance improvements of a particular architectural variant from examining the literature when building DNN acoustic models. Our work aims to uncover which variations among baseline systems are most relevant for automatic speech recognition (ASR) performance via a series of systematic tests on the limits of the major architectural choices. By holding all the other components fixed, we are able to explore the design and training decisions without being confounded by the other influencing factors. Our experiment results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, not only help build a better understanding towards why DNN acoustic models perform well or how they might be improved, but also help establish a set of best practices for new speech corpora and language understanding task variants.