Author name cluster

Wei Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

209 papers

2 author rows

EAAI Journal 2026 Journal Article

A dual-stream ensemble learning model for front vehicle lane-changing maneuver identification

Hongjia Zhang
Wei Li
Xia Zhao
Rui Fu
Yingshi Guo

The Lane-Changing Maneuver (LCM) behavior of the front vehicle affects the host vehicle's safety. Each year, there is a large number of traffic accidents attributed to lane-changing and cut-in, posing serious challenges to traffic safety. To address this issue, this paper proposes a dual-stream ensemble Convolutional Neural Network-Vision Transformer (CNN-ViT) model based on computer vision for identifying LCM of the front vehicles. Firstly, 6800 sets of natural driving samples that capture the LCM of front vehicle are collected using the data collection platform. Secondly, the temporal stream features are extracted from the videos using the optical flow theory, and they are concatenated with the spatial stream features extracted from the videos to create the model input. Finally, inspired from the Bagging theory, an ensemble learning model is proposed to identify the front vehicles' LCM. The CNN and ViT algorithms are fused to form the base-classifier, and a voting strategy is applied to fuse this base-classifier to get the ensemble CNN-ViT. The results show that the ensemble CNN-ViT model, proposed in this paper, has excellent identification performance. At 0. 4 s (s), 0. 8 s, and 1. 2 s after the LCM occurred, the identification accuracy of the model reaches 84. 58 %, 91. 52 %, and 95. 41 %, respectively, which is 9. 17 %, 4. 70 %, and 4. 29 % higher than that of the CNN-Gate Recurrent Unit model that is commonly deployed in such problems. To sum up, this study contributes to enhancing early switching in adaptive cruise control systems, thereby improving safety and comfort.

EAAI Journal 2026 Journal Article

A multi-modal multi-task learning network for intelligent parameter measurement in gas–liquid two-phase flow

Hanqing Chen
Zhiqiang Zhao
Bang Zhou
Ruiqi Wang
Mengyu Li
Wei Li
Jun Liu
Weidong Cao

Accurate identification of flow patterns and reliable measurement of phase fraction are fundamental for monitoring and control in gas–liquid two-phase flow systems. Conventional sensing and modeling approaches, however, are often constrained by limited spatial resolution and adaptability to dynamic operating conditions. A multi-modal multi-task learning network (MMLNet) is proposed, which integrates spatially distributed conductance time-series signals acquired from a custom-designed sensor with synchronized high-speed flow images. The network adopts a dual-branch architecture, where modality-specific backbones are constructed using multi-scale depthwise separable convolutions, followed by attention-driven cross-modal interaction and a per-token sample gate for adaptive fusion. Under a unified multi-task objective, MMLNet jointly optimizes flow pattern classification and gas volume fraction (GVF) regression, thereby exploiting the inherent correlation between the two tasks to improve accuracy and generalization. Experimental results show that MMLNet achieves 99. 88% accuracy in flow pattern classification, with a mean absolute error (MAE) of 0. 63%, and a mean absolute percentage error (MAPE) of 2. 23% for GVF prediction, outperforming state-of-the-art baselines. These results highlight the potential of MMLNet as a scalable soft-sensing solution for multiphase flow monitoring.

EAAI Journal 2026 Journal Article

An interpretable multimodal transformer for medical report generation via hierarchical semantics and clinical labeling

Jia Sheng Yang
Chenbo Xia
Wei Li
Xu Xiao

Radiology report generation remains challenging in artificial intelligence due to the semantic gap between medical images and diagnostic narratives and the lack of clinical interpretability in current models. To address these issues, we propose a novel cognitive-driven multimodal framework that integrates hierarchical visual-semantic alignment with radiologist-inspired attention mechanism. Specifically, we extract pixel, region, and organ level features using Vision Transformer (ViT) and align them with Radiology Lexicon (RadLex)-based medical concepts to enhance anatomical grounding. Our model incorporates knowledge-enhanced multi-label classification module to jointly predict diagnostic labels during report generation. This component is designed using novel Multi-path Interaction Expansion Multilayer Perceptron (MIX-MLP) structure, which improves robustness to noisy or incomplete labels by enabling residual feature refinement across classification stages. Beyond serving as an auxiliary loss, this module provides interpretable intermediate supervision. These predicted labels are embedded and fused with visual and textual features through Radiological Cognitive Triangular Attention (RCTA) module, which simulates the iterative reasoning process of radiologists. Experimental results on two benchmarks demonstrate the superiority of our method. On Indiana University Chest X-ray Collection (IU-Xray), our model achieves Consensus-based Image Description Evaluation (CIDEr) score of 0. 654 (25. 0%), while on Medical Information Mart for Intensive Care Chest X-Ray (MIMIC-CXR), it reaches a CIDEr score of 0. 397 (47. 5%), outperforming state-of-the-art baselines. Furthermore, the generated reports exhibit improved semantic accuracy, clinical consistency, and label-text alignment. These results demonstrate that our method effectively bridges the semantic gap in medical report generation and holds promise for real-world clinical applications. Our code is available at https: //github. com/jiasyang/Hi-CliTR.

TAAS Journal 2026 Journal Article

Auto-Follower: A Person-Following System for Urban Ackermann Human–Machine Collaborative Robotics

Zhijian Li
Dongliang Kou
Yizhao Wang
Wei Li
Zhiyan Dong
Lihua Zhang

Industry 5.0 is emerging as the next phase of industrial evolution, emphasizing human-centric manufacturing through close human–robot collaboration and the deployment of intelligent autonomous systems. As a representative example of such autonomy, person-following robots are typically implemented on differential-drive or omnidirectional mobile bases. However, certain tasks require Ackermann-steered robots, which face unique challenges due to limited maneuverability and the complexity of urban environments, often leading to target loss or navigation into non-drivable areas. To address these issues, we propose Auto-Follower, a person-following framework with enhanced perception and navigation capabilities. Auto-Follower integrates a vision–LiDAR servo tracker that fuses camera images with LiDAR points from a motorized rotating sensor, enabling 360° target perception. Instead of relying on a global map, the system employs real-time LiDAR-based local mapping for efficient path planning. In addition, an Iterative Radius Points Search (IRPS) method is developed to identify obstacle-free navigation goals when the target enters non-drivable regions, ensuring safe and continuous following. The framework has been validated extensively in both laboratory and urban environments and demonstrates robust, reliable performance, with strong potential for adaptation to diverse real-world person-following applications.

AAAI Conference 2026 Conference Paper

Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

Ziwei Liu
Borui Kang
Wei Li
Hangjie Yuan
Yanbing Yang
Wenbin Li
Yifan Zhu
Tao Feng

Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware stabilized ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.

PDF Details DOI

JBHI Journal 2026 Journal Article

CAMM: Confidence-Aligned Multiview Multimodal Fusion for Brain Disorders Prediction With Imaging Transcriptomics

Haoran Luo
Zhoujie Fan
Wei Li
Hong Liang
Chen Jason Zhang
Xiaoyong Wei
Zheng Wang
Shan Cong

Brain disorder prediction can be enhanced by models that capture not only imaging phenotypes but also their underlying molecular context. Neuroimaging provides detailed structural and functional information, yet it offers limited insight into the gene-regulated processes driving these alterations. Transcriptomic atlases offer such molecular insights but are rarely available at the subject level due to invasive sampling. To address this gap, we propose CAMM, a confidence-aware multi-modal framework that integrates transcriptomic priors with imaging features to embed molecular context before fusion. CAMM further introduces a unified confidence calibration–regularization strategy that adapts modality contributions at the sample level, ensuring that information from high-confidence samples is leveraged to improve predictions for low-confidence samples, thereby enhancing robustness. Applied to large neuroimaging cohorts, CAMM consistently surpasses state-of-the-art baselines and identifies biologically meaningful biomarkers, demonstrating how transcriptomic priors can bridge molecular mechanisms and imaging for interpretable precision modeling of brain disorders.

AAAI Conference 2026 Conference Paper

CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

Dongchen Li
Jitao Liang
Wei Li
Xiaoyu Wang
Longbing Cao
Kun Yu

Large Language Models (LLMs) hold significant promise for improving clinical decision support and reducing physician burnout by synthesizing complex, longitudinal cancer Electronic Health Records (EHRs). However, their implementation in this critical field faces three primary challenges: the inability to effectively process the extensive length and fragmented nature of patient records for accurate temporal analysis; a heightened risk of clinical hallucination, as conventional grounding techniques such as Retrieval-Augmented Generation (RAG) do not adequately incorporate process-oriented clinical guidelines; and unreliable evaluation metrics that hinder the validation of AI systems in oncology. To address these issues, we propose CliCARE, a framework for Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records. The framework operates by transforming unstructured, longitudinal EHRs into patient-specific Temporal Knowledge Graphs (TKGs) to capture long-range dependencies, and then grounding the decision support process by aligning these real-world patient trajectories with a normative guideline knowledge graph. This approach provides oncologists with evidence-grounded decision support by generating a high-fidelity clinical summary and an actionable recommendation. We validated our framework using large-scale, longitudinal data from a private Chinese cancer dataset and the public English MIMIC-IV dataset. In these settings, CliCARE significantly outperforms baselines, including leading long-context LLMs and Knowledge Graph-enhanced RAG methods. The clinical validity of our results is supported by a robust evaluation protocol, which demonstrates a high correlation with assessments made by oncologists.

PDF Details DOI

YNIMG Journal 2026 Journal Article

Corrigendum to “Opposite changes in morphometric similarity of medial reward and lateral non-reward orbitofrontal cortex circuits in obesity” [NeuroImage, 290 (2024) 120574]

Debo Dong
Ximei Chen
Wei Li
Xiao Gao
Yulin Wang
Feng Zhou
Simon B. Eickhoff
Hong Chen

IS Journal 2026 Journal Article

CRLNet: Cascaded Resolution Learning Network for Natural Scenes Segmentation

Wei Li
Shishun Tian
Guoguang Hua
Muxin Liao
Yuhang Zhang
Wenbin Zou

The natural environment presents a multitude of scenes with diverse content, posing challenges for satisfactory segmentation results using existing segmentation networks. In response, we propose a cascaded resolution learning network (CRLNet) to enhance segmentation performance through global textual embedding and multiresolution feature learning. The CRLNet constructs a multipath segmentation system that integrates multiresolution feature data from different paths, thereby progressively enhancing local feature learning. Two key modules, the partition-fusion channel attention module (PFCAM) and features learning module (FLM), are pivotal components of CRLNet. The PFCAM serves as a computationally efficient channel attention module to mitigate segmentation confusion stemming from similar objects. Meanwhile, the FLM is tailored to learn resolution feature maps from different paths, facilitating the refinement of object representation and enhancing segmentation performance. Extensive experiments that were conducted on real natural scene datasets demonstrate the superior accuracy of CRLNet over existing efficient segmentation methods.

AAAI Conference 2026 Conference Paper

DarkFarseer: Robust Spatio-Temporal Kriging Under Graph Sparsity and Noise

Zhuoxuan Liang
Wei Li
Dalin Zhang
Ziyu Jia
Yidan Chen
Zhihong Wang
Xiangping Zheng
Moustafa Youssef

The rapid expansion of the Internet of Things (IoT) has created a growing demand for large-scale sensor deployment. However, the high cost of physical sensors limits the scalability and coverage of sensor networks, making fine-grained sensing difficult. Inductive Spatio-Temporal Kriging (ISK) addresses this challenge by introducing virtual sensors that infer measurements from physical sensors, typically using graph neural networks (GNNs) to model their relationships. Despite its promise, current ISK methods often rely on standard message-passing and generic architectures that fail to effectively capture spatio-temporal features or represent virtual nodes accurately. Additionally, existing graph construction techniques suffer from sparse and noisy connections, further hindering performance. To address these limitations, we propose DarkFarseer, a novel ISK framework with three key innovations. First, the Style-enhanced Temporal-Spatial architecture adopts a temporal-then-spatial processing scheme with a temporal style transfer mechanism to enhance virtual node representations. Second, Regional-semantic Contrastive Learning improves representation learning by aligning virtual nodes with regional component patterns. Third, the Similarity-Based Graph Denoising Strategy mitigates the influence of noisy edges by leveraging temporal similarity and regional structure. Extensive experiments on real-world datasets demonstrate that DarkFarseer significantly outperforms state-of-the-art ISK methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

FashionMAC: Deformation-Free Fashion Image Generation with Fine-Grained Model Appearance Customization

Rong Zhang
Jinxiao Li
Jingnan Wang
Zhiwen Zuo
Jianfeng Dong
Wei Li
Chi Wang
Weiwei Xu

Garment-centric fashion image generation aims to synthesize realistic and controllable human models dressing a given garment, which has attracted growing interest due to its practical applications in e-commerce. The key challenges of the task lie in two aspects: (1) faithfully preserving the garment details, and (2) gaining fine-grained controllability over the model's appearance. Existing methods typically require performing garment deformation in the generation process, which often leads to garment texture distortions. Also, they fail to control the fine-grained attributes of the generated models, due to the lack of specifically designed mechanisms. To address these issues, we propose FashionMAC, a novel diffusion-based deformation-free framework that achieves high-quality and controllable fashion showcase image generation. The core idea of our framework is to eliminate the need for performing garment deformation and directly outpaint the garment segmented from a dressed person, which enables faithful preservation of the intricate garment details. Moreover, we propose a novel region-adaptive decoupled attention (RADA) mechanism along with a chained mask injection strategy to achieve fine-grained appearance controllability over the synthesized human models. Specifically, RADA adaptively predicts the generated regions for each fine-grained text attribute and enforces the text attribute to focus on the predicted regions by a chained mask injection strategy, significantly enhancing the visual fidelity and the controllability. Extensive experiments validate the superior performance of our framework compared to existing state-of-the-art methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

From Semantics to Spectrum: A New Lens on Graph Augmentation Strategy

Xiangping Zheng
Xiuxin Hao
Bo Wu
Wei Li
Bin Ren
Bin Tang
Yuhui Guo
Xun Liang

Graph augmentation is a cornerstone of effective graph contrastive learning, yet existing methods often rely on random designed perturbations, which may distort latent semantics and impair representation quality. In this work, we argue that semantic consistency can be effectively approximated by low-frequency components in the spectral domain, offering a principled proxy for guiding augmentation. Based on this insight, we propose Frequency-Aware Graph Contrastive Learning (FA-GCL), a novel framework that explicitly preserves low-frequency signals while selectively perturbing high-frequency components. By aligning augmentation with frequency-aware decomposition, FA-GCL generates diverse yet semantically coherent views, mitigating semantic drift and enhancing representational discrimination. Extensive experiments across multiple benchmarks demonstrate that FA-GCL consistently outperforms state-of-the-art baselines with statistically significant gains, validating its exclusive merits.

PDF Details DOI

AAAI Conference 2026 Conference Paper

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and a Comprehensive Multimodal Dataset Towards General Medical AI

Tianbin Li
Yanzhou Su
Wei Li
Bin Fu
Zhe Chen
Ziyan Huang
Guoan Wang
Chenglong Ma

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a 7B-parameter general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

PDF Details DOI

AAAI Conference 2026 Conference Paper

GRIP: Latent Field-Guided Graph Policy for Budget-Constrained Multi-Agent Routing

Yujiao Hu
Zuyu Chen
MengJie Lee
Jinchao Chen
Meng Shen
Hailun Zhang
Wei Li
Yan Pan

Subset selection under budget constraints is critical in applications like multi-robot patrolling, crime deterrence, and targeted marketing, where multiple agents must jointly select targets and plan feasible routes. We formalize this challenge as Multi-Subset Selection with Budget-Constrained Routing (MSS-BCR), involving complex, non-additive cost structures that defy traditional methods. We propose GRIP, a graph-based framework integrating spatial reward fields and policy learning to enable coordinated, budget-aware target selection and routing. GRIP uses attention-based embeddings and constraint-triggered pruning with utility recovery to produce high-quality, feasible solutions. Experiments based on multiple synthetic and real-world datasets show GRIP outperforms baselines in reward efficiency and scalability across varied scenarios.

PDF Details DOI

EAAI Journal 2026 Journal Article

Hypergraph topic neural network with cross-modal fusion for latent treatment pattern recommendation

Xin Min
Wei Li
Weidong Xie
Pengfei Zhang
Chuanbiao Wen
Weiping Ding

Electronic Medical Records (EMR) provide valuable data for intelligent treatment recommendation systems. However, existing methods face challenges in modeling complex high-order relationships and integrating heterogeneous medical features effectively. Traditional topic models struggle with shallow semantic representations. Conversely, hypergraph methods lack interpretability because they insufficiently integrate textual features. This study proposes a Hypergraph Topic neural network with cross-modal fusion for treatment pattern recommendation (HGTCNRec), integrating hypergraph neural networks with topic modeling and cross-modal feature fusion. The framework integrates three key components: hypergraph neural topic modeling for capturing high-order treatment relationships, auxiliary textual feature extraction using Transformer encoders, and cross-modal fusion for heterogeneous feature integration. The hypergraph structure captures complex co-occurrence patterns among treatment behaviors through hyperedge representations. The topic modeling component discovers clinically meaningful patterns while maintaining interpretability. We propose a novel score-level cross-modal architecture that fundamentally differs from existing fusion methods through key architectural innovations. Comprehensive experiments on three real-world datasets demonstrate superior performance compared to baseline methods. Results show significant improvements in recommendation accuracy metrics and topic modeling quality measures. This method has been clinically validated by experts to offer superior interpretability and holds promise for supporting personalized treatment plans. However, its practical clinical value requires further confirmation through in-depth evaluation by clinicians and prospective studies.

AAAI Conference 2026 Conference Paper

I2CD: An Invertible Causal Framework for Compositional Zero-Shot Learning via Disentangle-Compose-Disentangle

Zhaoquan Yuan
Zining Wang
Yuankang Pan
Ao Luo
Wei Li
Xiao Wu
Changsheng Xu

Compositional Zero-Shot Learning (CZSL) addresses the challenge of recognizing unseen attribute-object compositions in images, representing a fundamental challenge in artificial intelligence. Current approaches, which primarily focus on semantic alignment or distribution independence of primitives, have not achieved effective state-object decoupling and causal interventional invariance, limiting their performance on unseen compositions. To tackle this challenge, this study introduces I2CD (Invertible Causal framework via Disentangle-Compose-Disentangle), a novel framework that integrates invertible neural networks with causal intervention techniques to achieve state-object disentanglement. The framework employs a disentangle-compose-disentangle mechanism for counterfactual generation within the disentangled representation space, ensuring that modifications to one primitive (attribute or object) maintain independence from the other, thus enabling robust causal disentanglement. Representational consistency is maintained through semantic alignment between initial disentangled representations and their recomposed-then-disentangled counterparts with corresponding textual concepts. Comprehensive evaluations on three benchmark datasets—MIT-States, UT-Zappos, and C-GQA—demonstrate the framework's effectiveness in achieving both disentanglement and compositional generalization in CZSL tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

InfoCom: Kilobyte-Scale Communication-Efficient Collaborative Perception with Information Bottleneck

Quanmin Wei
Penglin Dai
Wei Li
Bingyi Liu
Xiao Wu

Precise environmental perception is critical for the reliability of autonomous driving systems. While collaborative perception mitigates the limitations of single-agent perception through information sharing, it encounters a fundamental communication-performance trade-off. Existing communication-efficient approaches typically assume MB-level data transmission per collaboration, which may fail due to practical network constraints. To address these issues, we propose InfoCom, an information-aware framework establishing the pioneering theoretical foundation for communication-efficient collaborative perception via extended Information Bottleneck principles. Departing from mainstream feature manipulation, InfoCom introduces a novel information purification paradigm that theoretically optimizes the extraction of minimal sufficient task-critical information under Information Bottleneck constraints. Its core innovations include: i) An Information-Aware Encoding condensing features into minimal messages while preserving perception-relevant information; ii) A Sparse Mask Generation identifying spatial cues with negligible communication cost; and iii) A Multi-Scale Decoding that progressively recovers perceptual information through mask-guided mechanisms rather than simple feature reconstruction. Comprehensive experiments across multiple datasets demonstrate that InfoCom achieves near-lossless perception while reducing communication overhead from megabyte to kilobyte-scale, representing 440-fold and 90-fold reductions per agent compared to Where2comm and ERMVP, respectively.

PDF Details DOI

EAAI Journal 2026 Journal Article

Large language models for explainable fault diagnosis of machines

Hamzah A.A.M. Qaid
Bo Zhang
Shuai Su
Dan Li
See-Kiong Ng
Wei Li

Large Language Models (LLMs) have demonstrated remarkable capabilities in capturing complex conceptual representations from textual data for a wide range of real-world applications. However, in Intelligent Fault Diagnosis (IFD), leveraging sensor data such as vibration signals is essential but remains a challenge due to the modality gap between time series and LLMs’ inputs. Existing efforts to bridge this gap often treat LLMs merely as classifiers, overlooking their potential for understanding and reasoning over vibration-based data. In this paper, we propose a novel LLM-based fault diagnosis framework (FD-LLM) that aligns vibration signals with LLMs by encoding the signals into textual representations. FD-LLM introduces a classification-oriented approach, which formulates fault diagnosis as a multi-class classification task for benchmarking LLMs’ performance, and a context-aware spectrum language modeling approach that enables explainable, reasoning-driven fault analysis. We evaluate four open-source LLMs using FD-LLM across multiple datasets and noise conditions, assessing their validity, adaptability, and robustness. The results demonstrate that models such as LLaMA models achieve robust diagnostic performance, strong zero-shot adaptability across operating conditions, and effective generalization in cross-dataset scenarios with few-shot learning. The results further indicate that explainable fault diagnosis can be achieved in LLMs.

JBHI Journal 2026 Journal Article

MedSegAgent: A Universal and Scalable Multi-Agent System for Instructive Medical Image Segmentation

Ziyan Huang
Haoyu Wang
Jin Ye
Yuanfeng Ji
Xiaowei Hu
Lihao Liu
Zhikai Yang
Wei Li

Medical image segmentation is vital for clinical diagnosis and treatment; however, current solutions face three major limitations: (1) the lack of a universal framework capable of handling diverse modalities and anatomical targets, (2) the limited scalability to adapt to evolving clinical needs and new datasets, and (3) the lack of instructive interfaces that make models usable for non-expert users. To address these challenges, this paper presents MedSegAgent, a universal and scalable multi-agent system for instructive medical image segmentation. Specifically, MedSegAgent comprises five agents: one query parsing agent that processes natural language requests, three coarse-to-fine filtering agents (modality filtering, anatomical filtering, and label selection) for identifying relevant datasets and label values, and one execution agent responsible for model inference and result integration. Based on this framework, MedSegAgent utilizes 23 diverse datasets and pre-trained models to perform 343 types of segmentation across various modalities and anatomical targets. Experimental results demonstrate that MedSegAgent simplifies model selection while maintaining high performance, accurately identifying matching datasets and labels in 94. 27% of queries and locating at least one suitable match in 99. 03% of queries. MedSegAgent offers a universal and scalable solution for diverse medical image segmentation tasks, bridging the gap between user-friendly queries and the complexities of model selection and deployment. Our code is publicly available at https://github.com/uni-medical/MedSegAgent.

AAAI Conference 2026 Conference Paper

Monocular Vehicle Pose and Shape Reconstruction via Dynamic Context Adaptation and Progressive Geometry Refinement

Wei Li
Long Ji
Ying Wang
Xiao Wu
Zhaoquan Yuan
Penglin Dai

Accurate reconstruction of 3D vehicle pose and shape from monocular images is challenging, particularly for distant objects in autonomous driving. Existing methods often suffer from geometric ambiguity in depth estimation and structural hollowness in shape recovery, primarily due to inadequate multi-scale feature aggregation and unflexible prior modeling. To overcome these limitations, MonoVPR is proposed, a novel framework integrating dynamic context adaptation and progressive geometry refinement. Specifically, a Hierarchical Dual-Context Attention (HDCA) module is introduced to resolve scale-dependent degradation through gated cross-attention across multi-resolution feature maps, dynamically fusing object-centric geometric cues with scene-centric semantics. For shape refinement, the Bounded Iterative Mesh Refiner (BIMR) progressively optimizes template-guided deformations via multi-head attention and a tanh-bounded correction loop, ensuring physically plausible reconstructions.Extensive experiments on the ApolloCar3D benchmark demonstrate MonoVPR achieves state-of-the-art performance, showing exceptional capability in reconstructing geometrically consistent shapes and precise poses for challenging long-range scenarios.

PDF Details DOI

YNIMG Journal 2026 Journal Article

Multidimensional characterization of structure aberrations for biotypes of major depressive disorder

Jiang Zhang
Heng Zhang
Hui Sun
Tianwei Qin
Jun Pan
Jin Chen
Wei Li
Meiling Chen

BACKGROUND: Major depressive disorder (MDD) is a heterogeneous clinical syndrome associated with brain structural abnormalities, yet the neurobiological heterogeneity and consistent neuroimaging findings underlying these alterations remain unclear. Multilevel and multidimensional analyses are therefore needed to identify reliable structural signatures of MDD biotypes. METHODS: K-means clustering was applied to identify biotypes in 387 drug-naive MDD patients, with gray matter volume (GMV) compared to 1104 healthy controls. Causal structural covariance network (CaSCN), individual differential structural covariance network (IDSCN), and graph theory-based single-subject morphological network analyses were performed to characterize subtype-specific causal influences, individual-level covariance, and network topology. Transcriptomic and neurotransmitter association analyses were further conducted to probe the biological mechanisms underlying each subtype. RESULTS: Subtype 1 showed predominant GMV alterations in the visual network, subtype 2 in somatomotor, default mode, and limbic networks, and subtype 3 in cerebellar-limbic regions. CaSCN revealed subtype-specific directed influences, indicating differential propagation of structural abnormalities. IDSCN identified distinct altered covariance patterns, highlighting subtype-dependent thalamo-cerebellar changes and selective links to depressive severity. Graph theory showed divergent global topology, with subtype 1 exhibiting higher network integration, whereas subtypes 2 and 3 showed reduced integration and efficiency. Each biotype showed distinct neurobiological profiles, with subtype 1 enriched in cellular functions, subtype 2 in metabolic regulation, and subtype 3 in neurodevelopmental genes, alongside distinct neurotransmitter associations. CONCLUSIONS: These findings advance the understanding of structural and individual-level network alterations underlying MDD biotypes and provide novel insights into the neurobiological mechanisms of MDD heterogeneity.

AAAI Conference 2026 Conference Paper

MvP-ECR: Multi-Perspective Emotion-Cause Reasoning for Empathetic Dialogue

Yuanyuan He
Guotai Huang
Wei Li
Jiali You
Jiawen Deng
Fuji Ren

The empathetic dialogue systems aim to recognize user emotions and generate appropriate empathetic responses. However, existing approaches predominantly rely on dialogue history, contextual descriptions, and emotion category labels, failing to model the causal relationship between emotions and their underlying triggers. This limitation leads to generated responses that lack grounding, exhibit weak relevance, and suffer from poor interpretability in emotional expression. To address this, we propose MvP-ECR, a multi-perspective emotion cause reasoning framework that explicitly constructs emotion-cause structures to help models focus on the core emotional drivers. Additionally, we introduce an emotion-cause consistency evaluation metric to quantitatively assess a model’s ability to identify causal relationships. Experiments across multiple large language models (LLMs) demonstrate that the MvP-ECR framework can serve as a plug-and-play tool to help the model correctly infer emotions and causes in empathetic conversations, and provide more immersive responses for empathetic responses. All code and data will be publicly released to promote the development of empathy dialogue research.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Outlier Matters: Efficient Long-to-Short Reasoning via Outlier-Guided Model Merging

Qiyuan Zhu
Dezhi Li
Lujun Li
Xiaoyu Qin
Wei Li
Hao Gu
Hua Xu
Sirui Han

Large Reasoning Language Models (LRMs) have recently shown remarkable performance in complex reasoning tasks, but their extensive reasoning chains incur substantial computational overhead. To address this challenge, we propose Outlier-aware Reasoning Conciseness Adaptive Merge (ORCA), a novel plug-and-play model merging framework that leverages outlier activation patterns to fuse base models with reasoning models. Our ORCA introduces three key innovations: (1) adaptive alignment that reduces conflicts between disparate activation patterns during merging, (2) outlier-guided allocation that assigns merging coefficients proportional to each layer's reasoning importance as indicated by outlier concentrations, and (3) dynamic probe-based adjustment that adapts merging coefficients during inference based on input-specific activation characteristics. These strategies allow seamless integration into existing merging pipelines while creating unified models that maintain reasoning accuracy with significantly reduced response verbosity. Comprehensive evaluation across six benchmarks using Qwen and LLaMA models shows ORCA reduces average response length by 55% while improving accuracy by 2.4∼5.7% over existing methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Real Garment Benchmark (RGBench): A Comprehensive Benchmark for Robotic Garment Manipulation Featuring a High-Fidelity Scalable Simulator

Wenkang Hu
Xincheng Tang
Yanzhi E
Yitong Li
Zhengjie Shu
Wei Li
Huamin Wang
Ruigang Yang

While there has been significant progress to use simulated data to learn robotic manipulation of rigid objects, applying its success to deformable objects has been hindered by the lack of both deformable object models and realistic non-rigid body simulators. In this paper, we present Real Garment Benchmark (RGBench), a comprehensive benchmark for robotic manipulation of garments. It features a diverse set of over 6000 garment mesh models, a new high-performance simulator, and a comprehensive protocol to evaluate garment simulation quality with carefully measured real garment dynamics. Our experiments demonstrate that our simulator outperforms currently available cloth simulators by a large margin, reducing simulation error by 20% while maintaining a speed of 3 times faster. We will publicly release RGBench to accelerate future research in robotic garment manipulation.

PDF Details DOI

EAAI Journal 2026 Journal Article

Research on transparent working face based on dynamic interpretation technology of three-dimensional seismic data

Wei Li
Lei Zhao
Zaibin Liu
Wenming Liu
Bo Li
Junsheng Yan
Huahui Wang

As coal mining extends to greater depths, accurately detecting coal seam floor undulations, identifying coal thickness variations, and recognizing complex geological features such as collapse columns has become increasingly essential. These challenges raise higher demands for safety and efficiency in mining operations. This study proposes a dynamic interpretation method for transparent mining faces based on three-dimensional (3D) seismic data to enhance the accuracy of detecting coal seam geological structures. The method comprehensively applies target processing and dynamic interpretation to conduct an accurate analysis of the coal seam floor elevation and average velocity field. The data model is dynamically updated by integrating surface drilling and underground roadway data, which significantly enhances model accuracy and reliability. The study shows that the improved data correction method significantly enhances model accuracy and reliability. The accuracy of channel wave detection in structural prediction reaches 75%, while maintaining the maximum absolute error for floor profile prediction within 1. 33 m. The random forest model, a machine learning approach, was improved by combining gray correlation and Particle Swarm Optimization (PSO) algorithms, further revealing the complex relationship between coal seam floor elevation and Two-Way Travel Time (TWTT). The method proposed not only enhances the precision and efficiency of transparent face detection using Artificial Intelligence (AI) techniques but also offers reliable geological support for safe coal mining operations. As geological data are continuously and dynamically updated, the method enables real-time optimization of mining decisions and reduces the risk of geological hazards.

AAAI Conference 2026 Conference Paper

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Wei Li
Renshan Zhang
Rui Shao
Zhijian Fang
Kaiwen Zhou
Zhuotao Tian
Liqiang Nie

Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: **1) perceptual redundancy**, where irrelevant visual inputs are processed inefficiently, and **2) superficial instruction-vision alignment**, which hampers semantic grounding of actions. In this paper, we propose **SemanticVLA**, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: **1)** To sparsify redundant perception while preserving semantic alignment, **Semantic-guided Dual Visual Pruner (SD-Pruner)** performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. **2)** To exploit sparsified features and integrate semantics with spatial geometry, **Semantic-complementary Hierarchical Fuser (SH-Fuser)** fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. **3)** To enhance the transformation from perception to action, **Semantic-conditioned Action Coupler (SA-Coupler)** replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by **21.1%** in success rate, while reducing training cost and inference latency by **3.0×** and **2.7×**.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SPSC: Sparse and Scalable Multi-Modal 3D Occupancy Prediction for Autonomous Driving

Qingju Guo
Shuang Li
Binhui Xie
Jing Geng
Wei Li

3D semantic occupancy prediction offers a nuanced representation of the surrounding environment, which is crucial for ensuring the safety of autonomous driving. However, fine-grained scene representations inevitably result in cubic growth in data scale, which imposes substantial demands on model architecture and computational complexity, especially in high-resolution scenarios. Existing approaches for handling high-resolution scenes typically obtain fine-grained features by grid sampling on low-resolution feature map, resulting in limited sparsity and insufficient feature interaction. This paper presents a framework leveraging SParse representation and SCalable feature interaction to address the aforementioned challenges, called SPSC. Specifically, we maintain sparsity by progressively pruning unoccupied queries during the coarse-to-fine process, thereby reducing the scale of data that the model needs to handle. Subsequently, we introduce query serialization, which transforms queries into an ordered sequence while preserving their spatial structure, This enables fine-grained feature interaction while maintaining linear computational complexity and a larger receptive field. Without complex architectural designs, SPSC significantly outperforms SOTA approaches, relatively enhances the mIoU by 12.0%, 11.0% and 4.8% on nuScenes-Occupancy dataset under the muli-modal, LiDAR and camera settings, respectively.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging

Lujun Li
Qiyuan Zhu
Jiacheng Wang
Xiaoyu Qin
Wei Li
Hao Gu
Sirui Han
Yike Guo

Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods aim to achieve greater efficiency by consolidating several experts, they are fundamentally hindered by parameter conflicts arising from expert specialization. In this paper, we present Sub-MoE, a novel MoE compression framework via Subspace Expert Merging. Our key insight is to perform joint Singular Value Decomposition (SVD) on concatenated expert weights, reducing conflicting parameters by extracting shared U-matrices while enabling effective merging of the expert-specific V components. Specifically, Sub-MoE consists of two innovative stages: (1) Adaptive Expert Clustering, which groups functionally coherent experts via K-means clustering based on cosine similarity of expert outputs; and (2) Subspace Expert Merging, which first performs Experts Union Decomposition to derive the shared U-matrix across experts in the same group, then applies frequency-based merging for individual V-matrices, and completes expert reconstruction using the merged V-matrix. In this way, we align and fuse experts in a shared subspace. Additionally, the framework can be extended with intra-expert compression for further inference optimization. Extensive experiments on Mixtral, DeepSeek, and Qwen-1.5/3 MoE LLMs demonstrate that our Sub-MoE significantly outperforms existing expert pruning and merging methods. Notably, our Sub-MoE maintains 96%/86% of original performance with 25%/50% expert reduction on Mixtral-8×7B in zero-shot benchmarks.

PDF Details DOI

EAAI Journal 2026 Journal Article

T-S fuzzy non-fragile robust slip ratio control incorporating multi-source uncertainty and energy recovery for electro-mechanical brake system

Linfeng Lv
Wanzhong Zhao
Chunyan Wang
Wei Li
Zhiyang Shi

To address the issues of inaccurate slip ratio tracking in the Electro-mechanical Brake (EMB) system due to uncertainties in road adhesion coefficient and controllers, as well as the limited energy recovery in traditional braking torque distribution methods, this study proposes a non-fragile robust slip ratio control method for the EMB composite braking system based on maximum braking energy recovery power. The type-2 fuzzy membership function is employed to model uncertainties related to vehicle speed and road adhesion coefficient, improving the modeling accuracy of the EMB Anti-lock Braking System. An interval type-2 Takagi-Sugeno fuzzy non-fragile robust H ∞ slip ratio controller is designed to compute the required braking torque for each wheel. By constructing a fuzzy Lyapunov function and introducing relaxation variables, the conservatism of the slip ratio tracking system under uncertain conditions is reduced, thus improving slip ratio tracking accuracy. Additionally, a composite brake torque distribution strategy is developed to increase energy recovery during the slip ratio tracking process. Simulation and experimental results show that the proposed method achieves higher slip ratio tracking accuracy under multi-source uncertainties on different roads. Furthermore, energy recovery is improved by 9. 92 % on a high-adhesion coefficient road, 8. 37 % on a medium-adhesion coefficient road, and 49. 61 % on a low-adhesion coefficient road. These results confirm the effectiveness and superiority of the proposed control method in terms of both slip ratio tracking and energy recovery under varying road conditions and system uncertainties.

EAAI Journal 2026 Journal Article

Towards explainable visual question answering via cross-modal causal reasoning

Wei Li
Fuyun Deng
Zhixin Li

Explainable Visual Question Answering (EVQA) aims to not only predict accurate answers to visual questions but also generate human-friendly multimodal explanations that reveal the underlying reasoning process. Despite significant progress, existing EVQA methods suffer from two critical limitations: (1) they often rely on spurious cross-modal correlations (e. g. , linguistic biases or visual shortcuts) rather than genuine causal relations, leading to unreliable reasoning; (2) the consistency between predicted answers and generated explanations is compromised due to the lack of explicit modeling of their causal dependencies. To address these issues, we propose a Cross-Modal Causal Reasoning (CMCR) framework that integrates causal inference with multimodal learning to disentangle causal effects from spurious correlations and enforce answer-explanation consistency. Specifically, CMCR incorporates three key innovations: (1) Causal Intervention, which employs backdoor adjustment to eliminate linguistic biases and frontdoor adjustment to mitigate visual shortcut biases; (2) a Neural-Symbolic Explanation Generator designed to translate symbolic reasoning processes into natural language explanations, thereby enhancing process explainability; and (3) Variational Causal Inference, which enforces causal consistency between answers and explanations. Experiments on benchmark datasets demonstrate that CMCR outperforms state-of-the-art methods, achieving a 1. 19% higher accuracy, a 1. 05% higher grounding for explanation quality, and a 0. 42% higher answer-explanation consistency.

YNIMG Journal 2026 Journal Article

Uniformity in happiness and uniqueness in sadness: Naturalistic emotional representation in major depression

Qingjin Liu
Xi Zhang
Jinpeng Niu
Kangjia Chen
Jie Xia
Yaohui He
Shuo Xu
Wei Li

Humans develop shared concepts of others' emotions to support adaptive social functioning, yet how these concepts are dynamically represented in major depressive disorder (MDD) during naturalistic movie viewing is not yet fully established. Using functional MRI, we examined patients with MDD (n = 55) and healthy controls (HCs; n = 62) as they freely viewed movie clips depicting happy and sad emotions. Neural similarity was quantified with inter-subject correlation at whole-brain, network, and regional levels, and its association with emotional traits was assessed using inter-subject representational similarity analysis. Compared with HCs, patients with MDD showed significantly reduced whole-brain similarity, particularly during sad contexts. Network analyses revealed that HCs exhibited increased similarity in the limbic network during sadness, reflecting a shared "sadness resonance," whereas patients with higher depressive severity showed widespread disruptions across visual, limbic, dorsal attention, and default mode networks. At the regional level, similarity in the inferior temporal gyrus and lateral occipital cortex was closely linked to individual differences in emotional awareness, with pronounced context- and region-specificity. These findings highlight neural decoupling and heterogeneity as core features of MDD and provide new evidence for potential biomarkers to inform risk assessment and personalized interventions.

AAAI Conference 2025 Conference Paper

AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic Segmentation via Adaptive Label Correction

Pufan Zou
Shijia Zhao
Weijie Huang
Qiming Xia
Chenglu Wen
Wei Li
Cheng Wang

Recently, Visual Foundation Models (VFMs) have shown a remarkable generalization performance in 3D perception tasks. However, their effectiveness in large-scale outdoor datasets remains constrained by the scarcity of accurate supervision signals, the extensive noise caused by variable outdoor conditions, and the abundance of unknown objects. In this work, we propose a novel label-free learning method, Adaptive Label Correction (AdaCo), for 3D semantic segmentation. AdaCo first introduces the Cross-modal Label Generation Module (CLGM), providing cross-modal supervision with the formidable interpretive capabilities of the VFMs. Subsequently, AdaCo incorporates the Adaptive Noise Corrector (ANC), updating and adjusting the noisy samples within this supervision iteratively during training. Moreover, we develop an Adaptive Robust Loss (ARL) function to modulate each sample's sensitivity to noisy supervision, preventing potential underfitting issues associated with robust loss. Our proposed AdaCo can effectively mitigate the performance limitations of label-free learning networks in 3D semantic segmentation tasks. Extensive experiments on two outdoor benchmark datasets highlight the superior performance of our method.

PDF Details DOI

IROS Conference 2025 Conference Paper

Adaptive Sliding Window Optimization for Multi-Modal LiDAR Inertial Odometry and Mapping

Guodong Han
Wei Li
Yu Hu

Fixed-Lag smoothing is widely employed as a backend in localization tasks. Generally, increasing the window length leads to better accuracy, but demands more computational resources. Therefore, determining an appropriate window length and whether a fixed length should be maintained throughout the localization process are worth studying. Assuming independent and identically distributed noise based on the distance-independent characteristic of LiDAR ranging errors, we propose an uncertainty-based adaptive sliding window (ASW) strategy. Through mathematical derivation, the reference uncertainty is affected by the LiDAR feature distribution of each frame. Consequently, we develop a multimodal LiDAR inertial odometry and mapping framework based on ASW, which integrates mechanical and solid-state LiDAR to enhance odometry accuracy and mapping density. By designing a joint matching module, our approach leverages the strengths of distinct scanning patterns. Additionally, we incorporate loop closure detection in the mapping process to minimize cumulative drift. Extensive experiments conducted on both public and self-collected datasets demonstrate the effectiveness of our method. Compared to the state-of-the-art method, our approach improves the average accuracy by 10. 3%. We also provide an open-source implementation for further studies. https://github.com/wowhhhhgd/ASW-LIOM.

EAAI Journal 2025 Journal Article

An interpretable and physics-informed adaptive multi-branch deep learning framework for intelligent fault diagnosis of large-scale multi-row tapered roller bearings

Chao Ma
Jianliang Sun
Shuilin Lin
Wei Li
Yunfei Liu

Deep learning in heavy-industry condition monitoring is often constrained by limited interpretability, parameter redundancy, and poor robustness under variable operating conditions. To address these limitations, we propose a novel Physics-Informed Interpretable Multi-Branch Deep Learning Network (PI-MBDNet), which couples an adaptive frequency-band partitioning and weighting strategy with a vibration-mechanism-based channel-attention module. First, an adaptive scale–space spectral front end is constructed: a trainable parameter θ enables data-driven spectral partitioning, and empirical wavelet functions perform multiscale reconstruction. Next, a channel-attention module informed by vibration mechanics, together with a multi-branch convolutional network, models and extracts features from horizontal, vertical, and axial vibration signals separately. Furthermore, a label-aware dynamic classifier is introduced; during training, a sensitivity matrix provides fine-grained loss weighting according to each channel's responsiveness to specific fault types, thereby improving both diagnostic accuracy and engineering interpretability. By decoupling training and inference, the framework also achieves a substantial gain in inference speed after deployment, meeting real-time on-site requirements. PI-MBDNet has been stably deployed on a large steel production line, where real-time health monitoring of multi-row bearings has reduced unnecessary maintenance and downtime, improving product quality consistency and maintenance efficiency. By bridging physical interpretability with engineering practicality, this work provides a deployable pathway to trustworthy, real-time fault diagnosis under complex operating conditions.

ICLR Conference 2025 Conference Paper

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles
Sarah Clinckemaillie
Yifan Chang
Jonathan Waltz
Gabrielle Lau
Marybeth Fair
Alice Li
William E. Bishop

Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. To ensure reproducibility, each task includes dedicated initialization, success-checking, and tear-down logic, which modifies and inspects the device’s system state. We experiment with baseline agents to test AndroidWorld and provide initial results on the benchmark. Our best agent can complete 30.6% of AndroidWorld's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-platform agents. Finally, we also conduct a robustness analysis, showing that task variations can significantly affect agent performance, demonstrating that without such testing, agent performance metrics may not fully reflect practical challenges. AndroidWorld and the experiments in this paper are available at https://github.com/google-research/android_world.

YNICL Journal 2025 Journal Article

Association of valproate use and hippocampal atrophy in idiopathic generalized epilepsy

Xiang Huang
Yingying Zhang
Qiuxing Lin
Kailing Huang
Yuming Li
Peiwen Liu
Danyang Cao
Wenhao Li

OBJECTIVE: Recent studies revealed the effect of valproate (VPA) on brain structural changes in idiopathic generalized epilepsy (IGE). We aimed to investigate the volume of the entire hippocampus and subfields in patients with IGE, and explored their associations with VPA use. METHODS: A total of 211 patients with IGE and 97 healthy controls (HCs) were enrolled in this study. All participants underwent T1-weighted images. Each hippocampus was segmented into seven subfields using HippUnfold. The volumes of bilateral hippocampi and each hippocampal subfield were evaluated. Spearman correlation analyses were performed to identify VPA use related abnormalities in IGE. Subgroup analyses for juvenile myoclonic epilepsy (JME), epilepsy with generalized tonic-clonic seizures alone (GTCA), and absence epilepsy (AE) were conducted. RESULTS: The volumes of bilateral hippocampi were reduced in IGE compared with HCs. Subgroup analysis showed significant volume reductions in right hippocampus and its subfields in GTCA. Additionally, significant volume reductions were detected in bilateral hippocampal volumes and subfields in IGE patients currently taking VPA compared with HCs. A negative correlation was observed between the left CA2 volume and the age of onset. CONCLUSIONS: Our study revealed volume reductions in bilateral hippocampi in IGE, as well as in the right hippocampus and its subfields in GTCA. Abnormalities in both subfields and the whole hippocampus were associated with VPA use. These findings suggest that VPA may have more extensive neuroanatomical effects in IGE, potentially accounting for the heterogeneity observed in neuroimaging studies.

AAAI Conference 2025 Conference Paper

Breaking Information Isolation: Accelerating MRI via Inter-sequence Mapping and Progressive Masking

Jianwei Zheng
Xiaomin Yao
Guojiang Shen
Wei Li
Jiawei Jiang

Deep unfolding network (DUN) has shed new light on multi-sequence MRI reconstruction, providing both high interpretability and acceptable performance. However, current approaches still suffer from the plight of information isolation, i.e., learning features of multi-suquences individually and leaving the mask departed from model updating. In this work, we propose a new unfolding solution, namely Information-coupled MRI Acceleration (IMA), to address the isolation issue. Concretely, two specific mechanisms are presented. On the one hand, the latent connections across different sequences are explicitly molded via two auxiliary matrices. While the first matrix is meticulously engineered to assemble the spatial details, the second one hammers at capturing the depth information conditioned on the enriched channels. On the other hand, following a deep analysis on the non-uniform distribution in low- and high-frequency components of the given mask, we elaborate a new unfolding flow using a progressive masking scheme, featuring a dilation-contraction mechanism during forward propagation of successive stages. Massive experiments are conducted under various sampling patterns and acceleration rates, whose results demonstrate that, without any sophisticated architectures, our IMA outperforms the current cutting-edge methods both visually and numerically.

PDF Details DOI

IROS Conference 2025 Conference Paper

CA 2 Point: Learning Keypoint Detection and Description with Context Aggregation and Cross Augmentation

Xuebin Meng
Wei Li
Yu Hu
Yinhe Han

Keypoint detection and description are fundamental tasks for a variety of computer vision applications. Due to the limited receptive field of convolutional neural networks, most existing methods based on deep learning mainly focus on the local features, instead of taking into account the global context from entire image. The purpose of this work is to enhance the detection and description process of keypoints by leveraging global information obtained from Transformer, and to boost the consistence between keypoints and descriptors through their interaction. Specifically, the above two improvements are respectively implemented through the Local & Global Context Aggregation (LGCA) Module and Point & Descriptor Cross Augmentation (PDCA) Module proposed in this article. The LGCA module, which can model the long-range context, is inserted a Feature Pyramid Network (FPN) to extract features which contain diverse scales and different receptive fields. Moreover, the PDCA module enhances descriptors by the geometry information of keypoints detected, while enhancing the keypoint detection process by the position coordinates of correctly matched descriptors. Finally, we design a lightweight model to improve the running efficiency. Extensive experiments on various tasks demonstrate that our method achieves a substantial performance improvement over the current feature extraction methods. Code is available at: https://github.com/meng152634/CA2Point.

AAAI Conference 2025 Conference Paper

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

Junxian Li
Di Zhang
Xunzhi Wang
Zeying Hao
Jingdi Lei
Qian Tan
Cai Zhou
Wei Liu

Large Language Models (LLMs) have achieved remarkable success and have been applied across various scientific fields, including chemistry. However, many chemical tasks require the processing of visual information, which cannot be successfully handled by existing chemical LLMs. This brings a growing need for models capable of integrating multimodal information in the chemical domain. In this paper, we introduce ChemVLM, an open-source chemical multimodal large language model specifically designed for chemical applications. ChemVLM is trained on a carefully curated bilingual multimodal dataset that enhances its ability to understand both textual and visual chemical information, including molecular structures, reactions, and chemistry examination questions. We develop three datasets for comprehensive evaluation, tailored to Chemical Optical Character Recognition (OCR), Multimodal Chemical Reasoning (MMCR), and Multimodal Molecule Understanding tasks. We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks. Experimental results demonstrate that ChemVLM achieves competitive performance across all evaluated tasks.

PDF Details DOI

JBHI Journal 2025 Journal Article

Closed-Loop Respiratory Intervention Enhances Sleep Ventilation and Oxygen Saturation in Healthy Participants With Rapid High-Altitude Exposure

Yilin Yang
Wei Li
Hanyu Chen
Xiaochen Wang
Linhong Ji
Boda Zhou
Chong Li

Individuals who rapidly exposed to high-altitude environments are at risk of developing acute mountain sickness, which can inhibit the respiratory center or cause upper airway obstruction, leading to sleep apnea (SA). SA reduces oxygen saturation (SPO2) during sleep, which not only impairs sleep quality but affects cognitive and memory function. Positive airway pressure ventilation helps alleviate SA, but existing devices are prone to failure at high altitudes and are unable to realize real-time intervention based on user's physiological parameters. In this paper, we propose a respiratory ventilation system which addresses the issue of equipment failure at high altitudes through the implementation of an atmospheric pressure compensation algorithm. Additionally, we have developed a closed-loop algorithm that adjusts the inhalation and exhalation pressure based on the user's SPO2 during sleep. Experimental evaluations were conducted at an altitude of 3650 m, where participants were randomly assigned to receive closed-loop respiratory intervention, bi-level positive airway pressure (Bi-PAP) ventilation, and sham stimulation on three days. Heart rate (HR), SPO2, tidal volume (VT), respiratory rate (Rf) and sleep papameters were collected, and sleep quality was assessed. Experimental results showed that participants experienced an 26. 3% elevate in ventilation (p=0. 004, 0. 002, 0. 003, respectively), an 8% increase in SPO2 (p $< $ 0. 001 on three days), reduction in apnea events and an enhancement in deep sleep duration and sleep stability. These findings demonstrate that the incorporation of the closed-loop algorithm has significantly enhanced the system's effectiveness, offering a novel solution for addressing sleep apnea in high-altitude environments.

NeurIPS Conference 2025 Conference Paper

CogVLA: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification

Wei Li
Renshan Zhang
Rui Shao
Jie He
Liqiang Nie

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment. Existing sparsification strategies—such as Mixture-of-Depths, layer skipping, and early exit—fall short by neglecting the semantic coupling across vision-language-action modalities, and focusing narrowly on intra-LLM computation while overlooking end-to-end coherence from perception to control. To address these challenges, we propose **CogVLA**, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) **Encoder-FiLM based Aggregation Routing (EFA-Routing)** injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, **LLM-FiLM based Pruning Routing (LFP-Routing)** introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce **V‑L‑A Coupled Attention (CAtten)**, which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97. 4\% and 70. 0\%, respectively, while reducing training costs by 2. 5$\times$ and decreasing inference latency by 2. 8$\times$ compared to OpenVLA.

ICML Conference 2025 Conference Paper

Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation

Zhan Zhuang
Xiequn Wang
Wei Li
Yulong Zhang 0005
Qiushi Huang
Shuhao Chen
Xuehao Wang
Yanbin Wei

Low-rank adaptation (LoRA) has emerged as a leading parameter-efficient fine-tuning technique for adapting large foundation models, yet it often locks adapters into suboptimal minima near their initialization. This hampers model generalization and limits downstream operators such as adapter merging and pruning. Here, we propose CoTo, a progressive training strategy that gradually increases adapters’ activation probability over the course of fine-tuning. By stochastically deactivating adapters, CoTo encourages more balanced optimization and broader exploration of the loss landscape. We provide a theoretical analysis showing that CoTo promotes layer-wise dropout stability and linear mode connectivity, and we adopt a cooperative-game approach to quantify each adapter’s marginal contribution. Extensive experiments demonstrate that CoTo consistently boosts single-task performance, enhances multi-task merging accuracy, improves pruning robustness, and reduces training overhead, all while remaining compatible with diverse LoRA variants. Code is available at https: //github. com/zwebzone/coto.

NeurIPS Conference 2025 Conference Paper

Controllable Human-centric Keyframe Interpolation with Generative Prior

Zujin Guo
Size Wu
Zhongang Cai
Wei Li
Chen Change Loy

Existing interpolation methods use pre‑trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D‑informed control model, features a novel SMPL‑X encoder that encodes and aggregates 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL‑X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9\% improvement in PSNR and a 38\% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

AAAI Conference 2025 Conference Paper

CoPEFT: Fast Adaptation Framework for Multi-Agent Collaborative Perception with Parameter-Efficient Fine-Tuning

Quanmin Wei
Penglin Dai
Wei Li
Bingyi Liu
Xiao Wu

Multi-agent collaborative perception is expected to significantly improve perception performance by overcoming the limitations of single-agent perception through exchanging complementary information. However, training a robust collaborative perception model requires collecting sufficient training data that covers all possible collaboration scenarios, which is impractical due to intolerable deployment costs. Hence, the trained model is not robust against new traffic scenarios with inconsistent data distribution and fundamentally restricts its real-world applicability. Further, existing methods, such as domain adaptation, have mitigated this issue by exposing the deployment data during the training stage but incur a high training cost, which is infeasible for resource-constrained agents. In this paper, we propose a Parameter-Efficient Fine-Tuning-based lightweight framework, CoPEFT, for fast adapting a trained collaborative perception model to new deployment environments under low-cost conditions. CoPEFT develops a Collaboration Adapter and Agent Prompt to perform macro-level and micro-level adaptations separately. Specifically, the Collaboration Adapter utilizes the inherent knowledge from training data and limited deployment data to adapt the feature map to new data distribution. The Agent Prompt further enhances the Collaboration Adapter by inserting fine-grained contextual information about the environment. Extensive experiments demonstrate that our CoPEFT surpasses existing methods with less than 1\% trainable parameters, proving the effectiveness and efficiency of our proposed method.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Dynamic Contrastive Knowledge Distillation for Efficient Image Restoration

Yunshuai Zhou
Junbo Qiao
Jincheng Liao
Wei Li
Simiao Li
Jiao Xie
Yunhang Shen
Jie Hu

Knowledge distillation (KD) is a valuable yet challenging approach that enhances a compact student network by learning from a high-performance but cumbersome teacher model. However, previous KD methods for image restoration overlook the state of the student during the distillation, adopting a fixed solution space that limits the capability of KD. Additionally, relying solely on L1-type loss struggles to leverage the distribution information of images. In this work, we propose a novel dynamic contrastive knowledge distillation (DCKD) framework for image restoration. Specifically, we introduce dynamic contrastive regularization to perceive the student's learning state and dynamically adjust the distilled solution space using contrastive learning. Additionally, we also propose a distribution mapping module to extract and align the pixel-level category distribution of the teacher and student models. Note that the proposed DCKD is a structure-agnostic distillation framework, which can adapt to different backbones and can be combined with methods that optimize upper-bound constraints to further enhance model performance. Extensive experiments demonstrate that DCKD significantly outperforms the state-of-the-art KD methods across various image restoration tasks and backbones.

PDF Details DOI

JBHI Journal 2025 Journal Article

Enhancing and Shaping Closed-Loop Co-Adaptive Myoelectric Interfaces With Scenario-Guided Adaptive Incremental Learning

Wei Li
Jiang Shao
Ping Shi
Sujiao Li
Hongliu Yu

Virtual environments have been employed in the myoelectric prosthetics field as effective training and assessment tools to enhance intrinsic motivation, thereby encouraging sustained engagement in neuromuscular rehabilitation. However, motivating amputees to maintain consistent participation and perseverance in long-term training remains a critical challenge. To address this, we propose a scenario-guided adaptive incremental learning strategy that leverages contextual information in unknown environments to improve pseudo-label prediction accuracy. This strategy integrates two core components: Augmented Reality (AR) environment and Multimodal Progressive Domain Adversarial Neural Network (MPDANN). AR enables amputees to perform virtual prosthesis control and holographic object manipulation tasks in realistic, interactive scenarios, bridging the gap between laboratory training and daily-life usability. MPDANN Employs dual-domain classifiers through domain adversarial training, utilizing surface electromyography (sEMG) and inertial measurement unit (IMU) data to facilitate knowledge transfer across multi-source domains and achieve robust adaptation to unseen environments. A total of 16 non-disabled subjects and 2 amputee subjects completed a 5-day assessment protocol involving 10 holographic object manipulation tasks under 8 limb position conditions, using either a convolutional neural network (CNN) or MPDANN. Experimental results showed that non-disabled subjects using MPDANN achieved a 10% relative increase in average completion rate compared to the CNN baseline, reaching over 80% proficiency. While amputee subjects exhibited lower average completion rates than non-disabled subjects on the final day, the MPDANN strategy still demonstrated consistent performance improvements across both groups. This study substantiates the efficacy of integrating real-time visual feedback with a closed-loop domain adaptation algorithm, thereby enhancing sEMG recognition performance in untrained environments.

EAAI Journal 2025 Journal Article

Erratum to “Underwater target pose Recognition: A deep learning approach based on sonar signals” [Eng. Appl. Artif. Intell. (156), Part B, 15 September 2025, 111309]

Jikai Yang
Ziyan Gu
Peijun Li
Zihan Li
Wei Li

EAAI Journal 2025 Journal Article

Explainable time series features for hard disk drive failure prediction

Wei Li
Haozhou Zhou
Srinivasan Radhakrishnan
Sagar Kamarthi

Reliable data storage is crucial for industry digitalization and cloud infrastructure. To prevent data loss and improve maintenance efficiency in data centers, timely replacement of Hard Disk Drives (HDDs) is critical. HDDs are equipped with Self-Monitoring, Analysis, and Reporting Technology (SMART) to track key performance indicators with empirically established thresholds. In recent years, various machine learning models have utilized SMART time series data for early HDD failure prediction. However, decision-makers need greater transparency and explainability to trust and implement these data-driven models. In this work, we proposed a framework that extracts explainable features from the SMART time series and visualizes the feature impact on short-term HDD failure prediction using SHapley Additive exPlanations (SHAP) analysis. We trained an eXtreme Gradient Boosting (XGBoost) model with information-rich features and evaluated the failure detection rate and false alarm rate. We demonstrated the effectiveness of the proposed approach on Backblaze data for Quarter 1 and Quarter 2 of 2022. The model provided a 74. 7% failure detection rate with only a 0. 73% false alarm rate on the test data from Quarter 3 of 2022, outperforming an existing explainable model benchmark of a 54. 68% failure detection rate and an 11. 85% false alarm rate. In addition, the sensitivity analysis optimizes the signal length and the lead time to improve prediction accuracy and inform predictive maintenance policies. The results demonstrate the potential of the proposed framework for effective HDD failure prediction with explainable features. The proposed framework is also applicable to other sensor-based industrial equipment monitoring applications.

AAAI Conference 2025 Conference Paper

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

Yirui Chen
Xudong Huang
Quan Zhang
Wei Li
Mingjian Zhu
Qiangyu Yan
Simiao Li
Hanting Chen

The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location (IMDL). However, the lack of a large-scale data foundation makes the IMDL task unattainable. In this paper, we build a local manipulation data generation pipeline that integrates the powerful capabilities of SAM, LLM, and generative models. Upon this basis, we propose the GIM dataset, which has the following advantages: 1) Large scale, GIM includes over one million pairs of AI-manipulated images and real images. 2) Rich image content, GIM encompasses a broad range of image classes. 3) Diverse generative manipulation, the images are manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce the GIM benchmark with two settings to evaluate existing IMDL methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial block (FSB), and a Multi-Window Anomalous Modeling (MWAM) module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses the previous state-of-the-art approach on two different benchmarks.

PDF Details DOI

AAAI Conference 2025 Conference Paper

GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expressions

Ziqi Zhou
Weize Quan
Hailin Shi
Wei Li
Lili Wang
Dong-Ming Yan

Audio-driven talking head generation necessitates seamless integration of audio and visual data amidst the challenges posed by diverse input portraits and intricate correlations between audio and facial motions. In response, we propose a robust framework GoHD designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion. GoHD innovates with three key modules: Firstly, an animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles. This module achieves high disentanglement of motion and identity, and it also incorporates gaze orientation to rectify unnatural eye movements that were previously overlooked. Secondly, a conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody. Thirdly, to estimate lip-synchronized and realistic expressions from the input audio within limited training data, a two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions, e.g., blinks and frowns. Extensive experiments validate GoHD's advanced generalization capabilities, demonstrating its effectiveness in generating realistic talking face results on arbitrary subjects.

PDF Details DOI

ICRA Conference 2025 Conference Paper

Hierarchical Visual Policy Learning for Long-Horizon Robot Manipulation in Densely Cluttered Scenes

Hecheng Wang
Lizhe Qi
Ziheng Wang
Jiankun Ren
Wei Li
Yunquan Sun

In this work, we focus on addressing the long-horizon packing tasks in densely cluttered scenes. Such tasks require policies to effectively manage severe occlusions among objects and continually produce precise actions based on visual observations. We propose a vision-based Hierarchical policy for Cluttered-scene Long-horizon Manipulation (HCLM). It employs a high-level policy and three options to select and instantiate three parameterized action primitives: push, pick, and place. We first train the two-stream pick and place options by behavior cloning (BC). Subsequently, we use hierarchical reinforcement learning (HRL) to train the high-level policy and push option. During HRL, we propose a Spatially Extended Q-update (SEQ) to augment the updates for the push option and a Two-Stage Update Scheme (TSUS) to alleviate the non-stationary transition problem in updating the high-level policy. We demonstrate that HCLM significantly outperforms baseline methods in terms of success rate and efficiency in diverse tasks both in simulation and real world. The ablation studies also validate the key roles of SEQ and TSUS in HRL.

EAAI Journal 2025 Journal Article

HK-MOEA/D: A historical knowledge-guided resource allocation for decomposition multiobjective optimization

Wei Li
Xiaolong Zeng
Ying Huang
Yiu-ming Cheung

Decomposition-based multiobjective evolutionary algorithms is one of the prevailing algorithmic frameworks for multiobjective optimization. This framework distributes the same amount of evolutionary computing resources to each subproblems, but it ignores the variable contributions of different subproblems to population during the evolution. Resource allocation strategies (RAs) have been proposed to dynamically allocate appropriate evolutionary computational resources to different subproblems, with the aim of addressing this limitation. However, the majority of RA strategies result in inefficiencies and mistakes when performing subproblem assessment, thus generating unsuitable algorithmic results. To address this problem, this paper proposes a decomposition-based multiobjective evolutionary algorithm (HK-MOEA/D). The HK-MOEA/D algorithm uses a historical knowledge-guided RA strategy to evaluate the subproblem’s evolvability, allocate evolutionary computational resources based on the evaluation value, and adaptively select genetic operators based on the evaluation value to either help the subproblem converge or move away from a local optimum. Additionally, the density-first individual selection mechanism of the external archive is utilized to improve the diversity of the algorithm. An external archive update mechanism based on θ -dominance is also used to store solutions that are truly worth keeping to guide the evaluation of subproblem evolvability. The efficacy of the proposed algorithm is evaluated by comparing it with seven state-of-the-art algorithms on three types of benchmark functions and three types of real-world application problems. The experimental results show that HK-MOEA/D accurately evaluates the evolvability of the subproblems and displays reliable performance in a variety of complex Pareto front optimization problems.

NeurIPS Conference 2025 Conference Paper

Hybrid Boundary Physics-Informed Neural Networks for Solving Navier-Stokes Equations with Complex Boundary

ChuYu Zhou
Tianyu Li
Chenxi Lan
Rongyu Du
Guoguo Xin
Wei Li
Guoqing Wang
Xun Liu

Physics-informed neural networks (PINN) have achieved notable success in solving partial differential equations (PDE), yet solving the Navier-Stokes equations (NSE) with complex boundary conditions remains a challenging task. In this paper, we introduce a novel Hybrid Boundary PINN (HB-PINN) method that combines a pretrained network for efficient initialization with a boundary-constrained mechanism. The HB-PINN method features a primary network focused on inner domain points and a distance metric network that enhances predictions at the boundaries, ensuring accurate solutions for both boundary and interior regions. Comprehensive experiments have been conducted on the NSE under complex boundary conditions, including the 2D cylinder wake flow and the 2D blocked cavity flow with a segmented inlet. The proposed method achieves state-of-the-art (SOTA) performance on these benchmark scenarios, demonstrating significantly improved accuracy over existing PINN-based approaches.

NeurIPS Conference 2025 Conference Paper

Hyper-Modality Enhancement for Multimodal Sentiment Analysis with Missing Modalities

Yan Zhuang
Minhao Liu
Wei Bai
Yanru Zhang
Wei Li
Jiawen Deng
Fuji Ren

Multimodal Sentiment Analysis (MSA) aims to infer human emotions by integrating complementary signals from diverse modalities. However, in real-world scenarios, missing modalities are common due to data corruption, sensor failure, or privacy concerns, which can significantly degrade model performance. To tackle this challenge, we propose Hyper-Modality Enhancement (HME), a novel framework that avoids explicit modality reconstruction by enriching each observed modality with semantically relevant cues retrieved from other samples. This cross-sample enhancement reduces reliance on fully observed data during training, making the method better suited to scenarios with inherently incomplete inputs. In addition, we introduce an uncertainty-aware fusion mechanism that adaptively balances original and enriched representations to improve robustness. Extensive experiments on three public benchmarks show that HME consistently outperforms state-of-the-art methods under various missing modality conditions, demonstrating its practicality in real-world MSA applications.

JBHI Journal 2025 Journal Article

Innovative Dual-Decoupling CNN With Layer-Wise Temporal-Spatial Attention for Sensor-Based Human Activity Recognition

Qi Teng
Wei Li
Guangwei Hu
Yuanyuan Shu
Yun Liu

Human Activity Recognition (HAR) is essential for monitoring and analyzing human behavior, particularly in health applications such as fall detection and chronic disease management. Traditional methods, even those incorporating attention mechanisms, often oversimplify the complex temporal and spatial dependencies in sensor data by processing features uniformly, leading to inadequate modeling of high-dimensional interactions. To address these limitations, we propose a novel framework: the Temporal-Spatial Feature Decoupling Unit with Layer-wise Training Convolutional Neural Network (CNN-TSFDU-LW). Our model enhances HAR accuracy by decoupling temporal and spatial dependencies, facilitating more precise feature extraction and reducing computational overhead. The TSFDU mechanism enables parallel processing of temporal and spatial features, thereby enriching the learned representations. Furthermore, layer-wise training with a local error function allows for independent updates of each CNN layer, reducing the number of parameters and improving memory efficiency without compromising performance. Experiments on four benchmark datasets (UCI-HAR, PAMAP2, UNIMIB-SHAR, and USC-HAD) demonstrate accuracy improvements ranging from 0. 9% to 4. 19% over state-of-the-art methods while simultaneously reducing computational complexity. Specifically, our framework achieves accuracy rates of 97. 90% on UCI-HAR, 94. 34% on PAMAP2, 78. 90% on UNIMIB-SHAR, and 94. 71% on USC-HAD, underscoring its effectiveness in complex HAR tasks. In conclusion, the CNN-TSFDU-LW framework represents a significant advancement in sensor-based HAR, delivering both improved accuracy and computational efficiency, with promising potential for enhancing health monitoring applications.

AAAI Conference 2025 Conference Paper

ISPDiffuser: Learning RAW-to-sRGB Mappings with Texture-Aware Diffusion Models and Histogram-Guided Color Consistency

Yang Ren
Hai Jiang
Menglong Yang
Wei Li
Shuaicheng Liu

RAW-to-sRGB mapping, or the simulation of the traditional camera image signal processor (ISP), aims to generate DSLR-quality sRGB images from raw data captured by smartphone sensors. Despite achieving comparable results to sophisticated handcrafted camera ISP solutions, existing learning-based methods still struggle with detail disparity and color distortion. In this paper, we present ISPDiffuser, a diffusion-based decoupled framework that separates the RAW-to-sRGB mapping into detail reconstruction in grayscale space and color consistency mapping from grayscale to sRGB. Specifically, we propose a texture-aware diffusion model that leverages the generative ability of diffusion models to focus on local detail recovery, in which a texture enrichment loss is further proposed to prompt the diffusion model to generate more intricate texture details. Subsequently, we introduce a histogram-guided color consistency module that utilizes color histogram as guidance to learn precise color information for grayscale to sRGB color consistency mapping, with a color consistency loss designed to constrain the learned color information. Extensive experimental results show that the proposed ISPDiffuser outperforms state-of-the-art competitors both quantitatively and visually.

PDF Details DOI

EAAI Journal 2025 Journal Article

Key region Semantic information Augmented Transformer for Image Captioning

Fuyun Deng
Wei Li
Zhixin Li

Existing image captioning models often face difficulties in capturing inter-object relationships and generating description that comprehensively understands the entire image content, either relying on object detectors that overlook contextual information or depending on grid features that fail to adequately model spatial interactions. This paper proposes two solutions to these challenges. The first is the introduction of a module for mining semantic information from key regions. Based on the spatial proximity and high co-occurrence between objects, this module identifies the public region covered by these objects as a key region, mines their semantic information, and incorporates it into the modeling process, which compensates for the limitations of grid features. Second, we improve the standard Transformer decoder’s architecture by innovatively introducing an adaptive gating mechanism that dynamically adjusts the alignment between textual and visual features, enhancing the model’s overall comprehension of the image. To validate our approach, we applied these modules to the Transformer framework and proposed a novel method for image captioning, called Key region Semantic information Augmented Transformer (KSAT) for Image Captioning. Extensive experiments on benchmark datasets show that the proposed method outperforms many models. Specifically, our method achieves a score of 139. 6% on the offline test, and 138. 4% on the official online test server on the Consensus-based Image Description Evaluation (CIDEr) metric. In qualitative evaluation, our method also outperforms other methods at generating captions for complex scenes. Overall, these results confirm the validity of our method and advance the field of artificial intelligence.

NeurIPS Conference 2025 Conference Paper

Learning to Control Free-Form Soft Swimmers

Changyu Hu
Yanke Qu
Qiuan Yang
Xiaoyu Xiong
Kui Wu
Wei Li
Tao Du

Swimming in nature achieves remarkable performance through diverse morphological adaptations and intricate solid-fluid interaction, yet exploring this capability in artificial soft swimmers remains challenging due to the high-dimensional control complexity and the computational cost of resolving hydrodynamic details. Traditional approaches often rely on morphology-dependent heuristics and simplified fluid models, which constrain exploration and preclude advanced strategies like vortex exploitation. To address this, we propose an automated framework that combines a unified, reduced-mode control space with a high-fidelity GPU-accelerated simulator. Our control space naturally captures deformation patterns for diverse morphologies, minimizing manual design, while our simulator efficiently resolves the crucial fluid-structure interactions required for learning. We evaluate our method on a wide range of morphologies, from bio-inspired to unconventional. From this general framework, high-performance swimming patterns emerge that qualitatively reproduce canonical gaits observed in nature without requiring domain-specific priors, where state-of-the-art baselines often fail, particularly on complex topologies like a torus. Our work lays a foundation for future opportunities in automated co-design of soft robots in complex hydrodynamic environments. The code is available at https: //github. com/changyu-hu/FreeFlow.

TMLR Journal 2025 Journal Article

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang
Jinming Wu
Wei Li
Bo Li
Zejun Ma
Ziwei Liu
Chunyuan Li

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

JBHI Journal 2025 Journal Article

Medical Hyperspectral Image Feature Selection Framework Using Functional Data Analysis: Application to Membranous Nephropathy Pathologic Diagnosis

Meng Lv
Shiyu Liu
Xiaoying Ma
Yue Yang
Haihao Zhang
Wei Li

To address the core issue of high-dimensional data processing in hyperspectral pathological diagnosis, we develop a new feature selection framework using functional data analysis (FSFDA). The framework models pixel spectra as continuous functions to preserve spectral continuity, overcoming the limitations of traditional discrete representations. Based on functional features, an innovative adaptive spectral segmentation strategy driven by functional change rate is developed to achieve optimal segmentation in the feature space. Additionally, a multi-criteria scoring mechanism including supervised (FSFDA-S) and unsupervised (FSFDA-U) paradigms is developed to enhance feature diagnostic discriminability while maintaining sparsity. Experimental results on the pathological hyperspectral image dataset of membranous nephropathy validate that the proposed method achieves over 99% classification accuracy while reducing feature dimensions by 94. 5%. For cross-modal data involving in-vivo human brain and white blood cells, FSFDA effectively identifies diagnostic bands aligned with histopathological signatures, verifying its adaptive feature selection ability and cross sample generalization performance.

NeurIPS Conference 2025 Conference Paper

MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search

Zonglin Yang
Wanhao Liu
Ben Gao
Yujie Liu
Wei Li
Tong Xie
Lidong Bing
Wanli Ouyang

Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the new task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent literature show that our method consistently outperforms strong baselines.

AAAI Conference 2025 Conference Paper

NaFV-Net: An Adversarial Four-view Network for Mammogram Classification

Feng Lu
Yuxiang Hou
Wei Li
Xiangying Yang
Haibo Zheng
Wenxi Luo
Leqing Chen
Yuyang Cao

Breast cancer remains a leading cause of mortality among women, with millions of new cases diagnosed annually. Early detection through screening is crucial. Using neural networks to improve the accuracy of breast cancer screening has become increasingly important. In accordance with radiologists' practices, we proposed using images from the unaffected side to create adversarial samples with critical medical implications in our adversarial learning process. By introducing beneficial perturbations, this method aims to reduce overconfidence and improve the precision and robustness of breast cancer classification. Our proposed framework is an adversarial quadruple-view classification network (NaFV-Net) incorporating images from both affected and unaffected perspectives. By comprehensively capturing local and global information and implementing adversarial learning from four mammography views, this framework allows for the fusion of features and the integration of medical principles and radiologist evaluation techniques, thus facilitating the accurate identification and characterization of breast tissues. Extensive experiments have shown the high effectiveness of our model in accurately distinguishing between benign and malignant findings, demonstrating state-of-the-art classification performance on both internal and public datasets.

PDF Details DOI

EAAI Journal 2025 Journal Article

Scenario-based improved bayesian network model for construction safety assessment

Zelong Lin
Dewei Kong
Wei Li
W.M. Edmund Loh
C.J. Wong
Zhijian Sun
Wei He

Assessing the safety of construction projects amidst potential accidents is crucial, especially with the rapid growth of modern infrastructure, which increases the risk of incidents. Existing safety assessment methods often fall short due to uncertainties and a lack of reliable analyses. To address this, a new method based on Scenario Theory (ST) and an improved Bayesian Network (BN) is proposed. This approach uses multi-source data to create a construction accident ontology and safety assessment paradigm. It refines the traditional BN algorithm to establish a construction accident BN and develops task and capability assessment functions. The method's effectiveness is validated through simulations and a real construction project in Tianjin, China. It improves safety by assessing engineering projects' emergency management capabilities in the face of accidents, benefiting construction firms and government agencies.

NeurIPS Conference 2025 Conference Paper

Solving Partial Differential Equations via Radon Neural Operator

Wenbin Lu
Yihan Chen
Junnan Xu
Wei Li
Junwei Zhu
Jianwei Zheng

Neural operator is considered a popular data-driven alternative to traditional partial differential equation (PDE) solvers. However, most current solutions, whether fulfilling computations in frequency, Laplacian, and wavelet domains, all deviate far from the intrinsic PDE space. While with meticulous network architecture elaborated, the deviation often leads to biased accuracy. To address the issue, we open a new avenue that pioneers leveraging Radon transform to decompose the input space, finalizing a novel Radon neural operator (RNO) to solve PDEs in infinite-dimensional function space. Distinct from previous solutions, we project the input data into the sinogram domain, shrinking the multi-dimensional transformations to a reduced-dimensional counterpart and fitting compactly with the PDE space. Theoretically, we prove that RNO obeys a property of bilipschitz strongly monotonicity under diffeomorphism, providing deeper insights to guarantee the desired accuracy than typical discrete invariance or continuous-discrete equivalence. Within the sinogram domain, we further evidence that different angles contribute unequally to the overall space, thus engineering a reweighting technique to enable more effective PDE solutions. On that basis, a sinogram-domain convolutional layer is crafted, which operates on a fixed $\theta$-grid that is decoupled from the PDE space, further enjoying a natural guarantee of discrete invariance. Extensive experiments demonstrate that RNO sets new state-of-the-art (SOTA) scores across massive standard benchmarks, with superior generalization performance enjoyed. Code is available at.

EAAI Journal 2025 Journal Article

Stochastic reliability optimization of a controlled memristor-based Van der Pol circuit using a new intelligent algorithm

Wei Li
Mingzhi Lin
Junfeng Zhao
Drazan Kozak

Enhancing the reliability of structural dynamical systems under random loads via controllers is essential for maintaining system stability, resilience, performance, and safety. The challenge lies in optimally designing the controller while determining reliability probability. To address this issue, our study develops a new intelligent algorithm for optimal reliability analysis on a controlled memristor-based Van der Pol circuit under stochastic excitation. This algorithm integrates Gaussian Radial Basis Function Neural Network with Genetic Algorithm, the reliability function with control parameters serves as objective function, while the reliability probability equation acts as constraint conditions, meanwhile the fitness function is derived from the neural network’s solution to the governing equation. Our algorithm achieves optimal reliability results and overcomes the challenge of simultaneously optimizing an implicit objective and solving the reliability governing equation. The sensitivity of key parameters such as population size, maximum iteration to reliability performance are discussed, respectively. The effectiveness of the proposed algorithm is numerically compared with Monte-Carlo simulation and finite difference method. The control strategy in this work establishes a theoretical foundation for systems exhibiting random vibration and probabilistic fatigue in structural engineering, and holds promising potential for real-world application.

YNIMG Journal 2025 Journal Article

The effective neural connections in food inhibitory control and their relationship with daily eating behavior in individuals with overweight/obesity or normal-weight

Yong Liu
Mingyue Xiao
Yatong Guo
Pan Shi
Yazhi Pang
Wei Li
Ximei Chen
Jia Zhao

This study investigates the differences in effective neural connections during food inhibitory control between individuals with overweight/obesity (OW/OB) and those with normal weight (NW), and examines how these neural differences relate to daily eating behaviors. Fifty-one female participants were classified into OW/OB (BMI ≥ 25 kg/m²) or NW (BMI 18-22 kg/m²) groups. Participants completed a modified food-specific go/no-go task with working memory load during fMRI scanning. Neural connectivity was analyzed using dynamic causal modelling (DCM). Ecological momentary assessment (EMA) was used to collect real-time data on eating behaviors over one week. The OW/OB group showed lower accuracy in responding to low-calorie food cues and greater activation in the left hippocampus during no-go trials with high-calorie foods. DCM revealed stronger excitatory connectivity from the right inferior frontal gyrus (IFG) to the medial prefrontal cortex (mPFC), and stronger inhibitory connectivity from the mPFC to the dorsal caudate, as well as from the dorsal caudate to the left hippocampus in the OW/OB group. EMA results indicated that the OW/OB group was more likely to succumb to food desires between 13:00 and 17:00. Mediation analysis confirmed that effective connectivity mediated the relationship between task performance and daily eating behaviors. These findings elucidate the neural mechanisms underlying food inhibitory control in OW/OB individuals, highlighting the role of the hippocampus and the IFG-mPFC circuit. The study provides theoretical advances within the dual-system framework and suggests that targeting these neural pathways may improve dietary control in obesity.

IJCAI Conference 2025 Conference Paper

The Role of Video Generation in Enhancing Data-Limited Action Understanding

Wei Li
Dezhao Luo
Dongbao Yang
Zhenhang Li
Weiping Wang
Yu Zhou

Video action understanding tasks in real-world scenarios often suffer from data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that leverages a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the Information Enhancement Strategy and the Uncertainty-Based Soft Target tailored to generate sample training. Through quantitative and qualitative analyzes, we discovered that real samples generally contain a richer level of information compared to generated samples. Based on this observation, the information enhancement strategy was designed to enhance the informational content of the generated samples from two perspectives: the environment and the character. Furthermore, we observed that a portion of low-quality generated samples might negatively affect model training. To address this, we devised an uncertainty-based label-smoothing strategy to increase the smoothing of these low-quality samples, thereby reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets and five tasks, and achieve state-of-the-art performance for zero-shot action recognition.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

TimE: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Shaohang Wei
Wei Li
Feifan Song
Wen Luo
Tianyi Zhuang
Haochen Tan
Zhijiang Guo
Houfeng Wang

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TimE, designed for temporal reasoning in real-world scenarios. TimE consists of 38, 522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TimE-Wiki, TimE-News, and TimE-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TimE-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.

EAAI Journal 2025 Journal Article

Underwater target pose Recognition: A deep learning approach based on sonar signals

Jikai Yang
Ziyan Gu
Peijun Li
Zihan Li
Wei Li

Underwater target pose recognition has become a significant research focus in ocean exploration, resource investigation, and military applications. Traditional methods based on physical models and rule-based matching struggle with noise interference and dynamic underwater conditions. In this study, we propose an artificial intelligence-based approach, employing a multi-task deep learning model to enhance underwater target pose estimation. A synthetic sonar frequency response dataset was generated by simulating the backscattering characteristics of ellipsoidal targets under various incident angles. A multi-layer neural network was designed to simultaneously perform ellipsoid ratio classification and incidence angle estimation, utilizing a shared feature extraction framework for joint classification and regression learning. Experimental results demonstrate that the proposed model achieves a classification accuracy of 100 % under standard conditions and a mean absolute error (MAE) of 0. 0595° in angle estimation. Even under significant noise interference (10 % noise added), the model maintains a classification accuracy of 99. 5 % and an MAE of 0. 3805°. In extreme conditions with high noise and strong signal attenuation, the model achieves 99 % classification accuracy and an MAE of 0. 4328°, demonstrating its robustness and adaptability to complex underwater environments. These findings demonstrate that deep learning serves as a robust alternative to traditional physics-based modeling, significantly enhancing the precision and reliability of underwater target recognition. Future research will integrate real-world sonar data and explore advanced AI architectures such as convolutional neural networks (CNNs) and Transformers for enhanced feature extraction and generalization.

NeurIPS Conference 2025 Conference Paper

Uni-LoRA: One Vector is All You Need

Kaiyang Li
Shaobo Han
Qing Su
Wei Li
Zhipeng Cai
Shihao Ji

Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space R^D, can be reconstructed through a projection from a subspace R^d, with d << D. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, P ∈ R^{D×d}. Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM -- making Uni-LoRA both a unified framework and a “one-vector-only” solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance.

IROS Conference 2025 Conference Paper

Vibration-Aware Trajectory Optimization for Mobile Robots in Wild Environments via Physics-Informed Neural Network

Aochun Xu
Andong Yang
Wei Li
Yu Hu

The suspension system, through effective damping of vibrations and shocks, can enhance the stability of wheeled robots traversing challenging terrain. Because the suspension system decouples the rigid correspondence between terrain changes and robot vibrations, considering suspension modeling in trajectory planning offers the advantage of more accurate prediction of the robot’s response to terrain. This improved predictive capability facilitates the planning of safer trajectories and may reduce tracking errors in the subsequent control process. In this work, inspired by the structure of Physics-Informed Neural Network (PINN), we propose a physics-informed planning method that considers the vibrational effects of complex nonlinear suspension systems. In addition, we design a two-stage process to accelerate training. By incorporating PINN, our method can better guarantee the physical feasibility of the planned trajectories. The proposed approach has been evaluated on a real robot platform. Compared to state-of-the-art baseline methods, our proposed approach achieves a 15. 38% reduction in hazardous planning for mobile robots in wild environments.

NeurIPS Conference 2025 Conference Paper

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Senqiao Yang
Junyi Li
Xin Lai
Jinming Wu
Wei Li
Zejun Ma
Bei Yu
Hengshuang Zhao

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreoever, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. All our code and data are open-sourced.

ICML Conference 2025 Conference Paper

ZeroFlow: Overcoming Catastrophic Forgetting is Easier than You Think

Tao Feng 0014
Wei Li
Didi Zhu
Hangjie Yuan
Wendi Zheng
Dan Zhang
Jie Tang 0001

Backpropagation provides a generalized configuration for overcoming catastrophic forgetting. Optimizers such as SGD and Adam are commonly used for weight updates in continual learning and continual pre-training. However, access to gradient information is not always feasible in practice due to black-box APIs, hardware constraints, or non-differentiable systems, a challenge we refer to as the gradient bans. To bridge this gap, we introduce ZeroFlow, the first benchmark designed to evaluate gradient-free optimization algorithms for overcoming forgetting. ZeroFlow examines a suite of forward pass-based methods across various algorithms, forgetting scenarios, and datasets. Our results show that forward passes alone can be sufficient to mitigate forgetting. We uncover novel optimization principles that highlight the potential of forward pass-based methods in mitigating forgetting, managing task conflicts, and reducing memory demands. Additionally, we propose new enhancements that further improve forgetting resistance using only forward passes. This work provides essential tools and insights to advance the development of forward-pass-based methods for continual learning.

NeurIPS Conference 2024 Conference Paper

Adaptive Layer Sparsity for Large Language Models via Activation Correlation Assessment

Wei Li
Lujun Li
Mark Lee
Shengjie Sun

Large Language Models (LLMs) have revolutionized the field of natural language processing with their impressive capabilities. However, their enormous size presents challenges for deploying them in real-world applications. Traditional compression techniques, like pruning, often lead to suboptimal performance due to their uniform pruning ratios and lack of consideration for the varying importance of features across different layers. To address these limitations, we present a novel Adaptive Layer Sparsity (ALS) approach to optimize LLMs. Our approach consists of two key steps. Firstly, we estimate the correlation matrix between intermediate layers by leveraging the concept of information orthogonality. This novel perspective allows for a precise measurement of the importance of each layer across the model. Secondly, we employ a linear optimization algorithm to develop an adaptive sparse allocation strategy based on evaluating the correlation matrix. This strategy enables us to selectively prune features in intermediate layers, achieving fine-grained optimization of the LLM model. Considering the varying importance across different layers, we can significantly reduce the model size without sacrificing performance. We conduct extensive experiments on publicly available language processing datasets, including the LLaMA-V1|V2|V3 family and OPT, covering various benchmarks. Our experimental results validate the effectiveness of our ALS method, showcasing its superiority over previous approaches. The performance gains demonstrate its potential for enhancing LLMs' efficiency and resource utilization. Notably, our approach surpasses the state-of-the-art models Wanda and SparseGPT, showcasing its ability to excel even under high sparsity levels. Codes at: https: //github. com/lliai/ALS.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Alias-Free Mamba Neural Operator

Jianwei Zheng
Wei Li
Ni Xu
Junwei Zhu
Xiaoxu Lin
Xiaoqin Zhang

Benefiting from the booming deep learning techniques, neural operators (NO) are considered as an ideal alternative to break the traditions of solving Partial Differential Equations (PDE) with expensive cost. Yet with the remarkable progress, current solutions concern little on the holistic function features--both global and local information-- during the process of solving PDEs. Besides, a meticulously designed kernel integration to meet desirable performance often suffers from a severe computational burden, such as GNO with $O(N(N-1))$, FNO with $O(NlogN)$, and Transformer-based NO with $O(N^2)$. To counteract the dilemma, we propose a mamba neural operator with $O(N)$ computational complexity, namely MambaNO. Functionally, MambaNO achieves a clever balance between global integration, facilitated by state space model of Mamba that scans the entire function, and local integration, engaged with an alias-free architecture. We prove a property of continuous-discrete equivalence to show the capability ofMambaNO in approximating operators arising from universal PDEs to desired accuracy. MambaNOs are evaluated on a diverse set of benchmarks with possibly multi-scale solutions and set new state-of-the-art scores, yet with fewer parameters and better efficiency.

PDF Details DOI

JBHI Journal 2024 Journal Article

BTSSPro: Prompt-Guided Multimodal Co-Learning for Breast Cancer Tumor Segmentation and Survival Prediction

Wei Li
Tianyu Liu
Feiyan Feng
Shengpeng Yu
Hong Wang
Yanshen Sun

Early detection significantly enhances patients' survival rates by identifying tumors in their initial stages through medical imaging. However, prevailing methodologies encounter challenges in extracting comprehensive information from diverse modalities, thereby exacerbating semantic disparities and overlooking critical task correlations, consequently compromising the accuracy of prognosis predictions. Moreover, clinical insights emphasize the advantageous sharing of parameters between tumor segmentation and survival prediction for enhanced prognostic accuracy. This paper proposes a novel model, BTSSPro, designed to concurrently address B reast cancer T umor S egmentation and S urvival prediction through a Pro mpt-guided multi-modal co-learning framework. Technologically, our approach involves the extraction of tumor-specific discriminative features utilizing shared dual attention (SDA) blocks, which amalgamate spatial and channel information from breast MR images. Subsequently, we employ a guided fusion module (GFM) to seamlessly integrate the Electronic Health Record (EHR) vector into the extracted tumor-related discriminative feature representations. This integration prompts the model's feature selection to align more closely with real-world scenarios. Finally, a feature harmonic unit (FHU) is introduced to coordinate the transformer encoder and CNN decoder, thus reducing semantic differences. Remarkably, BTSSPro achieved a C-index of 0. 968 and Dice score of 0. 715 on the Breast MRI-NACT-Pilot dataset and a C-index of 0. 807 and Dice score of 0. 791 on the ISPY1 dataset, surpassing the previous state-of-the-art methods.

JBHI Journal 2024 Journal Article

CellT-Net: A Composite Transformer Method for 2-D Cell Instance Segmentation

Zhijiang Wan
Manyu Li
Zihan Wang
Hai Tan
Wei Li
Lisu Yu
Dinesh Jackson Samuel

Cell instance segmentation (CIS) via light microscopy and artificial intelligence (AI) is essential to cell and gene therapy-based health care management, which offers the hope of revolutionary health care. An effective CIS method can help clinicians to diagnose neurological disorders and quantify how well these deadly disorders respond to treatment. To address the CIS task challenged by dataset characteristics such as irregular morphology, variation in sizes, cell adhesion, and obscure contours, we propose a novel deep learning model named CellT-Net to actualize effective cell instance segmentation. In particular, the Swin transformer (Swin-T) is used as the basic model to construct the CellT-Net backbone, as the self-attention mechanism can adaptively focus on useful image regions while suppressing irrelevant background information. Moreover, CellT-Net incorporating Swin-T constructs a hierarchical representation and generates multi-scale feature maps that are suitable for detecting and segmenting cells at different scales. A novel composite style named cross-level composition (CLC) is proposed to build composite connections between identical Swin-T models in the CellT-Net backbone and generate more representational features. The earth mover's distance (EMD) loss and binary cross entropy loss are used to train CellT-Net and actualize the precise segmentation of overlapped cells. The LiveCELL and Sartorius datasets are utilized to validate the model effectiveness, and the results demonstrate that CellT-Net can achieve better model performance for dealing with the challenges arising from the characteristics of cell datasets than state-of-the-art models.

AAAI Conference 2024 Conference Paper

CGS-Mask: Making Time Series Predictions Intuitive for All

Feng Lu
Wei Li
Yifei Sun
Cheng Song
Yufei Ren
Albert Y. Zomaya

Artificial intelligence (AI) has immense potential in time series prediction, but most explainable tools have limited capabilities in providing a systematic understanding of important features over time. These tools typically rely on evaluating a single time point, overlook the time ordering of inputs, and neglect the time-sensitive nature of time series applications. These factors make it difficult for users, particularly those without domain knowledge, to comprehend AI model decisions and obtain meaningful explanations. We propose CGS-Mask, a post-hoc and model-agnostic cellular genetic strip mask-based saliency approach to address these challenges. CGS-Mask uses consecutive time steps as a cohesive entity to evaluate the impact of features on the final prediction, providing binary and sustained feature importance scores over time. Our algorithm optimizes the mask population iteratively to obtain the optimal mask in a reasonable time. We evaluated CGS-Mask on synthetic and real-world datasets, and it outperformed state-of-the-art methods in elucidating the importance of features over time. According to our pilot user study via a questionnaire survey, CGS-Mask is the most effective approach in presenting easily understandable time series prediction results, enabling users to comprehend the decision-making process of AI models with ease.

PDF Details DOI

AAAI Conference 2024 Conference Paper

DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection

Xiang Li
Junbo Yin
Wei Li
Chengzhong Xu
Ruigang Yang
Jianbing Shen

Vehicle-to-Everything (V2X) collaborative perception has recently gained significant attention due to its capability to enhance scene understanding by integrating information from various agents, e.g., vehicles, and infrastructure. However, current works often treat the information from each agent equally, ignoring the inherent domain gap caused by the utilization of different LiDAR sensors of each agent, thus leading to suboptimal performance. In this paper, we propose DI-V2X, that aims to learn Domain-Invariant representations through a new distillation framework to mitigate the domain discrepancy in the context of V2X 3D object detection. DI-V2X comprises three essential components: a domain-mixing instance augmentation (DMA) module, a progressive domain-invariant distillation (PDD) module, and a domain-adaptive fusion (DAF) module. Specifically, DMA builds a domain-mixing 3D instance bank for the teacher and student models during training, resulting in aligned data representation. Next, PDD encourages the student models from different domains to gradually learn a domain-invariant feature representation towards the teacher, where the overlapping regions between agents are employed as guidance to facilitate the distillation process. Furthermore, DAF closes the domain gap between the students by incorporating calibration-aware domain-adaptive attention. Extensive experiments on the challenging DAIR-V2X and V2XSet benchmark datasets demonstrate DI-V2X achieves remarkable performance, outperforming all the previous V2X models. Code is available at https://github.com/Serenos/DI-V2X.

PDF Details DOI

IS Journal 2024 Journal Article

Exploring alterations of brain networks of AD patients using WTC method

Li Yapeng
Yuanyuan Qin
Xi Chen
Wei Li

Objective: To explore the influences of different frequency bands on preprocessing of resting-state fMRI datasets used by the Wavelet Transform Coherence (WTC) method, and to study changes in the functional brain networks of AD patients. Method: Resting-state fMRI datasets of 10 AD patients and 11 healthy controls were collected in this study and time series of 90 brain regions defined by AAL (Automated Anatomical Labeling) were exacted after preprocessing. Wavelet transformation was performed for each time series, and a functional brain network were established in different frequencies (0. 125Hz, 0. 0625Hz) using the WTC (Wavelet Transform Coherence) method. The topology parameters of networks, containing global efficiency, clustering coefficient, average short paths length and small world property were calculated and averaged within each group. Result: The results imply that there are significant differences of topology parameters in networks of different frequencies. Likewise, statistical analysis of topology parameters of AD and HC (Healthy Controls) show that global efficiency, clustering coefficient and small world properties of AD all decreased by varying degrees, while the short path length of AD remained longer. Conclusion: Our research provides a theoretical basis for the choice of filter bands for data preprocessing in functional magnetic resonance imaging. The findings may serve as indicators for early diagnosis of AD patients.

NeurIPS Conference 2024 Conference Paper

Exploring Structured Semantic Priors Underlying Diffusion Score for Test-time Adaptation

Mingjia Li
Shuang Li
Tongrui Su
Longhui Yuan
Jian Liang
Wei Li

Capitalizing on the complementary advantages of generative and discriminative models has always been a compelling vision in machine learning, backed by a growing body of research. This work discloses the hidden semantic structure within score-based generative models, unveiling their potential as effective discriminative priors. Inspired by our theoretical findings, we propose DUSA to exploit the structured semantic priors underlying diffusion score to facilitate the test-time adaptation of image classifiers or dense predictors. Notably, DUSA extracts knowledge from a single timestep of denoising diffusion, lifting the curse of Monte Carlo-based likelihood estimation over timesteps. We demonstrate the efficacy of our DUSA in adapting a wide variety of competitive pre-trained discriminative models on diverse test-time scenarios. Additionally, a thorough ablation study is conducted to dissect the pivotal elements in DUSA. Code is publicly available at https: //github. com/BIT-DA/DUSA.

PDF Details DOI

EAAI Journal 2024 Journal Article

FCT-Net: A dual-encoding-path network fusing atrous spatial pyramid pooling and transformer for pavement crack detection

Bing Xiong
Rong Hong
Rui Liu
Jing Wang
Jin Zhang
Wei Li
Songtao Lv
Dongdong Ge

Cracks are a typical form of road damage, and accurate detection of cracks is of great significance for road maintenance work and ensuring traffic safety. Recently, computer vision has gradually been applied in the field of crack segmentation. However, there are still some extremely challenging problems in crack segmentation, such as complex backgrounds, information loss caused by pooling and convolution operations, and insufficient fusion of global and local semantic information. In response to the above problems, this paper proposes a dual-encoding-path network with U-Net architecture called FCT-Net, by fusing channel atrous spatial pyramid pooling (CASPP) and transformer. Specifically, CASPP obtains multi-scale receptive fields by incorporating spatial and channel attention, while refining and extracting local features. Meanwhile, we introduce long-short distance attention to construct a novel transformer with the prominent characteristic of interaction between local and global attention features. In addition, a residual convolution module is designed to enhance the local features of the transformer. Furthermore, we devise a multi-scale attention weight cross fusion module to aggregate the features of the dual encoding branch, for reducing information loss during downsampling and suppress background information. Eventually, we evaluate the performance of FCT-Net by experiments on three public datasets. Extensive experimental results show that FCT-Net achieves higher F1-score and mean intersection over union (mIoU) than state-of-the-art segmentation networks on the DeepCrack537 and CrackLS315 datasets. Meanwhile, it has excellent segmentation performance for cracks in complex scenes, with the highest recall, F1-score, and mIoU respectively as 85. 64%, 81. 67%, and 84. 05% on the CrackTree260 dataset.

JBHI Journal 2024 Journal Article

FD-Net: Feature Distillation Network for Oral Squamous Cell Carcinoma Lymph Node Segmentation in Hyperspectral Imagery

Xueyu Zhang
Qingxiang Li
Wei Li
Yuxing Guo
Jianyun Zhang
Chuanbin Guo
Kan Chang
Nigel H. Lovell

Oral squamous cell carcinoma (OSCC) has the characteristics of early regional lymph node metastasis. OSCC patients often have poor prognoses and low survival rates due to cervical lymph metastases. Therefore, it is necessary to rely on a reasonable screening method to quickly judge the cervical lymph metastastic condition of OSCC patients and develop appropriate treatment plans. In this study, the widely used pathological sections with hematoxylin-eosin (H&E) staining are taken as the target, and combined with the advantages of hyperspectral imaging technology, a novel diagnostic method for identifying OSCC lymph node metastases is proposed. The method consists of a learning stage and a decision-making stage, focusing on cancer and non-cancer nuclei, gradually completing the lesions' segmentation from coarse to fine, and achieving high accuracy. In the learning stage, the proposed feature distillation-Net (FD-Net) network is developed to segment the cancerous and non-cancerous nuclei. In the decision-making stage, the segmentation results are post-processed, and the lesions are effectively distinguished based on the prior. Experimental results demonstrate that the proposed FD-Net is very competitive in the OSCC hyperspectral medical image segmentation task. The proposed FD-Net method performs best on the seven segmentation evaluation indicators: MIoU, OA, AA, SE, CSI, GDR, and DICE. Among these seven evaluation indicators, the proposed FD-Net method is 1. 75%, 1. 27%, 0. 35%, 1. 9%, 0. 88%, 4. 45%, and 1. 98% higher than the DeepLab V3 method, which ranks second in performance, respectively. In addition, the proposed diagnosis method of OSCC lymph node metastasis can effectively assist pathologists in disease screening and reduce the workload of pathologists.

NeurIPS Conference 2024 Conference Paper

Generalizable Implicit Motion Modeling for Video Frame Interpolation

Zujin Guo
Wei Li
Chen Change Loy

Motion modeling is critical in flow-based Video Frame Interpolation (VFI). Existing paradigms either consider linear combinations of bidirectional flows or directly predict bilateral flows for given timestamps without exploring favorable motion priors, thus lacking the capability of effectively modeling spatiotemporal dynamics in real-world videos. To address this limitation, in this study, we introduce Generalizable Implicit Motion Modeling (GIMM), a novel and effective approach to motion modeling for VFI. Specifically, to enable GIMM as an effective motion modeling paradigm, we design a motion encoding pipeline to model spatiotemporal motion latent from bidirectional flows extracted from pre-trained flow estimators, effectively representing input-specific motion priors. Then, we implicitly predict arbitrary-timestep optical flows within two adjacent input frames via an adaptive coordinate-based neural network, with spatiotemporal coordinates and motion latent as inputs. Our GIMM can be easily integrated with existing flow-based VFI works by supplying accurately modeled motion. We show that GIMM performs better than the current state of the art on standard VFI benchmarks.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Pengcheng Chen
Jin Ye
Guoan Wang
Yanjun Li
Zhongying Deng
Wei Li
Tianbin Li
Haodong Duan

Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities. Thus, they face specific challenges, including limited clinical relevance, incomplete evaluations, and insufficient guidance for interactive LVLMs. To address these limitations, we developed the GMAI-MMBench, the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format. Additionally, we implemented a lexical tree structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 53. 96\%, indicating significant room for improvement. Moreover, we identified five key insufficiencies in current cutting-edge LVLMs that need to be addressed to advance the development of better medical applications. We believe that GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

HeterGCL: Graph Contrastive Learning Framework on Heterophilic Graph

Chenhao Wang
Yong Liu
Yan Yang
Wei Li

Graph Contrastive Learning (GCL) has attracted significant research attention due to its self-supervised ability to learn robust node representations. Unfortunately, most methods primarily focus on homophilic graphs, rendering them less effective for heterophilic graphs. In addition, the complexity of node interactions in heterophilic graphs poses considerable challenges to augmentation schemes, coding architectures, and contrastive designs for traditional GCL. In this work, we propose HeterGCL, a novel graph contrastive learning framework with structural and semantic learning to explore the true potential of GCL on heterophilic graphs. Specifically, We abandon the random augmentation scheme that leads to the destruction of the graph structure, instead introduce an adaptive neighbor aggregation strategy (ANA) to extract topology-supervised signals from neighboring nodes at different distances and explore the structural information with an adaptive local-to-global contrastive loss. In the semantic learning module, we jointly consider the original nodes' features and the similarity between nodes in the latent feature space to explore hidden associations between nodes. Experimental results on homophilic and heterophilic graphs demonstrate that HeterGCL outperforms existing self-supervised and semi-supervised baselines across various downstream tasks.

PDF Details DOI

ICML Conference 2024 Conference Paper

Improving Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning

Wei Li
Hehe Fan
Yongkang Wong
Yi Yang 0001
Mohan S. Kankanhalli

Previous efforts using frozen Large Language Models (LLMs) for visual understanding, via image captioning or image-text retrieval tasks, face challenges when dealing with complex multimodal scenarios. In order to enhance the capabilities of Multimodal Large Language Models (MLLM) in comprehending the context of vision and language, we introduce Multimodal Composition Learning (MCL) for the purpose of mapping or aligning the vision and language input. In particular, we introduce two tasks: Multimodal-Context Captioning (MC-Cap) and Multimodal-Context Retrieval (MC-Ret) to guide a frozen LLM in comprehending the vision and language context. These specialized tasks are crafted to improve the LLM’s capacity for efficient processing and utilization of multimodal inputs, thereby enhancing its proficiency in generating more accurate text or visual representations. Extensive experiments on both retrieval tasks (i. e. , zero-shot composed image retrieval, visual storytelling image retrieval and visual dialog image retrieval) and text generation tasks (i. e. , visual question answering) demonstrate the effectiveness of the proposed method. The code is available at: https: //github. com/dhg-wei/MCL.

NeurIPS Conference 2024 Conference Paper

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Xiaoyi Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Bin Wang
Linke Ouyang
Songyang Zhang
Haodong Duan

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 $\times$ 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 × 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 $\times$ 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks.

PDF Details DOI

AAAI Conference 2024 Short Paper

Knowledge Transfer via Compact Model in Federated Learning (Student Abstract)

Jiaming Pei
Wei Li
Lukun Wang

Communication overhead remains a significant challenge in federated learning due to frequent global model updates. Essentially, the update of the global model can be viewed as knowledge transfer. We aim to transfer more knowledge through a compact model while reducing communication overhead. In our study, we introduce a federated learning framework where clients pre-train large models locally and the server initializes a compact model to communicate. This compact model should be light in size but still have enough knowledge to refine the global model effectively. We facilitate the knowledge transfer from local to global models based on pre-training outcomes. Our experiments show that our approach significantly reduce communication overhead without sacrificing accuracy.

PDF Details DOI

JBHI Journal 2024 Journal Article

Locality Cross-domain Discriminant Analysis for Membranous Nephropathy Recognition Using Microscopic Hyperspectral Imaging

Jinxin Zhang
Wei Li
Mingfeng Ge
Ruoqian Gao
Wenfei Dong

Cross-domain methods have been proposed to learn the domain invariant knowledge that can be transferred from the source domain to the target domain. Existing cross-domain methods attempt to minimize the distribution discrepancy of the domains. However, these methods fail to explore the domain invariant subspace due to the samples of different classes between two domains may overlap in the new subspace. They consider the features in the original space data that may be unnecessary or irrelevant to the final classification, and neglect to preserve the local manifold structure between two domains. To solve these problems, a novel feature extraction method called Locality Cross-domain Discriminant Analysis (LCDA) is proposed. LCDA first aligns the distributions and avoids overlap between two domains. Then, LCDA exploits the local manifold structure to maintain the discriminative capability of the low-dimensional projection matrices. Finally, a robust constraint is utilized to preserve the robustness of the projection matrices. The proposed LCDA not only avoids overlap between different classes but also explores the local manifold information. Experiment results on the medical membranous nephropathy hyperspectral dataset demonstrate that the proposed LCDA has better performance than other relevant feature extraction methods.

NeurIPS Conference 2024 Conference Paper

Make Continual Learning Stronger via C-Flat

Ang Bian
Wei Li
Hangjie Yuan
Chengrong Yu
Mang Wang
Zixiang Zhao
Aojun Lu
Pengliang Ji

How to balance the learning ’sensitivity-stability’ upon new task training and memory preserving is critical in CL to resolve catastrophic forgetting. Improving model generalization ability within each learning phase is one solution to help CL learning overcome the gap in the joint knowledge space. Zeroth-order loss landscape sharpness-aware minimization is a strong training regime improving model generalization in transfer learning compared with optimizer like SGD. It has also been introduced into CL to improve memory representation or learning efficiency. However, zeroth-order sharpness alone could favors sharper over flatter minima in certain scenarios, leading to a rather sensitive minima rather than a global optima. To further enhance learning stability, we propose a Continual Flatness (C-Flat) method featuring a flatter loss landscape tailored for CL. C-Flat could be easily called with only one line of code and is plug-and-play to any CL methods. A general framework of C-Flat applied to all CL categories and a thorough comparison with loss minima optimizer and flat minima based CL approaches is presented in this paper, showing that our method can boost CL performance in almost all cases. Code is available at https: //github. com/WanNaa/C-Flat.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Multi-Modal Disordered Representation Learning Network for Description-Based Person Search

Fan Yang
Wei Li
Menglong Yang
Binbin Liang
Jianwei Zhang

Description-based person search aims to retrieve images of the target identity via textual descriptions. One of the challenges for this task is to extract discriminative representation from images and descriptions. Most existing methods apply the part-based split method or external models to explore the fine-grained details of local features, which ignore the global relationship between partial information and cause network instability. To overcome these issues, we propose a Multi-modal Disordered Representation Learning Network (MDRL) for description-based person search to fully extract the visual and textual representations. Specifically, we design a Cross-modality Global Feature Learning Architecture to learn the global features from the two modalities and meet the demand of the task. Based on our global network, we introduce a Disorder Local Learning Module to explore local features by a disordered reorganization strategy from both visual and textual aspects and enhance the robustness of the whole network. Besides, we introduce a Cross-modality Interaction Module to guide the two streams to extract visual or textual representations considering the correlation between modalities. Extensive experiments are conducted on two public datasets, and the results show that our method outperforms the state-of-the-art methods on CUHK-PEDES and ICFG-PEDES datasets and achieves superior performance.

PDF Details DOI

EAAI Journal 2024 Journal Article

Nearshore optical video object detector based on temporal branch and spatial feature enhancement

Yuanlin Zhao
Wei Li
Jiangang Ding
Yansong Wang
Lili Pei
Aojia Tian

The computing power of nearshore and ship-borne devices is limited, posing significant challenges for accurately detecting objects in real-time on such devices. We propose a nearshore video object detector (NVID) to tackle these challenges. Considering the abundance of dynamic entities in the nearshore environment, we have developed you can look more (YCLM) to perceive the temporal characteristics of these objects. Furthermore, to improve the ability to detect objects of different sizes of networks, we designed parallel deformable attention (PDA) based on the spatial features of objects. More importantly, we developed fast re-parameterization convolution (FREConv) and faster conv (FConv). Building on these innovations, we proposed a fast re-parameterization network (FRENet) specifically tailored to produce low-parameter, multi-scale feature outputs. With end-to-end training, our pipeline outperforms other state-of-the-art (SOTA) methods on the nearshore objects (NearshoreObjects) dataset (90. 4 average precision (AP) 50 (＋4. 7), 9. 3 parameters (Params) (−1. 0M), 24. 8 frames per second (FPS) (Jetson Nano) (＋0. 6)). In addition, NVID also achieved excellent results in the on board (OnBoard) dataset (90. 3 AP50 (＋2. 8), 9. 3 params (−1. 0M), 26. 5 FPS (Jetson Nano) (＋0. 8)). The source code can be accessed at https: //github. com/Yuanlin-Zhao/NVID.

NeurIPS Conference 2024 Conference Paper

NeuMA: Neural Material Adaptor for Visual Grounding of Intrinsic Dynamics

Junyi Cao
Shanyan Guan
Yanhao Ge
Wei Li
Xiaokang Yang
Chao Ma

While humans effortlessly discern intrinsic dynamics and adapt to new scenarios, modern AI systems often struggle. Current methods for visual grounding of dynamics either use pure neural-network-based simulators (black box), which may violate physical laws, or traditional physical simulators (white box), which rely on expert-defined equations that may not fully capture actual dynamics. We propose the Neural Material Adaptor (NeuMA), which integrates existing physical laws with learned corrections, facilitating accurate learning of actual dynamics while maintaining the generalizability and interpretability of physical priors. Additionally, we propose Particle-GS, a particle-driven 3D Gaussian Splatting variant that bridges simulation and observed images, allowing back-propagate image gradients to optimize the simulator. Comprehensive experiments on various dynamics in terms of grounded particle accuracy, dynamic rendering quality, and generalization ability demonstrate that NeuMA can accurately capture intrinsic dynamics. Project Page: https: //xjay18. github. io/projects/neuma. html.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

On the Effects of Data Scale on UI Control Agents

Wei Li
William Bishop
Alice Li
Chris Rawles
Folawiyo Campbell-Ajala
Divya Tyamagundlu
Oriana Riva

Autonomous agents that control user interfaces to accomplish human tasks are emerging. Leveraging LLMs to power such agents has been of special interest, but unless fine-tuned on human-collected task demonstrations, performance is still relatively low. In this work we study whether fine-tuning alone is a viable approach for building real-world UI control agents. To this end we collect and release a new dataset, AndroidControl, consisting of 15, 283 demonstrations of everyday tasks with Android apps. Compared to existing datasets, each AndroidControl task instance includes both high and low-level human-generated instructions, allowing us to explore the level of task complexity an agent can handle. Moreover, AndroidControl is the most diverse computer control dataset to date, including 14, 548 unique tasks over 833 Android apps, thus allowing us to conduct in-depth analysis of the model performance in and out of the domain of the training data. Using the dataset, we find that when tested in domain fine-tuned models outperform zero and few-shot baselines and scale in such a way that robust performance might feasibly be obtained simply by collecting more data. Out of domain, performance scales significantly more slowly and suggests that in particular for high-level tasks, fine-tuning on more data alone may be insufficient for achieving robust out-of-domain performance.

PDF Details DOI

YNIMG Journal 2024 Journal Article

Opposite changes in morphometric similarity of medial reward and lateral non-reward orbitofrontal cortex circuits in obesity

Debo Dong
Ximei Chen
Wei Li
Xiao Gao
Yulin Wang
Feng Zhou
Simon B. Eickhoff
Hong Chen

Obesity has a profound impact on metabolic health thereby adversely affecting brain structure and function. However, the majority of previous studies used a single structural index to investigate the link between brain structure and body mass index (BMI), which hinders our understanding of structural covariance between regions in obesity. This study aimed to examine the relationship between macroscale cortical organization and BMI using novel morphometric similarity networks (MSNs). The individual MSNs were first constructed from individual eight multimodal cortical morphometric features between brain regions. Then the relationship between BMI and MSNs within the discovery sample of 434 participants was assessed. The key findings were further validated in an independent sample of 192 participants. We observed that the lateral non-reward orbitofrontal cortex (lOFC) exhibited decoupling (i.e., reduction in integration) in obesity, which was mainly manifested by its decoupling with the cognitive systems (i.e., DMN and FPN) while the medial reward orbitofrontal cortex (mOFC) showed de-differentiation (i.e., decrease in distinctiveness) in obesity, which was mainly represented by its de-differentiation with the cognitive and attention systems (i.e., DMN and VAN). Additionally, the lOFC showed de-differentiation with the visual system in obesity, while the mOFC showed decoupling with the visual system and hyper-coupling with the sensory-motor system in obesity. As an important first step in revealing the role of underlying structural covariance in body mass variability, the present study presents a novel mechanism that underlies the reward-control interaction imbalance in obesity, thus can inform future weight-management approaches.

NeurIPS Conference 2024 Conference Paper

PEACE: A Dataset of Pharmaceutical Care for Cancer Pain Analgesia Evaluation and Medication Decision

Yutao Dou
Huimin Yu
Wei Li
Jingyang Li
Fei Xia
Jian Xiao

Over half of cancer patients experience long-term pain management challenges. Recently, interest has grown in systems for cancer pain treatment effectiveness assessment (TEA) and medication recommendation (MR) to optimize pharmacological care. These systems aim to improve treatment effectiveness by recommending personalized medication plans based on comprehensive patient information. Despite progress, current systems lack multidisciplinary treatment (MDT) team assessments of treatment and the patient's perception of medication, crucial for effective cancer pain management. Moreover, managing cancer pain medication requires multiple adjustments to the treatment plan based on the patient's evolving condition, a detail often missing in existing datasets. To tackle these issues, we designed the PEACE dataset specifically for cancer pain medication research. It includes detailed pharmacological care records for over 38, 000 patients, covering demographics, clinical examination, treatment outcomes, medication plans, and patient self-perceptions. Unlike existing datasets, PEACE records not only long-term and multiple follow-ups both inside and outside hospitals but also includes patients' self-assessments of medication effects and the impact on their lives. We conducted a proof-of-concept study with 13 machine learning algorithms on the PEACE dataset for the TEA (classification task) and MR (regression task). These experiments provide valuable insights into the potential of the PEACE dataset for advancing personalized cancer pain management. The dataset is accessible at: [https: //github. com/YTYTYD/PEACE].

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Physics-Constrained Comprehensive Optical Neural Networks

Yanbing Liu
Jianwei Qin
Yan Liu
Xi Yue
Xun Liu
Guoqing Wang
Tianyu Li
Fangwei Ye

With the advantages of low latency, low power consumption, and high parallelism, optical neural networks (ONN) offer a promising solution for time-sensitive and resource-limited artificial intelligence applications. However, the performance of the ONN model is often diminished by the gap between the ideal simulated system and the actual physical system. To bridge the gap, this work conducts extensive experiments to investigate systematic errors in the optical physical system within the context of image classification tasks. Through our investigation, two quantifiable errors—light source instability and exposure time mismatches—significantly impact the prediction performance of ONN. To address these systematic errors, a physics-constrained ONN learning framework is constructed, including a well designed loss function to mitigate the effect of light fluctuations, a CCD adjustment strategy to alleviate the effects of exposure time mismatches and a ’physics-prior based’ error compensation network to manage other systematic errors, ensuring consistent light intensity across experimental results and simulations. In our experiments, the proposed method achieved a test classification accuracy of 96. 5% on the MNIST dataset, a substantial improvement over the 61. 6% achieved with the original ONN. For the more challenging QuickDraw16 and Fashion MNIST datasets, experimental accuracy improved from 63. 0% to 85. 7% and from 56. 2% to 77. 5%, respectively. Moreover, the comparison results further demonstrate the effectiveness of the proposed physics-constrained ONN learning framework over state-of-the-art ONN approaches. This lays the groundwork for more robust and precise optical computing applications.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Provable Acceleration of Nesterov’s Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks

Xin Liu
Wei Tao
Wei Li
Dazhi Zhan
Jun Wang
Zhisong Pan

Due to its simplicity and efficiency, the first-order gradient method has been extensively employed in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the first-order method is capable of attaining a global minimum during training over-parameterized neural networks, where the number of parameters is significantly larger than that of training instances. Momentum methods, including the heavy ball (HB) method and Nesterov's accelerated gradient (NAG) method, are the workhorse of first-order gradient methods owning to their accelerated convergence. In practice, NAG often exhibits superior performance than HB. However, current theoretical works fail to distinguish their convergence difference in training neural networks. To fill this gap, we consider the training problem of the two-layer ReLU neural network under over-parameterization and random initialization. Leveraging high-resolution dynamical systems and neural tangent kernel (NTK) theory, our result not only establishes tighter upper bounds of the convergence rate for both HB and NAG, but also provides the first theoretical guarantee for the acceleration of NAG over HB in training neural networks. Finally, we validate our theoretical results on three benchmark datasets.

PDF Details DOI

TCS Journal 2024 Journal Article

Resistance distances in directed graphs: Definitions, properties, and applications

Mingzhe Zhu
Liwang Zhu
Huan Li
Wei Li
Zhongzhi Zhang

Resistance distance has been studied extensively in the past years, with the majority of previous studies devoted to undirected networks, in spite of the fact that various realistic networks are directed. Although several generalizations of resistance distance on directed graphs have been proposed, they either have no physical interpretation or are not a metric. In this paper, we first extend the definition of resistance distance to strongly connected directed graphs based on random walks and show that the two-node resistance distance on directed graphs is a metric. Then, we introduce the Laplacian matrix for directed graphs that subsumes the Laplacian matrix of undirected graphs as a particular case, and use its pseudoinverse to express the two-node resistance distance, and many other relevant quantities derived from resistance distances. Moreover, we define the resistance distance between a vertex and a vertex group on directed graphs and further define a problem of optimally selecting a group of fixed number of nodes, such that their resistance distance is minimized. Since this combinatorial optimization problem is NP-hard, we present a greedy algorithm with a proved approximation ratio, and conduct experiments on model and realistic networks to validate the performance of this approximation algorithm.

EAAI Journal 2024 Journal Article

Strip and asymmetric aggregation network for unstructured terrain segmentation in wild environments

Wei Li
Shishun Tian
Yuhang Zhang
Muxin Liao
Guoguang Hua
Wenbin Zou

Recent studies have demonstrated the significant importance of two factors: global context and multi-level semantics, for semantic segmentation models. However, obtaining features that extract these two factors typically results in high computational cost, which poses challenges for unstructured terrain segmentation. In this paper, we propose a Strip and Asymmetric Aggregation Network (SAANet) to collect global context and multi-level semantics while ensuring considerable segmentation accuracy to distinguish secure and navigable areas in wild environments. The SAANet network mainly consists of two essential modules: Global Strip Module (GSM) and Asymmetric Fusion Module (AFM). GSM utilizes four stripe convolutions to capture long-range contextual information while maintaining lower computational complexity. AFM consists of two units: Asymmetric Fusion Unit (AFU) and Residual Aggregation Unit (RAU). AFU based on asymmetric structure leverages attention mechanisms to fuse discriminative semantic clues from different scales, aiming to enhance the recognition of objects with high inter-class similarity and obtain an effective multi-level semantic representation. RAU is used to enhance the significant semantic features from AFU to achieve impressive terrain segmentation results. Extensive experimental results on the Robot Unstructured Ground Driving (RUGD) and Rellis Campus of Texas A&M University (RELLIS) datasets demonstrate that SAANet outperforms other state-of-the-art methods in recognizing outdoor unstructured safe drivable terrain.

NeurIPS Conference 2024 Conference Paper

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

Wei Li
Hehe Fan
Yongkang Wong
Mohan Kankanhalli
Yi Yang

Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding, without the need for pre-training on real video data. Specifically, we first employ an advanced LLM to automatically generate Textual Videos comprising continuous textual frames, along with corresponding annotations to simulate real video-text data. Then, these annotated textual videos are used to pre-align a language-only LLM with the video modality. To bridge the gap between textual and real videos, we employ the CLIP model as the feature extractor to align image and text modalities. During text-only pre-alignment, the continuous textual frames, encoded as a sequence of CLIP text features, are analogous to continuous CLIP image features, thus aligning the LLM with real video representation. Extensive experiments, including zero-shot evaluation and finetuning on various video understanding tasks, demonstrate that TOPA is an effective and efficient framework for aligning video content with LLMs. In particular, without training on any video data, the TOPA-Llama2-13B model achieves a Top-1 accuracy of 51. 0% on the challenging long-form video understanding benchmark, Egoschema. This performance surpasses previous video-text pre-training approaches and proves competitive with recent GPT-3. 5 based video agents.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Unraveling Pain Levels: A Data-Uncertainty Guided Approach for Effective Pain Assessment

Xinwei Ji
Xiaomin Chang
Wei Li
Albert Y. Zomaya

Pain, a primary reason for seeking medical help, requires essential pain assessment for effective management. Studies have recognized electrodermal activity (EDA) signaling's potential for automated pain assessment, but traditional algorithms often ignore the noise and uncertainty inherent in pain data. To address this, we propose a learning framework predicated on data uncertainty, introducing two forms: a) subject-level stimulation-reaction drift; b) ambiguity in self-reporting scores. We formulate an uncertainty assessment using Heart Rate Variability (HRV) features to guide the selection of responsive pain profiles and reweight subtask importance based on the vagueness of self-reported data. These methods are integrated within an end-to-end neural network learning paradigm, focusing the detector on more accurate insights within the uncertainty domain. Extensive experimentation on both the publicly available biovid dataset and the proprietary Apon dataset demonstrates our approach's effectiveness. In the biovid dataset, we achieved a 6% enhancement over the state-of-the-art methodology, and on the Apon dataset, our method outperformed baseline approaches by over 20%.

PDF Details DOI

AAAI Conference 2024 Conference Paper

VIGC: Visual Instruction Generation and Correction

Bin Wang
Fan Wu
Xiao Han
Jiahui Peng
Huaping Zhong
Pan Zhang
Xiaoyi Dong
Weijia Li

The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code are available at https://opendatalab.github.io/VIGC

PDF Details DOI

AAAI Conference 2023 Conference Paper

A Composite Multi-Attention Framework for Intraoperative Hypotension Early Warning

Feng Lu
Wei Li
Zhiqiang Zhou
Cheng Song
Yifei Sun
Yuwei Zhang
Yufei Ren
Xiaofei Liao

Intraoperative hypotension (IOH) events warning plays a crucial role in preventing postoperative complications, such as postoperative delirium and mortality. Despite significant efforts, two fundamental problems limit its wide clinical use. The well-established IOH event warning systems are often built on proprietary medical devices that may not be available in all hospitals. The warnings are also triggered mainly through a predefined IOH event that might not be suitable for all patients. This work proposes a composite multi-attention (CMA) framework to tackle these problems by conducting short-term predictions on user-definable IOH events using vital signals in a low sampling rate with demographic characteristics. Our framework leverages a multi-modal fusion network to make four vital signals and three demographic characteristics as input modalities. For each modality, a multi-attention mechanism is used for feature extraction for better model training. Experiments on two large-scale real-world data sets show that our method can achieve up to 94.1% accuracy on IOH events early warning while the signals sampling rate is reduced by 3000 times. Our proposal CMA can achieve a mean absolute error of 4.50 mm Hg in the most challenging 15-minute mean arterial pressure prediction task and the error reduction by 42.9% compared to existing solutions.

PDF Details DOI

YNICL Journal 2023 Journal Article

Aberrant structural rich club organization in temporal lobe epilepsy with focal to bilateral tonic–clonic seizures

Qiuxing Lin
Wei Li
Yuming Li
Peiwen Liu
Yingying Zhang
Qiyong Gong
Dong Zhou
Dongmei An

OBJECTIVE: The purpose of this study was to assess the differences of topological characteristic and rich club organization between temporal lobe epilepsy (TLE) patients with focal seizure (FS) only and those with focal to bilateral tonic-clonic seizures (FBTCS). METHODS: We recruited 130 unilateral TLE patients, of which 57 patients with FS only and 73 patients with both FS and FBTCS, and 68 age- and gender-matched healthy controls (HC). Whole-brain networks were constructed based on diffusion weighted imaging data. Graph theory was applied to quantify the topological network metrics and rich club organization. Network-based statistic (NBS) analysis was administered to investigate the difference in edge-wise connectivity strength. The non-parametric permutation test was applied to evaluate the differences between groups. Benjamini-Hochberg FDR at the alpha of 5% was carried out for multiple comparations. RESULTS: In comparison with HC, both the FS and FBTCS group displayed a significant reduction in whole-brain connectivity strength and global efficiency. The FBTCS group showed lower connectivity strength both in the rich club and feeder connections compared to HC. The FS group had lower connectivity strength in the feeder and local connections compared to HC. NBS analysis revealed a wider range of decreased connectivity strength in the FBTCS group, involving 90% of the rich club regions, mainly affecting temporal-subcortical, frontal-parietal, and frontal-temporal lobe, the majority decreasing connections were between temporal lobe and stratum. While the decreased connectivity strength in the FS group were relatively local, involving 50% of rich club regions, mainly concentrated on the temporal-subcortical lobe. CONCLUSIONS: Network integration was reduced in TLE. TLE with FBTCS selectively disrupted the rich club regions, while TLE with FS only were more likely to affect the non-rich club regions, emphasizing the contribution of rich club organization to seizure generalization.

TCS Journal 2023 Journal Article

Analysis on methods to effectively improve transfer learning performance

Honghui Xu
Wei Li
Zhipeng Cai

Transfer learning has become a prevailing machine learning technique thanks to its superiority in learning knowledge from limited training data for prediction. In the existing works, collection and collaboration are two major approaches to realize the improvement of transfer learning performance. Even though the effectiveness of these approaches has been validated in extensive experiments, there lacks the support of theoretical analysis. Consequently, how to enhance transfer learning effectively is an open problem. In light of this, in this paper, we thoroughly and deeply study the methods of improving transfer learning performance in order to provide the guidelines for applying transfer learning in real applications. Through our proof process, critical conclusions are drawn to help learn the motivation of implementing collection and collaboration, the performance gap between collection and collaboration, and the impacts of data sharing strategies on transfer learning in collaboration. These conclusions can further build a theoretical foundation for future research on transfer learning.

AAAI Conference 2023 Short Paper

AsT: An Asymmetric-Sensitive Transformer for Osteonecrosis of the Femoral Head Detection (Student Abstract)

Haoyang Chen
Shuai Liu
Feng Lu
Wei Li
Bin Sheng
Mi Li
Hai Jin
Albert Y. Zomaya

Early diagnosis of osteonecrosis of the femoral head (ONFH) can inhibit the progression and improve femoral head preservation. The radiograph difference between early ONFH and healthy ones is not apparent to the naked eye. It is also hard to produce a large dataset to train the classification model. In this paper, we propose Asymmetric-Sensitive Transformer (AsT) to capture the uneven development of the bilateral femoral head to enable robust ONFH detection. Our ONFH detection is realized using the self-attention mechanism to femoral head regions while conferring sensitivity to the uneven development by the attention-shared transformer. The real-world experiment studies show that AsT achieves the best performance of AUC 0.9313 in the early diagnosis of ONFH and can find out misdiagnosis cases firmly.

PDF Details DOI

EAAI Journal 2023 Journal Article

Causal-ViT: Robust Vision Transformer by causal intervention

Wei Li
Zhixin Li
Xiwei Yang
Huifang Ma

Artificial intelligence based on deep learning is better at improving the representation ability of models from data. However, due to the limitation of fixed receptive field, these agents are not able to provide a correct response outside the fixed receptive field. To address this problem, this paper provides a new perspective with improving the Image Recognition tasks. This study firstly constructs two extended receptive fields using structural causal model. Then, an approximate intervention method that changes the traditional likelihood prediction to predict the result of causal intervention is proposed. Finally, this study formulates the objective function to adapt the proxy training, which makes the whole model work well. Above all of these, a new Vision Transformer variant named Causal-ViT is proposed. Furthermore, rich experimental results of different tasks are reported. These results show that the proposed perspective makes a significant improvement in Image Recognition tasks. By simply plugging Causal-ViT to different sub-tasks, all of them bring the new benchmarks of themselves field, which proves our method is flexible.

NeurIPS Conference 2023 Conference Paper

Cross-Domain Policy Adaptation via Value-Guided Data Filtering

Kang Xu
Chenjia Bai
Xiaoteng Ma
Dong Wang
Bin Zhao
Zhen Wang
Xuelong Li
Wei Li

Generalizing policies across different domains with dynamics mismatch poses a significant challenge in reinforcement learning. For example, a robot learns the policy in a simulator, but when it is deployed in the real world, the dynamics of the environment may be different. Given the source and target domain with dynamics mismatch, we consider the online dynamics adaptation problem, in which case the agent can access sufficient source domain data while online interactions with the target domain are limited. Existing research has attempted to solve the problem from the dynamics discrepancy perspective. In this work, we reveal the limitations of these methods and explore the problem from the value difference perspective via a novel insight on the value consistency across domains. Specifically, we present the Value-Guided Data Filtering (VGDF) algorithm, which selectively shares transitions from the source domain based on the proximity of paired value targets across the two domains. Empirical results on various environments with kinematic and morphology shifts demonstrate that our method achieves superior performance compared to prior approaches.

ICLR Conference 2023 Conference Paper

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Wei Li
Linchao Zhu
Longyin Wen
Yi Yang 0001

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks, e.g., image classification. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest. Prior arts approach to zero-shot captioning by either utilizing the existing large language models (e.g., GPT-2) or pre-training the encoder-decoder network in an end-to-end manner. However, the large language models may not generate sensible descriptions due to the task discrepancy between captioning and language modeling, while the end-to-end pre-training requires paired data and extensive computational resources. In this work, we propose a simple framework, named DeCap, for zero-shot captioning. We introduce a lightweight visual-aware language decoder. This decoder is both data-efficient and computation-efficient: 1) it only requires the \textit{text} data for training, easing the burden on the collection of paired data. 2) it does not require end-to-end training. When trained with text-only data, the decoder takes the text embedding extracted from the off-the-shelf CLIP encoder as a prefix embedding. The challenge is that the decoder is trained on the text corpus but at the inference stage, it needs to generate captions based on visual inputs. Though the CLIP text embedding and the visual embedding are correlated, the \textit{modality gap} issue is widely observed in multi-modal contrastive models that prevents us from directly taking the visual embedding as the prefix embedding. We propose a training-free mechanism to reduce the modality gap. We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input. Taking the projected embedding as the prefix embedding, the decoder generates high-quality descriptions that match the visual input. The experiments show that DeCap outperforms other zero-shot captioning methods and unpaired captioning methods by a large margin on the typical image captioning benchmarks, i.e., MSCOCO and NoCaps. We apply DeCap to video captioning and achieve state-of-the-art zero-shot performance on MSR-VTT and ActivityNet-Captions. The code is available at https://github.com/dhg-wei/DeCap.

EAAI Journal 2023 Journal Article

DeepCrackAT: An effective crack segmentation framework based on learning multi-scale crack features

Qinghua Lin
Wei Li
Xiangpan Zheng
Haoyi Fan
Zuoyong Li

The detection of cracks is essential for assessing and maintaining building and road safety. However, the large appearance variations and the complex topological structures of cracks bring challenges to automatic crack detection. To alleviate the above challenges, we propose a deep multi-scale crack feature learning model called DeepCrackAT for crack segmentation, which is based on an encoder–decoder network with feature tokenization mechanism and attention mechanism. Specifically, we use hybrid dilated convolutions in the first three layers of the encoder–decoder to increase the network’s receptive field and capture more crack information. Then, we introduce a tokenized multilayer perceptron (Tok-MLP) in the last two layers of the encoder–decoder to tokenize and project high-dimensional crack features into low-dimensional space. This helps to reduce parameters and enhance the network’s ability of noise resistance. Next, we concatenate the features corresponding to the encoder–decoder layers and introduce the convolutional block attention module (CBAM) to enhance the network’s perception of the critical crack region. Finally, the five-layer features are fused to generate a binary segmentation map of the crack image. We conducted extensive experiments and ablation studies on two real-world crack datasets, and DeepCrackAT achieved 97. 41% and 97. 25% accuracy on these datasets, respectively. The experimental results show that the proposed method outperforms the current state-of-the-art methods.

AAAI Conference 2023 Short Paper

ES-Mask: Evolutionary Strip Mask for Explaining Time Series Prediction (Student Abstract)

Yifei Sun
Cheng Song
Feng Lu
Wei Li
Hai Jin
Albert Y. Zomaya

Machine learning models are increasingly used in time series prediction with promising results. The model explanation of time series prediction falls behind the model development and makes less sense to users in understanding model decisions. This paper proposes ES-Mask, a post-hoc and model-agnostic evolutionary strip mask-based saliency approach for time series applications. ES-Mask designs the mask consisting of strips with the same salient value in consecutive time steps to produce binary and sustained feature importance scores over time for easy understanding and interpretation of time series. ES-Mask uses an evolutionary algorithm to search for the optimal mask by manipulating strips in rounds, thus is agnostic to models by involving no internal model states in the search. The initial experiments on MIMIC-III data set show that ES-Mask outperforms state-of-the-art methods.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Yu Qi
Fan Yang
Yousong Zhu
Yufei Liu
Liwei Wu
Rui Zhao
Wei Li

Autoregressive language modeling (ALM) has been successfully used in self-supervised pre-training in Natural language processing (NLP). However, this paradigm has not achieved comparable results with other self-supervised approaches in computer vision (e.g., contrastive learning, masked image modeling). In this paper, we try to find the reason why autoregressive modeling does not work well on vision tasks. To tackle this problem, we fully analyze the limitation of visual autoregressive methods and proposed a novel stochastic autoregressive image modeling (named SAIM) by the two simple designs. First, we serialize the image into patches. Second, we employ the stochastic permutation strategy to generate an effective and robust image context which is critical for vision tasks. To realize this task, we create a parallel encoder-decoder training process in which the encoder serves a similar role to the standard vision transformer focusing on learning the whole contextual information, and meanwhile the decoder predicts the content of the current position so that the encoder and decoder can reinforce each other. Our method significantly improves the performance of autoregressive image modeling and achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data. Transfer performance in downstream tasks also shows that our model achieves competitive performance. Code is available at https://github.com/qiy20/SAIM.

PDF Details DOI

JAIR Journal 2023 Journal Article

FactGen: Faithful Text Generation by Factuality-aware Pre-training and Contrastive Ranking Fine-tuning

ZhiBin Lan
Wei Li
Jinsong Su
Xinyan Xiao
Jiachen Liu
Wenhao Wu
Yajuan Lyu

Conditional text generation is supposed to generate a fluent and coherent target text that is faithful to the source text. Although pre-trained models have achieved promising results, they still suffer from the crucial factuality problem. To deal with this issue, we propose a factuality-aware pretraining-finetuning framework named FactGen, which fully considers factuality during two training stages. Specifically, at the pre-training stage, we utilize a natural language inference model to construct target texts that are entailed by the source texts, resulting in a more factually consistent pre-training objective. Then, during the fine-tuning stage, we further introduce a contrastive ranking loss to encourage the model to generate factually consistent text with higher probability. Extensive experiments on three conditional text generation tasks demonstrate the effectiveness and generality of our training framework.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Federated Generative Model on Multi-Source Heterogeneous Data in IoT

Zuobin Xiong
Wei Li
Zhipeng Cai

The study of generative models is a promising branch of deep learning techniques, which has been successfully applied to different scenarios, such as Artificial Intelligence and the Internet of Things. While in most of the existing works, the generative models are realized as a centralized structure, raising the threats of security and privacy and the overburden of communication costs. Rare efforts have been committed to investigating distributed generative models, especially when the training data comes from multiple heterogeneous sources under realistic IoT settings. In this paper, to handle this challenging problem, we design a federated generative model framework that can learn a powerful generator for the hierarchical IoT systems. Particularly, our generative model framework can solve the problem of distributed data generation on multi-source heterogeneous data in two scenarios, i.e., feature related scenario and label related scenario. In addition, in our federated generative models, we develop a synchronous and an asynchronous updating methods to satisfy different application requirements. Extensive experiments on a simulated dataset and multiple real datasets are conducted to evaluate the data generation performance of our proposed generative models through comparison with the state-of-the-arts.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image

Mingjian Zhu
Hanting Chen
Qiangyu Yan
Xudong Huang
Guanyu Lin
Wei Li
Zhijun Tu
Hailin Hu

The extraordinary ability of generative models to generate photographic images has intensified concerns about the spread of disinformation, thereby leading to the demand for detectors capable of distinguishing between AI-generated fake images and real images. However, the lack of large datasets containing images from the most advanced image generators poses an obstacle to the development of such detectors. In this paper, we introduce the GenImage dataset, which has the following advantages: 1) Plenty of Images, including over one million pairs of AI-generated fake images and collected real images. 2) Rich Image Content, encompassing a broad range of image classes. 3) State-of-the-art Generators, synthesizing images with advanced diffusion models and GANs. The aforementioned advantages allow the detectors trained on GenImage to undergo a thorough evaluation and demonstrate strong applicability to diverse images. We conduct a comprehensive analysis of the dataset and propose two tasks for evaluating the detection method in resembling real-world scenarios. The cross-generator image classification task measures the performance of a detector trained on one generator when tested on the others. The degraded image classification task assesses the capability of the detectors in handling degraded images such as low-resolution, blurred, and compressed images. With the GenImage dataset, researchers can effectively expedite the development and evaluation of superior AI-generated image detectors in comparison to prevailing methodologies.

I&C Journal 2023 Journal Article

Learnability and positive equivalence relations

David Belanger
Ziyuan Gao
Sanjay Jain
Wei Li
Frank Stephan

TMLR Journal 2023 Journal Article

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Weicheng Kuo
AJ Piergiovanni
Dahun Kim
xiyang luo
Benjamin Caine
Wei Li
Abhijit Ogale
Luowei Zhou

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach.

EAAI Journal 2023 Journal Article

MEFNET: Multi-expert fusion network for RGB-Thermal semantic segmentation

Wenjie Lai
Fanyu Zeng
Xiao Hu
Wei Li
Shaowei He
Ziji Liu
Yadong Jiang

Semantic segmentation using RGB and thermal images is crucial in a variety of applications, including autonomous driving and video surveillance. However, the validity of information differs between modalities, which is typically addressed by weighting image features using complex and inefficient networks. To address this issue, we propose a Multi-Expert Fusion Network (MEFNet) that decouples the three-dimensional attention matrix of image features into a two-dimensional modal weight matrix and a channel attention vector. This approach focuses more on modal and channel differences while excluding interferences from other factors. Specifically, MEFNet multiplies RGB and thermal features by their respective modal weights, and then uses channel attention to select important feature channels. Comprehensive experiments demonstrate that MEFNet is competitive with state-of-the-art methods, achieving 62. 6% mIoU on the IR SEG dataset.

JBHI Journal 2023 Journal Article

Microscopic Hyperspectral Image Classification Based on Fusion Transformer With Parallel CNN

Weijia Zeng
Wei Li
Mengmeng Zhang
Hao Wang
Meng Lv
Yue Yang
Ran Tao

Microscopic hyperspectral image (MHSI) has received considerable attention in the medical field. The wealthy spectral information provides potentially powerful identification ability when combining with advanced convolutional neural network (CNN). However, for high-dimensional MHSI, the local connection of CNN makes it difficult to extract the long-range dependencies of spectral bands. Transformer overcomes this problem well because of its self-attention mechanism. Nevertheless, transformer is inferior to CNN in extracting spatial detailed features. Therefore, a classification framework integrating transformer and CNN in parallel, named as Fusion Transformer (FUST), is proposed for MHSI classification tasks. Specifically, the transformer branch is employed to extract the overall semantics and capture the long-range dependencies of spectral bands to highlight the key spectral information. The parallel CNN branch is designed to extract significant multiscale spatial features. Furthermore, the feature fusion module is developed to effectively fuse and process the features extracted by the two branches. Experimental results on three MHSI datasets demonstrate that the proposed FUST achieves superior performance when compared with state-of-the-art methods.

AAMAS Conference 2023 Conference Paper

Mitigating Imminent Collision for Multi-robot Navigation: A TTC-force Reward Shaping Approach

Jinlin Chen
Jiannong Cao
Zhiqin Cheng
Wei Li

We study the distributed multi-robot navigation problem, which refers to a group of mobile robots avoiding collision with each other while navigating from their start positions to the goal positions. Existing works still suffer from two limitations: 1) accurately quantify the risk of collisions for heterogeneous robots and 2) effectively capture the state representation under dynamic environments. These limitations make the heterogeneous robots prone to collisions in high-density and dynamic environments. This work proposes a new time-to-collision force (TTC-force) reward shaping approach, termed Tfresh, incorporating reinforcement learning to learn a policy that adaptively chooses the optimal actions to mitigate the imminent collision. Specifically, we use TTC-force to quantify the risk of each robot exerted by its neighbors and shape the reward signal with TTC-force in applying the reinforcement learning scheme. Meanwhile, we design the spatial attention mechanism involving the dynamic adjacent matrix to capture the state representation effectively. We evaluate the learned policy in numerous simulated scenarios in which groups of mobile robots perform navigation tasks. The experimental results demonstrate that our approach outperforms the state-of-the-art methods regarding success rate, travel distance, and travel time.

TCS Journal 2023 Journal Article

Modeling spatial networks by contact graphs of disk packings

Mingzhe Zhu
Haoxin Sun
Wei Li
Zhongzhi Zhang

Spatial networks are ubiquitous in the real world, with examples including Internet, power grids, neural networks, and so on. In this paper, based on a proposed disk packing, we present a contact graph as an exactly solvable model for spatial networks. We then derive analytically some critical structural properties of the model, including degree distribution, clustering coefficient, and diameter, and show that it displays simultaneously the striking scale-free small-world phenomena observed in most real networks. We also determine all the eigenvalues and their multiplicities of the transition probability matrix for the contact graph, and exploit the eigenvalues to derive closed-form expressions for Kemeny constant of random walks and the number of spanning trees on the contact graph. Finally, we derive closed-form expressions for some interesting quantities about resistance distances, including Kirchhoff index, additive degree-Kirchhoff index, multiplicative degree-Kirchhoff index, as well as average resistance distance. As an application of resistance distances, we study the coherence of the first-order consensus dynamics, which tends to a small constant as the graph grows, implying that the graph is resistant to noise.

JBHI Journal 2023 Journal Article

MS-FRAN: A Novel Multi-Source Domain Adaptation Method for EEG-Based Emotion Recognition

Wei Li
Wei Huan
Shitong Shao
Bowen Hou
Aiguo Song

Electroencephalogram (EEG)-based emotion recognition has gradually become a research hotspot. However, the large distribution differences of EEG signals across subjects make the current research stuck in a dilemma. To resolve this problem, in this article, we propose a novel and effective method, Multi-Source Feature Representation and Alignment Network (MS-FRAN). The effectiveness of proposed method mainly comes from three new modules: Wide Feature Extractor (WFE) for feature learning, Random Matching Operation (RMO) for model training, and Top- $\mathit{h}$ ranked domain classifier selection (TOP) for emotion classification. MS-FRAN is not only effective in aligning the distributions of each pair of source and target domains, but also capable of reducing the distributional differences among the multiple source domains. Experimental results on the public benchmark datasets SEED and DEAP have demonstrated the advantage of our method over the related competitive approaches for cross-subject EEG-based emotion recognition.

AAAI Conference 2023 Conference Paper

Open-Ended Diverse Solution Discovery with Regulated Behavior Patterns for Cross-Domain Adaptation

Kang Xu
Yan Ma
Bingsheng Wei
Wei Li

While Reinforcement Learning can achieve impressive results for complex tasks, the learned policies are generally prone to fail in downstream tasks with even minor model mismatch or unexpected perturbations. Recent works have demonstrated that a policy population with diverse behavior characteristics can generalize to downstream environments with various discrepancies. However, such policies might result in catastrophic damage during the deployment in practical scenarios like real-world systems due to the unrestricted behaviors of trained policies. Furthermore, training diverse policies without regulation of the behavior can result in inadequate feasible policies for extrapolating to a wide range of test conditions with dynamics shifts. In this work, we aim to train diverse policies under the regularization of the behavior patterns. We motivate our paradigm by observing the inverse dynamics in the environment with partial state information and propose Diversity in Regulation (DiR) training diverse policies with regulated behaviors to discover desired patterns that benefit the generalization. Considerable empirical results on various variations of different environments indicate that our method attains improvements over other diversity-driven counterparts.

PDF Details DOI

TCS Journal 2023 Journal Article

Optimization on the smallest eigenvalue of grounded Laplacian matrix via edge addition

Xiaotian Zhou
Haoxin Sun
Wei Li
Zhongzhi Zhang

The grounded Laplacian matrix L − S of a graph G = ( V, E ) with n = | V | nodes and m = | E | edges is a ( n − s ) × ( n − s ) submatrix of its Laplacian matrix L, obtained from L by deleting rows and columns corresponding to s = | S | ≪ n ground nodes forming set S ⊂ V. The smallest eigenvalue of L − S plays an important role in various practical scenarios, such as characterizing the convergence rate of leader-follower opinion dynamics, with a larger eigenvalue indicating faster convergence of opinion. In this paper, we study the problem of adding k ≪ n edges among all the nonexistent edges forming the candidate edge set Q = ( V × V ) ﹨ E, in order to maximize the smallest eigenvalue of the grounded Laplacian matrix. We show that the objective function of the combinatorial optimization problem is monotone but non-submodular. To solve the problem, we first simplify the problem by restricting the candidate edge set Q to be ( S × ( V ﹨ S ) ) ﹨ E, and prove that it has the same optimal solution as the original problem, although the size of set Q is reduced from O ( n 2 ) to O ( n ). Then, we propose two greedy approximation algorithms. One is a simple greedy algorithm with an approximation ratio ( 1 − e − α γ ) / α and time complexity O ( k n 4 ), where γ and α are, respectively, submodularity ratio and curvature, whose bounds are provided for some particular cases. The other is a fast greedy algorithm without approximation guarantee, which has a running time O ˜ ( k m ), where O ˜ ( ⋅ ) suppresses the poly ( log ⁡ n ) factors. Numerous experiments on various real networks are performed to validate the superiority of our algorithms, in terms of effectiveness and efficiency.

IJCAI Conference 2023 Conference Paper

PasCore: A Chinese Overlapping Relation Extraction Model Based on Global Pointer Annotation Strategy

Peng Wang
Jiafeng Xie
Xiye Chen
Guozheng Li
Wei Li

Recent work for extracting relations from texts has achieved excellent performance. However, existing studies mainly focus on simple relation extraction, these methods perform not well on overlapping triple problem because the tags of shared entities would conflict with each other. Especially, overlapping entities are common and indispensable in Chinese. To address this issue, this paper proposes PasCore, which utilizes a global pointer annotation strategy for overlapping relation extraction in Chinese. PasCore first obtains the sentence vector via general pre-training model encoder, and uses classifier to predicate relations. Subsequently, it uses global pointer annotation strategy for head entity annotation, which uses global tags to label the start and end positions of the entities. Finally, PasCore integrates the relation, head entity and its type to mark the tail entity. Furthermore, PasCore performs conditional layer normalization to fuse features, which connects all stages and greatly enriches the association between relations and entities. Experimental results on both Chinese and English real-world datasets demonstrate that PasCore outperforms strong baselines on relation extraction and, especially, shows superior performance on overlapping relation extraction.

PDF Details DOI

EAAI Journal 2023 Journal Article

Progressive cross-domain knowledge distillation for efficient unsupervised domain adaptive object detection

Wei Li
Lingqiao Li
Huihua Yang

Unsupervised domain adaptation (UDA) is a technique for relieving domain shifts via transferring relevant domain knowledge from the full-labeled source domain to an unlabeled target domain. While tremendous advances have been witnessed recently, the adoption of deep CNN-based UDA methods in real-world scenarios is still constrained by low-resource computers. Most prior strategies either handle domain shift problems via UDA or compress CNNs using knowledge distillation (KD), we seek to implement the model on constrained-resource devices to learn domain adaptive knowledge without sacrificing accuracy. In this paper, we proposed a three-step Progressive Cross-domain Knowledge Distillation (PCdKD) paradigm for efficient unsupervised adaptive object detection, since directly alleviating the significant discrepancy across domains could result in unstable training procedures and suboptimal performance. First, we apply pixel-level alignment via image-to-image translation to reduce the appearance discrepancy between different domains. Then, a focal multi-domain discriminator is utilized to train the teacher–student peer networks for gradually distilling domain adaptive knowledge in a cooperative manner. Finally, reliable pseudo labels obtained by the adapted teacher detector are further utilized to retrain the teacher–student models. Our proposed method can boost the transferability of the teacher model as well as enhance the student model to meet the demand of real-time applications. Comprehensive experiments on four different cross-domain datasets show that our PCdKD outperforms most existing state-of-the-art approaches.

EAAI Journal 2023 Journal Article

SAR ship localization method with denoising and feature refinement

Cheng Zha
Weidong Min
Qing Han
Wei Li
Xin Xiong
Qi Wang
Meng Zhu

Synthetic Aperture Radar (SAR) ship detection is greatly important to marine transportation monitoring and fishery resource management. To improve the detection accuracy of small ships, an SAR ship localization method with Denoising and Feature Refinement (DFR) is proposed in this paper. It consists of three parts. The first part is the denoising module, which uses non-local mean to suppress the speckle noise of the SAR image. The second part is Hierarchical Feature Fusion (HFF) module. It can integrate more low-level features by adding skip connections. This prevents the low-level spatial position information of the fused features from being diluted by high-level semantic information, therefore it is beneficial to the detection of small ships. The third part is a center-based ship predictor with Feature Refinement (FR). The FR module is proposed to refine the features and reduce the background interference, which is conducive to locate ships more accurately. Extensive experiments are conducted. The experimental results show that after adding the denoising and FR modules, the value of AP 0. 5 is increased by 1. 7% and 2. 3%, respectively, which proves the effectiveness of these two modules. In inshore and offshore scenarios, the AP 0. 5 values of DFR are 0. 884 and 0. 966, respectively, achieving the best results. The proposed method can also be generalized to mark lesion locations in medical images and detect offshore oil production platforms.

AAAI Conference 2023 Conference Paper

SKIER: A Symbolic Knowledge Integrated Model for Conversational Emotion Recognition

Wei Li
Luyao Zhu
Rui Mao
Erik Cambria

Emotion recognition in conversation (ERC) has received increasing attention from the research community. However, the ERC task is challenging, largely due to the complex and unstructured properties of multi-party conversations. Besides, the majority of daily dialogues take place in a specific context or circumstance, which requires rich external knowledge to understand the background of a certain dialogue. In this paper, we address these challenges by explicitly modeling the discourse relations between utterances and incorporating symbolic knowledge into multi-party conversations. We first introduce a dialogue parsing algorithm into ERC and further improve the algorithm through a transfer learning method. Moreover, we leverage different symbolic knowledge graph relations to learn knowledge-enhanced features for the ERC task. Extensive experiments on three benchmarks demonstrate that both dialogue structure graphs and symbolic knowledge are beneficial to the model performance on the task. Additionally, experimental results indicate that the proposed model surpasses baseline models on several indices.

PDF Details DOI

AAAI Conference 2023 Conference Paper

SSDA3D: Semi-supervised Domain Adaptation for 3D Object Detection from Point Cloud

Yan Wang
Junbo Yin
Wei Li
Pascal Frossard
Ruigang Yang
Jianbing Shen

LiDAR-based 3D object detection is an indispensable task in advanced autonomous driving systems. Though impressive detection results have been achieved by superior 3D detectors, they suffer from significant performance degeneration when facing unseen domains, such as different LiDAR configurations, different cities, and weather conditions. The mainstream approaches tend to solve these challenges by leveraging unsupervised domain adaptation (UDA) techniques. However, these UDA solutions just yield unsatisfactory 3D detection results when there is a severe domain shift, e.g., from Waymo (64-beam) to nuScenes (32-beam). To address this, we present a novel Semi-Supervised Domain Adaptation method for 3D object detection (SSDA3D), where only a few labeled target data is available, yet can significantly improve the adaptation performance. In particular, our SSDA3D includes an Inter-domain Adaptation stage and an Intra-domain Generalization stage. In the first stage, an Inter-domain Point-CutMix module is presented to efficiently align the point cloud distribution across domains. The Point-CutMix generates mixed samples of an intermediate domain, thus encouraging to learn domain-invariant knowledge. Then, in the second stage, we further enhance the model for better generalization on the unlabeled target set. This is achieved by exploring Intra-domain Point-MixUp in semi-supervised learning, which essentially regularizes the pseudo label distribution. Experiments from Waymo to nuScenes show that, with only 10% labeled target data, our SSDA3D can surpass the fully-supervised oracle model with 100% target label. Our code is available at https://github.com/yinjunbo/SSDA3D.

PDF Details DOI

JBHI Journal 2023 Journal Article

The Contrastive Network With Convolution and Self-Attention Mechanisms for Unsupervised Cell Segmentation

Yuhang Zhao
Xianhao Shao
Cai Chen
Junlin Song
Chongxuan Tian
Wei Li

Deep learning for cell instance segmentation is a significant research direction in biomedical image analysis. The traditional supervised learning methods rely on pixel-wise annotation of object images to train the models, which is often accompanied by time-consuming and labor-intensive. Various modified segmentation methods, based on weakly supervised or semi-supervised learning, have been proposed to recognize cell regions by only using rough annotations of cell positions. However, it is still hard to achieve the fully unsupervised in most approaches that the utilization of few annotations for training is still inevitable. In this article, we propose an end-to-end unsupervised model that can segment individual cell regions on hematoxylin and eosin (H&E) stained slides without any annotation. Compared with weakly or semi-supervised methods, the input of our model is in the form of raw data without any identifiers and there is no need to generate pseudo-labelling during training. We demonstrated that the performance of our model is satisfactory and also has a great generalization ability on various validation sets compared with supervised models. The ablation experiment shows that our backbone has superior performance in capturing object edge and context information than pure CNN or transformer under our unsupervised method.

AAAI Conference 2023 Conference Paper

Transformation-Equivariant 3D Object Detection for Autonomous Driving

Hai Wu
Chenglu Wen
Wei Li
Xin Li
Ruigang Yang
Cheng Wang

3D object detection received increasing attention in autonomous driving recently. Objects in 3D scenes are distributed with diverse orientations. Ordinary detectors do not explicitly model the variations of rotation and reflection transformations. Consequently, large networks and extensive data augmentation are required for robust detection. Recent equivariant networks explicitly model the transformation variations by applying shared networks on multiple transformed point clouds, showing great potential in object geometry modeling. However, it is difficult to apply such networks to 3D object detection in autonomous driving due to its large computation cost and slow reasoning speed. In this work, we present TED, an efficient Transformation-Equivariant 3D Detector to overcome the computation cost and speed issues. TED first applies a sparse convolution backbone to extract multi-channel transformation-equivariant voxel features; and then aligns and aggregates these equivariant features into lightweight and compact representations for high-performance 3D object detection. On the highly competitive KITTI 3D car detection leaderboard, TED ranked 1st among all submissions with competitive efficiency. Code is available at https://github.com/hailanyi/TED.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

TransHP: Image Classification with Hierarchical Prompting

Wenhao Wang
Yifan Sun
Wei Li
Yi Yang

This paper explores a hierarchical prompting mechanism for the hierarchical image classification (HIC) task. Different from prior HIC methods, our hierarchical prompting is the first to explicitly inject ancestor-class information as a tokenized hint that benefits the descendant-class discrimination. We think it well imitates human visual recognition, i. e. , humans may use the ancestor class as a prompt to draw focus on the subtle differences among descendant classes. We model this prompting mechanism into a Transformer with Hierarchical Prompting (TransHP). TransHP consists of three steps: 1) learning a set of prompt tokens to represent the coarse (ancestor) classes, 2) on-the-fly predicting the coarse class of the input image at an intermediate block, and 3) injecting the prompt token of the predicted coarse class into the intermediate feature. Though the parameters of TransHP maintain the same for all input images, the injected coarse-class prompt conditions (modifies) the subsequent feature extraction and encourages a dynamic focus on relatively subtle differences among the descendant classes. Extensive experiments show that TransHP improves image classification on accuracy (e. g. , improving ViT-B/16 by +2. 83% ImageNet classification accuracy), training data efficiency (e. g. , +12. 69% improvement under 10% ImageNet training data), and model explainability. Moreover, TransHP also performs favorably against prior HIC methods, showing that TransHP well exploits the hierarchical information. The code is available at: https: //github. com/WangWenhao0716/TransHP.

NeurIPS Conference 2023 Conference Paper

“Why Not Looking backward?” A Robust Two-Step Method to Automatically Terminate Bayesian Optimization

Shuang Li
Ke Li
Wei Li

Bayesian Optimization (BO) is a powerful method for tackling expensive black-box optimization problems. As a sequential model-based optimization strategy, BO iteratively explores promising solutions until a predetermined budget, either iterations or time, is exhausted. The decision on when to terminate BO significantly influences both the quality of solutions and its computational efficiency. In this paper, we propose a simple, yet theoretically grounded, two-step method for automatically terminating BO. Our core concept is to proactively identify if the search is within a convex region by examining previously observed samples. BO is halted once the local regret within this convex region falls below a predetermined threshold. To enhance numerical stability, we propose an approximation method for calculating the termination indicator by solving a bilevel optimization problem. We conduct extensive empirical studies on diverse benchmark problems, including synthetic functions, reinforcement learning, and hyperparameter optimization. Experimental results demonstrate that our proposed method saves up to $\approx 80\%$ computational budget yet is with an order of magnitude smaller performance degradation, comparing against the other peer methods. In addition, our proposed termination method is robust in terms of the setting of its termination criterion.

EAAI Journal 2022 Journal Article

A multi-task learning for cavitation detection and cavitation intensity recognition of valve acoustic signals

Yu Sha
Johannes Faber
Shuiping Gou
Bo Liu
Wei Li
Stefan Schramm
Horst Stoecker
Thomas Steckenreiter

With the rapid development of smart manufacturing, data-driven machinery health management has received a growing attention. As one of the most popular methods in machinery health management, deep learning (DL) has achieved remarkable successes. However, due to the issues of limited samples and poor separability of different cavitation states of acoustic signals, which greatly hinder the eventual performance of DL modes for cavitation intensity recognition and cavitation detection. Also different tasks were performed separately conventionally. In this work, a novel multi-task learning framework for simultaneous cavitation detection and cavitation intensity recognition framework using 1-D double hierarchical residual networks (1-D DHRN) is proposed for analyzing valves acoustic signals. Firstly, a data augmentation method based on sliding window with fast Fourier transform (Swin-FFT) is developed to alleviate the small-sample issue confronted in this study. Secondly, a 1-D double hierarchical residual block (1-D DHRB) is constructed to capture sensitive features from the frequency domain acoustic signals of valve. Then, a new structure of 1-D DHRN is proposed. Finally, the devised 1-D DHRN is evaluated on two datasets of valve acoustic signals without noise ( Dataset 1 and Dataset 2 ) and one dataset of valve acoustic signals with realistic surrounding noise ( Dataset 3 ) provided by SAMSON AG (Frankfurt). Our method has achieved state-of-the-art results. The prediction accuracies of 1-D DHRN for cavitation intensitys recognition are as high as 93. 75%, 94. 31% and 100%, which indicates that 1-D DHRN outperforms other DL models and conventional methods. At the same time, the testing accuracies of 1-D DHRN for cavitation detection are as high as 97. 02%, 97. 64% and 100%. In addition, 1-D DHRN has also been tested for different frequencies of samples and shows excellent results for frequency of samples that mobile phones can accommodate.

JBHI Journal 2022 Journal Article

A Transferable Deep Learning Prognosis Model for Predicting Stroke Patients' Recovery in Different Rehabilitation Trainings

Ping-Ju Lin
Xiaoxue Zhai
Wei Li
Tianyi Li
Dandan Cheng
Chong Li
Yu Pan
Linhong Ji

Since the underlying mechanisms of neurorehabilitation are not fully understood, the prognosis of stroke recovery faces significant difficulties. Recovery outcomes can vary when undergoing different treatments; however, few models have been developed to predict patient outcomes toward multiple treatments. In this study, we aimed to investigate the potential of predicting a treatment's outcome using a deep learning prognosis model developed for another treatment. A total of 15 stroke survivors were recruited in this study, and their clinical and physiological data were measured before and after the treatment (clinical measurement, biomechanical measurement, and electroencephalography (EEG) measurement). Multiple biomarkers and clinical scale scores of patients who had completed manual stretching rehabilitation training were analyzed. Data were used to train deep learning prognosis models, yielding an 87. 50% prognosis accuracy. Pre-trained prognosis models were then applied to patients who completed robotic-assisted stretching training, yielding a prognosis accuracy of 91. 84%. Interpretation of the deep learning models revealed several key factors influencing patients' recoveries, including the plantar-flexor active range of movement (r = 0. 930, P = 0. 02), dorsiflexor strength (r = 0. 932, P = 0. 002), plantar-flexor strength (r = 0. 930, P = 0. 002), EEG power spectrum density and EEG functional connectivities in the occipital, central parietal, and parietal areas. Our results suggest (i) that deep learning can be a promising method for accurate prediction of the recovery potential of stroke patients in clinical scenarios and (ii) that it can be successfully applied to different rehabilitation trainings with explainable factors.

NeurIPS Conference 2022 Conference Paper

DeepInteraction: 3D Object Detection via Modality Interaction

Zeyu Yang
Jiaqi Chen
Zhenwei Miao
Wei Li
Xiatian Zhu
Li Zhang

Existing top-performance 3D object detectors typically rely on the multi-modal fusion strategy. This design is however fundamentally restricted due to overlooking the modality-specific useful information and finally hampering the model performance. To address this limitation, in this work we introduce a novel modality interaction strategy where individual per-modality representations are learned and maintained throughout for enabling their unique characteristics to be exploited during object detection. To realize this proposed strategy, we design a DeepInteraction architecture characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Experiments on the large-scale nuScenes dataset show that our proposed method surpasses all prior arts often by a large margin. Crucially, our method is ranked at the first position at the highly competitive nuScenes object detection leaderboard.

NeurIPS Conference 2022 Conference Paper

Delving into Out-of-Distribution Detection with Vision-Language Representations

Yifei Ming
Ziyang Cai
Jiuxiang Gu
Yiyou Sun
Wei Li
Yixuan Li

Recognizing out-of-distribution (OOD) samples is critical for machine learning systems deployed in the open world. The vast majority of OOD detection methods are driven by a single modality (e. g. , either vision or language), leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of OOD detection from a single-modal to a multi-modal regime. Particularly, we propose Maximum Concept Matching (MCM), a simple yet effective zero-shot OOD detection method based on aligning visual features with textual concepts. We contribute in-depth analysis and theoretical insights to understand the effectiveness of MCM. Extensive experiments demonstrate that MCM achieves superior performance on a wide variety of real-world tasks. MCM with vision-language features outperforms a common baseline with pure visual features on a hard OOD task with semantically similar classes by 13. 1% (AUROC) Code is available at https: //github. com/deeplearning-wisc/MCM.

JBHI Journal 2022 Journal Article

Improving the Robustness and Adaptability of sEMG-Based Pattern Recognition Using Deep Domain Adaptation

Ping Shi
Xinran Zhang
Wei Li
Hongliu Yu

The pattern recognition (PR) based on surface electromyography (sEMG) could improve the quality of daily life of amputees. However, the lack of robustness and adaptability hinders its practical application. To realize the long-term reliability and user adaptability simultaneously, a novel multi-task dual-stream supervised domain adaptation (MDSDA) network based on convolutional neural network (CNN) was proposed. A long-term multi-subject sEMG signal acquisition was conducted to validate the performance of MDSDA, recruiting 12 able-bodied subjects. A total of thirty gestures were used for the acquisition, including one set of static gestures and two sets of dynamic gestures. The long-term multi-subject sEMG dataset is publicly available at the website. Four train-test estimations were designed to evaluate the robustness and adaptability of MDSDA. The results showed that MDSDA outperformed CNN and fune-tuning. Furthermore, we studied the divisibility between static and dynamic gestures that performed similar actions. The outcomes demonstrated that there existed high separability between them. This may be helpful to reduce the signal collection burden. Experimental results proved MDSDA has the potential to provide a robust and generalized PR system for the clinic applications.

NeurIPS Conference 2022 Conference Paper

Learning from Future: A Novel Self-Training Framework for Semantic Segmentation

Ye Du
Yujun Shen
Haochen Wang
Jingjing Fei
Wei Li
Liwei Wu
Rui Zhao
Zehua Fu

Self-training has shown great potential in semi-supervised learning. Its core idea is to use the model learned on labeled data to generate pseudo-labels for unlabeled samples, and in turn teach itself. To obtain valid supervision, active attempts typically employ a momentum teacher for pseudo-label prediction yet observe the confirmation bias issue, where the incorrect predictions may provide wrong supervision signals and get accumulated in the training process. The primary cause of such a drawback is that the prevailing self-training framework acts as guiding the current state with previous knowledge because the teacher is updated with the past student only. To alleviate this problem, we propose a novel self-training strategy, which allows the model to learn from the future. Concretely, at each training step, we first virtually optimize the student (i. e. , caching the gradients without applying them to the model weights), then update the teacher with the virtual future student, and finally ask the teacher to produce pseudo-labels for the current student as the guidance. In this way, we manage to improve the quality of pseudo-labels and thus boost the performance. We also develop two variants of our future-self-training (FST) framework through peeping at the future both deeply (FST-D) and widely (FST-W). Taking the tasks of unsupervised domain adaptive semantic segmentation and semi-supervised semantic segmentation as the instances, we experimentally demonstrate the effectiveness and superiority of our approach under a wide range of settings. Code is available at https: //github. com/usr922/FST.

IJCAI Conference 2022 Conference Paper

Learning Graph-based Residual Aggregation Network for Group Activity Recognition

Wei Li
Tianzhao Yang
Xiao Wu
Zhaoquan Yuan

Group activity recognition aims to understand the overall behavior performed by a group of people. Recently, some graph-based methods have made progress by learning the relation graphs among multiple persons. However, the differences between an individual and others play an important role in identifying confusable group activities, which have not been elaborately explored by previous methods. In this paper, a novel Graph-based Residual AggregatIon Network (GRAIN) is proposed to model the differences among all persons of the whole group, which is end-to-end trainable. Specifically, a new local residual relation module is explicitly proposed to capture the local spatiotemporal differences of relevant persons, which is further combined with the multi-graph relation networks. Moreover, a weighted aggregation strategy is devised to adaptively select multi-level spatiotemporal features from the appearance-level information to high level relations. Finally, our model is capable of extracting a comprehensive representation and inferring the group activity in an end-to-end manner. The experimental results on two popular benchmarks for group activity recognition clearly demonstrate the superior performance of our method in comparison with the state-of-the-art methods.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks

Zhiyang Chen
Yousong Zhu
Zhaowen Li
Fan Yang
Wei Li
Haixin Wang
Chaoyang Zhao
Liwei Wu

Visual tasks vary a lot in their output formats and concerned contents, therefore it is hard to process them with an identical structure. One main obstacle lies in the high-dimensional outputs in object-level visual tasks. In this paper, we propose an object-centric vision framework, Obj2Seq. Obj2Seq takes objects as basic units, and regards most object-level visual tasks as sequence generation problems of objects. Therefore, these visual tasks can be decoupled into two steps. First recognize objects of given categories, and then generate a sequence for each of these objects. The definition of the output sequences varies for different tasks, and the model is supervised by matching these sequences with ground-truth targets. Obj2Seq is able to flexibly determine input categories to satisfy customized requirements, and be easily extended to different visual tasks. When experimenting on MS COCO, Obj2Seq achieves 45. 7% AP on object detection, 89. 0% AP on multi-label classification and 65. 0% AP on human pose estimation. These results demonstrate its potential to be generally applied to different visual tasks. Code has been made available at: https: //github. com/CASIA-IVA-Lab/Obj2Seq.

AAAI Conference 2022 Conference Paper

OoDHDR-Codec: Out-of-Distribution Generalization for HDR Image Compression

Linfeng Cao
Aofan Jiang
Wei Li
Huaying Wu
Nanyang Ye

Recently, deep learning has been proven to be a promising approach in standard dynamic range (SDR) image compression. However, due to the wide luminance distribution of high dynamic range (HDR) images and the lack of large standard datasets, developing a deep model for HDR image compression is much more challenging. To tackle this issue, we view HDR data as distributional shifts of SDR data and the HDR image compression can be modeled as an out-ofdistribution generalization (OoD) problem. Herein, we propose a novel out-of-distribution (OoD) HDR image compression framework (OoDHDR-codec). It learns the general representation across HDR and SDR environments, and allows the model to be trained effectively using a large set of SDR datasets supplemented with much fewer HDR samples. Specifically, OoDHDR-codec consists of two branches to process the data from two environments. The SDR branch is a standard black-box network. For the HDR branch, we develop a hybrid system that models luminance masking and tone mapping with white-box modules and performs content compression with black-box neural networks. To improve the generalization from SDR training data on HDR data, we introduce an invariance regularization term to learn the common representation for both SDR and HDR compression. Extensive experimental results show that the OoDHDR codec achieves strong competitive in-distribution performance and state-of-the-art OoD performance. To the best of our knowledge, our proposed approach is the first work to model HDR compression as OoD generalization problems and our OoD generalization algorithmic framework can be applied to any deep compression model in addition to the network architectural choice demonstrated in the paper. Code available at https: //github. com/caolinfeng/OoDHDR-codec.

AAAI Conference 2022 Short Paper

Transformer-Based Unsupervised Learning for Early Detection of Sepsis (Student Abstract)

Yutao Dou
Wei Li
Albert Y. Zomaya

A 6-hour early detection of sepsis leads to a significant increase in the chance of surviving it. Previous sepsis early detection studies have focused on improving the performance of supervised learning algorithms while ignoring the potential correlation in data mining, and there was no reliable method to deal with the problem of incomplete data. In this paper, we proposed the Denoising Transformer AutoEncoder (DTAE) for the first time combining transformer and unsupervised learning. DTAE can learn the correlation of the features required for early detection of sepsis without the label. This method can effectively solve the problems of data sparsity and noise and discover the potential correlation of features by adding DTAE enhancement module without modifying the existing algorithms. Finally, the experimental results show that the proposed method improves the existing algorithms and achieves the best results of early detection.

EAAI Journal 2022 Journal Article

UD_BBC: Named entity recognition in social network combined BERT-BiLSTM-CRF with active learning

Wei Li
YaJun Du
XianYong Li
XiaoLiang Chen
Chunzhi Xie
Hui Li
Xiaolei Li

With the rapid growth of Internet penetration, more and more people choose the Internet to express their views on topics of interest. In recent years, named entity recognition (NER) is becoming a popular task for the public to obtain structured information from public opinion text. At present, NER models with good results, such as deep learning model, need a lot of labeled data for training. However, this will give rise to a problem: labeling a large amount of data requires a lot of human resources, which is thankless in some areas. Therefore, in this paper, we propose a NER model combining active learning and deep learning methods. Firstly, the active learning method can solve the above problem. The strategy combines uncertainty-based sampling and diversity-based sampling to estimate the information of data. We use highly informative data as the initial training dataset. Secondly, this paper uses a deep learning model combining bidirectional encoder representations from Transformers, bidirectional long–short-term memory and conditional random field (BERT-BiLSTM-CRF). BERT extracts the semantic features of data, and BiLSTM predicts the probability distribution of entity labels. We use the CRF for decoding the probability distribution into corresponding entity labels. Finally, we use the initial training dataset for training BERT-BiLSTM-CRF. This model predicts the entity labels of the unlabeled data. Then, we judge if the machine-labeled data is highly reliable and expand the highly reliable data to the initial training dataset. The updated dataset retrains the NER model, so that the trained model has higher precision than the previous model. The results show that our model performs well without a large number of labeled datasets. The model achieves a precision value of 70. 31%, recall rate of 74. 93% and F1 score of 72. 55% in the named entity recognition task, which proves the effectiveness of our model. Besides, the F1 score of BERT-BiLSTM-CRF with uncertainty-based sampling and diversity-based sampling (UD_BBC) is higher than the BiLSTM-CRF based on maximum normalized log-probability (MNLP_BiLSTM-CRF) by 9. 00%, when recognizing overall entity categories. It provides a solution to the problem of named entity recognition in educational public opinion.

JBHI Journal 2021 Journal Article

Discriminant Tensor-Based Manifold Embedding for Medical Hyperspectral Imagery

Meng Lv
Wei Li
Tianhong Chen
Jun Zhou
Ran Tao

Medical hyperspectral imagery has recentlyattracted considerable attention. However, for identification tasks, the high dimensionality of hyperspectral images usually leads to poor performance. Thus, dimensionality reduction (DR) is crucial in hyperspectral image analysis. Motivated by exploiting the underlying structure information of medical hyperspectral images and enhancing the discriminant ability of features, a discriminant tensor-based manifold embedding (DTME) is proposed for discriminant analysis of medical hyperspectral images. Based on the idea of manifold learning, a new discriminant similarity metric is designed, which takes into account the tensor representation, sparsity, low-rank and distribution characteristics. Then, an inter-class tensor graph and an intra-class tensor graph are constructed using the new similarity metric to reveal intrinsic manifold of hyperspectral data. Dimensionality reduction is achieved by embedding this supervised tensor graphs into the low-dimensional tensor subspace. Experimental results on membranous nephropathy and white bloodcells identification tasks demonstrate the potential clinical value of the proposed DTME.

AAAI Conference 2021 Conference Paper

Gradient Descent Averaging and Primal-dual Averaging for Strongly Convex Optimization

Wei Tao
Wei Li
Zhisong Pan
Qing Tao

Averaging scheme has attracted extensive attention in deep learning as well as traditional machine learning. It achieves theoretically optimal convergence and also improves the empirical model performance. However, there is still a lack of sufficient convergence analysis for strongly convex optimization. Typically, the convergence about the last iterate of gradient descent methods, which is referred to as individual convergence, fails to attain its optimality due to the existence of logarithmic factor. In order to remove this factor, we first develop gradient descent averaging (GDA), which is a general projection-based dual averaging algorithm in the strongly convex setting. We further present primal-dual averaging for strongly convex cases (SC-PDA), where primal and dual averaging schemes are simultaneously utilized. We prove that GDA yields the optimal convergence rate in terms of output averaging, while SC-PDA derives the optimal individual convergence. Several experiments on SVMs and deep learning models validate the correctness of theoretical analysis and effectiveness of algorithms.

YNICL Journal 2021 Journal Article

Hippocampal subfield and anterior-posterior segment volumes in patients with sporadic amyotrophic lateral sclerosis

Shuangwu Liu
Qingguo Ren
Gaolang Gong
Yuan Sun
Bing Zhao
Xiaotian Ma
Na Zhang
Suyu Zhong

Neuroimaging studies of hippocampal volumes in patients with amyotrophic lateral sclerosis (ALS) have reported inconsistent results. Our aims were to demonstrate that such discrepancies are largely due to atrophy of different regions of the hippocampus that emerge in different disease stages of ALS and to explore the existence of co-pathology in ALS patients. We used the well-validated King's clinical staging system for ALS to classify patients into different disease stages. We investigated in vivo hippocampal atrophy patterns across subfields and anterior-posterior segments in different King's stages using structural MRI in 76 ALS patients and 94 health controls (HCs). The thalamus, corticostriatal tract and perforant path were used as structural controls to compare the sequence of alterations between these structures and the hippocampal subfields. Compared with HCs, ALS patients at King's stage 1 had lower volumes in the bilateral posterior subiculum and presubiculum; ALS patients at King's stage 2 exhibited lower volumes in the bilateral posterior subiculum, left anterior presubiculum and left global hippocampus; ALS patients at King's stage 3 showed significantly lower volumes in the bilateral posterior subiculum, dentate gyrus and global hippocampus. Thalamic atrophy emerged at King's stage 3. White matter tracts remained normal in a subset of ALS patients. Our study demonstrated that the pattern of hippocampal atrophy in ALS patients varies greatly across King's stages. Future studies in ALS patients that focus on the hippocampus may help to further clarify possible co-pathologies in ALS.

IJCAI Conference 2021 Conference Paper

Long-term, Short-term and Sudden Event: Trading Volume Movement Prediction with Graph-based Multi-view Modeling

Liang Zhao
Wei Li
Ruihan Bao
Keiko Harimoto
Yunfang Wu
Xu Sun

Trading volume movement prediction is the key in a variety of financial applications. Despite its importance, there is few research on this topic because of its requirement for comprehensive understanding of information from different sources. For instance, the relation between multiple stocks, recent transaction data and suddenly released events are all essential for understanding trading market. However, most of the previous methods only take the fluctuation information of the past few weeks into consideration, thus yielding poor performance. To handle this issue, we propose a graph-based approach that can incorporate multi-view information, i. e. , long-term stock trend, short-term fluctuation and sudden events information jointly into a temporal heterogeneous graph. Besides, our method is equipped with deep canonical analysis to highlight the correlations between different perspectives of fluctuation for better prediction. Experiment results show that our method outperforms strong baselines by a large margin.

PDF Details DOI

NeurIPS Conference 2021 Conference Paper

MST: Masked Self-Supervised Transformer for Visual Representation

Zhaowen Li
Zhiyang Chen
Fan Yang
Wei Li
Yousong Zhu
Chaoyang Zhao
Rui Deng
Liwei Wu

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76. 9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0. 4% and its comparable variant DINO by 1. 0%. For dense prediction tasks, MST also achieves 42. 7% mAP on MS COCO object detection and 74. 04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.

AIIM Journal 2021 Journal Article

NIA-Network: Towards improving lung CT infection detection for COVID-19 diagnosis

Wei Li
Jinlin Chen
Ping Chen
Lequan Yu
Xiaohui Cui
Yiwei Li
Fang Cheng
Wen Ouyang

During pandemics (e. g. , COVID-19) physicians have to focus on diagnosing and treating patients, which often results in that only a limited amount of labeled CT images is available. Although recent semi-supervised learning algorithms may alleviate the problem of annotation scarcity, limited real-world CT images still cause those algorithms producing inaccurate detection results, especially in real-world COVID-19 cases. Existing models often cannot detect the small infected regions in COVID-19 CT images, such a challenge implicitly causes that many patients with minor symptoms are misdiagnosed and develop more severe symptoms, causing a higher mortality. In this paper, we propose a new method to address this challenge. Not only can we detect severe cases, but also detect minor symptoms using real-world COVID-19 CT images in which the source domain only includes limited labeled CT images but the target domain has a lot of unlabeled CT images. Specifically, we adopt Network-in-Network and Instance Normalization to build a new module (we term it NI module) and extract discriminative representations from CT images from both source and target domains. A domain classifier is utilized to implement infected region adaptation from source domain to target domain in an Adversarial Learning manner, and learns domain-invariant region proposal network (RPN) in the Faster R-CNN model. We call our model NIA-Network (Network-in-Network, Instance Normalization and Adversarial Learning), and conduct extensive experiments on two COVID-19 datasets to validate our approach. The experimental results show that our model can effectively detect infected regions with different sizes and achieve the highest diagnostic accuracy compared with existing SOTA methods.

JBHI Journal 2021 Journal Article

Spatial-Spectral Density Peaks-Based Discriminant Analysis for Membranous Nephropathy Classification Using Microscopic Hyperspectral Images

Meng Lv
Wei Li
Ran Tao
Nigel H. Lovell
Yue Yang
Tianqi Tu
Wenge Li

The traditional differential diagnosis of membranous nephropathy (MN) mainly relies on clinical symptoms, serological examination and optical renal biopsy. However, there is a probability of false positives in the optical inspection results, and it is unable to detect the change of biochemical components, which poses an obstacle to pathogenic mechanism analysis. Microscopic hyperspectral imaging can reveal detailed component information of immune complexes, but the high dimensionality of microscopic hyperspectral image brings difficulties and challenges to image processing and disease diagnosis. In this paper, a novel classification framework, including spatial-spectral density peaks-based discriminant analysis (SSDP), is proposed for intelligent diagnosis of MN using a microscopic hyperspectral pathological dataset. SSDP constructs a set of graphs describing intrinsic structure of MHSI in both spatial and spectral domains by employing density peak clustering. In the process of graph embedding, low-dimensional features with important diagnostic information in the immune complex are obtained by compacting the spatial-spectral local intra-class pixels while separating the spectral inter-class pixels. For the MN recognition task, a support vector machine (SVM) is used to classify pixels in the low-dimensional space. Experimental validation data employ two types of MN that are difficult to distinguish with optical microscope, including primary MN and hepatitis B virus-associated MN. Experimental results show that the proposed SSDP achieves a sensitivity of 99. 36%, which has potential clinical value for automatic diagnosis of MN.

YNICL Journal 2021 Journal Article

Structural and functional reorganization of contralateral hippocampus after temporal lobe epilepsy surgery

Wei Li
Yuchao Jiang
Yingjie Qin
Baiwan Zhou
Du Lei
Heng Zhang
Ding Lei
Dezhong Yao

OBJECTIVE: To explore the structural and functional reorganization of contralateral hippocampus in patients with unilateral mesial temporal lobe epilepsy (mTLE) who achieved seizure-freedom after anterior temporal lobectomy (ATL). METHODS: We obtained high-resolution structural MRI and resting-state functional MRI data in 28 unilateral mTLE patients and 29 healthy controls. Patients were scanned before and three and 24 months after surgery while controls were scanned only once. Hippocampal gray matter volume (GMV) and functional connectivity (FC) were assessed. RESULTS: No obvious GMV changes were observed in contralateral hippocampus before and after successful surgery. Before surgery, ipsilateral hippocampus showed increased FC with ipsilateral insula (INS) and temporoparietal junction (TPJ), but decreased FC with widespread bilateral regions, as well as contralateral hippocampus. After successful ATL, contralateral hippocampus showed: (1) decreased FC with ipsilateral INS at three months follow-up, without further changes; (2) decreased FC with ipsilateral TPJ, postcentral gyrus and rolandic operculum at three months, with an obvious increase at 24 months follow-up; (3) increased FC with bilateral medial prefrontal cortex (MPFC) and superior frontal gyrus (SFG) at three months follow-up, without further changes. CONCLUSIONS: Successful ATL may not lead to an obvious structural reorganization in contralateral hippocampus. Surgical manipulation may lead to a transient FC reduction of contralateral hippocampus. Increased FC between contralateral hippocampus and bilateral MPFC and SFG may be related to postoperative functional remodeling.

EAAI Journal 2021 Journal Article

Virtual generation of pavement crack images based on improved deep convolutional generative adversarial network

Lili Pei
Zhaoyun Sun
Liyang Xiao
Wei Li
Jing Sun
He Zhang

To solve the problems associated with a small sample size during intelligent road detection, a virtual image set generation method for asphalt pavement cracks is proposed based on improved deep convolutional generative adversarial networks (DCGANs). First, a small set of sample crack images is collected and used as the basic image set to perform filtering, gamma transformation, and other processes, whereby crack feature recognition is enhanced. Second, a variational autoencoder (VAE) is used to encode real crack images. The latent variable values obtained from the VAE are provided as input to the DCGAN model generator, and the model hyperparameters are optimized. Subsequently, the adaptive moment estimation (Adam) optimizer is used to reoptimize the model and thereby improve the model convergence speed and generalization ability. The proposed method has the advantages of both VAE and DCGAN. Finally, a pavement crack classification detection model based on faster region convolutional neural network (Faster R-CNN) is used to evaluate the reliability of the generated crack images. The results show that the augmented dataset of the proposed method with the detection model has an average precision of 90. 32%, which is higher than that of the conventional method evaluated using the same test dataset. The proposed method generates virtual crack images that are moderately identical to real ones, thereby solving the problem of insufficient image datasets of cracks in specific road sections. The method also provides data assurance for the intelligentization of pavement crack detection and the reduction of pavement maintenance costs.

JBHI Journal 2020 Journal Article

A Hierarchical Neural Network for Sleep Stage Classification Based on Comprehensive Feature Learning and Multi-Flow Sequence Learning

Chenglu Sun
Chen Chen
Wei Li
Jiahao Fan
Wei Chen

Automatic sleep staging methods usually extract hand-crafted features or network trained features from signals recorded by polysomnography (PSG), and then estimate the stages by various classifiers. In this study, we propose a classification approach based on a hierarchical neural network to process multi-channel PSG signals for improving the performance of automatic five-class sleep staging. The proposed hierarchical network contains two stages: comprehensive feature learning stage and sequence learning stage. The first stage is used to obtain the feature matrix by fusing the hand-crafted features and network trained features. A multi-flow recurrent neural network (RNN) as the second stage is utilized to fully learn temporal information between sleep epochs and fine-tune the parameters in the first stage. The proposed model was evaluated by 147 full night recordings in a public sleep database, the Montreal Archive of Sleep Studies (MASS). The proposed approach can achieve the overall accuracy of 0. 878, and the F1-score is 0. 818. The results show that the approach can achieve better performance compared to the state-of-the-art methods. Ablation experiment and model analysis proved the effectiveness of different components of the proposed model. The proposed approach allows automatic sleep stage classification by multi-channel PSG signals with different criteria standards, signal characteristics, and epoch divisions, and it has the potential to exploit sleep information comprehensively.

AAAI Conference 2020 Conference Paper

AutoRemover: Automatic Object Removal for Autonomous Driving Videos

Rong Zhang
Wei Li
Peng Wang
Chenye Guan
Jin Fang
Yuhang Song
Jinhui Yu
Baoquan Chen

Motivated by the need for photo-realistic simulation in autonomous driving, in this paper we present a video inpainting algorithm AutoRemover, designed speciﬁcally for generating street-view videos without any moving objects. In our setup we have two challenges: the ﬁrst is the shadow, shadows are usually unlabeled but tightly coupled with the moving objects. The second is the large ego-motion in the videos. To deal with shadows, we build up an autonomous driving shadow dataset and design a deep neural network to detect shadows automatically. To deal with large ego-motion, we take advantage of the multi-source data, in particular the 3D data, in autonomous driving. More speciﬁcally, the geometric relationship between frames is incorporated into an inpainting deep neural network to produce high-quality structurally consistent video output. Experiments show that our method outperforms other state-of-the-art (SOTA) object removal algorithms, reducing the RMSE by over 19%.

JBHI Journal 2020 Journal Article

Blood Cell Classification Based on Hyperspectral Imaging With Modulated Gabor and CNN

Qian Huang
Wei Li
Baochang Zhang
Qingli Li
Ran Tao
Nigel H. Lovell

Cell classification, especially that of white blood cells, plays a very important role in the field of diagnosis and control of major diseases. Compared to traditional optical microscopic imaging, hyperspectral imagery, combined with both spatial and spectral information, provides more wealthy information for recognizing cells. In this paper, a novel blood cell classification framework, which combines a modulated Gabor wavelet and deep convolutional neural network (CNN) kernels, named as MGCNN, is proposed based on medical hyperspectral imaging. For each convolutional layer, multi-scale and orientation Gabor operators are taken dot product with initial CNN kernels. The essence is to transform the convolutional kernels into the frequency domain to learn features. By combining characteristics of Gabor wavelets, the features learned by modulated kernels at different frequencies and orientations are more representative and discriminative. Experimental results demonstrate that the proposed model can achieve better classification performance than traditional CNNs and widely used support vector machine approaches, especially as training small-sample-size situations.

JBHI Journal 2020 Journal Article

Differential Diagnosis of Atypical Hepatocellular Carcinoma in Contrast-Enhanced Ultrasound Using Spatio-Temporal Diagnostic Semantics

Qinghua Huang
Fengxin Pan
Wei Li
Feiniu Yuan
Hangtong Hu
Jinhua Huang
Jie Yu
Wei Wang

Atypical Hepatocellular Carcinoma (HCC) is very hard to distinguish from Focal Nodular Hyperplasia (FNH) in routine imaging. However little attention was paid to this problem. This paper proposes a novel liver tumor Computer-Aided Diagnostic (CAD) approach extracting spatio-temporal semantics for atypical HCC. With respect to useful diagnostic semantics, our model automatically calculates three types of semantic feature with equally down-sampled frames based on Contrast-Enhanced Ultrasound (CEUS). Thereafter, a Support Vector Machine (SVM) classifier is trained to make the final diagnosis. Compared with traditional methods for diagnosing HCC, the proposed model has the advantage of less computational complexity and being able to handle the atypical HCC cases. The experimental results show that our method obtained a pretty considerable performance and outperformed two traditional methods. According to the results, the average accuracy reaches 94. 40%, recall rate 94. 76%, F1-score value 94. 62%, specificity 93. 62% and sensitivity 94. 76%, indicating good merit for automatically diagnosing atypical HCC cases.

JMLR Journal 2020 Journal Article

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel
Noam Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2020. ( edit, beta )

AAAI Conference 2020 Conference Paper

FET-GAN: Font and Effect Transfer via K-shot Adaptive Instance Normalization

Wei Li
Yongxing He
Yanwei Qi
Zejian Li
Yongchuan Tang

Text effect transfer aims at learning the mapping between text visual effects while maintaining the text content. While remarkably successful, existing methods have limited robustness in font transfer and weak generalization ability to unseen effects. To address these problems, we propose FET-GAN, a novel end-to-end framework to implement visual effects transfer with font variation among multiple text effects domains. Our model achieves remarkable results both on arbitrary effect transfer between texts and effect translation from text to graphic objects. By a few-shot ﬁne-tuning strategy, FET-GAN can generalize the transfer of the pre-trained model to the new effect. Through extensive experimental validation and comparison, our model advances the state-of-the-art in the text effect transfer task. Besides, we have collected a font dataset including 100 fonts of more than 800 Chinese and English characters. Based on this dataset, we demonstrated the generalization ability of our model by the application that complements the font library automatically by few-shot samples. This application is signiﬁcant in reducing the labor cost for the font designer.

AAAI Conference 2020 Conference Paper

Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View

Deli Chen
Yankai Lin
Wei Li
Peng Li
Jie Zhou
Xu Sun

Graph Neural Networks (GNNs) have achieved promising performance on a wide range of graph-based tasks. Despite their success, one severe limitation of GNNs is the over-smoothing issue (indistinguishable representations of nodes in different classes). In this work, we present a systematic and quantitative study on the over-smoothing issue of GNNs. First, we introduce two quantitative metrics, MAD and MADGap, to measure the smoothness and oversmoothness of the graph nodes representations, respectively. Then, we verify that smoothing is the nature of GNNs and the critical factor leading to over-smoothness is the low information-to-noise ratio of the message received by the nodes, which is partially determined by the graph topology. Finally, we propose two methods to alleviate the oversmoothing issue from the topological view: (1) MADReg which adds a MADGap-based regularizer to the training objective; (2) AdaEdge which optimizes the graph topology based on the model predictions. Extensive experiments on 7 widely-used graph datasets with 10 typical GNN models show that the two proposed methods are effective for relieving the over-smoothing issue, thus improving the performance of various GNN models.

IJCAI Conference 2020 Conference Paper

Modeling the Stock Relation with Graph Network for Overnight Stock Movement Prediction

Wei Li
Ruihan Bao
Keiko Harimoto
Deli Chen
Jingjing Xu
Qi Su

Stock movement prediction is a hot topic in the Fintech area. Previous works usually predict the price movement in a daily basis, although the market impact of news can be absorbed much shorter, and the exact time is hard to estimate. In this work, we propose a more practical objective to predict the overnight stock movement between the previous close price and the open price. As no trading operation occurs after market close, the market impact of overnight news will be reflected by the overnight movement. One big obstacle for such task is the lacking of data, in this work we collect and publish the overnight stock price movement dataset of Reuters Financial News. Another challenge is that the stocks in the market are not independent, which is omitted by previous works. To make use of the connection among stocks, we propose a LSTM Relational Graph Convolutional Network (LSTM-RGCN) model, which models the connection among stocks with their correlation matrix. Extensive experiment results show that our model outperforms the baseline models. Further analysis shows that the introduction of the graph enables our model to predict the movement of stocks that are not directly associated with news as well as the whole market, which is not available in most previous methods.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Neural Graph Embedding for Neural Architecture Search

Wei Li
Shaogang Gong
Xiatian Zhu

Existing neural architecture search (NAS) methods often operate in discrete or continuous spaces directly, which ignores the graphical topology knowledge of neural networks. This leads to suboptimal search performance and efﬁciency, given the factor that neural networks are essentially directed acyclic graphs (DAG). In this work, we address this limitation by introducing a novel idea of neural graph embedding (NGE). Speciﬁcally, we represent the building block (i. e. the cell) of neural networks with a neural DAG, and learn it by leveraging a Graph Convolutional Network to propagate and model the intrinsic topology information of network architectures. This results in a generic neural network representation integrable with different existing NAS frameworks. Extensive experiments show the superiority of NGE over the state-of-the-art methods on image classiﬁcation and semantic segmentation.

YNIMG Journal 2020 Journal Article

Neural mechanisms of AVPR1A RS3-RS1 haplotypes that impact verbal learning and memory

Yan Zhang
Dan Zhu
Peng Zhang
Wei Li
Wen Qin
Feng Liu
Jiayuan Xu
Qiang Xu

Converging evidence from both human and animal studies has highlighted the pervasive role of the neuropeptide arginine vasopressin (AVP), which is mediated by arginine vasopressin receptor 1A (AVPR1A), in both social and nonsocial learning and memory. However, the effect of genetic variants in AVPR1A on verbal learning and memory is unknown. The hippocampus is a heterogeneous structure that consists of several anatomically and functionally distinct subfields, and it is the principal target structure for the memory-enhancing effect of AVP. We tested the hypothesis that genetic variants in the RS3 and RS1 repeat polymorphisms may influence verbal learning and memory performance evaluated by the California Verbal Learning Test-II (CVLT-II) by modulating the gray matter volume (GMV) and resting-state functional connectivity (rsFC) of whole hippocampus and its subfields in a large cohort of young healthy subjects (n = 1001). Using a short/long classification scheme for the repeat length of RS3 and RS1, we found that the individuals carrying more short alleles of RS3-RS1 haplotypes had poorer learning and memory performance compared to that of those carrying more long alleles. We also revealed that individuals carrying more short alleles exhibited a significantly smaller GMV in the left cornu ammonis (CA)2/3 and weaker rsFC of the left CA2/3-bilateral thalamic (primarily in medial prefrontal subfields) compared to those carrying more long alleles. Furthermore, multiple mediation analysis confirmed that these two hippocampal imaging measures jointly and fully mediated the relationship between the genetic variants in AVPR1A RS3-RS1 haplotypes and the individual differences in verbal learning and memory performance. Our results suggest that genetic variants in AVPR1A RS3-RS1 haplotypes may affect verbal learning and memory performance in part by modulating the left hippocampal CA2/3 structure and its rsFC with the thalamus.

AAAI Conference 2020 Conference Paper

Semi-Supervised Learning under Class Distribution Mismatch

Yanbei Chen
Xiatian Zhu
Wei Li
Shaogang Gong

Semi-supervised learning (SSL) aims to avoid the need for collecting prohibitively expensive labelled training data. Whilst demonstrating impressive performance boost, existing SSL methods artiﬁcially assume that small labelled data and large unlabelled data are drawn from the same class distribution. In a more realistic scenario with class distribution mismatch between the two sets, they often suffer severe performance degradation due to error propagation introduced by irrelevant unlabelled samples. Our work addresses this under-studied and realistic SSL problem by a novel algorithm named Uncertainty- Aware Self-Distillation (UASD). Speciﬁcally, UASD produces soft targets that avoid catastrophic error propagation, and empower learning effectively from unconstrained unlabelled data with out-of-distribution (OOD) samples. This is based on joint Self-Distillation and OOD ﬁltering in a uniﬁed formulation. Without bells and whistles, UASD signiﬁcantly outperforms six state-of-the-art methods in more realistic SSL under class distribution mismatch on three popular image classiﬁcation datasets: CIFAR10, CIFAR100, and TinyImageNet.

EAAI Journal 2020 Journal Article

Three-dimensional pavement crack detection based on primary surface profile innovation optimized dual-phase computing

Ju Huyan
Wei Li
Susan Tighe
Liyang Xiao
Zhaoyun Sun
Nana Shao

Accurate pavement crack detection has long been a challenging task, causing significant difficulties to the pavement management sectors in the managerial decision making. The high complexity of the crack’s characteristics and the less effective of the crack analytical tools are the two crucial aspects to be accounted for. Recently, three-dimensional (3D) technology based high precision crack detection methodologies has undergone extensive developments. Nevertheless, none of those methods has taken into the errors caused by the data collection systems into consideration, resulting in a less satisfying performance. Hence, the primary objective of this research is to outline the Primary Surface Profile (PSP) optimized dual-phase computing 3D crack detection methodology. Two years ago, variations caused by the automatic 3D data collection systems were observed, so researchers proposed PSP based data filtering algorithm. Therefore, this research is the upgrade solution of the previous innovation regarding the unbiased 3D pavement crack detection. Firstly, the dual-phase computing approach is proposed in dealing with the non-variance 3D data. Then, the self-adaptive 3D PSP generation method is introduced. Finally, PSP is embedded in the dual-phase computing method for performance optimization. For performance assessment, both precisions and recalls of the proposed approach are compared with conventional method for transverse, longitudinal, and map crack detections. Even crack detection precisions are found for both methods, which are all higher than 0. 9. However, the recalls of the proposed method (transverse cracks: 0. 973, longitudinal cracks: 0. 981, map cracks: 0. 940) are significantly outperforming non-optimized dual-phase computing method (transverse cracks: 0. 682, longitudinal cracks: 0. 789, map cracks: 0. 811).

JBHI Journal 2019 Journal Article

Detecting Alzheimer's Disease on Small Dataset: A Knowledge Transfer Perspective

Wei Li
Yifei Zhao
Xi Chen
Yang Xiao
Yuanyuan Qin

Computer-aided diagnosis (CAD) is an attractive topic in Alzheimer's disease (AD) research. Many algorithms are based on a relatively large training dataset. However, small hospitals are usually unable to collect sufficient training samples for robust classification. Although data sharing is expanding in scientific research, it is unclear whether a model based on one dataset is well suited for other data sources. Using a small dataset from a local hospital and a large shared dataset from the AD neuroimaging initiative, we conducted a heterogeneity analysis and found that different functional magnetic resonance imaging data sources show different sample distributions in feature space. In addition, we proposed an effective knowledge transfer method to diminish the disparity among different datasets and improve the classification accuracy on datasets with insufficient training samples. The accuracy increased by approximately 20% compared with that of a model based only on the original small dataset. The results demonstrated that the proposed approach is a novel and effective method for CAD in hospitals with only small training datasets. It solved the challenge of limited sample size in detection of AD, which is a common issue but lack of adequate attention. Furthermore, this paper sheds new light on effective use of multi-source data for neurological disease diagnosis.

YNICL Journal 2019 Journal Article

Different patterns of white matter changes after successful surgery of mesial temporal lobe epilepsy

Wei Li
Dongmei An
Xin Tong
Wenyu Liu
Fenglai Xiao
Jiechuan Ren
Running Niu
Yingying Tang

OBJECTIVES: To explore the dynamic changes of white matters following anterior temporal lobectomy (ATL) in mesial temporal lobe epilepsy (MTLE) patients who achieved seizure-free at two-year follow-up. METHODS: Diffusion tensor imaging (DTI) was obtained in ten MTLE patients at five serial time points: before surgery, three months, six months, 12 months and 24 months after surgery, as well as in 11 age- and sex-matched healthy controls at one time point. Regions with significant postoperative fractional anisotropy (FA) changes and their dynamic changes were confirmed by comparing all preoperative and postoperative data using Tract-Based Spatial Statistics (TBSS). RESULTS: After successful ATL, significant FA changes were found in widespread ipsilateral and contralateral white matter regions (P <.05, FWE correction). Ipsilateral external capsule, cingulum, superior corona radiate, body of corpus callosum, inferior longitudinal fasciculus, optic radiation and contralateral inferior cerebellar peduncle, inferior longitudinal fasciculus showed significant FA decrease at three months after surgery, without further changes. Ipsilateral superior cerebellar peduncle and contralateral corpus callosum, anterior corona radiate, external capsule, optic radiation showed significant FA decrease at three months follow up but increase later. Ipsilateral cerebral peduncle and contralateral middle cerebellar peduncle showed significant FA decrease at three months follow up, with further decrease after that. While ipsilateral posterior limb of internal capsule, retrolenticular part of internal capsule and contralateral posterior corona radiate showed significant FA increase after surgery. CONCLUSIONS: FA changes after successful ATL presented as four distinct patterns, reflecting different structural adaptions following epilepsy surgery. Some FA increases indicated the reversibility of preoperative diffusion abnormalities and the possibility of structural reorganization, especially in the contralateral hemisphere.

AAAI Conference 2019 Conference Paper

Learning Disentangled Representation with Pairwise Independence

Zejian Li
Yongchuan Tang
Wei Li
Yongxing He

Unsupervised disentangled representation learning is one of the foundational methods to learn interpretable factors in the data. Existing learning methods are based on the assumption that disentangled factors are mutually independent and incorporate this assumption with the evidence lower bound. However, our experiment reveals that factors in real-world data tend to be pairwise independent. Accordingly, we propose a new method based on a pairwise independence assumption to learn the disentangled representation. The evidence lower bound implicitly encourages mutual independence of latent codes so it is too strong for our assumption. Therefore, we introduce another lower bound in our method. Extensive experiments show that our proposed method gives competitive performances as compared with other state-of-the-art methods.

IJCAI Conference 2019 Conference Paper

Multiple Policy Value Monte Carlo Tree Search

Li-Cheng Lan
Wei Li
Ting-Han Wei
I-Chen Wu

Many of the strongest game playing programs use a combination of Monte Carlo tree search (MCTS) and deep neural networks (DNN), where the DNNs are used as policy or value evaluators. Given a limited budget, such as online playing or during the self-play phase of AlphaZero (AZ) training, a balance needs to be reached between accurate state estimation and more MCTS simulations, both of which are critical for a strong game playing agent. Typically, larger DNNs are better at generalization and accurate evaluation, while smaller DNNs are less costly, and therefore can lead to more MCTS simulations and bigger search trees with the same budget. This paper introduces a new method called the multiple policy value MCTS (MPV-MCTS), which combines multiple policy value neural networks (PV-NNs) of various sizes to retain advantages of each network, where two PV-NNs f_S and f_L are used in this paper. We show through experiments on the game NoGo that a combined f_S and f_L MPV-MCTS outperforms single PV-NN with policy value MCTS, called PV-MCTS. Additionally, MPV-MCTS also outperforms PV-MCTS for AZ training.

JBHI Journal 2019 Journal Article

Unified Fine-Grained Access Control for Personal Health Records in Cloud Computing

Wei Li
Bonnie M. Liu
Dongxi Liu
Ren Ping Liu
Peishun Wang
Shoushan Luo
Wei Ni

Attribute-based encryption has been a promising encryption technology to secure personal health records (PHRs) sharing in cloud computing. PHRs consist of the patient data often collected from various sources including hospitals and general practice centres. Different patients' access policies have a common access sub-policy. In this paper, we propose a novel attribute-based encryption scheme for fine-grained and flexible access control to PHRs data in cloud computing. The scheme generates shared information by the common access sub-policy, which is based on different patients' access policies. Then, the scheme combines the encryption of PHRs from different patients. Therefore, both time consumption of encryption and decryption can be reduced. Medical staff require varying levels of access to PHRs. The proposed scheme can also support multi-privilege access control so that medical staff can access the required level of information while maximizing patient privacy. Through implementation and simulation, we demonstrate that the proposed scheme is efficient in terms of time. Moreover, we prove the security of the proposed scheme based on security of the ciphertext-policy attribute-based encryption scheme.

AAAI Conference 2018 Conference Paper

A Unified Model for Document-Based Question Answering Based on Human-Like Reading Strategy

Weikang Li
Wei Li
Yunfang Wu

Document-based Question Answering (DBQA) in Natural Language Processing (NLP) is important but difﬁcult because of the long document and the complex question. Most of previous deep learning methods mainly focus on the similarity computation between two sentences. However, DBQA stems from the reading comprehension in some degree, which is originally used to train and test people’s ability of reading and logical thinking. Inspired by the strategy of doing reading comprehension tests, we propose a uniﬁed model based on the human-like reading strategy. The uniﬁed model contains three major encoding layers that are consistent to different steps of the reading strategy, including the basic encoder, combined encoder and hierarchical encoder. We conduct extensive experiments on both the English WikiQA dataset and the Chinese dataset, and the experimental results show that our uniﬁed model is effective and yields state-of-the-art results on WikiQA dataset.

NeurIPS Conference 2018 Conference Paper

Improved Expressivity Through Dendritic Neural Networks

Xundong Wu
Xiangwen Liu
Wei Li
Qing Wu

A typical biological neuron, such as a pyramidal neuron of the neocortex, receives thousands of afferent synaptic inputs on its dendrite tree and sends the efferent axonal output downstream. In typical artificial neural networks, dendrite trees are modeled as linear structures that funnel weighted synaptic inputs to the cell bodies. However, numerous experimental and theoretical studies have shown that dendritic arbors are far more than simple linear accumulators. That is, synaptic inputs can actively modulate their neighboring synaptic activities; therefore, the dendritic structures are highly nonlinear. In this study, we model such local nonlinearity of dendritic trees with our dendritic neural network (DENN) structure and apply this structure to typical machine learning tasks. Equipped with localized nonlinearities, DENNs can attain greater model expressivity than regular neural networks while maintaining efficient network inference. Such strength is evidenced by the increased fitting power when we train DENNs with supervised machine learning tasks. We also empirically show that the locality structure can improve the generalization performance of DENNs, as exemplified by DENNs outranking naive deep neural network architectures when tested on 121 classification tasks from the UCI machine learning repository.

AAMAS Conference 2017 Conference Paper

Dynamic Generalization Kanerva Coding in Reinforcement Learning for TCP Congestion Control Design

Wei Li
Fan Zhou
Waleed Meleis
Kaushik Chowdhury

Traditional reinforcement learning (RL) techniques often encounter limitations when solving large or continuous stateaction spaces. Training times needed to explore the very large space are impractically long, and it can be difficult to generalize learned knowledge. A compact representation of the state space is usually generated to solve both problems. However, simple state abstraction often cannot achieve the desired learning quality, while expert state representations usually involve costly hand-crafted strategies. We propose a new technique, generalization-based Kanerva coding, that automatically generates and optimizes state abstractions for learning. When applied to adapting the congestion window of the highly complex TCP congestion control protocol, a standard Internet protocol, this technique outperforms the current standard-TCP New Reno by 59. 5% in throughput and 6. 5% in delay. Our technique also achieves a 35. 2% improvement in throughput over the best previously proposed Kanerva coding technique when applied in the same context.

NeurIPS Conference 2017 Conference Paper

Generalizing GANs: A Turing Perspective

Roderich Gross
Yue Gu
Wei Li
Melvin Gauci

Recently, a new class of machine learning algorithms has emerged, where models and discriminators are generated in a competitive setting. The most prominent example is Generative Adversarial Networks (GANs). In this paper we examine how these algorithms relate to the Turing test, and derive what - from a Turing perspective - can be considered their defining features. Based on these features, we outline directions for generalizing GANs - resulting in the family of algorithms referred to as Turing Learning. One such direction is to allow the discriminators to interact with the processes from which the data samples are obtained, making them "interrogators", as in the Turing test. We validate this idea using two case studies. In the first case study, a computer infers the behavior of an agent while controlling its environment. In the second case study, a robot infers its own sensor configuration while controlling its movements. The results confirm that by allowing discriminators to interrogate, the accuracy of models is improved.

IJCAI Conference 2017 Conference Paper

Person Re-Identification by Deep Joint Learning of Multi-Loss Classification

Wei Li
Xiatian Zhu
Shaogang Gong

Existing person re-identification (re-id) methods rely mostly on either localised or global feature representation. This ignores their joint benefit and mutual complementary effects. In this work, we show the advantages of jointly learning local and global features in a Convolutional Neural Network (CNN) by aiming to discover correlated local and global features in different context. Specifically, we formulate a method for joint learning of local and global feature selection losses designed to optimise person re-id when using generic matching metrics such as the L2 distance. We design a novel CNN architecture for Jointly Learning Multi-Loss (JLML) of local and global discriminative feature optimisation subject concurrently to the same re-id labelled information. Extensive comparative evaluations demonstrate the advantages of this new JLML model for person re-id over a wide range of state-of-the-art re-id methods on five benchmarks (VIPeR, GRID, CUHK01, CUHK03, Market-1501).

AIIM Journal 2017 Journal Article

Prediction of synergistic anti-cancer drug combinations based on drug target network and drug induced gene expression profiles

Xiangyi Li
Yingjie Xu
Hui Cui
Tao Huang
Disong Wang
Baofeng Lian
Wei Li
Guangrong Qin

Objective Synergistic drug combinations are promising therapies for cancer treatment. However, effective prediction of synergistic drug combinations is quite challenging as mechanisms of drug synergism are still unclear. Various features such as drug response, and target networks may contribute to prediction of synergistic drug combinations. In this study, we aimed to construct a computational model to predict synergistic drug combinations. Methods We designed drug physicochemical features and network features, including drug chemical structure similarity, target distance in protein–protein network and targeted pathway similarity. At the same time, we designed fifteen pharmacogenomics features using drug treated gene expression profiles based on the background of cancer-related biology network. Based on these eighteen features, we built a prediction model for Synergistic Drug combination using Random forest algorithm (SyDRa). Results Our model achieved a quite good performance with AUC value of 0. 89 and Out-of-bag estimate error rate of 0. 15 in training dataset. Using the random anti-cancer drug combinations which have transcriptional profile data in the Connectivity Map dataset as the testing dataset, we identified 28 potentially synergistic drug combinations, three out of which had been reported to be effective drug combinations by literatures. Conclusions We studied eighteen features for drug combinations and built a computational model using random forest algorithm. The model was evaluated using an independent test dataset. Our model provides an efficient strategy to identify potentially synergistic drug combinations for cancer and may help reduce the search space for high-throughput synergistic drug combinations screening.

EAAI Journal 2017 Journal Article

Sprinkled semantic diffusion kernel for word sense disambiguation

Tinghua Wang
Wei Li
Fulai Liu
Jialin Hua

Word sense disambiguation (WSD), the task of identifying the intended meanings (senses) of words in context, has been a long-standing research objective for natural language processing (NLP). In this paper, we are concerned with kernel methods for automatic WSD. Under this framework, the main difficulty is to design an appropriate kernel function to represent the sense distinction knowledge. Semantic diffusion kernel, which models semantic similarity by means of a diffusion process on a graph defined by lexicon and co-occurrence information to smooth the typical “Bag of Words” (BOW) representation, has been successfully applied to WSD. However, the diffusion is an unsupervised process, which fails to exploit the class information in a supervised classification scenario. To address the limitation, we present a sprinkled semantic diffusion kernel to make use of the class knowledge of training documents in addition to the co-occurrence knowledge. The basic idea is to construct an augmented term-document matrix by encoding class information as additional terms and appending them to training documents. Diffusion is then performed on the augmented term-document matrix. In this way, the words belonging to the same class are indirectly drawn closer to each other, hence the class-specific word correlations are strengthened. We evaluate our method on several Senseval/Semeval benchmark examples with support vector machine (SVM), and show that the proposed kernel can significantly improve the disambiguation performance over semantic diffusion kernel in terms of different measures and yield a competitive result with the state-of-the-art kernel methods for WSD.

YNIMG Journal 2016 Journal Article

Imaging whole-brain cytoarchitecture of mouse with MRI-based quantitative susceptibility mapping

Hongjiang Wei
Luke Xie
Russell Dibb
Wei Li
Kyle Decker
Yuyao Zhang
G. Allan Johnson
Chunlei Liu

The proper microstructural arrangement of complex neural structures is essential for establishing the functional circuitry of the brain. We present an MRI method to resolve tissue microstructure and infer brain cytoarchitecture by mapping the magnetic susceptibility in the brain at high resolution. This is possible because of the heterogeneous magnetic susceptibility created by varying concentrations of lipids, proteins and irons from the cell membrane to cytoplasm. We demonstrate magnetic susceptibility maps at a nominal resolution of 10-μm isotropic, approaching the average cell size of a mouse brain. The maps reveal many detailed structures including the retina cell layers, olfactory sensory neurons, barrel cortex, cortical layers, axonal fibers in white and gray matter. Olfactory glomerulus density is calculated and structural connectivity is traced in the optic nerve, striatal neurons, and brainstem nerves. The method is robust and can be readily applied on MRI scanners at or above 7T.

YNIMG Journal 2016 Journal Article

Magnetic susceptibility of brain iron is associated with childhood spatial IQ

Kimberly L.H. Carpenter
Wei Li
Hongjiang Wei
Bing Wu
Xue Xiao
Chunlei Liu
Gordon Worley
Helen Link Egger

Iron is an essential micronutrient for healthy brain function and development. Because of the importance of iron in the brain, iron deficiency results in widespread and lasting effects on behavior and cognition. We measured iron in the basal ganglia of young children using a novel MRI method, quantitative susceptibility mapping, and examined the association of brain iron with age and cognitive performance. Participants were a community sample of 39 young children recruited from pediatric primary care who were participating in a 5-year longitudinal study of child brain development and anxiety disorders. The children were ages 7 to 11years old (mean age: 9. 5years old) at the time of the quantitative susceptibility mapping scan. The differential abilities scale was administered when the children were 6years old to provide a measure of general intelligence and verbal (receptive and expressive), non-verbal, and spatial performance. Magnetic susceptibility values, which are linearly related to iron concentration in iron-rich areas, were extracted from regions of interest within iron-rich deep gray matter nuclei from the basal ganglia, including the caudate, putamen, substantia nigra, globus pallidus, and thalamus. Controlling for scan age, there was a significant positive association between iron in the basal ganglia and spatial IQ, with this effect being driven by iron in the right caudate We also replicated previous findings of a significant positive association between iron in the bilateral basal ganglia and age. Our finding of a positive association between spatial IQ and mean iron in the basal ganglia, and in the caudate specifically, suggests that iron content in specific regions of the iron-rich deep nuclei of the basal ganglia influences spatial intelligence. This provides a potential neurobiological mechanism linking deficits in spatial abilities reported in children who were severely iron deficient as infants to decreased iron within the caudate.

YNIMG Journal 2015 Journal Article

A method for estimating and removing streaking artifacts in quantitative susceptibility mapping

Wei Li
Nian Wang
Fang Yu
Hui Han
Wei Cao
Rebecca Romero
Bundhit Tantiwongkosi
Timothy Q. Duong

Quantitative susceptibility mapping (QSM) is a novel MRI method for quantifying tissue magnetic property. In the brain, it reflects the molecular composition and microstructure of the local tissue. However, susceptibility maps reconstructed from single-orientation data still suffer from streaking artifacts which obscure structural details and small lesions. We propose and have developed a general method for estimating streaking artifacts and subtracting them from susceptibility maps. Specifically, this method uses a sparse linear equation and least-squares (LSQR)-algorithm-based method to derive an initial estimation of magnetic susceptibility, a fast quantitative susceptibility mapping method to estimate the susceptibility boundaries, and an iterative approach to estimate the susceptibility artifact from ill-conditioned k-space regions only. With a fixed set of parameters for the initial susceptibility estimation and subsequent streaking artifact estimation and removal, the method provides an unbiased estimate of tissue susceptibility with negligible streaking artifacts, as compared to multi-orientation QSM reconstruction. This method allows for improved delineation of white matter lesions in patients with multiple sclerosis and small structures of the human brain with excellent anatomical details. The proposed methodology can be extended to other existing QSM algorithms.

YNIMG Journal 2015 Journal Article

Association between increased magnetic susceptibility of deep gray matter nuclei and decreased motor function in healthy adults

Wei Li
Christian Langkammer
Ying-hui Chou
Katja Petrovic
Reinhold Schmidt
Allen W. Song
David J. Madden
Stefan Ropele

In the human brain, iron is more prevalent in gray matter than in white matter, and deep gray matter structures, particularly the globus pallidus, putamen, caudate nucleus, substantia nigra, red nucleus, and dentate nucleus, exhibit especially high iron content. Abnormally elevated iron levels have been found in various neurodegenerative diseases. Additionally, iron overload and related neurodegeneration may also occur during aging, but the functional consequences are not clear. In this study, we explored the correlation between magnetic susceptibility — a surrogate marker of brain iron — of these gray matter structures with behavioral measures of motor and cognitive abilities, in 132 healthy adults aged 40–83years. Latent variables corresponding to manual dexterity and executive functions were obtained using factor analysis. The factor scores for manual dexterity declined significantly with increasing age. Independent of gender, age, and global cognitive function, increasing magnetic susceptibility in the globus pallidus and red nuclei was associated with decreasing manual dexterity. This finding suggests the potential value of magnetic susceptibility, a non-invasive quantitative imaging marker of iron, for the study of iron-related brain function changes.

YNIMG Journal 2015 Journal Article

Cerebral angiography, blood flow and vascular reactivity in progressive hypertension

Yunxia Li
Qiang Shen
Shiliang Huang
Wei Li
Eric R. Muir
Justin A. Long
Timothy Q. Duong

Chronic hypertension alters cerebral vascular morphology, cerebral blood flow (CBF), cerebrovascular reactivity, and increses susceptibility to neurological disorders. This study evaluated: i) the lumen diameters of major cerebral and downstream arteries using magnetic resonance angiography, ii) basal CBF, and iii) cerebrovascular reactivity to hypercapnia of multiple brain regions using arterial-spin-labeling technique in spontaneously hypertensive rats (SHR) at different stages. Comparisons were made with age-matched normotensive Wistar Kyoto (WKY) rats. In 10-week SHR, lumen diameter started to reduce, basal CBF, and hypercapnic CBF response were higher from elevated arterial blood pressure, but there was no evidence of stenosis, compared to age-matched WKY. In 20-week SHR, lumen diameter remained reduced, CBF returned toward normal from vasoconstriction, hypercapnic CBF response reversed and became smaller, but without apparent stenosis. In 40-week SHR, lumen diameter remained reduced and basal CBF further decreased, resulting in larger differences compared to WKY. There was significant stenosis in main supplying cerebral vessels. Hypercapnic CBF response further decreased, with some animals showing negative hypercapnic CBF responses in some brain regions, indicative of compromised cerebrovascular reserve. The territory with negative hypercapnia CBF responses corresponded with the severity of stenosis in arteries that supplied those territories. We also found enlargement of downstream vessels and formation of collateral vessels as compensatory responses to stenosis of upstream vessels. The middle cerebral and azygos arteries were amongst the most susceptible to hypertension-induced changes. Multimodal MRI provides clinically relevant data that might be useful to characterize disease pathogenesis, stage disease progression, and monitor treatment effects in hypertension.

IJCAI Conference 2015 Conference Paper

Multi-Modality Tracker Aggregation: From Generative to Discriminative

Xiaoqin Zhang
Wei Li
Mingyu Fan
Di Wang
Xiuzi Ye

Visual tracking is an important research topic in computer vision community. Although there are numerous tracking algorithms in the literature, no one performs better than the others under all circumstances, and the best algorithm for a particular dataset may not be known a priori. This motivates a fundamental problem-the necessity of an ensemble learning of different tracking algorithms to overcome their drawbacks and to increase the generalization ability. This paper proposes a multimodality ranking aggregation framework for fusion of multiple tracking algorithms. In our work, each tracker is viewed as a ‘ranker’ which outputs a rank list of the candidate image patches based on its own appearance model in a particular modality. Then the proposed algorithm aggregates the rankings of different rankers to produce a joint ranking. Moreover, the level of expertise for each ‘ranker’ based on the historical ranking results is also effectively used in our model. The proposed model not only provides a general framework for fusing multiple tracking algorithms on multiple modalities, but also provides a natural way to combine the advantages of the generative model based trackers and the the discriminative model based trackers. It does not need to directly compare the output results obtained by different trackers, and such a comparison is usually heuristic. Extensive experiments demonstrate the effectiveness of our work.

AAAI Conference 2015 Conference Paper

What Is Hot in CHI

Wei Li

IS Journal 2014 Journal Article

A Network Evolution Model for Chinese Traditional Acquaintance Networks

Xi Chen
Lan Zhang
Wei Li

The evolution model of Chinese traditional acquaintance relationship networks described in this article emphasizes individual heterogeneity and social culture. The model incorporates three distinct mechanisms that affect acquaintance network evolution and formation: heredity linking, variation linking, and similarity-based disconnection. The authors found that the degree distribution of Chinese traditional acquaintance networks is manifested in a piecewise approximation that combines a power-law form with an exponential cutoff and exponential distribution. Numerical results indicate that individuals maintaining a medium amount of connections far outweigh others, reflecting the characteristics of Guanxi-centered society. The formation of acquaintance relationship networks is greatly affected by the special Chinese kinship culture. The authors' findings are supported by sociological statistical conclusions and offer a rational explanation for the nature of Chinese kinship networks. Their work provides an adequate framework for further research on dynamic human complex behaviors such as epidemic spreading and rumor propagation.

YNIMG Journal 2014 Journal Article

Prenatal alcohol exposure reduces magnetic susceptibility contrast and anisotropy in the white matter of mouse brains

Wei Cao
Wei Li
Hui Han
Shonagh K. O'Leary-Moore
Kathleen K. Sulik
G. Allan Johnson
Chunlei Liu

Prenatal alcohol exposure can result in long-term cognitive and behavioral deficits. Fetal alcohol spectrum disorder (FASD) refers to a range of permanent birth defects caused by prenatal alcohol exposure, and is the most common neurodevelopmental disorder in the US. Studies by autopsy and conventional structural MRI indicate that the midline structures of the brain are particularly vulnerable to prenatal alcohol exposure. Diffusion tensor imaging (DTI) has shown that abnormalities in brain white matter especially the corpus callosum are very common in FASD. Quantitative susceptibility mapping (QSM) is a novel technique that measures tissue's magnetic property. Such magnetic property is affected by tissue microstructure and molecular composition including that of myelin in the white matter. In this work, we studied three major white matter fiber bundles of a mouse model of FASD and compared it to control mice using both QSM and DTI. QSM revealed clear and significant abnormalities in anterior commissure, corpus callosum, and hippocampal commissure, which were likely due to reduced myelination. Our data also suggested that QSM may be even more sensitive than DTI for examining changes due to prenatal alcohol exposure. Although this is a preclinical study, the technique of QSM is readily translatable to human brain.

YNIMG Journal 2014 Journal Article

Quantitative magnetic susceptibility of the developing mouse brain reveals microstructural changes in the white matter

Ioannis Argyridis
Wei Li
G. Allan Johnson
Chunlei Liu

Cerebral development involves a complex cascade of events which are difficult to visualize and quantify in vivo. In this study we combine information from Diffusion Tensor Imaging (DTI) and Quantitative Susceptibility Mapping (QSM) to analyze developing mouse brains at five stages up to 56days postnatal. Susceptibility maps were calculated using frequency shifts in gradient echo MR images acquired at 9. 4T. The mean apparent magnetic susceptibility and magnetic susceptibility anisotropy of major white matter tracts were evaluated as a function of age. During the first two weeks, susceptibility of white matter appeared paramagnetic relative to surrounding gray matter; it then gradually became more diamagnetic. While diffusion anisotropy was already apparent and high at postnatal day 2, susceptibility anisotropy only became significant during the third week. This mismatch indicated different microstructural underpinnings for diffusion anisotropy and susceptibility anisotropy. Histological exams were also performed to evaluate myelin and iron content. It is confirmed that the main source of susceptibility contrast in WM is the myelin content. The ability to quantify the magnetic properties of white matter will provide valuable information on the architecture of the brain during development and potentially a more specific indicator for myelin degenerative diseases.

ICRA Conference 2014 Conference Paper

Robot learning based on Partial Observable Markov Decision Process in unstructured environment

Hongtai Cheng
Heping Chen
Lina Hao
Wei Li

Robot teaching is necessary for the current industrial robot applications. Because work stations have to be stopped to perform teaching processes, the manufacturing efficiency is decreased. In this paper we propose to utilize an uncalibrated vision system mounted on a mobile robot (“Adult” robot) with learning capability to supervise a group of fixed robots (“Child” robots) to accomplish a robot teaching task automatically without stopping work stations. To increase the system flexibility, hand-eye calibration and calibration between the robots are eliminated. A Partial Observable Markov Decision Process(POMDP) is formulated and solved using the Successive Approximation of the Reachable Space under Optimal Policies (SARSOP) algorithm to enable the teaching process using image features with uncertainties. The proposed algorithm was tested using the “adult” robot to teach a “child” robot to perform a high accuracy peg-in-hole assembly process. The experimental results verify the effectiveness of the proposed approach. The proposed method can also be used in other areas to enable robot teaching.

YNIMG Journal 2013 Journal Article

Automatic hippocampus segmentation of 7.0Tesla MR images by combining multiple atlases and auto-context models

Minjeong Kim
Guorong Wu
Wei Li
Li Wang
Young-Don Son
Zang-Hee Cho
Dinggang Shen

In many neuroscience and clinical studies, accurate measurement of hippocampus is very important to reveal the inter-subject anatomical differences or the subtle intra-subject longitudinal changes due to aging or dementia. Although many automatic segmentation methods have been developed, their performances are still challenged by the poor image contrast of hippocampus in the MR images acquired especially from 1. 5 or 3. 0Tesla (T) scanners. With the recent advance of imaging technology, 7. 0T scanner provides much higher image contrast and resolution for hippocampus study. However, the previous methods developed for segmentation of hippocampus from 1. 5T or 3. 0T images do not work for the 7. 0T images, due to different levels of imaging contrast and texture information. In this paper, we present a learning-based algorithm for automatic segmentation of hippocampi from 7. 0T images, by taking advantages of the state-of-the-art multi-atlas framework and also the auto-context model (ACM). Specifically, ACM is performed in each atlas domain to iteratively construct sequences of location-adaptive classifiers by integrating both image appearance and local context features. Due to the plenty texture information in 7. 0T images, more advanced texture features are also extracted and incorporated into the ACM during the training stage. Then, under the multi-atlas segmentation framework, multiple sequences of ACM-based classifiers are trained for all atlases to incorporate the anatomical variability. In the application stage, for a new image, its hippocampus segmentation can be achieved by fusing the labeling results from all atlases, each of which is obtained by applying the atlas-specific ACM-based classifiers. Experimental results on twenty 7. 0T images with the voxel size of 0. 35×0. 35×0. 35mm3 show very promising hippocampus segmentations (in terms of Dice overlap ratio 89. 1±0. 020), indicating high applicability for the future clinical and neuroscience studies.

IJCAI Conference 2013 Conference Paper

Dimensionality Reduction with Generalized Linear Models

Mo Chen
Wei Li
Wei Zhang
Xiaogang Wang

In this paper, we propose a general dimensionality reduction method for data generated from a very broad family of distributions and nonlinear functions based on the generalized linear model, called Generalized Linear Principal Component Analysis (GLPCA). Data of different domains often have very different structures. These data can be modeled by different distributions and reconstruction functions. For example, real valued data can be modeled by the Gaussian distribution with a linear reconstruction function, whereas binary valued data may be more appropriately modeled by the Bernoulli distribution with a logit or probit function. Based on general linear models, we propose a unified framework for extracting features from data of different domains. A general optimization algorithm based on natural gradient ascent on distribution manifold is proposed for obtaining the maximum likelihood solutions. We also present some specific algorithms derived from this framework to deal with specific data modeling problems such as document modeling. Experimental results of these algorithms on several data sets are shown for the validation of GLPCA.

PDF Details DOI

YNIMG Journal 2013 Journal Article

Imaging neural architecture of the brain based on its multipole magnetic response

Chunlei Liu
Wei Li

Although magnetic fields interact weakly with biological tissues, at high fields, this interaction is sufficiently strong to cause measurable shifts in the Larmor frequency among various tissue types. While measuring frequency shift and its anisotropy has enabled NMR spectroscopy to determine structures of large molecules, MRI has not been able to fully utilize the vast information existing in the frequency to elucidate tissue microstructure. Using a multipole analysis of the complex MRI signal in the Fourier spectral space, we developed a fast and high-resolution method that enables the quantification of tissue's magnetic response with a set of magnetic susceptibility tensors of various ranks. The Fourier spectral space, termed p-space, can be generated by applying field gradients or equivalently by shifting the k-space data in various directions. Measuring these tensors allows the visualization and quantification of tissue architecture. We performed 3D whole-brain multipole susceptibility tensor imaging in simulation, on intact mouse brains ex vivo and on human brains in vivo. We showed that these multipole susceptibility tensors can be used to image orientations of ordered white matter fibers. These experiments demonstrate that multipole tensor analysis may enable practical mapping of tissue microstructure in vivo without rotating subject or magnetic field.

JMLR Journal 2013 Journal Article

One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

Jun Wan
Qiuqi Ruan
Wei Li
Shuang Deng

For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

YNIMG Journal 2013 Journal Article

Subregions of the human superior frontal gyrus and their connections

Wei Li
Wen Qin
Huaigui Liu
Lingzhong Fan
Jiaojian Wang
Tianzi Jiang
Chunshui Yu

The superior frontal gyrus (SFG) is located at the superior part of the prefrontal cortex and is involved in a variety of functions, suggesting the existence of functional subregions. However, parcellation schemes of the human SFG and the connection patterns of each subregion remain unclear. We firstly parcellated the human SFG into the anteromedial (SFGam), dorsolateral (SFGdl), and posterior (SFGp) subregions based on diffusion tensor tractography. The SFGam was anatomically connected with the anterior and mid-cingulate cortices, which are critical nodes of the cognitive control network and the default mode network (DMN). The SFGdl was connected with the middle and inferior frontal gyri, which are involved in the cognitive execution network. The SFGp was connected with the precentral gyrus, caudate, thalamus, and frontal operculum, which are nodes of the motor control network. Resting-state functional connectivity analysis further revealed that the SFGam was mainly correlated with the cognitive control network and the DMN; the SFGdl was correlated with the cognitive execution network and the DMN; and the SFGp was correlated with the sensorimotor-related brain regions. The SFGam and SFGdl were further parcellated into three and two subclusters that are well corresponding to Brodmann areas. These findings suggest that the human SFG consists of multiple dissociable subregions that have distinct connection patterns and that these subregions are involved in different functional networks and serve different functions. These results may improve our understanding on the functional complexity of the SFG and provide us an approach to investigate the SFG at the subregional level.

IJCAI Conference 2013 Conference Paper

TutorialPlan: Automated Tutorial Generation from CAD Drawings

Wei Li
Yuanlin Zhang
George Fitzmaurice

Authoring tutorials for complex software applications is a time consuming process. It also highly depends on the tutorial designer’s skill level and experience. This paper introduces an approach which automatically generates software tutorials using the digital artifacts produced by the users of a software program. We model this process as an optimal planning problem using software produced artifacts, software specifications and the human-computer interaction Keystroke- Level Model (KLM). We present TutorialPlan, an automated tutorial generator, which creates stepby-step text and image instructions from CAD drawings and helps users learn AutoCAD, a complex design and drafting software. In our tutorial generator, the optimal planning problem is represented and solved using DLV, a general Answer Set Programming (ASP) system. DLV offers a natural representation of both the problem and the heuristics needed to solve it efficiently. A user study shows that the tutorials generated by our system are comparable to those generated by experienced AutoCAD users.

PDF Details DOI

YNIMG Journal 2012 Journal Article

3D fiber tractography with susceptibility tensor imaging

Chunlei Liu
Wei Li
Bing Wu
Yi Jiang
G. Allan Johnson

Gradient-echo MRI has revealed anisotropic magnetic susceptibility in the brain white matter. This magnetic susceptibility anisotropy can be measured and characterized with susceptibility tensor imaging (STI). In this study, a method of fiber tractography based on STI is proposed and demonstrated in the mouse brain. STI experiments of perfusion-fixed mouse brains were conducted at 7. 0T. The magnetic susceptibility tensor was calculated for each voxel with regularization and decomposed into its eigensystem. The major eigenvector is found to be aligned with the underlying fiber orientation. Following the orientation of the major eigenvector, we are able to map distinctive fiber pathways in 3D. As a comparison, diffusion tensor imaging (DTI) and DTI fiber tractography were also conducted on the same specimens. The relationship between STI and DTI fiber tracts was explored with similarities and differences identified. It is anticipated that the proposed method of STI tractography may provide a new way to study white matter fiber architecture. As STI tractography is based on physical principles that are fundamentally different from DTI, it may also be valuable for the ongoing validation of DTI tractography.

YNIMG Journal 2012 Journal Article

Fast and tissue-optimized mapping of magnetic susceptibility and T2* with multi-echo and multi-shot spirals

Bing Wu
Wei Li
Alexandru Vlad Avram
Sung-Min Gho
Chunlei Liu

Gradient-echo MRI of resonance-frequency shift and T2* values exhibit unique tissue contrast and offer relevant physiological information. However, acquiring 3D-phase images and T2* maps with the standard spoiled gradient echo (SPGR) sequence is lengthy for routine imaging at high-spatial resolution and whole-brain coverage. In addition, with the standard SPGR sequence, optimal signal-to-noise ratio (SNR) cannot be achieved for every tissue type given their distributed resonance frequency and T2* value. To address these two issues, a SNR optimized multi-echo sequence with a stack-of-spiral acquisition is proposed and implemented for achieving fast and simultaneous acquisition of image phase and T2* maps. The analytical behavior of the phase SNR is derived as a function of resonance frequency, T2* and echo time. This relationship is utilized to achieve tissue optimized SNR by combining phase images with different echo times. Simulations and in vivo experiments were designed to verify the theoretical predictions. Using the multi-echo spiral acquisition, whole-brain coverage with 1mm isotropic resolution can be achieved within 2. 5min, shortening the scan time by a factor of 8. The resulting multi-echo phase map shows similar SNR to that of the standard SPGR. The acquisition can be further accelerated with non-Cartesian parallel imaging. The technique can be readily extended to other multi-shot readout trajectories besides spiral. It may provide a practical acquisition strategy for high resolution and simultaneous 3D mapping of magnetic susceptibility and T2*.

YNIMG Journal 2012 Journal Article

Magnetic susceptibility anisotropy of human brain in vivo and its molecular underpinnings

Wei Li
Bing Wu
Alexandru V. Avram
Chunlei Liu

Frequency shift of gradient-echo MRI provides valuable information for assessing brain tissues. Recent studies suggest that the frequency and susceptibility contrast depend on white matter fiber orientation. However, the molecular underpinning of the orientation dependence is unclear. In this study, we investigated the orientation dependence of susceptibility of human brain in vivo and mouse brains ex vivo. The source of susceptibility anisotropy in white matter is likely to be myelin as evidenced by the loss of anisotropy in the dysmyelinating shiverer mouse brain. A biophysical model is developed to investigate the effect of the molecular susceptibility anisotropy of myelin components, especially myelin lipids, on the bulk anisotropy observed by MRI. This model provides a consistent interpretation of the orientation dependence of macroscopic magnetic susceptibility in normal mouse brain ex vivo and human brain in vivo and the microscopic origin of anisotropic susceptibility. It is predicted by the theoretical model and illustrated by the experimental data that the magnetic susceptibility of the white matter is least diamagnetic along the fiber direction. This relationship allows an efficient extraction of fiber orientation using susceptibility tensor imaging. These results suggest that anisotropy on the molecular level can be observed on the macroscopic level when the molecules are aligned in a highly ordered manner. Similar to the utilization of magnetic susceptibility anisotropy in elucidating molecular structures, imaging magnetic susceptibility anisotropy may also provide a useful tool for elucidating the microstructure of ordered biological tissues.

YNIMG Journal 2011 Journal Article

High-field (9.4T) MRI of brain dysmyelination by quantitative mapping of magnetic susceptibility

Chunlei Liu
Wei Li
G. Allan Johnson
Bing Wu

The multilayered myelin sheath wrapping around nerve axons is essential for proper functioning of the central nervous system. Abnormal myelination leads to a wide range of neurological diseases and developmental disorders. Non-invasive imaging of myelin content is of great clinical importance. The present work demonstrated that loss of myelin in the central nervous system of the shiverer mouse results in a dramatic reduction of magnetic susceptibility in white matter axons. The reduction resulted in a near extinction of susceptibility contrast between gray and white matter. Quantitative magnetic susceptibility imaging and diffusion tensor imaging were conducted on a group of control and shiverer mice at 9. 4T. We measured the resonance frequency distribution of the whole brain for each mouse. Magnetic susceptibility maps were computed and compared between the two groups. It was shown that the susceptibility contrast between gray and white matter was reduced by 96% in the shiverer compared to the controls. Diffusion measurements further confirmed intact fiber pathways in the shiverer mice, ruling out the possibility of axonal injury and its potential contribution to the altered susceptibility. As an autosomal recessive mutation, shiverer is characterized by an almost total lack of central nervous system myelin. Our data provide new evidences indicating that myelin is the predominant source of susceptibility differences between deep gray and white matter observed in magnetic resonance imaging. More importantly, the present study suggests that quantitative magnetic susceptibility is a potential endogenous biomarker for myelination.

YNIMG Journal 2011 Journal Article

Quantitative susceptibility mapping of human brain reflects spatial variation in tissue composition

Wei Li
Bing Wu
Chunlei Liu

Image phase from gradient echo MRI provides a unique contrast that reflects brain tissue composition variations, such as iron and myelin distribution. Phase imaging is emerging as a powerful tool for the investigation of functional brain anatomy and disease diagnosis. However, the quantitative value of phase is compromised by its nonlocal and orientation dependent properties. There is an increasing need for reliable quantification of magnetic susceptibility, the intrinsic property of tissue. In this study, we developed a novel and accurate susceptibility mapping method that is also phase-wrap insensitive. The proposed susceptibility mapping method utilized two complementary equations: (1) the Fourier relationship of phase and magnetic susceptibility; and (2) the first-order partial derivative of the first equation in the spatial frequency domain. In numerical simulation, this method reconstructed the susceptibility map almost free of streaking artifact. Further, the iterative implementation of this method allowed for high quality reconstruction of susceptibility maps of human brain in vivo. The reconstructed susceptibility map provided excellent contrast of iron-rich deep nuclei and white matter bundles from surrounding tissues. Further, it also revealed anisotropic magnetic susceptibility in brain white matter. Hence, the proposed susceptibility mapping method may provide a powerful tool for the study of brain physiology and pathophysiology. Further elucidation of anisotropic magnetic susceptibility in vivo may allow us to gain more insight into the white matter micro-architectures.

IS Journal 2010 Journal Article

Social Learning

Qiang Yang
Zhi-Hua Zhou
Wenji Mao
Wei Li
Nathan Nan Liu

In recent years, social behavioral data have been exponentially expanding due to the tremendous success of various outlets on the social Web (aka Web 2. 0) such as Facebook, Digg, Twitter, Wikipedia, and Delicious. As a result, there's a need for social learning to support the discovery, analysis, and modeling of human social behavioral data. The goal is to discover social intelligence, which encompasses a spectrum of knowledge that characterizes human interaction, communication, and collaborations. The social Web has thus become a fertile ground for machine learning and data mining research. This special issue gathers the state-of-the-art research in social learning and is devoted to exhibiting some of the best representative works in this area.

AAAI Conference 2008 Conference Paper

Exploiting Causal Independence Using Weighted Model Counting

Wei Li

Previous studies have demonstrated that encoding a Bayesian network into a SAT-CNF formula and then performing weighted model counting using a backtracking search algorithm can be an effective method for exact inference in Bayesian networks. In this paper, we present techniques for improving this approach for Bayesian networks with noisy-OR and noisy-MAX relations—two relations which are widely used in practice as they can dramatically reduce the number of probabilities one needs to specify. In particular, we present two space efficient CNF encodings for noisy- OR/MAX and explore alternative search ordering heuristics. We experimentally evaluated our techniques on large-scale real and randomly generated Bayesian networks. On these benchmarks, our techniques gave speedups of up to two orders of magnitude over the best previous approaches and scaled up to networks with larger numbers of random variables.

TCS Journal 2006 Journal Article

Many hard examples in exact phase transitions

Ke Xu
Wei Li

This paper analyzes the resolution complexity of two random constraint satisfaction problem (CSP) models (i. e. Model RB/RD) for which we can establish the existence of phase transitions and identify the threshold points exactly. By encoding CSPs into CNF formulas, it is proved that almost all instances of Model RB/RD have no tree-like resolution proofs of less than exponential size. Thus, we not only introduce new families of CSPs and CNF formulas hard to solve, which can be useful in the experimental evaluation of CSP and SAT algorithms, but also propose models with both many hard instances and exact phase transitions. Finally, conclusions are presented, as well as a detailed comparison of Model RB/RD with the Hamiltonian cycle problem and random 3-SAT, which, respectively, exhibit three different kinds of phase transition behavior in NP-complete problems.

AAAI Conference 2006 Conference Paper

Performing Incremental Bayesian Inference by Dynamic Model Counting

Wei Li

The ability to update the structure of a Bayesian network when new data becomes available is crucial for building adaptive systems. Recent work by Sang, Beame, and Kautz (AAAI 2005) demonstrates that the well-known Davis-Putnam procedure combined with a dynamic decomposition and caching technique is an effective method for exact inference in Bayesian networks with high density and width. In this paper, we define dynamic model counting and extend the dynamic decomposition and caching technique to multiple runs on a series of problems with similar structure. This allows us to perform Bayesian inference incrementally as the structure of the network changes. Experimental results show that our approach yields significant improvements over the previous model counting approaches on multiple challenging Bayesian network instances.

AAAI Conference 2005 Conference Paper

Semi-Supervised Sequence Modeling with Syntactic Topic Models

Wei Li

Although there has been significant previous work on semi-supervised learning for classification, there has been relatively little in sequence modeling. This paper presents an approach that leverages recent work in manifold-learning on sequences to discover word clusters from language data, including both syntactic classes and semantic topics. From unlabeled data we form a smooth, low-dimensional feature space, where each word token is projected based on its underlying role as a function or content word. We then use this projection as additional input features to a linear-chain conditional random field trained on limited labeled training data. On standard part-of-speech tagging and Chinese word segmentation data sets we show as much as 14% error reduction due to the unlabeled data, and also statistically-significant improvements over a related semi-supervised sequence tagging method due to Miller et al.

YNIMG Journal 2003 Journal Article

Neural development of selective attention and response inhibition

James R Booth
Douglas D Burman
Joel R Meyer
Zhang Lei
Barbara L Trommer
Nicholas D Davenport
Wei Li
Todd B Parrish

Brain activation differences between 12 children (9- to 12-year-olds) and 12 adults (20- to 30-year-olds) were examined on two cognitive tasks during functional magnetic resonance imaging (fMRI). Spatial selective attention was measured with the visual search for a conjunction target (red triangle) in a field of distracters and response inhibition was measured with a go no-go task. There were small developmental differences in the selective attention task, with children showing greater activation than adults in the anterior cingulate and thalamus. There were large developmental differences in the response inhibition task, with children showing greater activation than adults in a fronto-striatal network including middle cingulate, medial frontal gyrus, medial aspects of bilateral superior frontal gyrus, and the caudate nucleus on the left. Children also showed greater bilateral activation for the response inhibition task in posterior cingulate, thalamus and the hippocampo-amygdaloid region. The extensive developmental differences on the response inhibition task are consistent with the prolonged maturation of the fronto-striatal network.

EAAI Journal 1994 Journal Article

Fuzzy-logic-based reactive behavior control of an autonomous mobile system in unknown environments

Wei Li

This paper presents a method for fuzzy-logic-based reactive behavior control of an autonomous mobile system in unknown environments. Difficulties in behavior-based control arise mainly from the quantitative formulation of reactive behavior as well as from the need for efficient coordination of conflicts and competition among multiple types of reactive behavior. The main idea of the present study is to incorporate fuzzy logic control with behavior-based control such that types of reactive behavior are formulated by fuzzy sets and fuzzy rules, and conflicts and competition among different types are coordinate by fuzzy reasoning. The inputs to the control scheme consist of a heading angle between the robot and a specified target and the distances between the robot and the obstacles to the left, front, and right locations, acquired by an array of ultrasonic sensors. The outputs from the control scheme are commands for the speed control unit of two rear wheels of a mobile robot. Simulation results show that the proposed method can be applied to efficient robot navigation in complex and unknown environments by weighting different varieties of reactive behavior, such as avoiding obstacles, following edges, moving towards a target, and so forth. In addition, this method is suitable for robot navigation by multisensor fusion and integration.

IJCAI Conference 1993 Conference Paper

Wï¿½A Logic System Based on the Shared Common Knowledge Views

Xianchang Wang
Huowang Chen
Quingping Zhao
Wei Li

In this paper, we give a logic system — W based on the view of shared common knowledge, and prove some properties of W. By W, we effectively describe and solve the Conway Paradox, the typical multi-agent problem involving common knowledge.