Arrow Research search

Author name cluster

Shuo Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

82 papers
2 author rows

Possible papers

82

AAAI Conference 2026 Conference Paper

A Brain-Inspired Saliency Prediction Framework for Human-AI Cognitive Consistency in AIGC Content via Multi-Region Liquid Neurons

  • Shibo Wang
  • Yan Zhao
  • Shigang Wang
  • Jian Wei
  • Shuo Li

In recent years, human-AI cognitive consistency has emerged as a crucial perspective for evaluating the perceptual quality and interpretability of AIGC (Artificial Intelligence Generated Content). This paper proposes a biologically inspired saliency prediction framework that models six core regions of the human visual system—namely V1, V2, V4, MT, LIP, and FEF—using liquid neurons to capture the dynamic saliency features aligned with human gaze behavior. To enable effective alignment between AIGC models and human cognitive mechanisms, we introduce a cross-domain dual-teacher distillation strategy and construct a large-scale multimodal dataset comprising natural images, eye-tracking data, AIGC-generated images, and their corresponding cross-attention maps. Furthermore, we propose HAMCI (Human-AI Mutual Cognitive Index), a novel metric designed to quantitatively assess the spatial and semantic alignment between predicted saliency maps and model attention distributions. The proposed method demonstrates promising performance across various saliency prediction and cognitive alignment tasks, with results comparable to or surpassing recent state-of-the-art methods in several benchmarks. The code and dataset will be released upon acceptance to facilitate future research on cognitively aligned AIGC evaluation.

AAAI Conference 2026 Conference Paper

Ambiguity-aware Truncated Flow Matching for Ambiguous Medical Image Segmentation

  • Fanding Li
  • Xiangyu Li
  • Xianghe Su
  • Xingyu Qiu
  • Suyu Dong
  • Wei Wang
  • Kuanquan Wang
  • Gongning Luo

A simultaneous enhancement of accuracy and diversity of predictions remains a challenge in ambiguous medical image segmentation (AMIS) due to the inherent trade-offs. While truncated diffusion probabilistic models (TDPMs) hold strong potential with a paradigm optimization, existing TDPMs suffer from entangled accuracy and diversity of predictions with insufficient fidelity and plausibility. To address the aforementioned challenges, we propose Ambiguity-aware Truncated Flow Matching (ATFM), which introduces a novel inference paradigm and dedicated model components. Firstly, we propose Data-Hierarchical Inference, a redefinition of AMIS-specific inference paradigm, which enhances accuracy and diversity at data-distribution and data-sample level, respectively, for an effective disentanglement. Secondly, Gaussian Truncation Representation (GTR) is introduced to enhance both fidelity of predictions and reliability of truncation distribution, by explicitly modeling it as a Gaussian distribution at Ttrunc instead of using sampling-based approximations. Thirdly, Segmentation Flow Matching (SFM) is proposed to enhance the plausibility of diverse predictions by extending semantic-aware flow transformation in Flow Matching (FM). Comprehensive evaluations on LIDC and ISIC3 datasets demonstrate that ATFM outperforms SOTA methods and simultaneously achieves a more efficient inference. ATFM improves GED and HM-IoU by up to 12% and 7.3% compared to advanced methods.

JBHI Journal 2026 Journal Article

CalDiff: Calibrating Uncertainty and Accessing Reliability of Diffusion Models for Trustworthy Lesion Segmentation

  • Xinxin Wang
  • Mingrui Yang
  • Sercan Tosun
  • Kunio Nakamura
  • Shuo Li
  • Xiaojuan Li

Low reliability has consistently been a challenge in the application of deep learning models for high-risk decision-making scenarios. In medical image segmentation, multiple expert annotations can be consulted to reduce subjective bias and reach a consensus, thereby enhancing the segmentation accuracy and reliability. To develop a reliable lesion segmentation model, we propose CalDiff, a novel framework that can leverage the uncertainty from multiple annotations, capture real-world diagnostic variability and provide more informative predictions. To harness the superior generative ability of diffusion models, a dual step-wise and sequence-aware calibration mechanism is proposed on the basis of the sequential nature of diffusion models. We evaluate the calibrated model through a comprehensive quantitative and visual analysis, addressing the previously overlooked challenge of assessing uncertainty calibration and model reliability in scenarios with multiple annotations and multiple predictions. Experimental results on two lesion segmentation datasets demonstrate that CalDiff produces uncertainty maps that can reflect low confidence areas, further indicating the false predictions made by the model. By calibrating the uncertainty in the training phase, the uncertain areas produced by our model are closely correlated with areas where the model has made errors in the inference. In summary, the uncertainty captured by CalDiff can serve as a powerful indicator, which can help mitigate the risks of adopting model's outputs, allowing clinicians to prioritize reviewing areas or slices with higher uncertainty and enhancing the model's reliability and trustworthiness in clinical practice.

JBHI Journal 2026 Journal Article

Causality-Adjusted Data Augmentation for Domain Continual Medical Image Segmentation

  • Zhanshi Zhu
  • Qing Dong
  • Gongning Luo
  • Wei Wang
  • Suyu Dong
  • Kuanquan Wang
  • Ye Tian
  • Guohua Wang

In domain continual medical image segmentation, distillation-based methods mitigate catastrophic forgetting by continuously reviewing old knowledge. However, these approaches often exhibit biases towards both new and old knowledge simultaneously due to confounding factors, which can undermine segmentation performance. To address these biases, we propose the Causality-Adjusted Data Augmentation (CauAug) framework, introducing a novel causal intervention strategy called the Texture-Domain Adjustment Hybrid-Scheme (TDAHS) alongside two causality-targeted data augmentation approaches: the Cross Kernel Network (CKNet) and the Fourier Transformer Generator (FTGen). (1) TDAHS establishes a domain-continual causal model that accounts for two types of knowledge biases by identifying irrelevant local textures (L) and domain-specific features (D) as confounders. It introduces a hybrid causal intervention that combines traditional confounder elimination with a proposed replacement approach to better adapt to domain shifts, thereby promoting causal segmentation. (2) CKNet eliminates confounder L to reduce biases in new knowledge absorption. It decreases reliance on local textures in input images, forcing the model to focus on relevant anatomical structures and thus improving generalization. (3) FTGen causally intervenes on confounder D by selectively replacing it to alleviate biases that impact old knowledge retention. It restores domain-specific features in images, aiding in the comprehensive distillation of old knowledge. Our experiments show that CauAug significantly mitigates catastrophic forgetting and surpasses existing methods in various medical image segmentation tasks.

AAAI Conference 2026 Conference Paper

DeepBooTS: Dual-Stream Residual Boosting for Drift-Resilient Time-Series Forecasting

  • Daojun Liang
  • Jing Chen
  • Xiao Wang
  • Yinglong Wang
  • Shuo Li

Time-Series (TS) exhibits pronounced non-stationarity. Consequently, most forecasting methods display compromised robustness to concept drift, despite the prevalent application of instance normalization. We tackle this challenge by first analysing concept drift through a bias-variance lens and proving that weighted ensemble reduces variance without increasing bias. These insights motivate DeepBooTS, a novel end-to-end dual-stream residual-decreasing boosting method that progressively reconstructs the intrinsic signal. In our design, each block of a deep model becomes an ensemble of learners with an auxiliary output branch forming a highway to the final prediction. The block‑wise outputs correct the residuals of previous blocks, leading to a learning‑driven decomposition of both inputs and targets. This method enhances versatility and interpretability while substantially improving robustness to concept drift. Extensive experiments, including those on large-scale datasets, show that the proposed method outperforms existing methods by a large margin, yielding an average performance improvement of 15.8% across various datasets, establishing a new benchmark for TS forecasting.

JBHI Journal 2026 Journal Article

Direct PET-to-CT Generation for Attenuation Correction: A Slice-to-Slice Continual Transformer Segmentation-Aware Network

  • Rongjun Ge
  • Hanyuan Zheng
  • Yuxin Liu
  • Liutao Yang
  • Li Wang
  • Xu Ji
  • Jingtao Shen
  • Nan Li

Direct synthetic computed tomography (CT) generation from positron emission tomography (PET) plays a crucial role in PET attenuation correction, yet providing detailed structural information to compensate for functional imaging. Compared to the widely used PET/CT and indirect PET/MR-CT, the direct PET-to-CT translation method (denoted as PET-to-CT) offers several advantages: 1) The CT required for PET-to-CT is directly obtained from PET, thereby avoiding the intermediate errors generated in the inter-step processes of multimodal scanning in PET/CT and PET/MR-CT. 2) Furthermore, direct PET-to-CT eliminates the requirement for supplementary imaging equipment, thereby reducing complexity and scan duration in contrast to PET/CT and PET/MR-CT imaging. Thus, direct PET-to-CT is highly promising for clinical applications. However, it faces challenges, including spatial resolution mismatches between PET and CT, as well as voxel-wise semantic differences arising from functional and structural imaging. To address these challenges, this paper proposes a 2D hierarchical method called S2SCT (Slice-to-Slice Continual Transformer)-SA (Segmentation-aware) Network. It uses a slice-continual network to acquire semantic transformation knowledge from each PET slice to a CT slice, facilitating the conversion between functional and structural imaging domains. Subsequently, the segmentation-aware network is designed to futher capture spatial correlations both between slices and within slice, resulting in improved CT spatial resolution. The experiment results demonstrate that our proposed method outperforms mainstream methods in both CT generation and attenuation correction, as evidenced by both visual results and metric values.

JBHI Journal 2026 Journal Article

EEG-VLM: A Hierarchical Vision-Language Model With Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction

  • Xihe Qiu
  • Gengchen Ma
  • Haoyu Wang
  • Chen Zhan
  • Xiaoyu Tan
  • Shuo Li

Sleep stage classification based on electroencephalography (EEG) is fundamental for assessing sleep quality and diagnosing sleep-related disorders. However, most traditional machine learning methods rely heavily on prior knowledge and handcrafted features, while existing deep learning models still struggle to jointly capture fine-grained time–frequency patterns and achieve clinical interpretability. Recently, vision–language models (VLMs) have made significant progress in the medical domain, yet their performance remains constrained when applied to physiological waveform data, especially EEG signals, due to their limited visual understanding and insufficient reasoning capability. To address these challenges, we propose EEG-VLM, a hierarchical vision–language framework that integrates multi-level feature alignment with visually enhanced language-guided reasoning for interpretable EEG-based sleep stage classification. Specifically, a specialized visual enhancement module constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images. These tokens are further aligned with low-level CLIP features through a multi-level alignment mechanism, enhancing the VLM's image-processing capability. In addition, a Chain-of-Thought (CoT) reasoning strategy decomposes complex medical inference into interpretable logical steps, effectively simulating expert-like decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in EEG-based sleep stage classification, showing promising potential for automated and explainable EEG analysis in clinical settings.

AAAI Conference 2026 Conference Paper

HTTrack: Learning to Perceive Targets via Historical Trajectories in Satellite Video Tracking

  • Jiahao Wang
  • Fang Liu
  • Licheng Jiao
  • Hao Wang
  • Shuo Li
  • Xinyi Wang
  • Lingling Li
  • Puhua Chen

In recent years, the rapid progress of deep learning has driven notable advancements in satellite video tracking, a critical task for applications such as environmental monitoring, disaster management, and defense. Despite these strides, existing approaches remain constrained by their inability to handle dynamic challenges, such as target appearance variations, complex motion patterns, and occlusions. Traditional methods often suffer from static template matching or overly complex update mechanisms, compromising their robustness and practicality in real-world scenarios. To address these limitations, we propose a paradigm shift in satellite video tracking by integrating historical trajectory knowledge with visual features. This fusion enhances the tracker's perceptual understanding of targets over time, enabling more adaptive and resilient tracking. By aligning spatial, temporal, and cross-modal information, our approach effectively bridges the gap between fragmented observations and coherent tracking performance, even under challenging conditions like small target detection and cluttered backgrounds. Extensive experiments conducted on multiple satellite video tracking benchmarks demonstrate the superiority of our method, with HTTrack achieving success rates of 51.5% on SV248S, 52.9% on SatSOT, and 32.6% on VISO, significantly outperforming state-of-the-art trackers and marking a step forward in achieving robust, accurate, and scalable satellite video tracking.

JBHI Journal 2026 Journal Article

Radiomics-Driven Diffusion Model and Monte Carlo Compression Sampling for Reliable Medical Image Synthesis

  • Jianfeng Zhao
  • Shuo Li

Reliable medical image synthesis is crucial for clinical applications and downstream tasks, where high-quality anatomical structure and predictive confidence are essential. Existing studies have made significant progress by embedding prior conditional knowledge, such as conditional images or textual information, to synthesize natural images. However, medical image synthesis remains a challenging task due to: 1) Data scarcity: High-quality medical text prompt are extremely rare and require specialized expertise. 2) Insufficient uncertainty estimation: The uncertainty estimation is critical for evaluating the confidence of reliable medical image synthesis. This paper presents a novel approach for medical image synthesis, driven by radiomics prompts and combined with Monte Carlo Compression Sampling (MCCS) to ensure reliability. For the first time, our method leverages clinically focused radiomics prompts to condition the generation process, guiding the model to produce reliable medical images. Furthermore, the innovative MCCS algorithm employs Monte Carlo methods to randomly select and compress sampling steps within the denoising diffusion implicit models (DDIM), enabling efficient uncertainty quantification. Additionally, we introduce a MambaTrans architecture to model long-range dependencies in medical images and embed prior conditions (e. g. , radiomics prompts). Extensive experiments on benchmark medical imaging datasets demonstrate that our approach significantly improves image quality and reliability, outperforming SoTA methods in both qualitative and quantitative evaluations.

AAAI Conference 2026 Conference Paper

Semantic Feature Purification for Adversarially-Aware RGB-T Tracking

  • Jiahao Wang
  • Fang Liu
  • Hao Wang
  • Shuo Li
  • Xinyi Wang
  • Puhua Chen

RGB-T tracking is increasingly deployed in safety-critical applications such as autonomous driving, surveillance, and rescue robotics, where tracking reliability is essential under adverse conditions. Although the fusion of RGB and thermal infrared (TIR) modalities offers improved robustness in low-light and occluded scenes, recent findings show that RGB-T trackers remain highly susceptible to subtle input perturbations, human-imperceptible modifications that exploit cross-modal inconsistencies to mislead tracking outputs. In real-world scenarios, such perturbations can arise from sensor spoofing, infrared camouflage, or physical-world attacks, posing serious risks to operational safety. To address this, we propose SFPT, a Semantic Feature Purification framework that enhances RGB-T tracking at the representation level. Rather than filtering corrupted inputs at the pixel level, SFPT introduces task-specific semantic anchors into the feature space to reinforce perturbation-invariant cues. These anchors are derived from descriptive language, interact with visual features to purify representations. To further suppress modality-specific interference, we design an Adaptive Perturbation-Guided Cross-Modal Fusion (APG-CMF) module, which leverages language and visual signals to estimate reliability and dynamically reweight cross-modal features, ensuring robust fusion under perturbation conditions. Extensive experiments under diverse perturbation conditions validate the effectiveness of our approach. Notably, SFPT maintains performance comparable to clean settings even when subjected to perturbations of strength 1/255 and 4/255, demonstrating strong resilience to real-world interference.

JBHI Journal 2026 Journal Article

Source-Resilient Joint Learning Framework for Preserving Stable Generalization on Diverse Ultrasonic Source Scenarios

  • Bin Huang
  • Zhong Liu
  • Ziyue Xu
  • Shing-Chow Chan
  • Huiying Wen
  • Chao HOU
  • Qicai Huang
  • Meiqin Jiang

Joint learning on diverse ultrasonic source scenarios presents a challenge in preserving stable generalization due to the combination of heterogeneity of different sources and the inconsistency of joint learning features. Previous joint learning studies, which are not source-resilient frameworks, may not preserve stable generalization when trained on diverse source scenarios. Furthermore, the limited variations in single-source data and the interference from ultrasound imaging, which are common in ultrasonic source scenarios, further decrease generalization. To address these problems, we proposed a source-resilient joint learning framework consisting of three stages: 1) Source transforming, where our 1-to-N transformation unifies diverse source scenarios for source-resiliency. 2) Our feature enhancement modules model the source-resilient joint learning network, including a manifold-constraint normalization module (MCNM) for addressing heterogeneity by minimizing manifold-based loss, a task-consistent attention module (TCAM) shares the multi-scale features with self-attention to address inconsistency, and an adaptive feature-shifting module (AFSM) for feature-level augmentation to overcome single-source data. 3) Our ultrasound-hybrid linear mapping (USmapping) cascades speckle randomization and mask-guiding Monge-Kantorovitch linear mapping to achieve ultrasonic style randomization for addressing the interference of ultrasonic data. Our framework was evaluated on eight ultrasound datasets from various scanners at multiple centers and surpassed previous comparable studies in both segmentation (DSC $WAvg$ of 75. 7%) and classification (AUROC $WAvg$ of 68. 8%) tasks. Our framework has the potential to serve as a general framework for enhancing the performance of joint learning under diverse ultrasonic source scenarios.

JBHI Journal 2026 Journal Article

TKRL: Targeted Knowledge Rectification Learning Against Teacher-Originated Defects in Domain Continual Segmentation

  • Zhanshi Zhu
  • Wenjian Gu
  • Xiangyu Li
  • Qince Li
  • Yongfeng Yuan
  • Wei Wang
  • Kuanquan Wang
  • Suyu Dong

Knowledge distillation can mitigate catastrophic forgetting in domain continual segmentation by transferring knowledge from the older model to the newer model. However, existing distillation-based methods primarily emphasize knowledge retention while overlooking inherent defects in the older teacher models. As a result, these teacher-originated defects, such as knowledge gaps or biases, are propagated and exacerbate forgetting. To address this challenge, we propose a Targeted Knowledge Rectification Learning framework (TKRL) to probe and correct teacher-originated defects. TKRL consists of two modules: (1) Probe-augmented Class Distillation, which generates gradient-driven “probes” to uncover underrepresented features in the older model, thereby bridging knowledge gaps by distilling hidden information into the new model; (2) Variance-guided Masked Autoencoder, which selectively masks and reconstructs critical high-uncertainty patches across multi-level semantic regions, thereby correcting biases inherited from the older model. Our experimental results show that TKRL effectively rectifies knowledge gaps and biases, thereby mitigating catastrophic forgetting and enhancing performance in domain continual segmentation. The implementation code is publicly available at: https://github.com/PerceptionComputingLab/TKRL_DCMIS.

AAAI Conference 2026 Conference Paper

Towards Long-window Anchoring in Vision-Language Model Distillation

  • Haoyi Zhou
  • Shuo Li
  • Tianyu Chen
  • Qi Song
  • Chonghan Gao
  • Jianxin Li

While large vision-language models (VLMs) demonstrate impressive long-context understanding, their prevalent small branches fails on linguistics-photography alignment for limited window size. We discover that knowledge distillation improve students capability as compelementary to Rotary Position Embeddings (RoPE) on certain windows size (anchored from large models). Building on this insight, we propose LAid, which explicitly targets the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2× longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.

AAAI Conference 2026 Conference Paper

What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study

  • Xiaoran Fan
  • Zhichao Sun
  • Yangfan Gao
  • Jingfei Xiong
  • Hang Yan
  • Yifei Cao
  • Jiajun Sun
  • Shuo Li

Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12× faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

AAAI Conference 2025 Conference Paper

A Trusted Lesion-assessment Network for Interpretable Diagnosis of Coronary Artery Disease in Coronary CT Angiography

  • Xinghua Ma
  • Xinyan Fang
  • Mingye Zou
  • Gongning Luo
  • Wei Wang
  • Kuanquan Wang
  • Zhaowen Qiu
  • Xin Gao

Coronary Artery Disease (CAD) poses a significant threat to cardiovascular patients worldwide, underscoring the critical importance of automated CAD diagnostic technologies in clinical practice. Previous technologies for lesion assessment in Coronary CT Angiography (CCTA) images have been insufficient in terms of interpretability, resulting in solutions that lack clinical reliability in both network architecture and prediction outcomes, even when diagnoses are accurate. To address the limitation of interpretability, we introduce the Trusted Lesion-Assessment Network (TLA-Net), which provides a clinically reliable solution for multi-view CAD diagnosis: (1) The causality-informed evidence collection constructs a causal graph for the diagnostic process and implements causal interventions, preventing confounders' interference and enhancing the transparency of the network architecture. (2) The clinically-aligned uncertainty integration hierarchically combines Dirichlet distributions from various views based on clinical priors, offering confidence coefficients for prediction outcomes that align with physicians' image analysis procedures. Experimental results on a dataset of 2,618 lesions demonstrate that TLA-Net, supported by its interpretable methodological design, exhibits superior performance with outstanding generalization, domain adaptability, and robustness.

NeurIPS Conference 2025 Conference Paper

Alignment of Large Language Models with Constrained Learning

  • Botong Zhang
  • Shuo Li
  • Ignacio Hounie
  • Osbert Bastani
  • Dongsheng Ding
  • Alejandro Ribeiro

We study the problem of computing an optimal large language model (LLM) policy for the constrained alignment problem, where the goal is to maximize a primary reward objective while satisfying constraints on secondary utilities. Despite the popularity of Lagrangian-based LLM policy search in constrained alignment, iterative primal-dual methods often fail to converge, and non-iterative dual-based methods do not achieve optimality in the LLM parameter space. To address these challenges, we employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the LLM policy via Lagrangian maximization and updating the dual variable via dual descent. In theory, we characterize the primal-dual gap between the primal value in the distribution space and the dual value in the LLM parameter space. We further quantify the optimality gap of the learned LLM policies at near-optimal dual variables with respect to both the objective and the constraint functions. These results prove that dual-based alignment methods can find an optimal constrained LLM policy, up to an LLM parametrization gap. We demonstrate the effectiveness and merits of our approach through extensive experiments conducted on the PKU-SafeRLHF and Anthropic HH-RLHF datasets.

AAAI Conference 2025 Conference Paper

An LLM-Empowered Adaptive Evolutionary Algorithm for Multi-Component Deep Learning Systems

  • Haoxiang Tian
  • Xingshuo Han
  • Guoquan Wu
  • An Guo
  • Yuan Zhou
  • Jie Zhang
  • Shuo Li
  • Jun Wei

Multi-objective evolutionary algorithms (MOEAs) are widely used for searching optimal solutions in complex multi-component applications. Traditional MOEAs for multi-component deep learning (MCDL) systems face challenges in enhancing the search efficiency while maintaining the diversity. To combat these, this paper proposes the first LLM-empowered adaptive evolutionary search algorithm to detect safety violations in MCDL systems. Inspired by the context-understanding ability of Large Language Models (LLMs), our approach promotes the LLM to comprehend the optimization problem and generate an initial population tailed to evolutionary objectives. Subsequently, it employs adaptive selection and variation to iteratively produce offspring, balancing the evolutionary efficiency and diversity. During the evolutionary process, to navigate away from the local optima, our approach integrates the evolutionary experience back into the LLM. This utilization harnesses the LLM's quantitative reasoning prowess to generate differential seeds, breaking away from current optimal solutions. We evaluate our approach in finding safety violations of MCDL systems, and compare its performance with state-of-the-art MOEA methods. Experimental results show that our approach can significantly improve the efficiency and diversity of the evolutionary search.

JBHI Journal 2025 Journal Article

Common-Unique Decomposition Driven Diffusion Model for Contrast-Enhanced Liver MR Images Multi-Phase Interconversion

  • Chenchu Xu
  • Shijie Tian
  • Boyan Wang
  • Jie Zhang
  • Kemal Polat
  • Adi Alhudhaif
  • Shuo Li

All three contrast-enhanced (CE) phases (e. g. , Arterial, Portal Venous, and Delay) are crucial for diagnosing liver tumors. However, acquiring all three phases is constrained due to contrast agents (CAs) risks, long imaging time, and strict imaging criteria. In this paper, we propose a novel Common-Unique Decomposition Driven Diffusion Model (CUDD-DM), capable of converting any two input phases in three phases into the remaining one, thereby reducing patient wait time, conserving medical resources, and reducing the use of CAs. 1) The Common-Unique Feature Decomposition Module, by utilizing spectral decomposition to capture both common and unique features among different inputs, not only learns correlations in highly similar areas between two input phases but also learns differences in different areas, thereby laying a foundation for the synthesis of remaining phase. 2) The Multi-scale Temporal Reset Gates Module, by bidirectional comparing lesions in current and multiple historical slices, maximizes reliance on previous slices when no lesions and minimizes this reliance when lesions are present, thereby preventing interference between consecutive slices. 3) The Diffusion Model-Driven Lesion Detail Synthesis Module, by employing a continuous and progressive generation process, accurately captures detailed features between data distributions, thereby avoiding the loss of detail caused by traditional methods (e. g. , GAN) that overfocus on global distributions. Extensive experiments on a generalized CE liver tumor dataset have demonstrated that our CUDD-DM achieves state-of-the-art performance (improved the SSIM by at least 2. 2% (lesions area 5. 3%) comparing the seven leading methods). These results demonstrate that CUDD-DM advances CE liver tumor imaging technology.

ICLR Conference 2025 Conference Paper

Conformal Structured Prediction

  • Botong Zhang
  • Shuo Li
  • Osbert Bastani

Conformal prediction has recently emerged as a promising strategy for quantifying the uncertainty of a predictive model; these algorithms modify the model to output sets of labels that are guaranteed to contain the true label with high probability. However, existing conformal prediction algorithms have largely targeted classification and regression settings, where the structure of the prediction set has a simple form as a level set of the scoring function. However, for complex structured outputs such as text generation, these prediction sets might include a large number of labels and therefore be hard for users to interpret. In this paper, we propose a general framework for conformal prediction in the structured prediction setting, that modifies existing conformal prediction algorithms to output structured prediction sets that implicitly represent sets of labels. In addition, we demonstrate how our approach can be applied in domains where the prediction sets can be represented as a set of nodes in a directed acyclic graph; for instance, for hierarchical labels such as image classification, a prediction set might be a small subset of coarse labels implicitly representing the prediction set of all their more fine-descendants. We demonstrate how our algorithm can be used to construct prediction sets that satisfy a desired coverage guarantee in several domains.

TMLR Journal 2025 Journal Article

Conformalized Credal Regions for Classification with Ambiguous Ground Truth

  • Michele Caprio
  • David Stutz
  • Shuo Li
  • Arnaud Doucet

An open question in Imprecise Probabilistic Machine Learning is how to empirically derive a credal region (i.e., a closed and convex family of probabilities on the output space) from the available data, without any prior knowledge or assumption. In classification problems, credal regions are a tool that is able to provide provable guarantees under realistic assumptions by characterizing the uncertainty about the distribution of the labels. Building on previous work, we show that credal regions can be directly constructed using conformal methods. This allows us to provide a novel extension of classical conformal prediction to problems with ambiguous ground truth, that is, when the exact labels for given inputs are not exactly known. The resulting construction enjoys desirable practical and theoretical properties: (i) conformal coverage guarantees, (ii) smaller prediction sets (compared to classical conformal prediction regions) and (iii) disentanglement of uncertainty sources (epistemic, aleatoric). We empirically verify our findings on both synthetic and real datasets.

ICRA Conference 2025 Conference Paper

Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

  • Xian Wang
  • Jin Zhou
  • Yuanli Feng
  • Jiahao Mei
  • Jiming Chen 0001
  • Shuo Li

Recent innovations in autonomous drones have facilitated time-optimal flight in single-drone configurations, and enhanced maneuverability in multi-drone systems by applying optimal control and learning-based methods. However, few studies have achieved time-optimal motion planning for multi-drone systems, particularly during highly agile maneuvers or in dynamic scenarios. This paper presents a decentralized policy network using multi-agent reinforcement learning for time-optimal multi-drone flight. To strike a balance between flight efficiency and collision avoidance, we introduce a soft collision-free mechanism inspired by optimization-based methods. By customizing PPO in a centralized training, decentralized execution (CTDE) fashion, we unlock higher efficiency and stability in training while ensuring lightweight implementation. Extensive simulations show that, despite slight performance tradeoffs compared to single-drone systems, our multi-drone approach maintains near-time-optimal performance with a low collision rate. Real-world experiments validate our method, with two quadrotors using the same network as in simulation achieving a maximum speed of 13. 65 m/s and a maximum body rate of 13. 4 rad/s in a 5. 5 m × 5. 5 m × 2. 0 m space across various tracks, relying entirely on onboard computation [video 3 3 https://youtu.be/KACuFMtGGpo][code 4 4 https://github.com/KafuuChikai/Dashing-for-the-Golden-Snitch-Multi-Drone-RL].

AAAI Conference 2025 Conference Paper

DuSSS: Dual Semantic Similarity-Supervised Vision-Language Model for Semi-Supervised Medical Image Segmentation

  • Qingtao Pan
  • Wenhao Qiao
  • Jingjiao Lou
  • Bing Ji
  • Shuo Li

Semi-supervised medical image segmentation (SSMIS) uses consistency learning to regularize model training, which alleviates the burden of pixel-wise manual annotations. However, it often suffers from error supervision from low-quality pseudo labels. Vision-Language Model (VLM) has great potential to enhance pseudo labels by introducing text prompt guided multimodal supervision information. It nevertheless faces the cross-modal problem: the obtained messages tend to correspond to multiple targets. To address aforementioned problems, we propose a Dual Semantic Similarity-Supervised VLM (DuSSS) for SSMIS. Specifically, 1) a Dual Contrastive Learning (DCL) is designed to improve cross-modal semantic consistency by capturing intrinsic representations within each modality and semantic correlations across modalities. 2) To encourage the learning of multiple semantic correspondences, a Semantic Similarity-Supervision strategy (SSS) is proposed and injected into each contrastive learning process in DCL, supervising semantic similarity via the distribution-based uncertainty levels. Furthermore, a novel VLM-based SSMIS network is designed to compensate for the quality deficiencies of pseudo-labels. It utilizes the pretrained VLM to generate text prompt guided supervision information, refining the pseudo label for better consistency regularization. Experimental results demonstrate that our DuSSS achieves outstanding performance with Dice of 82.52%, 74.61% and 78.03% on three public datasets (QaTa-COV19, BM-Seg and MoNuSeg).

ICRA Conference 2025 Conference Paper

Gate-Aware Online Planning for Two-Player Autonomous Drone Racing

  • Fangguo Zhao
  • Jiahao Mei
  • Jin Zhou
  • Yuanyi Chen
  • Jiming Chen
  • Shuo Li

The flying speed of autonomous quadrotors has increased significantly in the field of autonomous drone racing. However, most research primarily focuses on the aggressive flight of a single quadrotor, simplifying the racing gate traversal problem to a waypoint passing problem that neglects the orientations of the racing gates or implicitly considers the waypoint direction during path planning. In this paper, we propose a systematic method called Pairwise Model Predictive Control (PMPC) that can guide two quadrotors online to navigate racing gates with minimal time and without collisions. The flight task is initially simplified as a point-mass model waypoint passing problem to provide time optimal reference through an efficient two-step velocity search method. Subsequently, we utilize the spatial configuration of the racing track to compute the optimal heading at each gate, maximizing the visibility of subsequent gates for the quadrotors. To address varying gate orientations, we introduce a novel Magnetic Induction Line-based spatial curve to guide the quadrotors through racing gates of different orientations. Furthermore, we formulate a nonlinear optimization problem that uses the point-mass trajectory as initial values and references to enhance solving efficiency. The feasibility of the proposed method is validated through both simulation and real-world experiments. In real-world tests, the two quadrotors achieved a top speed of $6. 1m/s$ on a 7-waypoint racing track within a compact flying arena of $5m\times 4m\times 2m$.

ICLR Conference 2025 Conference Paper

Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs

  • Shuo Li
  • Tao Ji
  • Xiaoran Fan
  • Linsheng Lu
  • Leyi Yang
  • Yuming Yang 0001
  • Zhiheng Xi
  • Rui Zheng

In the study of LLMs, sycophancy represents a prevalent hallucination that poses significant challenges to these models. Specifically, LLMs often fail to adhere to original correct responses, instead blindly agreeing with users' opinions, even when those opinions are incorrect or malicious. However, research on sycophancy in visual language models (VLMs) has been scarce. In this work, we extend the exploration of sycophancy from LLMs to VLMs, introducing the MM-SY benchmark to evaluate this phenomenon. We present evaluation results from multiple representative models, addressing the gap in sycophancy research for VLMs. To mitigate sycophancy, we propose a synthetic dataset for training and employ methods based on prompts, supervised fine-tuning, and DPO. Our experiments demonstrate that these methods effectively alleviate sycophancy in VLMs. Additionally, we probe VLMs to assess the semantic impact of sycophancy and analyze the attention distribution of visual tokens. Our findings indicate that the ability to prevent sycophancy is predominantly observed in higher layers of the model. The lack of attention to image knowledge in these higher layers may contribute to sycophancy, and enhancing image attention at high layers proves beneficial in mitigating this issue.

IJCAI Conference 2025 Conference Paper

Improving Consistency Identification in Task-oriented Dialogue Through Multi-Agent Collaboration

  • Peng Wang
  • Shuo Li
  • Ruoxi Zhou
  • Qiguang Chen
  • Xiao Xu
  • Hao Fei
  • Dagang Li
  • Wanxiang Che

Consistency identification in task-oriented dialog (CI-ToD) typically consists of three sub-tasks: User Query Inconsistency (QI) identification, Dialogue History Inconsistency (HI) identification, and Knowledge Base Inconsistency (KBI) identification, which aim to determine inconsistent relationships between system response and user query, dialogue history, and knowledge base. Previous approaches focus on the exploration of deep learning models for CI-ToD. While these models achieve remarkable progress, they still rely on large amounts of labeled data, which is hard to achieve in real-world scenarios. Motivated by this, in the paper, we aim to explore large language models for CI-ToD, which do not require any training data. In addition, we further introduce a multi-agent collaboration framework (MAC-CIToD) to model the interaction across three sub-tasks in CI-ToD, including (1) Full Connection paradigm, (2) Cycle Connection paradigm, and (3) Central Connection paradigm, which effectively builds interaction across QI, HI, and KBI. Experiments on the standard benchmark reveal that our framework achieves superior performance. Additionally, we compare MAC-CIToD with the most advanced trained approaches and find that its zero-shot performance on most metrics even surpasses that of models after training on the CI-ToD dataset.

ICRA Conference 2025 Conference Paper

Learning Time-Optimal Online Replanning for Distributed Model Predictive Contouring Control of Quadrotors

  • Xin Guan
  • Fangguo Zhao
  • Shunxin Tian
  • Shuo Li

Ahstract-Achieving time-optimal flight in real time for multi-drone systems presents significant challenges, particularly in scenarios requiring rapid responses or aggressive maneuvers. This paper introduces a novel framework that bridges the gap between time-optimal polynomial trajectory generation and optimal control, facilitating efficient online replanning (100 Hz onboard) for multiple quadrotors. Specifically, the proposed method leverages a neural network to learn optimal time allocations for polynomial trajectories, which are then integrated with Model Predictive Contouring Control to fully exploit the dynamics of quadrotors. We further extend this approach to multi-drone systems, enabling collaborative high-speed flight with reciprocal collision avoidance. We benchmark the time-optimal performance and computational efficiency of our method in a drone racing scenario and demonstrate its effectiveness in agile cooperative flight within more constrained simulation and real-world environments. The results demonstrate that the proposed method achieves agile waypoint traverse at a speed of up to 19 m/s in simulation and up to 9 m/s in two-drone real-world scenario. [video 4 4 https://www.youtube.com/watch? v=KE97sKwYpAs]

JBHI Journal 2025 Journal Article

MedFILIP: Medical Fine-Grained Language-Image Pre-Training

  • Xinjie Liang
  • Xiangyu Li
  • Fanding Li
  • Jie Jiang
  • Qing Dong
  • Wei Wang
  • Kuanquan Wang
  • Suyu Dong

Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e. g. , RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6. 69%.

JBHI Journal 2025 Journal Article

MedKAFormer: When Kolmogorov–Arnold Theorem Meets Vision Transformer for Medical Image Representation

  • Guoli Wang
  • Qikui Zhu
  • Chaoda Song
  • Benzheng Wei
  • Shuo Li

Vision Transformers (ViTs) suffer from high parameter complexity because they rely on Multi-layer Perceptrons (MLPs) for nonlinear representation. This issue is particularly challenging in medical image analysis, where labeled data is limited, leading to inadequate feature representation. Existing methods have attempted to optimize either the patch embedding stage or the non-embedding stage of ViTs. Still, they have struggled to balance effective modeling, parameter complexity, and data availability. Recently, the Kolmogorov–Arnold Network (KAN) was introduced as an alternative to MLPs, offering a potential solution to the large parameter issue in ViTs. However, KAN cannot be directly integrated into ViT due to challenges such as handling 2D structured data and dimensionality catastrophe. To solve this problem, we propose MedKAFormer, the first ViT model to incorporate the Kolmogorov–Arnold (KA) theorem for medical image representation. It includes a Dynamic Kolmogorov–Arnold Convolution (DKAC) layer for flexible nonlinear modeling in the patch embedding stage. Additionally, it introduces a Nonlinear Sparse Token Mixer (NSTM) and a Nonlinear Dynamic Filter (NDF) in the non-embedding stage. These components provide comprehensive nonlinear representation while reducing model overfitting. MedKAFormer reduces parameter complexity by 85. 61% compared to ViT-Base and achieves competitive results on 14 medical datasets across various imaging modalities and structures.

IROS Conference 2025 Conference Paper

Online Motion Planning for Quadrotor Multi-Point Navigation Using Efficient Imitation Learning-Based Strategy

  • Jin Zhou
  • Jiahao Mei
  • Fangguo Zhao
  • Jiming Chen 0001
  • Shuo Li

Over the past decade, there has been a remarkable surge in utilizing quadrotors for various purposes due to their simple structure and aggressive maneuverability. One of the key challenges is online time-optimal trajectory generation and control technique. This paper proposes an imitation learning-based online solution to efficiently navigate the quadrotor through multiple waypoints with near-time-optimal performance. The neural networks (WN&CNets) are trained to learn the control law from the dataset generated by the time-consuming CPC algorithm and then deployed to generate the optimal control commands online to guide the quadrotors. To address the challenge of limited training data and the hover maneuver at the final waypoint, we propose a transition phase strategy that utilizes MINCO trajectories to help the quadrotor ‘jump over’ the stop-and-go maneuver when switching waypoints. Our method is demonstrated in both simulation and real-world experiments, achieving a maximum speed of 5. 6m/s while navigating through 7 waypoints in a confined space of 5. 5m × 5. 5m × 2. 0m [video 3 ]. The results show that with a slight loss in optimality, the WN&CNets significantly reduce the processing time and enable online control for multi-point flight tasks.

ICRA Conference 2025 Conference Paper

Safety-Critical Online Quadrotor Trajectory Planner for Agile Flights in Unknown Environments

  • Jiazhe Yuan
  • Dongcheng Cao
  • Jiahao Mei
  • Jiming Chen
  • Shuo Li

Autonomous high-speed flight in unknown, clut-tered environments is essential for a variety of quadrotor applications, such as inspection, search, and rescue. In this study, we propose a novel trajectory planner designed to achieve efficient, high-speed, collision-free flights in such environments. The proposed approach begins by generating a safe flight corridor based on the path found by Lazy Theta*, representing the safe regions with polytopic sets. These sets are then used to define discrete-time control barrier function (DCBF), ensuring the quadrotor stays within safe bounds during flight. By selecting a single waypoint ahead of the quadrotor on the path as the next waypoint, the trajectory is optimized by considering both the total flight time and safety constraints. Extensive simulations and real-world experiments have confirmed our method's feasibility, demonstrating its capability for high-speed performance and reliable obstacle avoidance. [video 4 4 https://www.youtube.com/playlist?list=PLJFduoH7QICOhcIX3JFsZwB4IgS4_-sPt]

JBHI Journal 2025 Journal Article

VLD-Net: Localization and Detection of the Vertebrae From X-Ray Images by Reinforcement Learning With Adaptive Exploration Mechanism and Spine Anatomy Information

  • Shun Xiang
  • Lei Zhang
  • Yuanquan Wang
  • Shoujun Zhou
  • Xing Zhao
  • Tao Zhang
  • Shuo Li

Accurate and efficient vertebrae localization and detection in X-ray images are essential for diagnosing and treating spinal diseases. However, most existing methods struggle with the complexity of spine X-ray images, yielding inaccurate results due to insufficient utilization of spinal anatomy information and neglect of individual vertebra characteristics. In this paper, we propose an innovative Vertebrae Localization and Detection Network (VLD-Net) to accurately assist physicians in diagnosing spine-related diseases from X-ray images. Our VLD-Net, for the first time, defines vertebrae localization as a top-bottom sequential decision-making process, employing deep reinforcement learning (DRL) to fully leverage the anatomical information of the spine. Simultaneously, it also prioritizes the distinct characteristics of each vertebra for accurate detection. Specifically, VLD-Net combines three key components: 1) An advanced vertebrae localization module based on DRL is proposed, effectively leveraging anatomical information of the spine. 2) A novel adaptive exploration mechanism is coined to understand the behavior of the DRL agent during training, pinpointing how to effectively achieve the trade-off between exploration and exploitation. 3) An innovative vertebra-focused module is proposed to accurately detect vertebral landmarks, using the attention region of each vertebra as input to enhance focus on the target and reduce interference from surrounding tissue. Extensive experiments on two public spine datasets demonstrate that the VLD-Net outperforms the state-of-the-art methods in accuracy and robustness.

IROS Conference 2024 Conference Paper

An Observability Constrained Downward-Facing Optical-Flow-Aided Visual-Inertial Odometry

  • Dandi Liu
  • Jiahao Mei
  • Jin Zhou
  • Shuo Li

Visual-Inertial Odometry (VIO) has been widely used by autonomous drones as an onboard navigation method. However, it suffers from drifts especially in scenarios where the environments have few texture features such as an empty room with solid color walls. Optical flow sensors are another type of onboard sensor used by drones that face downward and measure the velocity by detecting changes in pixels between consecutive images, which don’t introduce accumulative error. In this work, we present an efficient tight-coupled estimator to improve the accuracy of VIO by fusing the measurements of a downward-facing optical flow sensor into the VIO framework consistently. We further analyze the observability of the estimators and prove that there are four unobservable directions in the ideal case and then we utilize OC-EKF to maintain the consistency of the estimator. Furthermore, we extend an adaptive weighting algorithm to the proposed method, which can better adapt to the scenes where feature tracking is less accurate. Finally, both simulation and real-world experiments demonstrate the feasibility of the proposed method.

JBHI Journal 2024 Journal Article

Automatic Delineation of the 3D Left Atrium From LGE-MRI: Actor-Critic Based Detection and Semi-Supervised Segmentation

  • Shun Xiang
  • Nana Li
  • Yuanquan Wang
  • Shoujun Zhou
  • Jin Wei
  • Shuo Li

Accurate and automatic delineation of the left atrium (LA) is crucial for computer-aided diagnosis of atrial fibrillation-related diseases. However, effective model training typically requires a large amount of labeled data, which is time-consuming and labor-intensive. In this study, we propose a novel LA delineation framework. The region of LA is first detected using an actor-critic based deep reinforcement learning method with a shape-adaptive detection strategy using only box-level annotations, bypassing the need for voxel-level labeling. With the effectively detected LA, the impacts of class-imbalance and interference from surrounding tissues are significantly reduced. Subsequently, a semi-supervised segmentation scheme is coined to precisely delineate the contour of LA in 3D volume. The scheme integrates two independent networks with distinct structures, enabling implicit consistency regularization, capturing more spatial features, and avoiding the error accumulation present in current mainstream semi-supervised frameworks. Specifically, one network is combined with Transformer to capture latent spatial features, while the other network is based on pure CNN to capture local features. The difference prediction between these two sub-networks is exploited to mutually provide high-quality pseudo-labels and correct the cognitive bias. Experimental results on two public datasets demonstrate that our proposed strategy outperforms several state-of-the-art methods in terms of accuracy and clinical convenience.

IJCAI Conference 2024 Conference Paper

Class-consistent Contrastive Learning Driven Cross-dimensional Transformer for 3D Medical Image Classification

  • Qikui Zhu
  • Chuan Fu
  • Shuo Li

Transformer emerges as an active research topic in medical image analysis. Yet, three substantial challenges limit the effectiveness of both 2D and 3D Transformers in 3D medical image classification: 1) Challenge in capturing spatial structure correlation due to the unreasonable flattening operation; 2) Challenge in burdening the high computational complexity and memory consumption due to the quadratic growth of computational complexity and memory consumption for 3D medical data; 3) Challenge in discriminative representation learning, due to data-sensitivity. To address the above challenges, a novel Cross-dimensional Transformer (CdTransformer) and a creative Class-consistent Contrastive Learning (CcCL) are proposed. Specifically, CdTransformer consists of two novel modules: 1) Cross-dimensional Attention Module (CAM), which breaks the limitation that Transformer cannot reasonably establish spatial structure correlation when meeting 3D medical data, meanwhile, reduces the computational complexity and memory consumption. 2) Inter-dimensional Feed-forward Network (IdFN), which addresses the challenge of traditional feed-forward networks not being able to learn depth dimension information that is unique to 3D medical data. CcCL innovatively takes full advantage of the inter-class and intra-class features from the slice-distorted samples to boost Transformer in learning feature representation. CdTransformer and CcCL are validated on six 3D medical image classification tasks. Extensive experimental results demonstrate that CdTransformer outperforms state-of-the-art CNNs and Transformers on 3D medical image classification, and CcCL enables significantly improving Transformer in discriminative representation learning.

IROS Conference 2024 Conference Paper

FlowTrack: Point-level Flow Network for 3D Single Object Tracking

  • Shuo Li
  • Yubo Cui
  • Zhiheng Li 0003
  • Zheng Fang 0001

3D single object tracking (SOT) is a crucial task in fields of mobile robotics and autonomous driving. Traditional motion-based approaches achieve target tracking by estimating the relative movement of target between two consecutive frames. However, they usually overlook local motion information of the target and fail to exploit historical frame information effectively. To overcome the above limitations, we propose a point-level flow method with multi-frame information for 3D SOT task, called FlowTrack. Specifically, by estimating the flow for each point in the target, our method could capture the local motion details of target, thereby improving the tracking performance. Meanwhile, to handle scenes with sparse points, we present a learnable target feature as the bridge to efficiently integrate target information from past frames. Moreover, we design a Instance Flow Head to transform dense point-level flow into instance-level motion, effectively aggregating local motion information to obtain global target motion. Finally, our method achieves competitive performance with improvements of 5. 9% on the KITTI and 2. 9% on the NuScenes, compared to the next best method.

ICML Conference 2024 Conference Paper

MOMENT: A Family of Open Time-series Foundation Models

  • Mononito Goswami
  • Konrad Szafer
  • Arjun Choudhry
  • Yifu Cai
  • Shuo Li
  • Artur Dubrawski

We introduce MOMENT, a family of open-source foundation models for general-purpose time series analysis. Pre-training large models on time series data is challenging due to (1) the absence of a large and cohesive public time series repository, and (2) diverse time series characteristics which make multi-dataset training onerous. Additionally, (3) experimental benchmarks to evaluate these models, especially in scenarios with limited resources, time, and supervision, are still in their nascent stages. To address these challenges, we compile a large and diverse collection of public time series, called the Time series Pile, and systematically tackle time series-specific challenges to unlock large-scale multi-dataset pre-training. Finally, we build on recent work to design a benchmark to evaluate time series foundation models on diverse tasks and datasets in limited supervision settings. Experiments on this benchmark demonstrate the effectiveness of our pre-trained models with minimal data and task-specific fine-tuning. Finally, we present several interesting empirical observations about large pre-trained time series models. Pre-trained models (AutonLab/MOMENT-1-large) and Time Series Pile (AutonLab/Timeseries-PILE) are available on Huggingface.

NeurIPS Conference 2024 Conference Paper

One-Shot Safety Alignment for Large Language Models via Optimal Dualization

  • Xinmeng Huang
  • Shuo Li
  • Edgar Dobriban
  • Osbert Bastani
  • Hamed Hassani
  • Dongsheng Ding

The growing safety concerns surrounding large language models raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, typical Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based settings (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness and merits of our algorithms.

IROS Conference 2024 Conference Paper

Priority-Based Deadlock Recovery for Distributed Swarm Obstacle Avoidance in Cluttered Environments

  • Jiacheng He
  • Fangguo Zhao
  • Shaohao Zhu
  • Shuo Li
  • Jinming Xu 0002

We propose a novel hierarchical priority mechanism for deadlock recovery of distributed swarm via on-demand collision avoidance in cluttered dynamic environments. The proposed priority mechanism dynamically assigns certain priority and an optimized detour point for each agent based on its spatial context to avoid deadlocks which are predicted by properly designed deadlock conditions; as a byproduct, this priority mechanism allows us to effectively resolve livelocks as well. The resulting optimization problem is then solved by polar reformulation and alternating minimization methods. Simulation results demonstrate that, in both static and dynamic environments, our method (termed PriDRAM) outperforms the baseline Alternating Minimization Swarm (AMSwarm) method which does not explicitly account for deadlock recovery, with a 10. 5% improvement in average smoothness and a 4. 8% reduction in flight time. Moreover, for narrow passages, our method shows a superior performance against the Distributed Linear Safe Corridor (DLSC) method, with a more reasonable passing order and an achievement of up to 40% reduction in flight path length. Finally, we verify the efficacy of our proposed method with a Crazyflie 2. 1 quadrotor swarm.

JBHI Journal 2024 Journal Article

STANet: Spatio-Temporal Adaptive Network and Clinical Prior Embedding Learning for 3D+T CMR Segmentation

  • Xiaoming Qi
  • Yuting He
  • Yaolei Qi
  • Youyong Kong
  • Guanyu Yang
  • Shuo Li

The segmentation of cardiac structure in magnetic resonance images (CMR) is paramount in diagnosing and managing cardiovascular illnesses, given its 3D+Time (3D+T) sequence. The existing deep learning methods are constrained in their ability to 3D+T CMR segmentation, due to: (1) Limited motion perception. The complexity of heart beating renders the motion perception in 3D+T CMR, including the long-range and cross-slice motions. The existing methods' local perception and slice-fixed perception directly limit the performance of 3D+T CMR perception. (2) Lack of labels. Due to the expensive labeling cost of the 3D+T CMR sequence, the labels of 3D+T CMR only contain the end-diastolic and end-systolic frames. The incomplete labeling scheme causes inefficient supervision. Hence, we propose a novel spatio-temporal adaptation network with clinical prior embedding learning (STANet) to ensure efficient spatio-temporal perception and optimization on 3D+T CMR segmentation. (1) A spatio-temporal adaptive convolution (STAC) treats the 3D+T CMR sequence as a whole for perception. The long-distance motion correlation is embedded into the structural perception by learnable weight regularization to balance long-range motion perception. The structural similarity is measured by cross-attention to adaptively correlate the cross-slice motion. (2) A clinical prior embedding learning strategy (CPE) is proposed to optimize the partially labeled 3D+T CMR segmentation dynamically by embedding clinical priors into optimization. STANet achieves outstanding performance with Dice of 0. 917 and 0. 94 on two public datasets (ACDC and STACOM), which indicates STANet has the potential to be incorporated into computer-aided diagnosis tools for clinical application.

AAAI Conference 2024 Conference Paper

ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

  • Hao Wang
  • Fang Liu
  • Licheng Jiao
  • Jiahao Wang
  • Zehua Hao
  • Shuo Li
  • Lingling Li
  • Puhua Chen

Pre-trained vision-language(V-L) models such as CLIP have demonstrated impressive Zero-Shot performance in many downstream tasks. Since adopting contrastive video-text pairs methods like CLIP to video tasks is limited by its high cost and scale, recent approaches focus on efficiently transferring the image-based CLIP to the video domain. A major finding is that fine-tuning the pre-trained model to achieve strong fully supervised performance leads to low zero shot, few shot, and base to novel generalization. Instead, freezing the backbone network to maintain generalization ability weakens fully supervised performance. Otherwise, no single prompt tuning branch consistently performs optimally. In this work, we proposed a multimodal prompt learning scheme that balances supervised and generalized performance. Our prompting approach contains three sections: 1) Independent prompt on both the vision and text branches to learn the language and visual contexts. 2) Inter-modal prompt mapping to ensure mutual synergy. 3) Reducing the discrepancy between the hand-crafted prompt (a video of a person doing [CLS]) and the learnable prompt, to alleviate the forgetting about essential video scenarios. Extensive validation of fully supervised, zero-shot, few-shot, base-to-novel generalization settings for video recognition indicates that the proposed approach achieves competitive performance with less commute cost.

JBHI Journal 2023 Journal Article

A Knowledge-Guided Framework for Fine-Grained Classification of Liver Lesions Based on Multi-Phase CT Images

  • Xingxin Xu
  • Qikui Zhu
  • Hanning Ying
  • Jiongcheng Li
  • Xiujun Cai
  • Shuo Li
  • Xiaoqing Liu
  • Yizhou Yu

Automatic and accurate differentiation of liver lesions from multi-phase computed tomography imaging is critical for the early detection of liver cancer. Multi-phase data can provide more diagnostic information than single-phase data, and the effective use of multi-phase data can significantly improve diagnostic accuracy. Current fusion methods usually fuse multi-phase information at the image level or feature level, ignoring the specificity of each modality, therefore, the information integration capacity is always limited. In this paper, we propose a Knowledge-guided framework, named MCCNet, which adaptively integrates multi-phase liver lesion information from three different stages to fully utilize and fuse multi-phase liver information. Specifically, 1) a multi-phase self-attention module was designed to adaptively combine and integrate complementary information from three phases using multi-level phase features; 2) a cross-feature interaction module was proposed to further integrate multi-phase fine-grained features from a global perspective; 3) a cross-lesion correlation module was proposed for the first time to imitate the clinical diagnosis process by exploiting inter-lesion correlation in the same patient. By integrating the above three modules into a 3D backbone, we constructed a lesion classification network. The proposed lesion classification network was validated on an in-house dataset containing 3, 683 lesions from 2, 333 patients in 9 hospitals. Extensive experimental results and evaluations on real-world clinical applications demonstrate the effectiveness of the proposed modules in exploiting and fusing multi-phase information.

JBHI Journal 2023 Journal Article

Adaptive Frequency Learning Network With Anti-Aliasing Complex Convolutions for Colon Diseases Subtypes

  • Kaini Wang
  • Shuaishuai Zhuang
  • Juzheng Miao
  • Yang Chen
  • Jie Hua
  • Guang-Quan Zhou
  • Xiaopu He
  • Shuo Li

The automatic and dependable identification of colonic disease subtypes by colonoscopy is crucial. Once successful, it will facilitate clinically more in-depth disease staging analysis and the formulation of more tailored treatment plans. However, inter-class confusion and brightness imbalance are major obstacles to colon disease subtyping. Notably, the Fourier-based image spectrum, with its distinctive frequency features and brightness insensitivity, offers a potential solution. To effectively leverage its advantages to address the existing challenges, this article proposes a framework capable of thorough learning in the frequency domain based on four core designs: the position consistency module, the high-frequency self-supervised module, the complex number arithmetic model, and the feature anti-aliasing module. The position consistency module enables the generation of spectra that preserve local and positional information while compressing the spectral data range to improve training stability. Through band masking and supervision, the high-frequency autoencoder module guides the network to learn useful frequency features selectively. The proposed complex number arithmetic model allows direct spectral training while avoiding the loss of phase information caused by current general-purpose real-valued operations. The feature anti-aliasing module embeds filters in the model to prevent spectral aliasing caused by down-sampling and improve performance. Experiments are performed on the collected five-class dataset, which contains 4591 colorectal endoscopic images. The outcomes show that our proposed method produces state-of-the-art results with an accuracy rate of 89. 82%.

IROS Conference 2023 Conference Paper

Aggressive Trajectory Generation for a Swarm of Autonomous Racing Drones

  • Yuyang Shen
  • Jin Zhou
  • Danzhe Xu
  • Fangguo Zhao
  • Jinming Xu 0002
  • Jiming Chen 0001
  • Shuo Li

Autonomous drone racing is becoming an excellent platform to challenge quadrotors' autonomy techniques including planning, navigation and control technologies. However, most research on this topic mainly focuses on single drone scenarios. In this paper, we describe a novel time-optimal trajectory generation method for generating time-optimal trajectories for a swarm of quadrotors to fly through pre-defined waypoints with their maximum maneuverability without collision. We verify the method in the Gazebo simulations where a swarm of 5 quadrotors can fly through a complex 6-waypoint racing track in a $35m\times 35m$ space with a top speed of 14m/s. Flight tests are performed on two quadrotors passing through 3 waypoints in a $4m\times 2m$ flight arena to demonstrate the feasibility of the proposed method in the real world. Both simulations and real-world flight tests show that the proposed method can generate the optimal aggressive trajectories for a swarm of autonomous racing drones. The method can also be easily transferred to other types of robot swarms.

JBHI Journal 2023 Journal Article

ARR-GCN: Anatomy-Relation Reasoning Graph Convolutional Network for Automatic Fine-Grained Segmentation of Organ's Surgical Anatomy

  • Yinli Tian
  • Wenjian Qin
  • Fei Xue
  • Ricardo Lambo
  • Meiyan Yue
  • Songhui Diao
  • Lequan Yu
  • Yaoqin Xie

Anatomical resection (AR) based on anatomical sub-regions is a promising method of precise surgical resection, which has been proven to improve long-term survival by reducing local recurrence. The fine-grained segmentation of an organ's surgical anatomy (FGS-OSA), i. e. , segmenting an organ into multiple anatomic regions, is critical for localizing tumors in AR surgical planning. However, automatically obtaining FGS-OSA results in computer-aided methods faces the challenges of appearance ambiguities among sub-regions (i. e. , inter-sub-region appearance ambiguities) caused by similar HU distributions in different sub-regions of an organ's surgical anatomy, invisible boundaries, and similarities between anatomical landmarks and other anatomical information. In this paper, we propose a novel fine-grained segmentation framework termed the “anatomic relation reasoning graph convolutional network” (ARR-GCN), which incorporates prior anatomic relations into the framework learning. In ARR-GCN, a graph is constructed based on the sub-regions to model the class and their relations. Further, to obtain discriminative initial node representations of graph space, a sub-region center module is designed. Most importantly, to explicitly learn the anatomic relations, the prior anatomic-relations among the sub-regions are encoded in the form of an adjacency matrix and embedded into the intermediate node representations to guide framework learning. The ARR-GCN was validated on two FGS-OSA tasks: i) liver segments segmentation, and ii) lung lobes segmentation. Experimental results on both tasks outperformed other state-of-the-art segmentation methods and yielded promising performances by ARR-GCN for suppressing ambiguities among sub-regions.

JBHI Journal 2023 Journal Article

BMAnet: Boundary Mining With Adversarial Learning for Semi-Supervised 2D Myocardial Infarction Segmentation

  • Chenchu Xu
  • Yifei Wang
  • Dong Zhang
  • Longfei Han
  • Yanping Zhang
  • Jie Chen
  • Shuo Li

Automatic segmentation of myocardial infarction (MI) regions in late gadolinium-enhanced cardiac magnetic resonance images is an essential step in the computed diagnosis of myocardial infarction. Most of the current myocardial infarction region segmentation methods are based on fully supervised deep learning. However, cardiologists' annotation of myocardial infarction regions in cardiac magnetic resonance images during the diagnosis process is time-consuming and expensive. This paper proposes a semi-supervised myocardial infarction segmentation. It consists of two models: 1) a boundary mining model and 2) an adversarial learning model. The boundary mining model can solve the boundary ambiguity problem by enlarging the gap between the foreground and background features, thus segmenting the myocardial infarction region accurately. The adversarial learning model can make the boundary mining model learn from additional unlabeled data by evaluating the segmentation performance and providing pseudo supervision, which significantly increases the robustness of the boundary mining model. We conduct extensive experiments on an in-house myocardial magnetic resonance dataset. The experimental results on six evaluation metrics demonstrate that our method achieves excellent results in myocardial infarction segmentation and outperforms the state-of-the-art semi-supervised methods.

ICRA Conference 2023 Conference Paper

Efficient View Path Planning for Autonomous Implicit Reconstruction

  • Jing Zeng
  • Yanxu Li
  • Yunlong Ran
  • Shuo Li
  • Fei Gao
  • Lincheng Li
  • Shibo He
  • Jiming Chen 0001

Implicit neural representations have shown promising potential for 3D scene reconstruction. Recent work applies it to autonomous 3D reconstruction by learning information gain for view path planning. Effective as it is, the computation of the information gain is expensive, and compared with that using volumetric representations, collision checking using the implicit representation for a 3D point is much slower. In the paper, we propose to 1) leverage a neural network as an implicit function approximator for the information gain field and 2) combine the implicit fine-grained representation with coarse volumetric representations to improve efficiency. Further with the improved efficiency, we propose a novel informative path planning based on a graph-based planner. Our method demonstrates significant improvements in the reconstruction quality and planning efficiency compared with autonomous reconstructions with implicit and explicit representations. We deploy the method on a real UAV and the results show that our method can plan informative views and reconstruct a scene with high quality.

JBHI Journal 2023 Journal Article

Multi-Task Learning for Pulmonary Arterial Hypertension Prognosis Prediction Via Memory Drift and Prior Prompt Learning on 3D Chest CT

  • Guanyu Yang
  • Yuting He
  • Yang Lv
  • Yang Chen
  • Jean-Louis Coatrieux
  • Xiaoxuan Sun
  • Qiang Wang
  • Yongyue Wei

Pulmonary arterial hypertension (PAH) prognosis prediction on 3D non-contrast CT images is one of the most important tasks for PAH treatment. It will help clinicians stratify patients into different groups for early diagnosis and timely intervention via automatically extracting the potential biomarkers of PAH to predict mortality. However, it is still a task of great challenges due to the large volume and low-contrast regions of interest in 3D chest CT images. In this paper, we propose the first multi-task learning-based PAH prognosis prediction framework, P $^{2}$ -Net, which effectively optimizes the model and powerfully represents task-dependent features via our Memory Drift (MD) and Prior Prompt Learning (PPL) strategies. 1) Our MD maintains a large memory bank to provide a dense sampling of the deep biomarkers' distribution. Therefore, although the batch size is very small caused by our large volume, a reliable (negative log partial) likelihood loss is still able to be calculated on a representative probability distribution for robust optimization. 2) Our PPL simultaneously learns an additional manual biomarkers prediction task to embed clinical prior knowledge into our deep prognosis prediction task in hidden and explicit ways. Therefore, it will prompt the prediction of deep biomarkers and improve the perception of task-dependent features in our low-contrast regions. Our P $^{2}$ -Net achieves a high prognostic correlation of the prediction and great generalization with the highest 70. 19% C-index and 2. 14 HR. Extensive experiments with promising results on our PAH prognosis prediction reveal powerful prognosis performance and great clinical significance in PAH treatment. All of our code will be made publicly available online.

AAAI Conference 2023 Conference Paper

Pixel Is All You Need: Adversarial Trajectory-Ensemble Active Learning for Salient Object Detection

  • Zhenyu Wu
  • Lin Wang
  • Wei Wang
  • Qing Xia
  • Chenglizhao Chen
  • Aimin Hao
  • Shuo Li

Although weakly-supervised techniques can reduce the labeling effort, it is unclear whether a saliency model trained with weakly-supervised data (e.g., point annotation) can achieve the equivalent performance of its fully-supervised version. This paper attempts to answer this unexplored question by proving a hypothesis: there is a point-labeled dataset where saliency models trained on it can achieve equivalent performance when trained on the densely annotated dataset. To prove this conjecture, we proposed a novel yet effective adversarial trajectory-ensemble active learning (ATAL). Our contributions are three-fold: 1) Our proposed adversarial attack triggering uncertainty can conquer the overconfidence of existing active learning methods and accurately locate these uncertain pixels. 2) Our proposed trajectory-ensemble uncertainty estimation method maintains the advantages of the ensemble networks while significantly reducing the computational cost. 3) Our proposed relationship-aware diversity sampling algorithm can conquer oversampling while boosting performance. Experimental results show that our ATAL can find such a point-labeled dataset, where a saliency model trained on it obtained 97%-99% performance of its fully-supervised version with only 10 annotated points per image.

JBHI Journal 2023 Journal Article

Trajectory-Aware Adaptive Imaging Clue Analysis for Guidewire Artifact Removal in Intravascular Optical Coherence Tomography

  • Gongning Luo
  • Xinghua Ma
  • Jinwen Guo
  • Mingye Zou
  • Wei Wang
  • Yang Cao
  • Kuanquan Wang
  • Shuo Li

Guidewire Artifact Removal (GAR) involves restoring missing imaging signals in areas of IntraVascular Optical Coherence Tomography (IVOCT) videos affected by guidewire artifacts. GAR helps overcome imaging defects and minimizes the impact of missing signals on the diagnosis of CardioVascular Diseases (CVDs). To restore the actual vascular and lesion information within the artifact area, we propose a reliable Trajectory-aware Adaptive imaging Clue analysis Network (TAC-Net) that includes two innovative designs: (i) Adaptive clue aggregation, which considers both texture-focused original (ORI) videos and structure-focused relative total variation (RTV) videos, and suppresses texture-structure imbalance with an active weight-adaptation mechanism; (ii) Trajectory-aware Transformer, which uses a novel attention calculation to perceive the attention distribution of artifact trajectories and avoid the interference of irregular and non-uniform artifacts. We provide a detailed formulation for the procedure and evaluation of the GAR task and conduct comprehensive quantitative and qualitative experiments. The experimental results demonstrate that TAC-Net reliably restores the texture and structure of guidewire artifact areas as expected by experienced physicians ( e. g. , SSIM: 97. 23%). We also discuss the value and potential of the GAR task for clinical applications and computer-aided diagnosis of CVDs.

AAAI Conference 2023 Conference Paper

Video-Audio Domain Generalization via Confounder Disentanglement

  • Shengyu Zhang
  • Xusheng Feng
  • Wenyan Fan
  • Wenjing Fang
  • Fuli Feng
  • Wei Ji
  • Shuo Li
  • Li Wang

Existing video-audio understanding models are trained and evaluated in an intra-domain setting, facing performance degeneration in real-world applications where multiple domains and distribution shifts naturally exist. The key to video-audio domain generalization (VADG) lies in alleviating spurious correlations over multi-modal features. To achieve this goal, we resort to causal theory and attribute such correlation to confounders affecting both video-audio features and labels. We propose a DeVADG framework that conducts uni-modal and cross-modal deconfounding through back-door adjustment. DeVADG performs cross-modal disentanglement and obtains fine-grained confounders at both class-level and domain-level using half-sibling regression and unpaired domain transformation, which essentially identifies domain-variant factors and class-shared factors that cause spurious correlations between features and false labels. To promote VADG research, we collect a VADG-Action dataset for video-audio action recognition with over 5,000 video clips across four domains (e.g., cartoon and game) and ten action classes (e.g., cooking and riding). We conduct extensive experiments, i.e., multi-source DG, single-source DG, and qualitative analysis, validating the rationality of our causal analysis and the effectiveness of the DeVADG framework.

JBHI Journal 2022 Journal Article

Few-Shot Learning for Deformable Medical Image Registration With Perception-Correspondence Decoupling and Reverse Teaching

  • Yuting He
  • Tiantian Li
  • Rongjun Ge
  • Jian Yang
  • Youyong Kong
  • Jian Zhu
  • Huazhong Shu
  • Guanyu Yang

Deformable medical image registration estimates corresponding deformation to align the regions of interest (ROIs) of two images to a same spatial coordinate system. However, recent unsupervised registration models only have correspondence ability without perception, making misalignment on blurred anatomies and distortion on task-unconcerned backgrounds. Label-constrained (LC) registration models embed the perception ability via labels, but the lack of texture constraints in labels and the expensive labeling costs causes distortion internal ROIs and overfitted perception. We propose the first few-shot deformable medical image registration framework, Perception-Correspondence Registration (PC-Reg), which embeds perception ability to registration models only with few labels, thus greatly improving registration accuracy and reducing distortion. 1) We propose the Perception-Correspondence Decoupling which decouples the perception and correspondence actions of registration to two CNNs. Therefore, independent optimizations and feature representations are available avoiding interference of the correspondence due to the lack of texture constraints. 2) For few-shot learning, we propose Reverse Teaching which aligns labeled and unlabeled images to each other to provide supervision information to the structure and style knowledge in unlabeled images, thus generating additional training data. Therefore, these data will reversely teach our perception CNN more style and structure knowledge, improving its generalization ability. Our experiments on three datasets with only five labels demonstrate that our PC-Reg has competitive registration accuracy and effective distortion-reducing ability. Compared with LC-VoxelMorph( $\lambda =1$ ), we achieve the 12. 5%, 6. 3% and 1. 0% Reg-DSC improvements on three datasets, revealing our framework with great potential in clinical application.

JBHI Journal 2022 Journal Article

Guest Editorial Artificial Intelligence in Pre-DICOM

  • Tao Tan
  • Ravi Soni
  • Jungong Han
  • Shuo Li

The papers in this special section focus on artificial intelligence pre-DICOM medical imaging. AI for medical imaging is applied in three domains: pre-DICOM, pre-processing and clinical applications. Clinical applications mainly cover topics such as disease detection, classification, segmentation, registration. Pre-processing components are mainly designed for facilitating applications using image transformation such as image normalization, noise reduction, bias correction in MR. AI in the pre-DICOM domain is expected to improve imaging workflow, image protocol selection, imaging quality, imaging scanning time before images are converted into DICOM format for radiologists to review. The trends of AI publications in medical imaging have been gradually extended from clinical applications to pre-processing and, to pre-DICOM. The papers in this special section seek to present and highlight the latest development on applying advanced deep learning techniques in pre-DICOM space. The papers highlight the latest development on applying advanced deep learning techniques in pre-DICOM space.

JBHI Journal 2022 Journal Article

Guest Editorial Generative Adversarial Networks in Biomedical Image Computing

  • Huazhu Fu
  • Tao Zhou
  • Shuo Li
  • Alejandro F. Frangi

The papers in this special section focus on generative adversarial networks in biomedical image computing. The field of biomedical imaging has obtained great progress from Roentgen’s original discovery of the X-ray to the current imaging tools, including Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), Computed Tomography (CT), and Ultrasound (US). The benefits of using these non-invasive imaging technologies are to assess the current condition of an organ or tissue, which can be used to monitor a patient over time over time for accurate and timely diagnosis and treatment. With the development of imaging technologies, developing advanced artificial intelligence algorithms for automated image analysis has shown the potential to change many aspects of clinical applications within the next decade. Meanwhile, these advanced technologies have also brought new issues and challenges. Thus, there has been a growing demand for biomedical imaging computing to be a component of clinical trials and device improvement. Currently, Generative adversarial networks (GANs) have been attached growing interests in the computer vision community due to their capability of data generation or translation. GAN-based models are able to learn from a set of training data and generate new data with the same characteristics as the training ones, which have also proven to be the state of the art for generating sharp and realistic images. More importantly, GAN has been rapidly applied to many traditional and novel applications in the medical domain, such as image reconstruction, segmentation, diagnosis, synthesis, and so on. Despite GAN substantial progress in these areas, their application to medical image computing still faces challenges and unsolved problems remain.

JBHI Journal 2022 Journal Article

Hematoma Expansion Context Guided Intracranial Hemorrhage Segmentation and Uncertainty Estimation

  • Xiangyu Li
  • Gongning Luo
  • Wei Wang
  • Kuanquan Wang
  • Yue Gao
  • Shuo Li

Accurate segmentation of the Intracranial Hemorrhage (ICH) in non-contrast CT images is significant for computer-aided diagnosis. Although existing methods have achieved remarkable 1 1 The code will be available from https://github.com/JohnleeHIT/SLEX-Net.results, none of them incorporated ICH’s prior information in their methods. In this work, for the first time, we proposed a novel SLice EXpansion Network (SLEX-Net), which incorporated hematoma expansion in the segmentation architecture by directly modeling the hematoma variation among adjacent slices. Firstly, a new module named Slice Expansion Module (SEM) was built, which can effectively transfer contextual information between two adjacent slices by mapping predictions from one slice to another. Secondly, to perceive contextual information from both upper and lower slices, we designed two information transmission paths: forward and backward slice expansion, and aggregated results from those paths with a novel weighing strategy. By further exploiting intra-slice and inter-slice context with the information paths, the network significantly improved the accuracy and continuity of segmentation results. Moreover, the proposed SLEX-Net enables us to conduct an uncertainty estimation with one-time inference, which is much more efficient than existing methods. We evaluated the proposed SLEX-Net and compared it with some state-of-the-art methods. Experimental results demonstrate that our method makes significant improvements in all metrics on segmentation performance and outperforms other existing uncertainty estimation methods in terms of several metrics.

IJCAI Conference 2022 Conference Paper

MNet: Rethinking 2D/3D Networks for Anisotropic Medical Image Segmentation

  • Zhangfu Dong
  • Yuting He
  • Xiaoming Qi
  • Yang Chen
  • Huazhong Shu
  • Jean-Louis Coatrieux
  • Guanyu Yang
  • Shuo Li

The nature of thick-slice scanning causes severe inter-slice discontinuities of 3D medical images, and the vanilla 2D/3D convolutional neural networks (CNNs) fail to represent sparse inter-slice information and dense intra-slice information in a balanced way, leading to severe underfitting to inter-slice features (for vanilla 2D CNNs) and overfitting to noise from long-range slices (for vanilla 3D CNNs). In this work, a novel mesh network (MNet) is proposed to balance the spatial representation inter axes via learning. 1) Our MNet latently fuses plenty of representation processes by embedding multi-dimensional convolutions deeply into basic modules, making the selections of representation processes flexible, thus balancing representation for sparse inter-slice information and dense intra-slice information adaptively. 2) Our MNet latently fuses multi-dimensional features inside each basic module, simultaneously taking the advantages of 2D (high segmentation accuracy of the easily recognized regions in 2D view) and 3D (high smoothness of 3D organ contour) representations, thus obtaining more accurate modeling for target regions. Comprehensive experiments are performed on four public datasets (CT\&MR), the results consistently demonstrate the proposed MNet outperforms the other methods. The code and datasets are available at: https: //github. com/zfdong-code/MNet

JBHI Journal 2022 Journal Article

MVSGAN: Spatial-Aware Multi-View CMR Fusion for Accurate 3D Left Ventricular Myocardium Segmentation

  • Xiaoming Qi
  • Yuting He
  • Guanyu Yang
  • Yang Chen
  • Jian Yang
  • Wangyag Liu
  • Yinsu Zhu
  • Yi Xu

The accurate 3D left ventricular (LV) myocardium segmentation in short-axis (SAX) view of cardiac magnetic resonance (CMR) is challenged by the sparse spatial structure of CMR. The strategy of multi-view CMR fusion can provide fine-grained spatial structure for accurate segmentation. However, the large information misalignment and lack of dense 3D CMR as fusion target in multi-view CMR fusion, and the different spatial resolution between the fusion result and the ground truth in segmentation limit the strategy. In this study, we propose a multi-view spatial-aware adversarial network (MVSGAN). It studies the perception of fine-grained cardiac structure for accurate segmentation by the spatialaware multi-view CMR fusion. It consists of three modules: (1) A residual adversarial fusion (RAF) module takes inter-slices deep correlation and anatomical prior to refine the spatial structures by residual supplement and adversarial optimization. (2) A structural perception-aggregation (SPA) module establishes the spatial correlation between the dense cardiac model and sparse label for accurate CMR LV myocardium segmentation. (3) A joint training strategy utilizes the dense SAX volume as explicit and implicit goals to jointly optimize the framework. The experiments are applied on a public dataset and a clinical dataset to evaluate the performance of MVSGAN. The average Dice and Jaccard score of LV myocardium segmentation obtained by MVSGAN are highest among seven existing state-of-the-art methods, which are up to 0. 92 and 0. 75. It is concluded that the spatial-aware multi-view CMR fusion can provide meaningful spatial correlation for accurate LV myocardium segmentation.

JBHI Journal 2022 Journal Article

Regional Cardiac Motion Scoring With Multi-Scale Motion-Based Spatial Attention

  • Wufeng Xue
  • Zejian Chen
  • Tianfu Wang
  • Shuo Li
  • Dong Ni

Regional cardiac motion scoring aims to classify the motion status of each myocardium segment into one of the four categories (normal, hypokinetic, akinetic, and dyskinetic) from multiple short-axis MR sequences. It is essential for prognosis and early diagnosis for various cardiac diseases. However, the complex motion procedure of the myocardium and the invisible pattern differences pose great challenges, leading to low performance for automatic methods. Most existing works mitigate the task by differentiating the normal motion patterns from the abnormal ones, without fine-grained motion scoring. We propose an effective method for the task of cardiac motion scoring by connecting a bottom-up and another top-down branch with a novel motion-based spatial attention module in multi-scale space. Specifically, we use the convolution blocks for low-level feature extraction that acts as a bottom-up mechanism, and the task of optical flow for explicit motion extraction that acts as a top-down mechanism for high-level allocation of spatial attention. To this end, a newly designed Multi-scale Motion-based Spatial Attention (MMSA) module is used as the pivot connecting the bottom-up part and the top-down part, and adaptively weight the low-level features according to the motion information. Experimental results on a newly constructed dataset of 1440 myocardium segments from 90 subjects demonstrate that the proposed MMSA can accurately analyze the regional myocardium motion, with accuracies of 79. 3% for 4-way motion scoring, 89. 0% for abnormality detection, and correlation of 0. 943 for estimation of motion score index. This work has great potential for practical assessmentof cardiac motion function.

JBHI Journal 2022 Journal Article

SC2Net: A Novel Segmentation-Based Classification Network for Detection of COVID-19 in Chest X-Ray Images

  • Huimin Zhao
  • Zhenyu Fang
  • Jinchang Ren
  • Calum MacLellan
  • Yong Xia
  • Shuo Li
  • Meijun Sun
  • Kevin Ren

The pandemic of COVID-19 has become a global crisis in public health, which has led to a massive number of deaths and severe economic degradation. To suppress the spread of COVID-19, accurate diagnosis at an early stage is crucial. As the popularly used real-time reverse transcriptase polymerase chain reaction (RT-PCR) swab test can be lengthy and inaccurate, chest screening with radiography imaging is still preferred. However, due to limited image data and the difficulty of the early-stage diagnosis, existing models suffer from ineffective feature extraction and poor network convergence and optimisation. To tackle these issues, a segmentation-based COVID-19 classification network, namely SC2Net, is proposed for effective detection of the COVID-19 from chest x-ray (CXR) images. The SC2Net consists of two subnets: a COVID-19 lung segmentation network (CLSeg), and a spatial attention network (SANet). In order to supress the interference from the background, the CLSeg is first applied to segment the lung region from the CXR. The segmented lung region is then fed to the SANet for classification and diagnosis of the COVID-19. As a shallow yet effective classifier, SANet takes the ResNet-18 as the feature extractor and enhances high-level feature via the proposed spatial attention module. For performance evaluation, the COVIDGR 1. 0 dataset is used, which is a high-quality dataset with various severity levels of the COVID-19. Experimental results have shown that, our SC2Net has an average accuracy of 84. 23% and an average F1 score of 81. 31% in detection of COVID-19, outperforming several state-of-the-art approaches.

AAAI Conference 2022 Conference Paper

Self-Training Multi-Sequence Learning with Transformer for Weakly Supervised Video Anomaly Detection

  • Shuo Li
  • Fang Liu
  • Licheng Jiao

Weakly supervised Video Anomaly Detection (VAD) using Multi-Instance Learning (MIL) is usually based on the fact that the anomaly score of an abnormal snippet is higher than that of a normal snippet. In the beginning of training, due to the limited accuracy of the model, it is easy to select the wrong abnormal snippet. In order to reduce the probability of selection errors, we first propose a Multi-Sequence Learning (MSL) method and a hinge-based MSL ranking loss that uses a sequence composed of multiple snippets as an optimization unit. We then design a Transformer-based MSL network to learn both video-level anomaly probability and snippet-level anomaly scores. In the inference stage, we propose to use the video-level anomaly probability to suppress the fluctuation of snippet-level anomaly scores. Finally, since VAD needs to predict the snippet-level anomaly scores, by gradually reducing the length of selected sequence, we propose a self-training strategy to gradually refine the anomaly scores. Experimental results show that our method achieves significant improvements on ShanghaiTech, UCF-Crime, and XD-Violence.

JBHI Journal 2022 Journal Article

TRSA-Net: Task Relation Spatial Co-Attention for Joint Segmentation, Quantification and Uncertainty Estimation on Paired 2D Echocardiography

  • Xiaoxiao Cui
  • Yankun Cao
  • Zhi Liu
  • Xiaoyu Sui
  • Jia Mi
  • Yuezhong Zhang
  • Lizhen Cui
  • Shuo Li

Clinical workflow of cardiac assessment on 2D echocardiography requires both accurate segmentation and quantification of the Left Ventricle (LV) from paired apical 4-chamber and 2-chamber. Moreover, uncertainty estimation is significant in clinically understanding the performance of a model. However, current research on 2D echocardiography ignores this vital task while joint segmentation with quantification, hence motivating the need for a unified optimization method. In this paper, we propose a multitask model with Task Relation Spatial co-Attention (referred as TRSA-Net) for joint segmentation, quantification, and uncertainty estimation on paired 2D echo. TRSA-Net achieves multitask joint learning by novelly exploring the spatial correlation between tasks. The task relation spatial co-attention learns the spatial mapping among task-specific features by non-local and co-excitation, which forcibly joints embedded spatial information in the segmentation and quantification. The Boundary-aware Structure Consistency (BSC) and Joint Indices Constraint (JIC) are integrated into the multitask learning optimization objective to guide the learning of segmentation and quantification paths. The BSC creatively promotes structural similarity of predictions, and JIC explores the internal relationship between three quantitative indices. We validate the efficacy of our TRSA-Net on the public CAMUS dataset. Extensive comparison and ablation experiments show that our approach can achieve competitive segmentation performance and highly accurate results on quantification.

JBHI Journal 2021 Journal Article

Left Ventricle Quantification Challenge: A Comprehensive Comparison and Evaluation of Segmentation and Regression for Mid-Ventricular Short-Axis Cardiac MR Data

  • Wufeng Xue
  • Jiahui Li
  • Zhiqiang Hu
  • Eric Kerfoot
  • James Clough
  • Ilkay Oksuz
  • Hao Xu
  • Vicente Grau

Automatic quantification of the left ventricle (LV) from cardiac magnetic resonance (CMR) images plays an important role in making the diagnosis procedure efficient, reliable, and alleviating the laborious reading work for physicians. Considerable efforts have been devoted to LV quantification using different strategies that include segmentation-based (SG) methods and the recent direct regression (DR) methods. Although both SG and DR methods have obtained great success for the task, a systematic platform to benchmark them remains absent because of differences in label information during model learning. In this paper, we conducted an unbiased evaluation and comparison of cardiac LV quantification methods that were submitted to the Left Ventricle Quantification (LVQuan) challenge, which was held in conjunction with the Statistical Atlases and Computational Modeling of the Heart (STACOM) workshop at the MICCAI 2018. The challenge was targeted at the quantification of 1) areas of LV cavity and myocardium, 2) dimensions of the LV cavity, 3) regional wall thicknesses (RWT), and 4) the cardiac phase, from mid-ventricle short-axis CMR images. First, we constructed a public quantification dataset Cardiac-DIG with ground truth labels for both the myocardium mask and these quantification targets across the entire cardiac cycle. Then, the key techniques employed by each submission were described. Next, quantitative validation of these submissions were conducted with the constructed dataset. The evaluation results revealed that both SG and DR methods can offer good LV quantification performance, even though DR methods do not require densely labeled masks for supervision. Among the 12 submissions, the DR method LDAMT offered the best performance, with a mean estimation error of 301 mm $^2$ for the two areas, 2. 15 mm for the cavity dimensions, 2. 03 mm for RWTs, and a 9. 5% error rate for the cardiac phase classification. Three of the SG methods also delivered comparable performances. Finally, we discussed the advantages and disadvantages of SG and DR methods, as well as the unsolved problems in automatic cardiac quantification for clinical practice applications.

ICLR Conference 2021 Conference Paper

PAC Confidence Predictions for Deep Neural Network Classifiers

  • Sangdon Park 0001
  • Shuo Li
  • Insup Lee 0001
  • Osbert Bastani

A key challenge for deploying deep neural networks (DNNs) in safety critical settings is the need to provide rigorous ways to quantify their uncertainty. In this paper, we propose a novel algorithm for constructing predicted classification confidences for DNNs that comes with provable correctness guarantees. Our approach uses Clopper-Pearson confidence intervals for the Binomial distribution in conjunction with the histogram binning approach to calibrated prediction. In addition, we demonstrate how our predicted confidences can be used to enable downstream guarantees in two settings: (i) fast DNN inference, where we demonstrate how to compose a fast but inaccurate DNN with an accurate but slow DNN in a rigorous way to improve performance without sacrificing accuracy, and (ii) safe planning, where we guarantee safety when using a DNN to predict whether a given action is safe based on visual observations. In our experiments, we demonstrate that our approach can be used to provide guarantees for state-of-the-art DNNs.

JBHI Journal 2021 Journal Article

Quantifying Axial Spine Images Using Object-Specific Bi-Path Network

  • Liyan Lin
  • Xi Tao
  • Wei Yang
  • Shumao Pang
  • Zhihai Su
  • Hai Lu
  • Shuo Li
  • Qianjin Feng

Automatic estimation of indices from medical images is the main goal of computer-aided quantification (CADq), which speeds up diagnosis and lightens the workload of radiologists. Deep learning technique is a good choice for implementing CADq. Usually, to acquire high-accuracy quantification, specific network architecture needs to be designed for a given CADq task. In this study, considering that the target organs are the intervertebral disc and the dural sac, we propose an object-specific bi-path network (OSBP-Net) for axial spine image quantification. Each path of the OSBP-Net comprises a shallow feature extraction layer (SFE) and a deep feature extraction sub-network (DFE). The SFEs use different convolution strides because the two target organs have different anatomical sizes. The DFEs use average pooling for downsampling based on the observation that the target organs have lower intensity than the background. In addition, an inter-path dissimilarity constraint is proposed and applied to the output of the SFEs, taking into account that the activated regions in the feature maps of two paths should be different theoretically. An inter-index correlation regularization is introduced and applied to the output of the DFEs based on the observation that the diameter and area of the same object express an approximately linear relation. The prediction results of OSBP-Net are compared to several state-of-the-art machine learning-based CADq methods. The comparison reveals that the proposed methods precede other competing methods extensively, indicating its great potential for spine CADq.

ICRA Conference 2020 Conference Paper

Aggressive Online Control of a Quadrotor via Deep Network Representations of Optimality Principles

  • Shuo Li
  • Ekin Öztürk
  • Christophe De Wagter
  • Guido C. H. E. de Croon
  • Dario Izzo

Optimal control holds great potential to improve a variety of robotic applications. The application of optimal control on-board limited platforms has been severely hindered by the large computational requirements of current state of the art implementations. In this work, we make use of a deep neural network to directly map the robot states to control actions. The network is trained offline to imitate the optimal control computed by a time consuming direct nonlinear method. A mixture of time optimality and power optimality is considered with a continuation parameter used to select the predominance of each objective. We apply our networks (termed G&CNets) to aggressive quadrotor control, first in simulation and then in the real world. We give insight into the factors that influence the `reality gap' between the quadrotor model used by the offline optimal control method and the real quadrotor. Furthermore, we explain how we set up the model and the control structure on-board of the real quadrotor to successfully close this gap and perform time-optimal maneuvers in the real world. Finally, G&CNet's performance is compared to state-of-the-art differential-flatness-based optimal control methods. We show, in the experiments, that G&CNets lead to significantly faster trajectory execution due to, in part, the less restrictive nature of the allowed state-to-input mappings.

JBHI Journal 2020 Journal Article

Direct Cup-to-Disc Ratio Estimation for Glaucoma Screening via Semi-Supervised Learning

  • Rongchang Zhao
  • Xuanlin Chen
  • Xiyao Liu
  • Zailiang Chen
  • Fan Guo
  • Shuo Li

Glaucoma is a chronic eye disease that leads to irreversible vision loss. The Cup-to-Disc Ratio (CDR) serves as the most important indicator for glaucoma screening and plays a significant role in clinical screening and early diagnosis of glaucoma. In general, obtaining CDR is subjected to measuring on manually or automatically segmented optic disc and cup. Despite great efforts have been devoted, obtaining CDR values automatically with high accuracy and robustness is still a great challenge due to the heavy overlap between optic cup and neuroretinal rim regions. In this paper, a direct CDR estimation method is proposed based on the well-designed semi-supervised learning scheme, in which CDR estimation is formulated as a general regression problem while optic disc/cup segmentation is cancelled. The method directly regresses CDR value based on the feature representation of optic nerve head via deep learning technique while bypassing intermediate segmentation. The scheme is a two-stage cascaded approach comprised of two phases: unsupervised feature representation of fundus image with a convolutional neural networks (MFPPNet) and CDR value regression by random forest regressor. The proposed scheme is validated on the challenging glaucoma dataset Direct-CSU and public ORIGA, and the experimental results demonstrate that our method can achieve a lower average CDR error of 0. 0563 and a higher correlation of around 0. 726 with measurement before manual segmentation of optic disc/cup by human experts. Our estimated CDR values are also tested for glaucoma screening, which achieves the areas under curve of 0. 905 on dataset of 421 fundus images. The experiments show that the proposed method is capable of state-of-the-art CDR estimation and satisfactory glaucoma screening with calculated CDR value.

JBHI Journal 2020 Journal Article

MRLN: Multi-Task Relational Learning Network for MRI Vertebral Localization, Identification, and Segmentation

  • Ranran Zhang
  • Xiaoyan Xiao
  • Zhi Liu
  • Yujun Li
  • Shuo Li

Magnetic resonance imaging (MRI) vertebral localization, identification, and segmentation are important steps in the automatic analysis of spines. Due to the similar appearances of vertebrae, the accurate segmentation, localization, and identification of vertebrae remain challenging. Previous methods solved the three tasks independently, ignoring the intrinsic correlation among them. In this paper, we propose a multi-task relational learning network (MRLN) that utilizes both the relationships between vertebrae and the relevance of the three tasks. A dilation convolution group is used to expand the receptive field, and LSTM(Long Short-Term Memory) to learn the prior knowledge of the order relationship between the vertebral bodies. We introduce a co-attention module to learn the correlation information, localization-guided segmentation attention(LGSA) and segmentation-guided localization attention(SGLA), in the decoder stage of segmentation and localization tasks. Learning two tasks simultaneously as well as the correlation between tasks can not only avoid the overfitting of a single task but also correct each other. To avoids the cumbersome weight adjustment for different tasks loss functions, we formulated a novel XOR loss that provides a direct evaluation criterion for the localization relationship of the semantic location regression and semantic segmentation. This method was evaluated on a dataset which includes multiple MRI modalities (T1 and T2), various fields of view. Experimental results demonstrate that both of the co-attention and XOR loss work outperforms the most recent state of art.

JBHI Journal 2020 Journal Article

Multiple Axial Spine Indices Estimation via Dense Enhancing Network With Cross-Space Distance-Preserving Regularization

  • Liyan Lin
  • Xi Tao
  • Shumao Pang
  • Zhihai Su
  • Hai Lu
  • Shuo Li
  • Qianjin Feng
  • Bo Chen

Automatic estimation of axial spine indices is clinically desired for various spine computer aided procedures, such as disease diagnosis, therapeutic evaluation, pathophysiological understanding, risk assessment, and biomechanical modeling. Currently, the spine indices are manually measured by physicians, which is time-consuming and laborious. Even worse, the tedious manual procedure might result in inaccurate measurement. To deal with this problem, in this paper, we aim at developing an automatic method to estimate multiple indices from axial spine images. Inspired by the success of deep learning for regression problems and the densely connected network for image classification, we propose a dense enhancing network (DE-Net) which uses the dense enhancing blocks (DEBs) as its main body, where a feature enhancing layer is added to each of the bypass in a dense block. The DEB is designed to enhance discriminative feature embedding from the intervertebral disc and the dural sac areas. In addition, the cross-space distance-preserving regularization (CSDPR), which enforces consistent inter-sample distances between the output and the label spaces, is proposed to regularize the loss function of the DE-Net. To train and validate the proposed method, we collected 895 axial spine MRI images from 143 subjects and manually measured the indices as the ground truth. The results show that all deep learning models obtain very small prediction errors, and the proposed DE-Net with CSDPR acquires the smallest error among all methods, indicating that our method has great potential for spine computer aided procedures.

AAAI Conference 2020 Conference Paper

OF-MSRN: Optical Flow-Auxiliary Multi-Task Regression Network for Direct Quantitative Measurement, Segmentation and Motion Estimation

  • Chengqian Zhao
  • Cheng Feng
  • Dengwang Li
  • Shuo Li

Comprehensively analyzing the carotid artery is critically significant to diagnosing and treating cardiovascular diseases. The object of this work is to simultaneously achieve direct quantitative measurement and automated segmentation of the lumen diameter and intima-media thickness as well as the motion estimation of the carotid wall. No work has simultaneously achieved the comprehensive analysis of carotid artery due to three intractable challenges: 1) Tiny intima-media is more challenging to measure and segment; 2) Artifact generated by radial motion restrict the accuracy of measurement and segmentation; 3) Occlusions on diseased carotid walls generate dynamic complexity and indeterminacy. In this paper, we propose a novel optical flow-auxiliary multi-task regression network named OF-MSRN to overcome these challenges. We concatenate multi-scale features to a regression network to simultaneously achieve measurement and segmentation, which makes full use of the potential correlation between the two tasks. More importantly, we creatively explore an optical flow auxiliary module to take advantage of the co-promotion of segmentation and motion estimation to overcome the restrictions of the radial motion. Besides, we evaluate consistency between forward and backward optical flow to improve the accuracy of motion estimation of the diseased carotid wall. Extensive experiments on US sequences of 101 patients demonstrate the superior performance of OF- MSRN on the comprehensive analysis of the carotid artery by utilizing the dual optimization of the optical flow auxiliary module.

ICRA Conference 2020 Conference Paper

Robust Model Predictive Shielding for Safe Reinforcement Learning with Stochastic Dynamics

  • Shuo Li
  • Osbert Bastani

We propose a framework for safe reinforcement learning that can handle stochastic nonlinear dynamical systems. We focus on the setting where the nominal dynamics are known, and are subject to additive stochastic disturbances with known distribution. Our goal is to ensure the safety of a control policy trained using reinforcement learning, e. g. , in a simulated environment. We build on the idea of model predictive shielding (MPS), where a backup controller is used to override the learned policy as needed to ensure safety. The key challenge is how to compute a backup policy in the context of stochastic dynamics. We propose to use a tube-based robust nonlinear model predictive controller (NMPC) as the backup controller. We estimate the tubes using sampled trajectories, leveraging ideas from statistical learning theory to obtain high-probability guarantees. We empirically demonstrate that our approach can ensure safety in stochastic systems, including cart-pole and a non-holonomic particle with random obstacles.

JBHI Journal 2019 Journal Article

Direct Segmentation-Based Full Quantification for Left Ventricle via Deep Multi-Task Regression Learning Network

  • Xiuquan Du
  • Renjun Tang
  • Susu Yin
  • Yanping Zhang
  • Shuo Li

Quantitative analysis of the heart is extremely necessary and significant for detecting and diagnosing heart disease, yet there are still some challenges. In this study, we propose a new end-to-end segmentation-based deep multi-task regression learning model (Indices-JSQ) to make a holonomic quantitative analysis of the left ventricle (LV), which contains a segmentation network (Img2Contour) and multi-task regression network (Contour2Indices). First, Img2Contour, which contains a deep convolutional encoder-decoder module, is designed to obtain the LV contour. Then, the predicted contour is fed as input to Contour2Indices for full quantification. On the whole, we take into account the relationship between different tasks, which can serve as a complementary advantage. Meanwhile, instead of using images directly from the original dataset, we creatively use the segmented contour of the original image to estimate the cardiac indices to achieve better and more accurate results. We make experiments on MR sequences of 145 subjects and gain the experimental results of 157 mm 2, 2. 43 mm, 1. 29 mm, and 0. 87 on areas, dimensions, regional wall thicknesses, and Dice Metric, respectively. It intuitively shows that the proposed method outperforms the other state-of-the-art methods and demonstrates that our method has a great potential in cardiac MR images segmentation, comprehensive clinical assessment, and diagnosis.

IROS Conference 2019 Conference Paper

Learning Safe Unlabeled Multi-Robot Planning with Motion Constraints

  • Arbaaz Khan
  • Chi Zhang
  • Shuo Li
  • Jiayue Wu
  • Brent Schlotfeldt
  • Sarah Y. Tang
  • Alejandro Ribeiro
  • Osbert Bastani

In this paper, we present a learning approach to goal assignment and trajectory planning for unlabeled robots operating in 2D, obstacle-filled workspaces. More specifically, we tackle the unlabeled multi-robot motion planning problem with motion constraints as a multi-agent reinforcement learning problem with some sparse global reward. In contrast with previous works, which formulate an entirely new hand-crafted optimization cost or trajectory generation algorithm for a different robot dynamic model, our framework is a general approach that is applicable to arbitrary robot models. Further, by using the velocity obstacle, we devise a smooth projection that guarantees collision free trajectories for all robots with respect to their neighbors and obstacles. The efficacy of our algorithm is demonstrated through varied simulations. A video describing our method and results can be found here.

AAAI Conference 2019 Conference Paper

Weakly-Supervised Simultaneous Evidence Identification and Segmentation for Automated Glaucoma Diagnosis

  • Rongchang Zhao
  • Wangmin Liao
  • Beiji Zou
  • Zailiang Chen
  • Shuo Li

Evidence identification, optic disc segmentation and automated glaucoma diagnosis are the most clinically significant tasks for clinicians to assess fundus images. However, delivering the three tasks simultaneously is extremely challenging due to the high variability of fundus structure and lack of datasets with complete annotations. In this paper, we propose an innovative Weakly-Supervised Multi-Task Learning method (WSMTL) for accurate evidence identification, optic disc segmentation and automated glaucoma diagnosis. The WSMTL method only uses weak-label data with binary diagnostic labels (normal/glaucoma) for training, while obtains pixel-level segmentation mask and diagnosis for testing. The WSMTL is constituted by a skip and densely connected CNN to capture multi-scale discriminative representation of fundus structure; a well-designed pyramid integration structure to generate high-resolution evidence map for evidence identification, in which the pixels with higher value represent higher confidence to highlight the abnormalities; a constrained clustering branch for optic disc segmentation; and a fully-connected discriminator for automated glaucoma diagnosis. Experimental results show that our proposed WSMTL effectively and simultaneously delivers evidence identification, optic disc segmentation (89. 6% TP Dice), and accurate glaucoma diagnosis (92. 4% AUC). This endows our WSMTL a great potential for the effective clinical assessment of glaucoma.

JBHI Journal 2018 Journal Article

Robust Segmentation of Intima–Media Borders With Different Morphologies and Dynamics During the Cardiac Cycle

  • Shen Zhao
  • Zhifan Gao
  • Heye Zhang
  • Yaoqin Xie
  • Jianwen Luo
  • Dhanjoo Ghista
  • Zhanghong Wei
  • Xiaojun Bi

Segmentation of carotid intima-media (IM) borders from ultrasound sequences is challenging because of unknown image noise and varying IM border morphologies and/or dynamics. In this paper, we have developed a state-space framework to sequentially segment the carotid IM borders in each image throughout the cardiac cycle. In this framework, an H∞ filter is used to solve the state-space equations, and a grayscale-derivative constraint snake is used to provide accurate measurements for the H∞ filter. We have evaluated the performance of our approach by comparing our segmentation results to the manually traced contours of ultrasound image sequences of three synthetic models and 156 real subjects from four medical centers. The results show that our method has a small segmentation error (lumen intima, LI: 53 ± 67 μm; media-adventitia, MA: 57 ± 63 μm) for synthetic and real sequences of different image characteristics, and also agrees well with the manual segmentation (LI: bias = 1. 44 μm; MA: bias = -3. 38 μm). Our approach can robustly segment the carotid ultrasound sequences with various IM border morphologies, dynamics, and unknown image noise. These results indicate the potential of our framework to segment IM borders for clinical diagnosis.

IJCAI Conference 2015 Conference Paper

Bi-Parameter Space Partition for Cost-Sensitive SVM

  • Bin Gu
  • Victor S. Sheng
  • Shuo Li

Model selection is an important problem of costsensitive SVM (CS-SVM). Although using solution path to find global optimal parameters is a powerful method for model selection, it is a challenge to extend the framework to solve two regularization parameters of CS-SVM simultaneously. To overcome this challenge, we make three main steps in this paper. (i) A critical-regions-based biparameter space partition algorithm is proposed to present all piecewise linearities of CS-SVM. (ii) An invariant-regions-based bi-parameter space partition algorithm is further proposed to compute empirical errors for all parameter pairs. (iii) The global optimal solutions for K-fold cross validation are computed by superposing K invariant region based bi-parameter space partitions into one. The three steps constitute the model selection of CS-SVM which can find global optimal parameter pairs in K-fold cross validation. Experimental results on seven normal datsets and four imbalanced datasets, show that our proposed method has better generalization ability and than various kinds of grid search methods, however, with less running time.

JBHI Journal 2014 Journal Article

Models and Methods for Quantitative Analysis of Surface-Enhanced Raman Spectra

  • Shuo Li
  • James O. Nyagilo
  • Digant P. Dave
  • Jean Gao

The quantitative analysis of surface-enhanced Raman spectra using scattering nanoparticles has shown the potential and promising applications in in vivo molecular imaging. The diverse approaches have been used for quantitative analysis of Raman spectra information, which can be categorized as direct classical least squares models, full spectrum multivariate calibration models, selected multivariate calibration models, and latent variable regression (LVR) models. However, the working principle of these methods in the Raman spectra application remains poorly understood and a clear picture of the overall performance of each model is missing. Based on the characteristics of the Raman spectra, in this paper, we first provide the theoretical foundation of the aforementioned commonly used models and show why the LVR models are more suitable for quantitative analysis of the Raman spectra. Then, we demonstrate the fundamental connections and differences between different LVR methods, such as principal component regression, reduced-rank regression, partial least square regression (PLSR), canonical correlation regression, and robust canonical analysis, by comparing their objective functions and constraints. We further prove that PLSR is literally a blend of multivariate calibration and feature extraction model that relates concentrations of nanotags to spectrum intensity. These features (a. k. a. latent variables) satisfy two purposes: the best representation of the predictor matrix and correlation with the response matrix. These illustrations give a new understanding of the traditional PLSR and explain why PLSR exceeds other methods in quantitative analysis of the Raman spectra problem. In the end, all the methods are tested on the Raman spectra datasets with different evaluation criteria to evaluate their performance.