Author name cluster

Si Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

2 author rows

JBHI Journal 2026 Journal Article

Adversarial and Correlation-Aware Data Augmentation Framework for Multi-Label Chest X-Ray Image Classification

Zhanbo Liang
Si Li
Jian Zhu

Deep learning-based methods have shown promising results in multi-label chest X-ray (CXR) image classification. However, most existing methods rely on large-scale fully-annotated datasets, which are costly and laborious to obtain. Therefore, training a high-performance model with limited annotation remains a significant challenge in practice. To address this issue, we propose an Adversarial and Correlation-Aware Data Augmentation (ACAA) framework for multi-label CXR image classification. First, we generate pseudo labels based on the model predictions on weakly-augmented images. Next, we propose a Generalized Nesterov Iterative Fast Gradient Sign Method (GNI-FGSM) to generate effective adversarial examples as image-level strongly-augmented data. Then, we introduce an image-level adversarial-augmentation-based consistency regularization to perform supervision of model predictions on the above image-level adversarial examples. To explore more diverse data transformations, we further perform adversarial augmentation in the feature space by imposing perturbations on the extracted features of each image and introduce a feature-level adversarialaugmentation- based consistency regularization. Furthermore, we propose a Batch-level Mamba module (Batch- Mamba) coupled with a batch-level Mamba-based correlation regularization to explore inter-sample correlations along batch dimension. Extensive experiments on two large CXR datasets (CheXpert and MIMIC-CXR) demonstrate the effectiveness of the proposed ACAA framework for multilabel CXR image classification under limited-annotation scenario.

Details DOI

JBHI Journal 2026 Journal Article

PASAformer: Cerebrovascular Disease Classification with Medical Prior-Guided Adapter and Pathology-Aware Sparse Attention

Baiming Chen
Xin Gao
Weiguo Zhang
Sue Cao
Si Li
Linhai Yan

Cerebrovascular diseases (CVDs) such as aneurysms, arteriovenous malformations, stenosis, and Moyamoya disease are major public health concerns. Accurate classification of these conditions is essential for timely intervention, yet current computer-aided methods often exhibit limited representational capacity, feature redundancy, and insufficient interpretability, restricting clinical applicability. We propose PASAformer, a Swin-Transformer-based framework for cerebrovascular disease classification on Digital Subtraction Angiography (DSA). PASAformer incorporates a Pathology-Aware Sparse Attention (PASA) module that emphasizes lesion-related regions while suppressing background redundancy. Inserted into the Swin backbone, PASA replaces dense window self-attention, improving computational efficiency while preserving the hierarchical architecture. We further employ the MiAMix data augmenter to increase sample diversity, and incorporate a CombinedAdapter encoder that injects anatomical priors from the frozen Medical Segment Anything Model (MED-SAM) into early-stage representations, strengthening discriminative power under limited supervision. To support research in this underexplored area, we curate CDSA-NEO, a proprietary DSA dataset comprising more than 1, 700 static images across four major cerebrovascular disease categories, constituting the first large-scale benchmark of its kind. Furthermore, an external cohort of angiographic runs with sequential, unselected frames is used to assess robustness in realistic temporal workflows. Extensive experiments on CDSA-NEO and public vascular datasets demonstrate that PASAformer achieves competitive precision and balanced accuracy compared to representative state-of-the-art models, while providing more focused visual explanations. These results suggest that PASAformer can support automated cerebrovascular disease classification on angiography, and that CDSA-NEO provides a benchmark for future method development and evaluation.

Details DOI

NeurIPS Conference 2025 Conference Paper

Audio-Sync Video Generation with Multi-Stream Temporal Control

Shuchen Weng
Haojie Zheng
Zheng Chang
Si Li
Boxin Shi
Xinlong Wang

Audio is inherently temporal and closely synchronized with the visual world, making it a naturally aligned and expressive control signal for controllable video generation (e. g. , movies). Beyond control, directly translating audio into video is essential for understanding and visualizing rich audio narratives (e. g. , Podcasts or historical recordings). However, existing approaches fall short in generating high-quality videos with precise audio-visual synchronization, especially across diverse and complex audio types. In this work, we introduce MTV, a versatile framework for audio-sync video generation. MTV explicitly separates audios into speech, effects, and music tracks, enabling disentangled control over lip motion, event timing, and visual mood, respectively—resulting in fine-grained and semantically aligned video generation. To support the framework, we additionally present DEMIX, a dataset comprising high-quality cinematic videos and demixed audio tracks. DEMIX is structured into five overlapped subsets, enabling scalable multi-stage training for diverse generation scenarios. Extensive experiments demonstrate that MTV achieves state-of-the-art performance across six standard metrics spanning video quality, text-video consistency, and audio-video alignment.

PDF Details

NeurIPS Conference 2025 Conference Paper

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

Haolong Yan
Yeqing Shen
Xin Huang
Jia Wang
Kaijun Tan
Zhixuan Liang
Hongxin Li
Zheng Ge

With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.

PDF Details

ECAI Conference 2025 Conference Paper

Wavelet-Based Sinogram Inner-Structure Aware Residual Diffusion Network for Low-Dose SPECT Reconstruction

Guihao Wen
Yu Luo 0004
Si Li

Despite the effectiveness of single-photon emission computed tomography (SPECT) imaging in clinics, the ionizing radiation induced by its radiotracer poses a potential hazard to human health. Clinically, a lower radiation dose can be achieved by reducing the activity of administered radiotracer, which inevitably leads to increased Poisson noise, severe artifacts and degraded spatial resolution in the sinogram domain. Although existing sinogram restoration methods for the low-dose scenario have made significant progress in noise suppression, they still fail to effectively recover the detailed sinusoidal features and intrinsic contrast within the sinogram. In addition, existing methods seldom explore the sinogram innerstructure, which may hinder further improvement on reconstructed image quality. To address these issues, we propose a residual framework based on the diffusion model that leverages the frequency characteristics of sinograms. Indeed, the proposed framework consists of two stages. The first stage employs the Residual Denoising Diffusion Model (RDDM) to denoise the low-dose sinogram, thereby producing a noise-suppressed coarse output. In the second stage, we develop a novel Wavelet-based Sinogram Structure Interaction network (WSSI-net) to explicitly and selectively process high- and low-frequency features of the coarse output. In particular, we propose a High-Frequency Restoration Module (HFRM) to further enhance high-frequency features, as well as a low-frequency Graph Convolution block (GC block) to effectively exploit the inherent inner-structure within low-frequency components. Moreover, we further introduce a Cross-Frequency Interaction Module (CFIM) to achieve correlation learning between high- and low-frequency features. Extensive experiments demonstrate that the proposed framework achieves superior reconstruction performance compared to the state-of-the-art methods.

Details

NeurIPS Conference 2024 Conference Paper

SfPUEL: Shape from Polarization under Unknown Environment Light

Youwei Lyu
Heng Guo
Kailong Zhang
Si Li
Boxin Shi

Shape from polarization (SfP) benefits from advancements like polarization cameras for single-shot normal estimation, but its performance heavily relies on light conditions. This paper proposes SfPUEL, an end-to-end SfP method to jointly estimate surface normal and material under unknown environment light. To handle this challenging light condition, we design a transformer-based framework for enhancing the perception of global context features. We further propose to integrate photometric stereo (PS) priors from pretrained models to enrich extracted features for high-quality normal predictions. As metallic and dielectric materials exhibit different BRDFs, SfPUEL additionally predicts dielectric and metallic material segmentation to further boost performance. Experimental results on synthetic and our collected real-world dataset demonstrate that SfPUEL significantly outperforms existing SfP and single-shot normal estimation methods. The code and dataset is available at https: //github. com/YouweiLyu/SfPUEL.

PDF Details DOI

JBHI Journal 2023 Journal Article

An Effective Co-Support Guided Analysis Model for Multi-Contrast MRI Reconstruction

Yu Luo
Manting Wei
Si Li
Jie Ling
Guobo Xie
Shun Yao

Multi-contrast magnetic resonance imaging (MRI) is widely used in clinical diagnosis. However, it is time-consuming to obtain MR data of multi-contrasts and the long scanning time may bring unexpected physiological motion artifacts. To obtain MR images of higher quality within limited acquisition time, we propose an effective model to reconstruct images from under-sampled k-space data of one contrast by utilizing another fully-sampled contrast of the same anatomy. Specifically, multiple contrasts from the same anatomical section exhibit similar structures. Enlightened by the fact that co-support of an image provides an appropriate characterization of morphological structures, we develop a similarity regularization of the co-supports across multi-contrasts. In this case, the guided MRI reconstruction problem is naturally formulated as a mixed integer optimization model consisting of three terms, the data fidelity of k-space, smoothness-enforcing regularization, and co-support regularization. An effective algorithm is developed to solve this minimization model alternatively. In the numerical experiments, T2-weighted images are used as the guidance to reconstruct T1-weighted/T2-weighted-Fluid-Attenuated Inversion Recovery (T2-FLAIR) images and PD-weighted images are used as the guidance to reconstruct PDFS-weighted images, respectively, from their under-sampled k-space data. The experimental results demonstrate that the proposed model outperforms other state-of-the-art multi-contrast MRI reconstruction methods in terms of both quantitative metrics and visual performance at various sampling ratios.

Details DOI

NeurIPS Conference 2023 Conference Paper

L-CAD: Language-based Colorization with Any-level Descriptions using Diffusion Priors

Zheng Chang
Shuchen Weng
Peixuan Zhang
Yu Li
Si Li
Boxin Shi

Language-based colorization produces plausible and visually pleasing colors under the guidance of user-friendly natural language descriptions. Previous methods implicitly assume that users provide comprehensive color descriptions for most of the objects in the image, which leads to suboptimal performance. In this paper, we propose a unified model to perform language-based colorization with any-level descriptions. We leverage the pretrained cross-modality generative model for its robust language understanding and rich color priors to handle the inherent ambiguity of any-level descriptions. We further design modules to align with input conditions to preserve local spatial structures and prevent the ghosting effect. With the proposed novel sampling strategy, our model achieves instance-aware colorization in diverse and complex scenarios. Extensive experimental results demonstrate our advantages of effectively handling any-level descriptions and outperforming both language-based and automatic colorization methods. The code and pretrained modelsare available at: https: //github. com/changzheng123/L-CAD.

PDF Details

AAAI Conference 2023 Conference Paper

Polarization-Aware Low-Light Image Enhancement

Chu Zhou
Minggui Teng
Youwei Lyu
Si Li
Chao Xu
Boxin Shi

Polarization-based vision algorithms have found uses in various applications since polarization provides additional physical constraints. However, in low-light conditions, their performance would be severely degenerated since the captured polarized images could be noisy, leading to noticeable degradation in the degree of polarization (DoP) and the angle of polarization (AoP). Existing low-light image enhancement methods cannot handle the polarized images well since they operate in the intensity domain, without effectively exploiting the information provided by polarization. In this paper, we propose a Stokes-domain enhancement pipeline along with a dual-branch neural network to handle the problem in a polarization-aware manner. Two application scenarios (reflection removal and shape from polarization) are presented to show how our enhancement can improve their results.

PDF Details DOI

AAAI Conference 2022 Conference Paper

L-CoDe:Language-Based Colorization Using Color-Object Decoupled Conditions

Shuchen Weng
Hao Wu
Zheng Chang
Jiajun Tang
Si Li
Boxin Shi

Colorizing a grayscale image is inherently an ill-posed problem with multi-modal uncertainty. Language-based colorization offers a natural way of interaction to reduce such uncertainty via a user-provided caption. However, the colorobject coupling and mismatch issues make the mapping from word to color difficult. In this paper, we propose L-CoDe, a Language-based Colorization network using color-object Decoupled conditions. A predictor for object-color corresponding matrix (OCCM) and a novel attention transfer module (ATM) are introduced to solve the color-object coupling problem. To deal with color-object mismatch that results in incorrect color-object correspondence, we adopt a soft-gated injection module (SIM). We further present a new dataset containing annotated color-object pairs to provide supervisory signals for resolving the coupling problem. Experimental results show that our approach outperforms state-of-the-art methods conditioned on captions.

PDF Details

NeurIPS Conference 2019 Conference Paper

Reflection Separation using a Pair of Unpolarized and Polarized Images

Youwei Lyu
Zhaopeng Cui
Si Li
Marc Pollefeys
Boxin Shi

When we take photos through glass windows or doors, the transmitted background scene is often blended with undesirable reflection. Separating two layers apart to enhance the image quality is of vital importance for both human and machine perception. In this paper, we propose to exploit physical constraints from a pair of unpolarized and polarized images to separate reflection and transmission layers. Due to the simplified capturing setup, the system becomes more underdetermined compared with existing polarization based solutions that take three or more images as input. We propose to solve semireflector orientation estimation first to make the physical image formation well-posed and then learn to reliably separate two layers using a refinement network with gradient loss. Quantitative and qualitative experimental results show our approach performs favorably over existing polarization and single image based solutions.

PDF Details

IJCAI Conference 2015 Conference Paper

Sketch the Storyline with CHARCOAL: A Non-Parametric Approach

Siliang Tang
Fei Wu
Si Li
Weiming Lu
Zhongfei Zhang
Yueting Zhuang

Generating a coherent synopsis and revealing the development threads for news stories from the increasing amounts of news content remains a formidable challenge. In this paper, we proposed a hddCRP (hybird distant-dependent Chinese Restaurant Process) based HierARChical tOpic model for news Article cLustering, abbreviated as CHARCOAL. Given a bunch of news articles, the outcome of CHARCOAL is threefold: 1) it aggregates relevant new articles into clusters (i. e. , stories); 2) it disentangles the chain links (i. e. , storyline) between articles in their describing story; 3) it discerns the topics that each story is assigned (e. g. ,Malaysia Airlines Flight 370 story belongs to the aircraft accident topic and U. S presidential election stories belong to the politics topic). CHARCOAL completes this task by utilizing a hddCRP as prior, and the entities (e. g. , names of persons, organizations, or locations) that appear in news articles as clues. Moveover, the adaptation of non-parametric nature in CHARCOAL makes our model can adaptively learn the appropriate number of stories and topics from news corpus. The experimental analysis and results demonstrate both interpretability and superiority of the proposed approach.

PDF Details

TCS Journal 2013 Journal Article

Degree distribution of large networks generated by the partial duplication model

Si Li
Kwok Pui Choi
Taoyang Wu

Details DOI