Arrow Research search

Author name cluster

Wenbo Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

32 papers
2 author rows

Possible papers

32

AAAI Conference 2026 Conference Paper

FastFLUX: Pruning FLUX with Block-wise Replacement and Sandwich Training

  • Fuhan Cai
  • Yong Guo
  • Jie Li
  • Wenbo Li
  • Jian Chen
  • Xiangzhong Fang

Recent advancements in text-to-image (T2I) generation have led to the emergence of highly expressive models such as diffusion transformers (DiTs), exemplified by FLUX. However, their massive parameter sizes lead to slow inference, high memory usage, and poor deployability. Existing acceleration methods (e.g., single-step distillation and attention pruning) often suffer from significant performance degradation and incur substantial training costs. To address these limitations, we propose FastFLUX, an architecture-level pruning framework designed to enhance the inference efficiency of FLUX. At its core is the Block-wise Replacement with Linear Layers (BRLL) method, which replaces structurally complex residual branches in ResBlocks with lightweight linear layers while preserving the original shortcut connections for stability. Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning strategy that leverages LoRA to supervise neighboring blocks, mitigating performance drops caused by structural replacement. Experiments show that our FastFLUX maintains high image quality under both qualitative and quantitative evaluations, while significantly improving inference speed, even with 20% of the hierarchy pruned.

AAAI Conference 2026 Conference Paper

FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

  • Shilong Zhang
  • Wenbo Li
  • Shoufa Chen
  • Chongjian Ge
  • Peize Sun
  • Yifu Zhang
  • Yi Jiang
  • Zehuan Yuan

DiT models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high-resolution outputs, further amplifying computational demands—especially for single-stage DiT models. To address these challenges, we propose a novel two-stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low-resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage achieves a nearly straight ODE trajectory between low and high resolutions via flow matching, effectively generating fine details and fixing artifacts with minimal NFEs. To ensure a seamless connection between the two independently trained stages during inference, we carefully design degradation strategies during the second-stage training. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high-resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability.

AAAI Conference 2026 Conference Paper

QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution

  • Bowen Chai
  • Zheng Chen
  • Libo Zhu
  • Wenbo Li
  • Yong Guo
  • Yulun Zhang

Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods.

AAAI Conference 2026 Conference Paper

SODiff:Semantic-Oriented Diffusion Model for JPEG Compression Artifacts Removal

  • Tingyu Yang
  • JUE GONG
  • Jinpei Guo
  • Wenbo Li
  • Yong Guo
  • Yulun Zhang

JPEG, as a widely used image compression standard, often introduces severe visual artifacts when achieving high compression ratios. Although existing deep learning-based restoration methods have made considerable progress, they often struggle to recover complex texture details, resulting in over-smoothed outputs. To overcome these limitations, we propose SODiff, a novel and efficient semantic-oriented one-step diffusion model for JPEG artifacts removal. Our core idea is that effective restoration hinges on providing semantic-oriented guidance to the pre-trained diffusion model, thereby fully leveraging its powerful generative prior. To this end, SODiff incorporates a semantic-aligned image prompt extractor (SAIPE). SAIPE extracts rich features from low-quality (LQ) images and projects them into an embedding space semantically aligned with that of the text encoder. Simultaneously, it preserves crucial information for faithful reconstruction. Furthermore, we propose a quality factor-aware time predictor that implicitly learns the compression quality factor (QF) of the LQ image and adaptively selects the optimal denoising start timestep for the diffusion process. Extensive experimental results show that our SODiff outperforms recent leading methods in both visual quality and quantitative metrics.

AAAI Conference 2026 Conference Paper

Test-Time Preference Optimization for Image Restoration

  • Bingchen Li
  • Xin Li
  • Jiaqi Xu
  • Jiaming Guo
  • Wenbo Li
  • Renjing Pei
  • Zhibo Chen

Image restoration (IR) models are typically trained to recover high-quality images using L1 or LPIPS loss. To handle diverse unknown degradations, zero-shot IR methods have also been introduced. However, existing pre-trained and zero-shot IR approaches often fail to align with human preferences, resulting in restored images that may not be favored. This highlights the critical need to enhance restoration quality and adapt flexibly to various image restoration tasks or backbones without requiring model retraining and ideally without labor-intensive preference data collection. In this paper, we propose the first Test-Time Preference Optimization (TTPO) paradigm for image restoration, which enhances perceptual quality, generates preference data on-the-fly, and is compatible with any IR model backbone. Specifically, we design a training-free, three-stage pipeline: (i) generate candidate preference images online using diffusion inversion and denoising based on the initially restored image; (ii) select preferred and dispreferred images using automated preference-aligned metrics or human feedback; and (iii) use the selected preference images as reward signals to guide the diffusion denoising process, optimizing the restored image to better align with human preferences. Extensive experiments across various image restoration tasks and models demonstrate the effectiveness and flexibility of the proposed pipeline.

JBHI Journal 2025 Journal Article

Automatic Brain Segmentation for PET/MR Dual-Modal Images Through a Cross-Fusion Mechanism

  • Hongyan Tang
  • Zhenxing Huang
  • Wenbo Li
  • Yaping Wu
  • Jianmin Yuan
  • Yang Yang
  • Yan Zhang
  • Jing Qin

The precise segmentation of different brain regions and tissues is usually a prerequisite for the detection and diagnosis of various neurological disorders in neuroscience. Considering the abundance of functional and structural dual-modality information for positron emission tomography/magnetic resonance (PET/MR) images, we propose a novel 3D whole-brain segmentation network with a cross-fusion mechanism introduced to obtain 45 brain regions. Specifically, the network processes PET and MR images simultaneously, employing UX-Net and a cross-fusion block for feature extraction and fusion in the encoder. We test our method by comparing it with other deep learning-based methods, including 3DUXNET, SwinUNETR, UNETR, nnFormer, UNet3D, NestedUNet, ResUNet, and VNet. The experimental results demonstrate that the proposed method achieves better segmentation performance in terms of both visual and quantitative evaluation metrics and achieves more precise segmentation in three views while preserving fine details. In particular, the proposed method achieves superior quantitative results, with a Dice coefficient of 85. 73% $\pm$ 0. 01%, a Jaccard index of 76. 68% $\pm$ 0. 02%, a sensitivity of 85. 00% $\pm$ 0. 01%, a precision of 83. 26% $\pm$ 0. 03% and a Hausdorff distance (HD) of 4. 4885 $\pm$ 14. 85%. Moreover, the distribution and correlation of the SUV in the volume of interest (VOI) are also evaluated (PCC > 0. 9), indicating consistency with the ground truth and the superiority of the proposed method. In future work, we will utilize our whole-brain segmentation method in clinical practice to assist doctors in accurately diagnosing and treating brain diseases.

NeurIPS Conference 2025 Conference Paper

CamEdit: Continuous Camera Parameter Control for Photorealistic Image Editing

  • Xinran Qin
  • Zhixin Wang
  • Fan Li
  • Haoyu Chen
  • Renjing Pei
  • Wenbo Li
  • Xiaochun Cao

Recent advances in diffusion models have substantially improved text-driven image editing. However, existing frameworks based on discrete textual tokens struggle to support continuous control over camera parameters and smooth transitions in visual effects. These limitations hinder their applications to realistic, camera-aware, and fine-grained editing tasks. In this paper, we present CamEdit, a diffusion-based framework for photorealistic image editing that enables continuous and semantically meaningful manipulation of common camera parameters such as aperture and shutter speed. CamEdit incorporates a continuous parameter prompting mechanism and a parameter-aware modulation module that guides the model in smoothly adjusting focal plane, aperture, and shutter speed, reflecting the effects of varying camera settings within the diffusion process. To support supervised learning in this setting, we introduce CamEdit50K, a dataset specifically designed for photorealistic image editing with continuous camera parameter settings. It contains over 50k image pairs combining real and synthetic data with dense camera parameter variations across diverse scenes. Extensive experiments demonstrate that CamEdit enables flexible, consistent, and high-fidelity image editing, achieving state-of-the-art performance in camera-aware visual manipulation and fine-grained photographic control.

IJCAI Conference 2025 Conference Paper

Directing Mamba to Complex Textures: An Efficient Texture-Aware State Space Model for Image Restoration

  • Long Peng
  • Xin Di
  • ZhanFeng Feng
  • Wenbo Li
  • Renjing Pei
  • Yang Wang
  • Xueyang Fu
  • Yang Cao

Image restoration aims to recover details and enhance contrast in degraded images. With the growing demand for high-quality imaging (e. g. , 4K and 8K), achieving a balance between restoration quality and computational efficiency has become increasingly critical. Existing methods, primarily based on CNNs, Transformers, or their hybrid approaches, apply uniform deep representation extraction across the image. However, these methods often struggle to effectively model long-range dependencies and largely overlook the spatial characteristics of image degradation (regions with richer textures tend to suffer more severe damage), making it hard to achieve the best trade-off between restoration quality and efficiency. To address these issues, we propose a novel texture-aware image restoration method, TAMambaIR, which simultaneously perceives image textures and achieves a trade-off between performance and efficiency. Specifically, we introduce a novel Texture-Aware State Space Model, which enhances texture awareness and improves efficiency by modulating the transition matrix of the state-space equation and focusing on regions with complex textures. Additionally, we design a Multi-Directional Perception Block to improve multi-directional receptive fields while maintaining low computational overhead. Extensive experiments on benchmarks for image super-resolution, deraining, and low-light image enhancement demonstrate that TAMambaIR achieves state-of-the-art performance with significantly improved efficiency, establishing it as a robust and efficient framework for image restoration.

IROS Conference 2025 Conference Paper

Dual-Mode Motion Control of Multi-Stimulus Deformable Miniature Robots with Adaptive Orientation Compensation in Unstructured Environments

  • Shihao Zhong
  • Wenbo Li
  • Haotian Yang
  • Zhenyang Niu
  • Yaozhen Hou
  • Qiang Huang 0002
  • Huaping Wang

Miniature robots hold great promise for performing micromanipulation tasks within hard-to-reach confined spaces. However, effectively maneuvering across complex and unstructured terrain, achieving adaptive morphogenesis, and developing adaptive multimodal locomotion strategies remain challenges for these robotic systems. Here, we develop a multi-stimulus-responsive deformable miniature robot integrated with an adaptive multimodal motion control method. Sodium alginate hydrogel and graphene-coated magnetic elastomer are integrated into the sheet-shaped robot to enable responsiveness to temperature, humidity, and magnetic fields. A kinematic gait model is designed to control oscillatory motion in the semi-contracted state and rotational motion in the fully contracted state of the miniature robot. To automatically mitigate angular deviation between the robot's motion direction and the intended path, an adaptive orientation compensation control algorithm based on Support Vector Regression (SVR) is proposed. Experimental results demonstrate that the proposed robot exhibits capabilities for flexible and accurate navigation within unstructured environments (e. g. , rock piles and stomach models), and is further shown to be capable of cargo transport. The proposed adaptive morphogenesis robots, enabled by dual-mode motion control, hold significant potential for targeted delivery and other micromanipulation applications in complex, unstructured, and confined environments.

NeurIPS Conference 2025 Conference Paper

JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

  • Yunlong Lin
  • Zixu Lin
  • Kunjie Lin
  • Jinbin Bai
  • Panwang Pan
  • Chenxin Li
  • Haoyu Chen
  • Zhongdao Wang

Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60\% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities.

NeurIPS Conference 2025 Conference Paper

NopeRoomGS: Indoor 3D Gaussian Splatting Optimization without Camera Pose Input

  • Wenbo Li
  • Yan Xu
  • Mingde Yao
  • Fengjie Liang
  • Jiankai Sun
  • Menglu Wang
  • Guofeng Zhang
  • Linjiang Huang

Recent advances in 3D Gaussian Splatting (3DGS) have enabled real-time, high-fidelity view synthesis, but remain critically dependent on camera poses estimated by Structure-from-Motion (SfM), which is notoriously unreliable in textureless indoor environments. To eliminate this dependency, recent pose-free variants have been proposed, yet they often fail under abrupt camera motion due to unstable initialization and purely photometric objectives. In this work, we introduce Nope-RoomGS, an optimization framework with no need for camera pose inputs, which effectively addresses the textureless regions and abrupt camera motion in indoor room environments through a local-to-global optimization paradigm for 3DGS reconstruction. In the local stage, we propose a lightweight local neural geometric representation to bootstrap a set of reliable local 3D Gaussians for separated short video clips, regularized by multi-frame tracking constraints and foundation model depth priors. This enables reliable initialization even in textureless regions or under abrupt camera motions. In the global stage, we fuse local 3D Gaussians into a unified 3DGS representation through an alternating optimization strategy that jointly refines camera poses and Gaussian parameters, effectively mitigating gradient interference between them. Furthermore, we decompose camera pose optimization based on a piecewise planarity assumption, further enhancing robustness under abrupt camera motion. Extensive experiments on Replica, ScanNet and Tanks & Temples demonstrate the state-of-the-art performance of our method in both camera pose estimation and novel view synthesis.

NeurIPS Conference 2025 Conference Paper

OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates

  • Jinpei Guo
  • Yifei Ji
  • Zheng Chen
  • Kai Liu
  • Min Liu
  • Wang Rao
  • Wenbo Li
  • Yong Guo

Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models are available at https: //github. com/jp-guo/OSCAR/.

NeurIPS Conference 2025 Conference Paper

PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement

  • ZhanFeng Feng
  • Long Peng
  • Xin Di
  • Yong Guo
  • Wenbo Li
  • Yulun Zhang
  • Renjing Pei
  • Yang Wang

Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences by leveraging temporal information from multiple frames, which are widely used in streaming video processing, surveillance, and generation. Although numerous Transformer-based enhancement methods have achieved impressive performance, their computational and memory demands hinder deployment on edge devices. Quantization offers a practical solution by reducing the bit-width of weights and activations to improve efficiency. However, directly applying existing quantization methods to video enhancement tasks often leads to significant performance degradation and loss of fine details. This stems from two limitations: (a) inability to allocate varying representational capacity across frames, which results in suboptimal dynamic range adaptation; (b) over-reliance on full-precision teachers, which limits the learning of low-bit student models. To tackle these challenges, we propose a novel quantization method for video enhancement: Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE). This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD). BMFQ utilizes a percentile-based initialization and iterative search with pruning and backtracking for robust clipping bounds. PMTD employs a progressive distillation strategy with both full-precision and multiple high-bit (INT) teachers to enhance low-bit models' capacity and quality. Extensive experiments demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance across multiple tasks and benchmarks. The code will be made publicly available.

NeurIPS Conference 2025 Conference Paper

PocketSR: The Super-Resolution Expert in Your Pocket Mobiles

  • Haoze Sun
  • Linfeng Jiang
  • Fan Li
  • Renjing Pei
  • Zhixin Wang
  • Yong Guo
  • Jiaqi Xu
  • Haoyu Chen

Real-world image super-resolution (RealSR) aims to enhance the visual quality of in-the-wild images, such as those captured by mobile phones. While existing methods leveraging large generative models demonstrate impressive results, the high computational cost and latency make them impractical for edge deployment. In this paper, we introduce PocketSR, an ultra-lightweight, single-step model that brings generative modeling capabilities to RealSR while maintaining high fidelity. To achieve this, we design LiteED, a highly efficient alternative to the original computationally intensive VAE in SD, reducing parameters by 97. 5\% while preserving high-quality encoding and decoding. Additionally, we propose online annealing pruning for the U-Net, which progressively shifts generative priors from heavy modules to lightweight counterparts, ensuring effective knowledge transfer and further optimizing efficiency. To mitigate the loss of prior knowledge during pruning, we incorporate a multi-layer feature distillation loss. Through an in-depth analysis of each design component, we provide valuable insights for future research. PocketSR, with a model size of 146M parameters, processes 4K images in just 0. 8 seconds, achieving a remarkable speedup over previous methods. Notably, it delivers performance on par with state-of-the-art single-step and even multi-step RealSR models, making it a highly practical solution for edge-device applications.

AAAI Conference 2025 Conference Paper

ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance

  • Shuwei Shi
  • Wenbo Li
  • Yuechen Zhang
  • Jingwen He
  • Biao Gong
  • Yinqiang Zheng

Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions. Specifically, ResMaster leverages a low-resolution reference image created by a pre-trained diffusion model to provide structural and fine-grained guidance for crafting high-resolution images on a patch-by-patch basis. To ensure a coherent structure, ResMaster meticulously aligns the low-frequency components of high-resolution patches with the low-resolution reference at each denoising step. For fine-grained guidance, tailored image prompts based on the low-resolution reference and enriched textual prompts produced by a vision-language model are incorporated. This approach could significantly mitigate local pattern distortions and improve detail refinement. Extensive experiments validate that ResMaster sets a new benchmark for high-resolution image generation.

AAAI Conference 2025 Conference Paper

Restabilizing Diffusion Models with Predictive Noise Fusion Strategy for Image Super-Resolution

  • Luoqian Jiang
  • Yong Guo
  • Bingna Xu
  • Haolin Pan
  • Jiezhang Cao
  • Wenbo Li
  • Jian Chen

Diffusion models are prominent in image generation for producing detailed and realistic images from Gaussian noises. However, they often encounter instability issues in image restoration tasks, e.g., super-resolution. Existing methods typically rely on multiple runs to find an initial noise that produces a reasonably restored image. Unfortunately, these methods are computationally expensive and time-consuming without guaranteeing stable and consistent performance. To address these challenges, we propose a novel Predictive Noise Fusion Strategy (PNFS) that predicts pixel-wise errors in the restored image and combines different noises to generate a more effective noise. Extensive experiments show that PNFS significantly improves the stability and performance of diffusion models in super-resolution, both quantitatively and qualitatively. Furthermore, PNFS can be flexibly integrated into various diffusion models to enhance their stability.

AAAI Conference 2025 Conference Paper

Unsupervised Diffusion-Based Degradation Modeling for Real-World Super-Resolution

  • Yuying Chen
  • Mingde Yao
  • Wenbo Li
  • Renjing Pei
  • Jinjing Zhao
  • Wenqi Ren

Single image super-solution (SR) aims to restore a high-resolution (HR) image from a degraded low-resolution (LR) image. However, existing SR models still face a significant domain gap between synthetic and real-world datasets due to the mismatched degradation distributions, hindering SR models from achieving optimal results. In this paper, we propose an unsupervised diffusion-based degradation modeling framework (UDDM) to effectively capture real-world degradation distributions. Specifically, given unpaired LR and HR images, a diffusion-based degradation module (DDM) first models the degradation distribution by diffusing real-world LR images to downsampled LR images, which does not require HR images. It then applies reverse diffusion to generate real-world LR images from extremely downsampled HR images. This approach allows DDM to model and generate real-world degradation distributions without requiring paired data, by using extreme downsampling to link unpaired LR and HR images. Additionally, we introduce a physics-based dynamic degradation module (P-DDM) that adaptively models content-aware degradation, ensuring both content and structural accuracy. Finally, the LR images generated by DDM and P-DDM are adaptively weighted to produce the final LR images, which are paired with the given HR images for training the SR network. Extensive experiments across multiple real-world datasets demonstrate that our framework achieves state-of-the-art performance in both qualitative and quantitative comparison.

JBHI Journal 2024 Journal Article

Accurate Whole-Brain Image Enhancement for Low-Dose Integrated PET/MR Imaging Through Spatial Brain Transformation

  • Zhenxing Huang
  • Wenbo Li
  • Yaping Wu
  • Lin Yang
  • Yun Dong
  • Yongfeng Yang
  • Hairong Zheng
  • Dong Liang

Positron emission tomography/magnetic resonance imaging (PET/MRI) systems can provide precise anatomical and functional information with exceptional sensitivity and accuracy for neurological disorder detection. Nevertheless, the radiation exposure risks and economic costs of radiopharmaceuticals may pose significant burdens on patients. To mitigate image quality degradation during low-dose PET imaging, we proposed a novel 3D network equipped with a spatial brain transform (SBF) module for low-dose whole-brain PET and MR images to synthesize high-quality PET images. The FreeSurfer toolkit was applied to derive the spatial brain anatomical alignment information, which was then fused with low-dose PET and MR features through the SBF module. Moreover, several deep learning methods were employed as comparison measures to evaluate the model performance, with the peak signal-to-noise ratio (PSNR), structural similarity (SSIM) and Pearson correlation coefficient (PCC) serving as quantitative metrics. Both the visual results and quantitative results illustrated the effectiveness of our approach. The obtained PSNR and SSIM were $41. 96 \, \pm \, 4. 91$ dB (p $0. 9654 \, \pm \, 0. 0215$ (p < 0. 01), which achieved a 19% and 20% improvement, respectively, compared to the original low-dose brain PET images. The volume of interest (VOI) analysis of brain regions such as the left thalamus (PCC = 0. 959) also showed that the proposed method could achieve a more accurate standardized uptake value (SUV) distribution while preserving the details of brain structures. In future works, we hope to apply our method to other multimodal systems, such as PET/CT, to assist clinical brain disease diagnosis and treatment.

AAAI Conference 2024 Conference Paper

Adaptive Meta-Learning Probabilistic Inference Framework for Long Sequence Prediction

  • Jianping Zhu
  • Xin Guo
  • Yang Chen
  • Yao Yang
  • Wenbo Li
  • Bo Jin
  • Fei Wu

Long sequence prediction has broad and significant application value in fields such as finance, wind power, and weather. However, the complex long-term dependencies of long sequence data and the potential domain shift problems limit the effectiveness of traditional models in practical scenarios. To this end, we propose an Adaptive Meta-Learning Probabilistic Inference Framework (AMPIF) based on sequence decomposition, which can effectively enhance the long sequence prediction ability of various basic models. Specifically, first, we decouple complex sequences into seasonal and trend components through a frequency domain decomposition module. Then, we design an adaptive meta-learning task construction strategy, which divides the seasonal and trend components into different tasks through a clustering-matching approach. Finally, we design a dual-stream amortized network (ST-DAN) to capture shared information between seasonal-trend tasks and use the support set to generate task-specific parameters for rapid generalization learning on the query set. We conducted extensive experiments on six datasets, including wind power and finance scenarios, and the results show that our method significantly outperforms baseline methods in prediction accuracy, interpretability, and algorithm stability and can effectively enhance the long sequence prediction capabilities of base models. The source code is publicly available at https://github.com/Zhu-JP/AMPIF.

JBHI Journal 2024 Journal Article

MMCA-NET: A Multimodal Cross Attention Transformer Network for Nasopharyngeal Carcinoma Tumor Segmentation Based on a Total-Body PET/CT System

  • Wenjie Zhao
  • Zhenxing Huang
  • Si Tang
  • Wenbo Li
  • Yunlong Gao
  • Yingying Hu
  • Wei Fan
  • Chuanli Cheng

Nasopharyngeal carcinoma (NPC) is a malignant tumor primarily treated by radiotherapy. Accurate delineation of the target tumor is essential for improving the effectiveness of radiotherapy. However, the segmentation performance of current models is unsatisfactory due to poor boundaries, large-scale tumor volume variation, and the labor-intensive nature of manual delineation for radiotherapy. In this paper, MMCA-Net, a novel segmentation network for NPC using PET/CT images that incorporates an innovative multimodal cross attention transformer (MCA-Transformer) and a modified U-Net architecture, is introduced to enhance modal fusion by leveraging cross-attention mechanisms between CT and PET data. Our method, tested against ten algorithms via fivefold cross-validation on samples from Sun Yat-sen University Cancer Center and the public HECKTOR dataset, consistently topped all four evaluation metrics with average Dice similarity coefficients of 0. 815 and 0. 7944, respectively. Furthermore, ablation experiments were conducted to demonstrate the superiority of our method over multiple baseline and variant techniques. The proposed method has promising potential for application in other tasks.

NeurIPS Conference 2024 Conference Paper

RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models

  • Haoyu Chen
  • Wenbo Li
  • Jinjin Gu
  • Jingjing Ren
  • Sixiang Chen
  • Tian Ye
  • Renjing Pei
  • Kaiwen Zhou

Natural images captured by mobile devices often suffer from multiple types of degradation, such as noise, blur, and low light. Traditional image restoration methods require manual selection of specific tasks, algorithms, and execution sequences, which is time-consuming and may yield suboptimal results. All-in-one models, though capable of handling multiple tasks, typically support only a limited range and often produce overly smooth, low-fidelity outcomes due to their broad data distribution fitting. To address these challenges, we first define a new pipeline for restoring images with multiple degradations, and then introduce RestoreAgent, an intelligent image restoration system leveraging multimodal large language models. RestoreAgent autonomously assesses the type and extent of degradation in input images and performs restoration through (1) determining the appropriate restoration tasks, (2) optimizing the task sequence, (3) selecting the most suitable models, and (4) executing the restoration. Experimental results demonstrate the superior performance of RestoreAgent in handling complex degradation, surpassing human experts. Furthermore, the system’s modular design facilitates the fast integration of new tasks and models.

NeurIPS Conference 2024 Conference Paper

UltraPixel: Advancing Ultra High-Resolution Image Synthesis to New Peaks

  • Jingjing Ren
  • Wenbo Li
  • Haoyu Chen
  • Renjing Pei
  • Bin Shao
  • Yong Guo
  • Long Peng
  • Fenglong Song

Ultra-high-resolution image generation poses great challenges, such as increased semantic planning complexity and detail synthesis difficulties, alongside substantial training resource demands. We present UltraPixel, a novel architecture utilizing cascade diffusion models to generate high-quality images at multiple resolutions (\textit{e. g. }, 1K, 2K, and 4K) within a single model, while maintaining computational efficiency. UltraPixel leverages semantics-rich representations of lower-resolution images in a later denoising stage to guide the whole generation of highly detailed high-resolution images, significantly reducing complexity. Specifically, we introduce implicit neural representations for continuous upsampling and scale-aware normalization layers adaptable to various resolutions. Notably, both low- and high-resolution processes are performed in the most compact space, sharing the majority of parameters with less than 3$\%$ additional parameters for high-resolution outputs, largely enhancing training and inference efficiency. Our model achieves fast training with reduced data requirements, producing photo-realistic high-resolution images and demonstrating state-of-the-art performance in extensive experiments.

NeurIPS Conference 2024 Conference Paper

Unleashing Multispectral Video's Potential in Semantic Segmentation: A Semi-supervised Viewpoint and New UAV-View Benchmark

  • Wei Ji
  • Jingjing Li
  • Wenbo Li
  • Yilin Shen
  • Li Cheng
  • Hongxia Jin

Thanks to the rapid progress in RGB & thermal imaging, also known as multispectral imaging, the task of multispectral video semantic segmentation, or MVSS in short, has recently drawn significant attentions. Noticeably, it offers new opportunities in improving segmentation performance under unfavorable visual conditions such as poor light or overexposure. Unfortunately, there are currently very few datasets available, including for example MVSeg dataset that focuses purely toward eye-level view; and it features the sparse annotation nature due to the intensive demands of labeling process. To address these key challenges of the MVSS task, this paper presents two major contributions: the introduction of MVUAV, a new MVSS benchmark dataset, and the development of a dedicated semi-supervised MVSS baseline - SemiMV. Our MVUAV dataset is captured via Unmanned Aerial Vehicles (UAV), which offers a unique oblique bird’s-eye view complementary to the existing MVSS datasets; it also encompasses a broad range of day/night lighting conditions and over 30 semantic categories. In the meantime, to better leverage the sparse annotations and extra unlabeled RGB-Thermal videos, a semi-supervised learning baseline, SemiMV, is proposed to enforce consistency regularization through a dedicated Cross-collaborative Consistency Learning (C3L) module and a denoised temporal aggregation strategy. Comprehensive empirical evaluations on both MVSeg and MVUAV benchmark datasets have showcased the efficacy of our SemiMV baseline.

NeurIPS Conference 2023 Conference Paper

DVSOD: RGB-D Video Salient Object Detection

  • Jingjing Li
  • Wei Ji
  • Size Wang
  • Wenbo Li
  • Li Cheng

Salient object detection (SOD) aims to identify standout elements in a scene, with recent advancements primarily focused on integrating depth data (RGB-D) or temporal data from videos to enhance SOD in complex scenes. However, the unison of two types of crucial information remains largely underexplored due to data constraints. To bridge this gap, we in this work introduce the DViSal dataset, fueling further research in the emerging field of RGB-D video salient object detection (DVSOD). Our dataset features 237 diverse RGB-D videos alongside comprehensive annotations, including object and instance-level markings, as well as bounding boxes and scribbles. These resources enable a broad scope for potential research directions. We also conduct benchmarking experiments using various SOD models, affirming the efficacy of multimodal video input for salient object detection. Lastly, we highlight some intriguing findings and promising future research avenues. To foster growth in this field, our dataset and benchmark results are publicly accessible at: https: //dvsod. github. io/.

IJCAI Conference 2023 Conference Paper

On Efficient Transformer-Based Image Pre-training for Low-Level Vision

  • Wenbo Li
  • Xin Lu
  • Shengju Qian
  • Jiangbo Lu

Pre-training has marked numerous state of the arts in high-level computer vision, while few attempts have ever been made to investigate how pre-training acts in image processing systems. In this paper, we tailor transformer-based pre-training regimes that boost various low-level tasks. To comprehensively diagnose the influence of pre-training, we design a whole set of principled evaluation tools that uncover its effects on internal representations. The observations demonstrate that pre-training plays strikingly different roles in low-level tasks. For example, pre-training introduces more local information to intermediate layers in super-resolution (SR), yielding significant performance gains, while pre-training hardly affects internal feature representations in denoising, resulting in limited gains. Further, we explore different methods of pre-training, revealing that multi-related-task pre-training is more effective and data-efficient than other alternatives. Finally, we extend our study to varying data scales and model sizes, as well as comparisons between transformers and CNNs. Based on the study, we successfully develop state-of-the-art models for multiple low-level tasks.

AAAI Conference 2022 Conference Paper

Best-Buddy GANs for Highly Detailed Image Super-resolution

  • Wenbo Li
  • Kun Zhou
  • Lu Qi
  • Liying Lu
  • Jiangbo Lu

We consider the single image super-resolution (SISR) problem, where a high-resolution (HR) image is generated based on a low-resolution (LR) input. Recently, generative adversarial networks (GANs) become popular to hallucinate details. Most methods along this line rely on a predefined single- LR-single-HR mapping, which is not flexible enough for the ill-posed SISR task. Also, GAN-generated fake details may often undermine the realism of the whole image. We address these issues by proposing best-buddy GANs (Beby-GAN) for rich-detail SISR. Relaxing the rigid one-to-one constraint, we allow the estimated patches to dynamically seek trustworthy surrogates of supervision during training, which is beneficial to producing more reasonable details. Besides, we propose a region-aware adversarial learning strategy that directs our model to focus on generating details for textured areas adaptively. Extensive experiments justify the effectiveness of our method. An ultra-high-resolution 4K dataset is also constructed to facilitate future super-resolution research.

IS Journal 2022 Journal Article

Xsickness in Intelligent Mobile Spaces and Metaverses

  • Ruichen Tan
  • Ruiyang Gao
  • Wenbo Li
  • Kai Cao
  • Ying Li
  • Chen Lv
  • Fei-Yue Wang
  • Dongpu Cao

Motion sickness is known to be a common problem that influences the comfort and work efficiency of human beings during their daily lives. With the proliferation of increasingly intelligent systems, the detection and mitigation of motion sickness will face more opportunities along with bigger challenges. On the one hand, the technology for integrated sensors in the intelligent system will provide more accurate and efficient methods for motion sickness detection. However, on the other hand, since cyber-physical systems have been gaining increasing concerns in the past two decades, the cyber-physical-social systems introduce and augment the social characteristics of such systems. The interactions between physical space and cyber space increase the chance of sensory conflicts when people use intelligent systems, such as traveling in intelligent cockpits or using metaverse-related virtual reality devices. The multimodal interaction methods and larger screens will cause more sensory conflicts. The symptoms will be more severe compared to traditional motion sickness. In this article, the classifications are first introduced based on the causes of motion sickness. A new type of multifactorial motion sickness (Xsickness) is discussed, which is foreseeable to be common with intelligent development. Then, the current state-of-the-art detection methods for motion sickness and cybersickness are summarized and theoretical methods for Xsickness detection are discussed. Finally, the mitigation methods based on motion reduction and four means of human perception are discussed and the innovative mitigation methods based on the intelligent system are also introduced.

AAAI Conference 2020 Conference Paper

3D Single-Person Concurrent Activity Detection Using Stacked Relation Network

  • Yi Wei
  • Wenbo Li
  • Yanbo Fan
  • Linghan Xu
  • Ming-Ching Chang
  • Siwei Lyu

We aim to detect real-world concurrent activities performed by a single person from a streaming 3D skeleton sequence. Different from most existing works that deal with concurrent activities performed by multiple persons that are seldom correlated, we focus on concurrent activities that are spatiotemporally or causally correlated and performed by a single person. For the sake of generalization, we propose an approach based on a decompositional design to learn a dedicated feature representation for each activity class. To address the scalability issue, we further extend the class-level decompositional design to the postural-primitive level, such that each class-wise representation does not need to be extracted by independent backbones, but through a dedicated weighted aggregation of a shared pool of postural primitives. There are multiple interdependent instances deriving from each decomposition. Thus, we propose Stacked Relation Networks (SRN), with a specialized relation network for each decomposition, so as to enhance the expressiveness of instance-wise representations via the inter-instance relationship modeling. SRN achieves state-of-the-art performance on a public dataset and a newly collected dataset. The relation weights within SRN are interpretable among the activity contexts. The new dataset and code are available at https: //github. com/weiyi1991/UA Concurrent/

NeurIPS Conference 2020 Conference Paper

LAPAR: Linearly-Assembled Pixel-Adaptive Regression Network for Single Image Super-resolution and Beyond

  • Wenbo Li
  • Kun Zhou
  • Lu Qi
  • Nianjuan Jiang
  • Jiangbo Lu
  • Jiaya Jia

Single image super-resolution (SISR) deals with a fundamental problem of upsampling a low-resolution (LR) image to its high-resolution (HR) version. Last few years have witnessed impressive progress propelled by deep learning methods. However, one critical challenge faced by existing methods is to strike a sweet spot of deep model complexity and resulting SISR quality. This paper addresses this pain point by proposing a linearly-assembled pixel-adaptive regression network (LAPAR), which casts the direct LR to HR mapping learning into a linear coefficient regression task over a dictionary of multiple predefined filter bases. Such a parametric representation renders our model highly lightweight and easy to optimize while achieving state-of-the-art results on SISR benchmarks. Moreover, based on the same idea, LAPAR is extended to tackle other restoration tasks, e. g. , image denoising and JPEG image deblocking, and again, yields strong performance.