Author name cluster

Haoqian Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

2 author rows

AAAI Conference 2026 Conference Paper

Language-Guided and Motion-Aware Gait Representation for Generalizable Recognition

Zhengxian Wu
Chuanrui Zhang
Shen'Ao Jiang
Hangrui Xu
Zirui Liao
Luyuan Zhang
Li Huaqiu
Peng Jiao

Gait recognition is emerging as a promising technology and an innovative field within computer vision, with a wide range of applications in remote human identification. However, existing methods typically rely on complex architectures to directly extract features from images and apply pooling operations to obtain sequence-level representations. Such designs often lead to overfitting on static noise (e.g., clothing), while failing to effectively capture dynamic motion regions, such as the arms and legs. This bottleneck is particularly challenging in the presence of intra-class variation, where gait features of the same individual under different environmental conditions are significantly distant in the feature space. To address the above challenges, we present a Language-guided and Motion-aware gait recognition framework, named LMGait. To the best of our knowledge, LMGait is the first method to introduce natural language descriptions as explicit semantic priors into the gait recognition task. In particular, we utilize designed gait-related language cues to capture key motion features in gait sequences. To improve cross-modal alignment, we propose the Motion Awareness Module (MAM), which refines the language features by adaptively adjusting various levels of semantic information to ensure better alignment with the visual representations. Furthermore, we introduce the Motion Temporal Capture Module (MTCM) to enhance the discriminative capability of gait features and improve the model’s motion tracking ability. We conducted extensive experiments across multiple datasets, and the results demonstrate the significant advantages of our proposed network. Specifically, our model achieved accuracies of 88.5%, 97.1%, and 97.5% on the CCPG, SUSTech1K, and CASIAB* datasets, respectively, achieving state-of-the-art performance.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SAME: Spatial-Aware Multimodal Egocentric Human Pose Estimation

Yurong Fu
Peng Dai
Yu Zhang
Feng Yiqiang
Yang Zhang
Haoqian Wang

Egocentric human pose estimation (HPE) plays a crucial role in immersive applications such as virtual and augmented reality. However, existing methods relying on either visual or sparse inertial data alone often suffer from occlusion or ill-posed problems. In this work, we propose SAME, a novel spatial-aware multimodal fusion framework combining the complementary signals from the stereo images and sparse IMUs for accurate and robust egocentric HPE. It adopts a two-stage network based on a dual coordinate frame to mitigate the coordinate inconsistencies among the stereo cameras and the IMUs. In the first stage, the IMU signals are transformed into the local frame and iteratively fused with the stereo images for estimating 3D poses in the local frame. In the second stage, the local poses are transformed into the global frame with the 6DOF head poses provided by the head-mounted display's (HMD) SLAM algorithm and then temporally aggregated via a temporal Transformer network. Meanwhile, to achieve geometric and semantic alignment among multi-modal features, we present a depth-guided spatial-aware deformable stereo attention network and a modality-aware Transformer decoder for cross-view and cross-modal feature fusion. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on the public EMHI multi-modal egocentric pose estimation benchmark.

PDF Details DOI

ICLR Conference 2025 Conference Paper

DiffPC: Diffusion-based High Perceptual Fidelity Image Compression with Semantic Refinement

Yichong Xia
Yimin Zhou 0011
Jinpeng Wang 0002
Baoyi An
Haoqian Wang
Yaowei Wang 0001
Bin Chen 0011

Reconstructing high-quality images under low bitrates conditions presents a challenge, and previous methods have made this task feasible by leveraging the priors of diffusion models. However, the effective exploration of pre-trained latent diffusion models and semantic information integration in image compression tasks still needs further study. To address this issue, we introduce Diffusion-based High Perceptual Fidelity Image Compression with Semantic Refinement (DiffPC), a two-stage image compression framework based on stable diffusion. DiffPC efficiently encodes low-level image information, enabling the highly realistic reconstruction of the original image by leveraging high-level semantic features and the prior knowledge inherent in diffusion models. Specifically, DiffPC utilizes a multi-feature compressor to represent crucial low-level information with minimal bitrates and employs pre-embedding to acquire more robust hybrid semantics, thereby providing additional context for the decoding end. Furthermore, we have devised a control module tailored for image compression tasks, ensuring structural and textural consistency in reconstruction even at low bitrates and preventing decoding collapses induced by condition leakage. Extensive experiments demonstrate that our method achieves state-of-the-art perceptual fidelity and surpasses previous perceptual image compression methods by a significant margin in statistical fidelity.

Details

ICLR Conference 2025 Conference Paper

Interpretable Unsupervised Joint Denoising and Enhancement for Real-World low-light Scenarios

Huaqiu Li
Xiaowan Hu
Haoqian Wang

Real-world low-light images often suffer from complex degradations such as local overexposure, low brightness, noise, and uneven illumination. Supervised methods tend to overfit to specific scenarios, while unsupervised methods, though better at generalization, struggle to model these degradations due to the lack of reference images. To address this issue, we propose an interpretable, zero-reference joint denoising and low-light enhancement framework tailored for real-world scenarios. Our method derives a training strategy based on paired sub-images with varying illumination and noise levels, grounded in physical imaging principles and retinex theory. Additionally, we leverage the Discrete Cosine Transform (DCT) to perform frequency domain decomposition in the sRGB space, and introduce an implicit-guided hybrid representation strategy that effectively separates intricate compounded degradations. In the backbone network design, we develop retinal decomposition network guided by implicit degradation representation mechanisms. Extensive experiments demonstrate the superiority of our method. Code will be available at https://github.com/huaqlili/unsupervised-light-enhance-ICLR2025.

Details

NeurIPS Conference 2025 Conference Paper

MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds

Bingquan Dai
Luo Li
Qihong Tang
Jie Wang
Xinyu Lian
Hao Xu
Minghan Qin
Xudong XU

Reconstructing 3D objects into editable programs is pivotal for applications like reverse engineering and shape editing. However, existing methods often rely on limited domain-specific languages (DSLs) and small-scale datasets, restricting their ability to model complex geometries and structures. To address these challenges, we introduce MeshLLM, a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts. We develop a comprehensive set of expressive Blender Python APIs capable of synthesizing intricate geometries. Leveraging these APIs, we construct a large-scale paired object-code dataset, where the code for each object is decomposed into distinct semantic parts. Subsequently, we train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts. Our approach not only achieves superior performance in shape-to-code reconstruction tasks but also facilitates intuitive geometric and topological editing through convenient code modifications. Furthermore, our code-based representation enhances the reasoning capabilities of LLMs in 3D shape understanding tasks. Together, these contributions establish MeshLLM as a powerful and flexible solution for programmatic 3D shape reconstruction and understanding.

PDF Details

AAAI Conference 2025 Conference Paper

MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Weitao Wang
Haoran Xu
Yuxiao Yang
Zhifang Liu
Jun Meng
Haoqian Wang

Recent years have witnessed remarkable progress in 3D content generation. However, corresponding evaluation methods struggle to keep pace. Automatic approaches have proven challenging to align with human preferences, and the mixed comparison of text- and image-driven methods often leads to unfair evaluations. In this paper, we present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences. To begin with, we first collect and filter a standardized image prompt set from DALL·E and Objaverse, which we then use to generate multi-view assets with several multi-view diffusion models. Through a systematic ranking pipeline on these assets, we obtain a human annotation dataset with 16k expert pairwise comparisons and train a reward model, coined MVReward, to effectively encode human preferences. With MVReward, image-driven 3D methods can be evaluated against each other in a more fair and transparent manner. Building on this, we further propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy. Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single Image Denoising

Huaqiu Li
Wang Zhang
Xiaowan Hu
Tao Jiang
Zikang Chen
Haoqian Wang

Many studies have concentrated on constructing supervised models utilizing paired datasets for image denoising, which proves to be expensive and time-consuming. Current self-supervised and unsupervised approaches typically rely on blind-spot networks or sub-image pairs sampling, resulting in pixel information loss and destruction of detailed structural information, thereby significantly constraining the efficacy of such methods. In this paper, we introduce Prompt-SID, a prompt-learning-based single image denoising framework that emphasizes the preservation of structural details. This approach is trained in a self-supervised manner using downsampled image pairs. It captures original-scale image information through structural encoding and integrates this prompt into the denoiser. To achieve this, we propose a structural representation generation model based on the latent diffusion process and design a structural attention module within the transformer-based denoiser architecture to decode the prompt. Additionally, we introduce a scale replay training mechanism, which effectively mitigates the scale gap from images of different resolutions. We conduct comprehensive experiments on synthetic, real-world, and fluorescence imaging datasets, showcasing the remarkable effectiveness of Prompt-SID.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting

Kangjie Chen
Yingji Zhong
Zhihao Li
Jiaqi Lin
Youyu Chen
Minghan Qin
Haoqian Wang

3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis under dense-view settings. However, in sparse-view scenarios, despite the realistic renderings in training views, 3DGS occasionally manifests appearance artifacts in novel views. This paper investigates the appearance artifacts in sparse-view 3DGS and uncovers a core limitation of current approaches: the optimized Gaussians are overly-entangled with one another to aggressively fit the training views, which leads to a neglect of the real appearance distribution of the underlying scene and results in appearance artifacts in novel views. The analysis is based on a proposed metric, termed Co-Adaptation Score (CA), which quantifies the entanglement among Gaussians, i. e. , co-adaptation, by computing the pixel-wise variance across multiple renderings of the same viewpoint, with different random subsets of Gaussians. The analysis reveals that the degree of co-adaptation is naturally alleviated as the number of training views increases. Based on the analysis, we propose two lightweight strategies to explicitly mitigate the co-adaptation in sparse-view 3DGS: (1) random gaussian dropout; (2) multiplicative noise injection to the opacity. Both strategies are designed to be plug-and-play, and their effectiveness is validated across various methods and benchmarks. We hope that our insights into the co-adaptation effect will inspire the community to achieve a more comprehensive understanding of sparse-view 3DGS.

PDF Details

AAAI Conference 2025 Conference Paper

Spatiotemporal Blind-Spot Network with Calibrated Flow Alignment for Self-Supervised Video Denoising

Zikang Chen
Tao Jiang
Xiaowan Hu
Wang Zhang
Huaqiu Li
Haoqian Wang

Self-supervised video denoising aims to remove noise from videos without relying on ground truth data, leveraging the video itself to recover clean frames. Existing methods often rely on simplistic feature stacking or apply optical flow without thorough analysis. This results in suboptimal utilization of both inter-frame and intra-frame information, and it also neglects the potential of optical flow alignment under self-supervised conditions, leading to biased and insufficient denoising outcomes. To this end, we first explore the practicality of optical flow in the self-supervised setting and introduce a SpatioTemporal Blind-spot Network (STBN) for global frame feature utilization. In the temporal domain, we utilize bidirectional blind-spot feature propagation through the proposed blind-spot alignment block to ensure accurate temporal alignment and effectively capture long-range dependencies. In the spatial domain, we introduce the spatial receptive field expansion module, which enhances the receptive field and improves global perception capabilities. Additionally, to reduce the sensitivity of optical flow estimation to noise, we propose an unsupervised optical flow distillation mechanism that refines fine-grained inter-frame interactions during optical flow alignment. Our method demonstrates superior performance across both synthetic and real-world video denoising datasets.

PDF Details DOI

AAAI Conference 2025 Conference Paper

TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

Chuanrui Zhang
Yingshuang Zou
Zhuoling Li
Minmin Yi
Haoqian Wang

Compared with previous 3D reconstruction methods like Nerf, recent Generalizable 3D Gaussian Splatting (G-3DGS) methods demonstrate impressive efficiency even in the sparse-view setting. However, the promising reconstruction performance of existing G-3DGS methods relies heavily on accurate multi-view feature matching, which is quite challenging. Especially for the scenes that have many non-overlapping areas between various views and contain numerous similar regions, the matching performance of existing methods is poor and the reconstruction precision is limited. To address this problem, we develop a strategy that utilizes a predicted depth confidence map to guide accurate local feature matching. In addition, we propose to utilize the knowledge of existing monocular depth estimation models as prior to boost the depth estimation precision in non-overlapping areas between views. Combining the proposed strategies, we present a novel G-3DGS method named TranSplat, which obtains the best performance on both the RealEstate10K and ACID benchmarks while maintaining competitive speed and presenting strong cross-dataset generalization ability.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

VaporTok: RL-Driven Adaptive Video Tokenizer with Prior & Task Awareness

Minghao Yang
Zechen Bai
Jing Lin
Haoqian Wang
Alex Jinpeng Wang

Recent advances in visual tokenizers have demonstrated their effectiveness for multimodal large language models and autoregressive generative models. However, most existing visual tokenizers rely on a fixed downsampling rate at a given visual resolution, and consequently produce a constant number of visual tokens, ignoring the fact that visual information of varying complexity warrant different token budgets. Motivated by this observation, we propose an adaptive video tokenizer "VaporTok" with two core contributions: Probabilistic Taildrop: We introduce a novel taildrop mechanism that learns a truncation index sampling distribution conditioned on visual complexity of the video. During both training and inference, the decoder reconstructs videos at adaptive token lengths, allocating more tokens to complex videos and fewer to simpler ones. Parallel Sample GRPO with Vapor Reward: By leveraging the probability distribution produced by probabilistic taildrop, we reformulate the visual tokenization pipeline as a sequential decision process. To optimize this process, we propose a variant of GRPO and a composite reward encompassing token efficiency, reconstruction fidelity, and generative quality, thus enabling metrics-aware adaptive tokenization across diverse objectives. Extensive experiments on standard video generation benchmarks confirm our analysis, showing that our adaptive approach matches or outperforms fixed‐rate baselines and naive taildrop while using fewer tokens.

PDF Details

AAAI Conference 2024 Conference Paper

High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field

Minghan Qin
Yifan Liu
Yuelang Xu
Xiaochen Zhao
Yebin Liu
Haoqian Wang

One crucial aspect of 3D head avatar reconstruction lies in the details of facial expressions. Although recent NeRF-based photo-realistic 3D head avatar methods achieve high-quality avatar rendering, they still encounter challenges retaining intricate facial expression details because they overlook the potential of specific expression variations at different spatial positions when conditioning the radiance field. Motivated by this observation, we introduce a novel Spatially-Varying Expression (SVE) conditioning. The SVE can be obtained by a simple MLP-based generation network, encompassing both spatial positional features and global expression information. Benefiting from rich and diverse information of the SVE at different positions, the proposed SVE-conditioned NeRF can deal with intricate facial expressions and achieve realistic rendering and geometry details of high-fidelity 3D head avatars. Additionally, to further elevate the geometric and rendering quality, we introduce a new coarse-to-fine training strategy, including a geometry initialization strategy at the coarse stage and an adaptive importance sampling strategy at the fine stage. Extensive experiments indicate that our method outperforms other state-of-the-art (SOTA) methods in rendering and geometry quality on mobile phone-collected and public datasets. Code and data can be found at https://github.com/minghanqin/AvatarSVE.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Binarized Spectral Compressive Imaging

Yuanhao Cai
Yuxin Zheng
Jing Lin
Xin Yuan
Yulun Zhang
Haoqian Wang

Existing deep learning models for hyperspectral image (HSI) reconstruction achieve good performance but require powerful hardwares with enormous memory and computational resources. Consequently, these methods can hardly be deployed on resource-limited mobile devices. In this paper, we propose a novel method, Binarized Spectral-Redistribution Network (BiSRNet), for efficient and practical HSI restoration from compressed measurement in snapshot compressive imaging (SCI) systems. Firstly, we redesign a compact and easy-to-deploy base model to be binarized. Then we present the basic unit, Binarized Spectral-Redistribution Convolution (BiSR-Conv). BiSR-Conv can adaptively redistribute the HSI representations before binarizing activation and uses a scalable hyperbolic tangent function to closer approximate the Sign function in backpropagation. Based on our BiSR-Conv, we customize four binarized convolutional modules to address the dimension mismatch and propagate full-precision information throughout the whole network. Finally, our BiSRNet is derived by using the proposed techniques to binarize the base model. Comprehensive quantitative and qualitative experiments manifest that our proposed BiSRNet outperforms state-of-the-art binarization algorithms. Code and models are publicly available at https: //github. com/caiyuanhao1998/BiSCI

PDF Details

AAAI Conference 2023 Conference Paper

Calibrated Teacher for Sparsely Annotated Object Detection

Haohan Wang
Liang Liu
Boshen Zhang
Jiangning Zhang
Wuhao Zhang
Zhenye Gan
Yabiao Wang
Chengjie Wang

Fully supervised object detection requires training images in which all instances are annotated. This is actually impractical due to the high labor and time costs and the unavoidable missing annotations. As a result, the incomplete annotation in each image could provide misleading supervision and harm the training. Recent works on sparsely annotated object detection alleviate this problem by generating pseudo labels for the missing annotations. Such a mechanism is sensitive to the threshold of the pseudo label score. However, the effective threshold is different in different training stages and among different object detectors. Therefore, the current methods with fixed thresholds have sub-optimal performance, and are difficult to be applied to other detectors. In order to resolve this obstacle, we propose a Calibrated Teacher, of which the confidence estimation of the prediction is well calibrated to match its real precision. In this way, different detectors in different training stages would share a similar distribution of the output confidence, so that multiple detectors could share the same fixed threshold and achieve better performance. Furthermore, we present a simple but effective Focal IoU Weight (FIoU) for the classification loss. FIoU aims at reducing the loss weight of false negative samples caused by the missing annotation, and thus works as the complement of the teacher-student paradigm. Extensive experiments show that our methods set new state-of-the-art under all different sparse settings in COCO. Code will be available at https://github.com/Whileherham/CalibratedTeacher.

PDF Details DOI

JBHI Journal 2023 Journal Article

dMIL-Transformer: Multiple Instance Learning Via Integrating Morphological and Spatial Information for Lymph Node Metastasis Classification

Yang Chen
Zhuchen Shao
Hao Bian
Zijie Fang
Yifeng Wang
Yuanhao Cai
Haoqian Wang
Guojun Liu

Automated classification of lymph node metastasis (LNM) plays an important role in the diagnosis and prognosis. However, it is very challenging to achieve satisfactory performance in LNM classification, because both the morphology and spatial distribution of tumor regions should be taken into account. To address this problem, this article proposes a two-stage dMIL-Transformer framework, which integrates both the morphological and spatial information of the tumor regions based on the theory of multiple instance learning (MIL). In the first stage, a double Max-Min MIL (dMIL) strategy is devised to select the suspected top-K positive instances from each input histopathology image, which contains tens of thousands of patches (primarily negative). The dMIL strategy enables a better decision boundary for selecting the critical instances compared with other methods. In the second stage, a Transformer-based MIL aggregator is designed to integrate all the morphological and spatial information of the selected instances from the first stage. The self-attention mechanism is further employed to characterize the correlation between different instances and learn the bag-level representation for predicting the LNM category. The proposed dMIL-Transformer can effectively deal with the thorny classification in LNM with great visualization and interpretability. We conduct various experiments over three LNM datasets, and achieve 1. 79%-7. 50% performance improvement compared with other state-of-the-art methods.

Details DOI

NeurIPS Conference 2023 Conference Paper

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

Jing Lin
Ailing Zeng
Shunlin Lu
Yuanhao Cai
Ruimao Zhang
Haoqian Wang
Lei Zhang

In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 15. 6M precise 3D whole-body pose annotations (i. e. , SMPL-X) covering 81. 1K motion sequences from massive scenes. Besides, Motion-X provides 15. 6M frame-level whole-body pose descriptions and 81. 1K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.

PDF Details

NeurIPS Conference 2022 Conference Paper

Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging

Yuanhao Cai
Jing Lin
Haoqian Wang
Xin Yuan
Henghui Ding
Yulun Zhang
Radu Timofte
Luc V Gool

In coded aperture snapshot spectral compressive imaging (CASSI) systems, hyperspectral image (HSI) reconstruction methods are employed to recover the spatial-spectral signal from a compressed measurement. Among these algorithms, deep unfolding methods demonstrate promising performance but suffer from two issues. Firstly, they do not estimate the degradation patterns and ill-posedness degree from CASSI to guide the iterative learning. Secondly, they are mainly CNN-based, showing limitations in capturing long-range dependencies. In this paper, we propose a principled Degradation-Aware Unfolding Framework (DAUF) that estimates parameters from the compressed image and physical mask, and then uses these parameters to control each iteration. Moreover, we customize a novel Half-Shuffle Transformer (HST) that simultaneously captures local contents and non-local dependencies. By plugging HST into DAUF, we establish the first Transformer-based deep unfolding method, Degradation-Aware Unfolding Half-Shuffle Transformer (DAUHST), for HSI reconstruction. Experiments show that DAUHST surpasses state-of-the-art methods while requiring cheaper computational and memory costs. Code and models are publicly available at https: //github. com/caiyuanhao1998/MST

PDF Details

NeurIPS Conference 2022 Conference Paper

Effective Backdoor Defense by Exploiting Sensitivity of Poisoned Samples

Weixin Chen
Baoyuan Wu
Haoqian Wang

Poisoning-based backdoor attacks are serious threat for training deep models on data from untrustworthy sources. Given a backdoored model, we observe that the feature representations of poisoned samples with trigger are more sensitive to transformations than those of clean samples. It inspires us to design a simple sensitivity metric, called feature consistency towards transformations (FCT), to distinguish poisoned samples from clean samples in the untrustworthy training set. Moreover, we propose two effective backdoor defense methods. Built upon a sample-distinguishment module utilizing the FCT metric, the first method trains a secure model from scratch using a two-stage secure training module. And the second method removes backdoor from a backdoored model with a backdoor removal module which alternatively unlearns the distinguished poisoned samples and relearns the distinguished clean samples. Extensive results on three benchmark datasets demonstrate the superior defense performance against eight types of backdoor attacks, to state-of-the-art backdoor defenses. Codes are available at: https: //github. com/SCLBD/Effective backdoor defense.

PDF Details

ICML Conference 2022 Conference Paper

Flow-Guided Sparse Transformer for Video Deblurring

Jing Lin
Yuanhao Cai
Xiaowan Hu
Haoqian Wang
Youliang Yan
Xueyi Zou
Henghui Ding
Yulun Zhang 0001

Exploiting similar and sharper scene patches in spatio-temporal neighborhoods is critical for video deblurring. However, CNN-based methods show limitations in capturing long-range dependencies and modeling non-local self-similarity. In this paper, we propose a novel framework, Flow-Guided Sparse Transformer (FGST), for video deblurring. In FGST, we customize a self-attention module, Flow-Guided Sparse Window-based Multi-head Self-Attention (FGSW-MSA). For each $query$ element on the blurry reference frame, FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse yet highly related $key$ elements corresponding to the same scene patch in neighboring frames. Besides, we present a Recurrent Embedding (RE) mechanism to transfer information from past frames and strengthen long-range temporal dependencies. Comprehensive experiments demonstrate that our proposed FGST outperforms state-of-the-art (SOTA) methods on both DVD and GOPRO datasets and yields visually pleasant results in real video deblurring. https: //github. com/linjing7/VR-Baseline

Details

IJCAI Conference 2022 Conference Paper

Iterative Few-shot Semantic Segmentation from Image Label Text

Haohan Wang
Liang Liu
Wuhao Zhang
Jiangning Zhang
Zhenye Gan
Yabiao Wang
Chengjie Wang
Haoqian Wang

Few-shot semantic segmentation aims to learn to segment unseen class objects with the guidance of only a few support images. Most previous methods rely on the pixel-level label of support images. In this paper, we focus on a more challenging setting, in which only the image-level labels are available. We propose a general framework to firstly generate coarse masks with the help of the powerful vision-language model CLIP, and then iteratively and mutually refine the mask predictions of support and query images. Extensive experiments on PASCAL-5i and COCO-20i datasets demonstrate that our method not only outperforms the state-of-the-art weakly supervised approaches by a significant margin, but also achieves comparable or better results to recent supervised methods. Moreover, our method owns an excellent generalization ability for the images in the wild and uncommon classes. Code will be available at https: //github. com/Whileherham/IMR-HSNet.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Unpaired Multi-Domain Stain Transfer for Kidney Histopathological Images

Yiyang Lin
Bowei Zeng
Yifeng Wang
Yang Chen
Zijie Fang
Jian Zhang
Xiangyang Ji
Haoqian Wang

As an essential step in the pathological diagnosis, histochemical staining can show specific tissue structure information and, consequently, assist pathologists in making accurate diagnoses. Clinical kidney histopathological analyses usually employ more than one type of staining: H&E, MAS, PAS, PASM, etc. However, due to the interference of colors among multiple stains, it is not easy to perform multiple staining simultaneously on one biological tissue. To address this problem, we propose a network based on unpaired training data to virtually generate multiple types of staining from one staining. Our method can preserve the content of input images while transferring them to multiple target styles accurately. To efficiently control the direction of stain transfer, we propose a style guided normalization (SGN). Furthermore, a multiple style encoding (MSE) is devised to represent the relationship among different staining styles dynamically. An improved one-hot label is also proposed to enhance the generalization ability and extendibility of our method. Vast experiments have demonstrated that our model can achieve superior performance on a tiny dataset. The results exhibit not only good performance but also great visualization and interpretability. Especially, our method also achieves satisfactory results over cross-tissue, cross-staining as well as cross-task. We believe that our method will significantly influence clinical stain transfer and reduce the workload greatly for pathologists. Our code and Supplementary materials are available at https: //github. com/linyiyang98/UMDST.

PDF Details

ICML Conference 2022 Conference Paper

Unsupervised Flow-Aligned Sequence-to-Sequence Learning for Video Restoration

Jing Lin
Xiaowan Hu
Yuanhao Cai
Haoqian Wang
Youliang Yan
Xueyi Zou
Yulun Zhang 0001
Luc Van Gool

How to properly model the inter-frame relation within the video sequence is an important but unsolved challenge for video restoration (VR). In this work, we propose an unsupervised flow-aligned sequence-to-sequence model (S2SVR) to address this problem. On the one hand, the sequence-to-sequence model, which has proven capable of sequence modeling in the field of natural language processing, is explored for the first time in VR. Optimized serialization modeling shows potential in capturing long-range dependencies among frames. On the other hand, we equip the sequence-to-sequence model with an unsupervised optical flow estimator to maximize its potential. The flow estimator is trained with our proposed unsupervised distillation loss, which can alleviate the data discrepancy and inaccurate degraded optical flow issues of previous flow-based methods. With reliable optical flow, we can establish accurate correspondence among multiple frames, narrowing the domain difference between 1D language and 2D misaligned frames and improving the potential of the sequence-to-sequence model. S2SVR shows superior performance in multiple VR tasks, including video deblurring, video super-resolution, and compressed video quality enhancement. https: //github. com/linjing7/VR-Baseline

Details

JBHI Journal 2021 Journal Article

Bridging the Gap Between 2D and 3D Contexts in CT Volume for Liver and Tumor Segmentation

Lei Song
Haoqian Wang
Z. Jane Wang

Automatic liver and tumor segmentation remain a challenging topic, which subjects to the exploration of 2D and 3D contexts in CT volume. Existing methods are either only focus on the 2D context by treating the CT volume as many independent image slices (but ignore the useful temporal information between adjacent slices), or just explore the 3D context lied in many little voxels (but damage the spatial detail in each slice). These factors lead an inadequate context exploration together for automatic liver and tumor segmentation. In this paper, we propose a novel full-context convolution neural network to bridge the gap between 2D and 3D contexts. The proposed network can utilize the temporal information along the Z axis in CT volume while retaining the spatial detail in each slice. Specifically, a 2D spatial network for intra-slice features extraction and a 3D temporal network for inter-slice features extraction are proposed separately and then are guided by the squeeze-and-excitation layer that allows the flow of 2D context and 3D temporal information. To address the severe class imbalance issue in the CT volume and meanwhile improve the segmentation performance, a loss function consisting of weighted cross-entropy and jaccard distance is proposed. During the network training, the 2D and 3D contexts are learned jointly in an end-to-end way. The proposed network achieves competitive results on the Liver Tumor Segmentation Challenge (LiTS) and the 3D-IRCADB datasets. This method should be a new promising paradigm to explore the contexts for liver and tumor segmentation.

Details DOI

NeurIPS Conference 2021 Conference Paper

Learning to Generate Realistic Noisy Images via Pixel-level Noise-aware Adversarial Training

Yuanhao Cai
Xiaowan Hu
Haoqian Wang
Yulun Zhang
Hanspeter Pfister
Donglai Wei

Existing deep learning real denoising methods require a large amount of noisy-clean image pairs for supervision. Nonetheless, capturing a real noisy-clean dataset is an unacceptable expensive and cumbersome procedure. To alleviate this problem, this work investigates how to generate realistic noisy images. Firstly, we formulate a simple yet reasonable noise model that treats each real noisy pixel as a random variable. This model splits the noisy image generation problem into two sub-problems: image domain alignment and noise domain alignment. Subsequently, we propose a novel framework, namely Pixel-level Noise-aware Generative Adversarial Network (PNGAN). PNGAN employs a pre-trained real denoiser to map the fake and real noisy images into a nearly noise-free solution space to perform image domain alignment. Simultaneously, PNGAN establishes a pixel-level adversarial training to conduct noise domain alignment. Additionally, for better noise fitting, we present an efficient architecture Simple Multi-scale Network (SMNet) as the generator. Qualitative validation shows that noise generated by PNGAN is highly similar to real noise in terms of intensity and distribution. Quantitative experiments demonstrate that a series of denoisers trained with the generated noisy images achieve state-of-the-art (SOTA) results on four real denoising benchmarks.

PDF Details

IJCAI Conference 2021 Conference Paper

Multi-Scale Selective Feedback Network with Dual Loss for Real Image Denoising

Xiaowan Hu
Yuanhao Cai
Zhihong Liu
Haoqian Wang
Yulun Zhang

The feedback mechanism in the human visual system extracts high-level semantics from noisy scenes. It then guides low-level noise removal, which has not been fully explored in image denoising networks based on deep learning. The commonly used fully-supervised network optimizes parameters through paired training data. However, unpaired images without noise-free labels are ubiquitous in the real world. Therefore, we proposed a multi-scale selective feedback network (MSFN) with the dual loss. We allow shallow layers to access valuable contextual information from the following deep layers selectively between two adjacent time steps. Iterative refinement mechanism can remove complex noise from coarse to fine. The dual regression is designed to reconstruct noisy images to establish closed-loop supervision that is training-friendly for unpaired data. We use the dual loss to optimize the primary clean-to-noisy task and the dual noisy-to-clean task simultaneously. Extensive experiments prove that our method achieves state-of-the-art results and shows better adaptability on real-world images than the existing methods.

PDF Details DOI

JBHI Journal 2020 Journal Article

An End-to-End Multi-Task Deep Learning Framework for Skin Lesion Analysis

Lei Song
Jianzhe Lin
Z. Jane Wang
Haoqian Wang

Automatic skin lesion analysis of dermoscopy images remains a challenging topic. In this paper, we propose an end-to-end multi-task deep learning framework for automatic skin lesion analysis. The proposed framework can perform skin lesion detection, classification, and segmentation tasks simultaneously. To address the class imbalance issue in the dataset (as often observed in medical image datasets) and meanwhile to improve the segmentation performance, a loss function based on the focal loss and the jaccard distance is proposed. During the framework training, we employ a three-phase joint training strategy to ensure the efficiency of feature learning. The proposed framework outperforms state-of-the-art methods on the benchmarks ISBI 2016 challenge dataset towards melanoma classification and ISIC 2017 challenge dataset towards melanoma segmentation, especially for the segmentation task. The proposed framework should be a promising computer-aided tool for melanoma diagnosis.

Details DOI