Arrow Research search

Author name cluster

Wenming Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

25 papers
2 author rows

Possible papers

25

AAAI Conference 2025 Conference Paper

Decoupling Appearance Variations with 3D Consistent Features in Gaussian Splatting

  • Jiaqi Lin
  • Zhihao Li
  • Binxiao Huang
  • Xiao Tang
  • Jianzhuang Liu
  • Shiyong Liu
  • Xiaofei Wu
  • Fenglong Song

Gaussian Splatting has emerged as a prominent 3D representation in novel view synthesis, but it still suffers from appearance variations, which are caused by various factors, such as modern camera ISPs, different time of day, weather conditions, and local light changes. These variations can lead to floaters and color distortions in the rendered images/videos. Recent appearance modeling approaches in Gaussian Splatting are either tightly coupled with the rendering process, hindering real-time rendering, or they only account for mild global variations, performing poorly in scenes with local light changes. In this paper, we propose DAVIGS, a method that decouples appearance variations in a plug-and-play and efficient manner. By transforming the rendering results at the image level instead of the Gaussian level, our approach can model appearance variations with minimal optimization time and memory overhead. Furthermore, our method gathers appearance-related information in 3D space to transform the rendered images, thus building 3D consistency across views implicitly. We validate our method on several appearance-variant scenes, and demonstrate that it achieves state-of-the-art rendering quality with minimal training time and memory usage, without compromising rendering speeds. Additionally, it provides performance improvements for different Gaussian Splatting baselines in a plug-and-play manner.

AAAI Conference 2025 Conference Paper

DM-Adapter: Domain-Aware Mixture-of-Adapters for Text-Based Person Retrieval

  • Yating Liu
  • Zimo Liu
  • Xiangyuan Lan
  • Wenming Yang
  • Yaowei Li
  • Qingmin Liao

Text-based person retrieval (TPR) has gained significant attention as a fine-grained and challenging task that closely aligns with practical applications. Tailoring CLIP to person domain is now a emerging research topic due to the abundant knowledge of vision-language pretraining, but challenges still remain during fine-tuning: (i) Previous full-model fine-tuning in TPR is computationally expensive and prone to overfitting.(ii) Existing parameter-efficient transfer learning (PETL) for TPR lacks of fine-grained feature extraction. To address these issues, we propose Domain-Aware Mixture-of-Adapters (DM-Adapter), which unifies Mixture-of-Experts (MOE) and PETL to enhance fine-grained feature representations while maintaining efficiency. Specifically, Sparse Mixture-of-Adapters is designed in parallel to MLP layers in both vision and language branches, where different experts specialize in distinct aspects of person knowledge to handle features more finely. To promote the router to exploit domain information effectively and alleviate the routing imbalance, Domain-Aware Router is then developed by building a novel gating function and injecting learnable domain-aware prompts. Extensive experiments show that our DM-Adapter achieves state-of-the-art performance, outperforming previous methods by a significant margin.

IROS Conference 2025 Conference Paper

FEG-VON: Frontier Embedding Graph for Efficient Visual Object Navigation

  • Yingru Dai
  • Pengwei Xie
  • Yikai Liu
  • Siang Chen
  • Wenming Yang
  • Guijin Wang

Visual object navigation, requiring agents to locate target objects in novel environments through egocentric visual observation, remains a critical challenge in Embodied AI. We propose FEG-VON, a training-free framework that constructs and maintains a Frontier Embedding Graph for efficient Visual Object Navigation. The graph initializes frontier embeddings using Vision Language Models (VLMs), where visual observations are encoded into spatially anchored semantic embeddings through cross-modal alignment with target text descriptors. We then update the graph by aggregating spatio-temporal semantic relations across frontiers, enabling online adaptation to new targets via similarity scoring without remapping. The evaluation results in public benchmarks demonstrate the superior performance of FEG-VON in both single- and multi-object navigation tasks compared with state-of-the-art methods. Crucially, FEG-VON eliminates dependency on task-specific training for exploration and advances the feasibility of zero-shot navigation in open-world environments.

AAAI Conference 2025 Conference Paper

GaussianSR: High Fidelity 2D Gaussian Splatting for Arbitrary-Scale Image Super-Resolution

  • Jintong Hu
  • Bin Xia
  • Bin Chen
  • Wenming Yang
  • Lei Zhang

Implicit neural representations (INRs) have revolutionized arbitrary-scale super-resolution (ASSR) by modeling images as continuous functions. Most existing INR-based ASSR networks first extract features from the given low-resolution image using an encoder, and then render the super-resolved result via a multi-layer perceptron decoder. Although these approaches have shown promising results, their performance is constrained by the limited representation ability of discrete latent codes in the encoded features. In this paper, we propose a novel ASSR method named GaussianSR that overcomes this limitation through 2D Gaussian Splatting (2DGS). Unlike traditional methods that treat pixels as discrete points, GaussianSR represents each pixel as a continuous Gaussian field. The encoded features are simultaneously refined and upsampled by rendering the mutually stacked Gaussian fields. As a result, long-range dependencies are established to enhance representation ability. In addition, a classifier is developed to dynamically assign Gaussian kernels to all pixels to further improve flexibility. All components of GaussianSR (i.e. encoder, classifier, Gaussian kernels, and decoder) are jointly learned end-to-end. Experiments demonstrate that GaussianSR achieves superior ASSR performance with fewer parameters than existing methods while enjoying interpretable and content-aware feature aggregations.

ICML Conference 2025 Conference Paper

GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

  • Zhun Mou
  • Bin Xia
  • Zhengchao Huang
  • Wenming Yang
  • Jiaya Jia

Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3. 3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.

AAAI Conference 2025 Conference Paper

Pose Magic: Efficient and Temporally Consistent Human Pose Estimation with a Hybrid Mamba-GCN Network

  • Xinyi Zhang
  • Qiqi Bao
  • Qinpeng Cui
  • Wenming Yang
  • Qingmin Liao

Current state-of-the-art (SOTA) methods in 3D Human Pose Estimation (HPE) are primarily based on Transformers. However, existing Transformer-based 3D HPE backbones often encounter a trade-off between accuracy and computational efficiency. To resolve the above dilemma, in this work, we leverage recent advances in state space models and utilize Mamba for high-quality and efficient long-range modeling. Nonetheless, Mamba still faces challenges in precisely exploiting local dependencies between joints. To address these issues, we propose a new attention-free hybrid spatiotemporal architecture named Hybrid Mamba-GCN (Pose Magic). This architecture introduces local enhancement with GCN by capturing relationships between neighboring joints, thus producing new representations to complement Mamba's outputs. By adaptively fusing representations from Mamba and GCN, Pose Magic demonstrates superior capability in learning the underlying 3D structure. To meet the requirements of real-time inference, we also provide a fully causal version. Extensive experiments show that Pose Magic achieves new SOTA results (0.9 mm drop) while saving 74.1% FLOPs. In addition, Pose Magic exhibits optimal motion consistency and the ability to generalize to unseen sequence lengths.

IROS Conference 2025 Conference Paper

Region-Centric 6-Dof Grasp Detection: A Data-Efficient Solution for Cluttered Scenes

  • Siang Chen
  • Wei Tang
  • Pengwei Xie
  • Dingchang Hu
  • Wenming Yang
  • Guijin Wang

Robotic grasping, serving as the cornerstone of robot manipulation, is fundamental for embodied intelligence. Manipulation in challenging scenarios demands grasp detection algorithms with higher efficiency and generalizability. However, for general 6-Dof grasp detection, most data-driven methods directly extract scene-level features to generate grasp prediction, relying on a relatively heavy scene-level feature encoder and a significant amount of data with dense grasp labels for model training. In this letter, we propose a novel data-efficient 6-Dof grasp detection framework in cluttered scenes, named Region-Centric Grasp Detection (RCGD), consisting of an Iterative Search Module (ISM) and a Region Grasp Model (RGM). Concretely, ISM aims to retrieve potential region centers and aggregate multiple regions in a coarse-to-fine way. Then, RGM extracts aligned grasp-related embeddings and predicts grasps within these local regions. Benefiting from the region-centric paradigm and the training-free location strategy, RCGD significantly outperforms previous methods and shows minimal performance loss with even a very small portion of training data or labels. Furthermore, real-world robotic experiments in two distinct settings highlight the effectiveness of our method with a 95% success rate.

ICRA Conference 2025 Conference Paper

SAP-SLAM: Semantic-Assisted Perception SLAM with 3D Gaussian Splatting

  • Yuheng Yang
  • Yudong Lin
  • Wenming Yang
  • Guijin Wang
  • Qingmin Liao

The integration of 3D Gaussians has introduced a novel scene representation in Simultaneous Localization and Mapping (SLAM), characterized by explicit representation and differentiable rendering capabilities that enhance scene reconstruction and understanding. However, most current SLAM systems only exploit the basic representational capacity of 3D Gaussians, neglecting their potential to offer richer information and facilitate higher-dimensional scene comprehension. Furthermore, these systems often struggle with reconstruction when encountering rapid camera movements or depth missing. Drawing inspiration from 3D language field, which explores the intrinsic relationships among scene objects, we propose SAPSLAM, a dense SLAM system that combines high-fidelity reconstruction and advanced semantic understanding. Our approach leverages pre-trained visual models to extract semantic features, which are then fused, dimensionally reduced, and encoded into the 3D Gaussian model for optimization and rendering. The integration of these features improves the systems semantic comprehension and scene representation, ultimately enabling the creation of high-precision 3D semantic maps. Additionally, we introduce a semantic-guided Gaussian densification and pruning strategy, which uses semantic consistency to prioritize attention on poorly reconstructed areas, greatly improving performance in complex scenarios. SAP-SLAM achieves competitive results on both real-world and synthetic datasets, demonstrating superior capabilities in semantic understanding and reconstruction.

AAAI Conference 2025 Conference Paper

SOVGaussian: Sparse-View 3D Gaussian Splatting for Open-Vocabulary Scene Understanding

  • Peng Ling
  • Tiao Tan
  • Jiaqi Lin
  • Wenming Yang

Modeling 3D open-vocabulary language fields is challenging yet highly anticipated. Despite great progress, existing approaches heavily rely on a large number of training views to construct language-embedded 3D scenes, which is unfortunately impractical in real-world scenarios. This paper introduces SOVGaussian, the first method for few-shot novel view open-vocabulary language querying. We introduce a depth-constrained neural language field to mitigate the geometry degradation caused by overfitting training views. Rather than straightforwardly using dense depth maps for loosely accurate supervision, Language-Aware Depth Distillation (LAD) based on open-vocabulary object masks is proposed, ensuring intra-object geometric accuracy within the language field. To further refine the language-geometry consistency of the language field, we propose a novel Language-Guided Outlier Pruning (LOP) strategy, which identifies floating 3D Gaussian primitives overfitting training views based on their language-grouped densities. Our comprehensive experiments demonstrate that SOVGaussian is able to reconstruct a superior scene representation from few-shot images, outperforming existing state-of-the-art methods and achieving significantly better performance on novel view language querying and synthesis.

YNICL Journal 2024 Journal Article

Abnormalities in subcortical function and their treatment response in Wilson’s disease

  • Sheng Hu
  • Taohua Wei
  • Chuanfu Li
  • Xiaoxiao Wang
  • Benedictor Alexander Nguchu
  • Yanming Wang
  • Ting Dong
  • Yulong Yang

Extensive neuroimaging abnormalities in subcortical regions build the pathophysiological basis of Wilson's disease (WD). Yet, subcortical topographic organization fails to articulate, leaving a huge gap in understanding the neural mechanism of WD. Thus, how functional abnormalities of WD subcortical regions influence complex clinical symptoms and response to treatment remain unknown. Using resting-state functional MRI data from 232 participants (including 130 WD patients and 102 healthy controls), we applied a connectivity-based parcellation technique to develop a subcortical atlas for WD. The atlas was further used to investigate abnormalities in subcortical function (ASF) by exploring intrasubcortical functional connectivity (FC) and topographic organization of cortico-subcortical FC. We further used support vector machine (SVM) to integrate these functional abnormalities into the ASF score, which serves as a biomarker for characterizing individual subcortical dysfunction for WD. Finally, the baseline ASF score and one-year treatment data of the follow-up WD patients were used to assess treatment response. A group set of subcortical parcellations was evaluated, in which 26 bilateral regions well recapitulated the anatomical nuclei of the subcortical areas of WD. The results of cortico-subcortical FC and intrasubcortical FC reveal that dysfunction of the somatomotor networks-lenticular nucleus-thalamic pathways is involved in complex symptoms of WD. The ASF score was able to characterize disease progression and was significantly associated with treatment response of WD. Our findings provide a comprehensive elaboration of functional abnormalities of WD subcortical regions and reveal their association with clinical presentations, improving our understanding of the functional neural underpinnings in WD. Furthermore, abnormalities in subcortical function could serve as a potential biomarker for understanding the disease progression and evaluating treatment response of WD.

AAAI Conference 2024 Conference Paper

Binding-Adaptive Diffusion Models for Structure-Based Drug Design

  • Zhilin Huang
  • Ling Yang
  • Zaixi Zhang
  • Xiangxin Zhou
  • Yu Bao
  • Xiawu Zheng
  • Yuwei Yang
  • Yu Wang

Structure-based drug design (SBDD) aims to generate 3D ligand molecules that bind to specific protein targets. Existing 3D deep generative models including diffusion models have shown great promise for SBDD. However, it is complex to capture the essential protein-ligand interactions exactly in 3D space for molecular generation. To address this problem, we propose a novel framework, namely Binding-Adaptive Diffusion Models (BindDM). In BindDM, we adaptively extract subcomplex, the essential part of binding sites responsible for protein-ligand interactions. Then the selected protein-ligand subcomplex is processed with SE(3)-equivariant neural networks, and transmitted back to each atom of the complex for augmenting the target-aware 3D molecule diffusion generation with binding interaction information. We iterate this hierarchical complex-subcomplex process with cross-hierarchy interaction node for adequately fusing global binding context between the complex and its corresponding subcomplex. Empirical studies on the CrossDocked2020 dataset show BindDM can generate molecules with more realistic 3D structures and higher binding affinities towards the protein targets, with up to -5.92 Avg. Vina Score, while maintaining proper molecular properties. Our code is available at https://github.com/YangLing0818/BindDM

AAAI Conference 2024 Conference Paper

Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction

  • Qiqi Bao
  • Zheng Hui
  • Rui Zhu
  • Peiran Ren
  • Xuansong Xie
  • Wenming Yang

Generative diffusion prior captured from the off-the-shelf denoising diffusion generative model has recently attained significant interest. However, several attempts have been made to adopt diffusion models to noisy inverse problems either fail to achieve satisfactory results or require a few thousand iterations to achieve high-quality reconstructions. In this work, we propose a diffusion-based image restoration with error contraction and error correction (DiffECC) method. Two strategies are introduced to contract the restoration error in the posterior sampling process. First, we combine existing CNN-based approaches with diffusion models to ensure data consistency from the beginning. Second, to amplify the error contraction effects of the noise, a restart sampling algorithm is designed. In the error correction strategy, the estimation-correction idea is proposed on both the data term and the prior term. Solving them iteratively within the diffusion sampling framework leads to superior image generation results. Experimental results for image restoration tasks such as super-resolution (SR), Gaussian deblurring, and motion deblurring demonstrate that our approach can reconstruct high-quality images compared with state-of-the-art sampling-based diffusion models.

ICML Conference 2024 Conference Paper

Interaction-based Retrieval-augmented Diffusion Models for Protein-specific 3D Molecule Generation

  • Zhilin Huang
  • Ling Yang 0006
  • Xiangxin Zhou
  • Chujun Qin
  • Yijie Yu 0001
  • Xiawu Zheng
  • Zikun Zhou
  • Wentao Zhang 0001

Generating ligand molecules that bind to specific protein targets via generative models holds substantial promise for advancing structure-based drug design. Existing methods generate molecules from scratch without reference or template ligands, which poses challenges in model optimization and may yield suboptimal outcomes. To address this problem, we propose an innovative interaction-based retrieval-augmented diffusion model named IRDiff to facilitate target-aware molecule generation. IRDiff leverages a curated set of ligand references, i. e. , those with desired properties such as high binding affinity, to steer the diffusion model towards synthesizing ligands that satisfy design criteria. Specifically, we utilize a protein-molecule interaction network (PMINet), which is pretrained with binding affinity signals to: (i) retrieve target-aware ligand molecules with high binding affinity to serve as references, and (ii) incorporate essential protein-ligand binding structures for steering molecular diffusion generation with two effective augmentation mechanisms, i. e. , retrieval augmentation and self augmentation. Empirical studies on CrossDocked2020 dataset show IRDiff can generate molecules with more realistic 3D structures and achieve state-of-the-art binding affinities towards the protein targets, while maintaining proper molecular properties. The codes and models are available at https: //github. com/YangLing0818/IRDiff

ICML Conference 2024 Conference Paper

LLM-Empowered State Representation for Reinforcement Learning

  • Boyuan Wang
  • Yun Qu 0002
  • Yuhang Jiang 0001
  • Jianzhun Shao
  • Chang Liu 0030
  • Wenming Yang
  • Xiangyang Ji

Conventional state representations in reinforcement learning often omit critical task-related details, presenting a significant challenge for value networks in establishing accurate mappings from states to task rewards. Traditional methods typically depend on extensive sample learning to enrich state representations with task-specific information, which leads to low sample efficiency and high time costs. Recently, surging knowledgeable large language models (LLM) have provided promising substitutes for prior injection with minimal human intervention. Motivated by this, we propose LLM-Empowered State Representation (LESR), a novel approach that utilizes LLM to autonomously generate task-related state representation codes which help to enhance the continuity of network mappings and facilitate efficient training. Experimental results demonstrate LESR exhibits high sample efficiency and outperforms state-of-the-art baselines by an average of 29% in accumulated reward in Mujoco tasks and 30% in success rates in Gym-Robotics tasks. Codes of LESR are accessible at https: //github. com/thu-rllab/LESR.

ICLR Conference 2024 Conference Paper

Protein-Ligand Interaction Prior for Binding-aware 3D Molecule Diffusion Models

  • Zhilin Huang
  • Ling Yang 0006
  • Xiangxin Zhou
  • Zhilong Zhang
  • Wentao Zhang 0001
  • Xiawu Zheng
  • Jie Chen 0001
  • Yu Wang 0027

Generating 3D ligand molecules that bind to specific protein targets via diffusion models has shown great promise for structure-based drug design. The key idea is to disrupt molecules into noise through a fixed forward process and learn its reverse process to generate molecules from noise in a denoising way. However, existing diffusion models primarily focus on incorporating protein-ligand interaction information solely in the reverse process, and neglect the interactions in the forward process. The inconsistency between forward and reverse processes may impair the binding affinity of generated molecules towards target protein. In this paper, we propose a novel Interaction Prior-guided Diffusion model (IPDiff) for the protein-specific 3D molecular generation by introducing geometric protein-ligand interactions into both diffusion and sampling process. Specifically, we begin by pretraining a protein-ligand interaction prior network (IPNet) by utilizing the binding affinity signals as supervision. Subsequently, we leverage the pretrained prior network to (1) integrate interactions between the target protein and the molecular ligand into the forward process for adapting the molecule diffusion trajectories (prior-shifting), and (2) enhance the binding-aware molecule sampling process (prior-conditioning). Empirical studies on CrossDocked2020 dataset show IPDiff can generate molecules with more realistic 3D structures and state-of-the-art binding affinities towards the protein targets, with up to -6.42 Avg. Vina Score, while maintaining proper molecular properties. https://github.com/YangLing0818/IPDiff

ICLR Conference 2023 Conference Paper

Basic Binary Convolution Unit for Binarized Image Restoration Network

  • Bin Xia
  • Yulun Zhang 0001
  • Yitong Wang
  • Yapeng Tian
  • Wenming Yang
  • Radu Timofte
  • Luc Van Gool

Lighter and faster image restoration (IR) models are crucial for the deployment on resource-limited devices. Binary neural network (BNN), one of the most promising model compression methods, can dramatically reduce the computations and parameters of full-precision convolutional neural networks (CNN). However, there are different properties between BNN and full-precision CNN, and we can hardly use the experience of designing CNN to develop BNN. In this study, we reconsider components in binary convolution, such as residual connection, BatchNorm, activation function, and structure, for IR tasks. We conduct systematic analyses to explain each component's role in binary convolution and discuss the pitfalls. Specifically, we find that residual connection can reduce the information loss caused by binarization; BatchNorm can solve the value range gap between residual connection and binary convolution; The position of the activation function dramatically affects the performance of BNN. Based on our findings and analyses, we design a simple yet efficient basic binary convolution unit (BBCU). Furthermore, we divide IR networks into four parts and specially design variants of BBCU for each part to explore the benefit of binarizing these parts. We conduct experiments on different IR tasks, and our BBCU significantly outperforms other BNNs and lightweight models, which shows that BBCU can serve as a basic unit for binarized IR networks. All codes and models will be released.

ICML Conference 2023 Conference Paper

Crafting Training Degradation Distribution for the Accuracy-Generalization Trade-off in Real-World Super-Resolution

  • Ruofan Zhang
  • Jinjin Gu
  • Haoyu Chen 0003
  • Chao Dong 0005
  • Yulun Zhang 0001
  • Wenming Yang

Super-resolution (SR) techniques designed for real-world applications commonly encounter two primary challenges: generalization performance and restoration accuracy. We demonstrate that when methods are trained using complex, large-range degradations to enhance generalization, a decline in accuracy is inevitable. However, since the degradation in a certain real-world applications typically exhibits a limited variation range, it becomes feasible to strike a trade-off between generalization performance and testing accuracy within this scope. In this work, we introduce a novel approach to craft training degradation distributions using a small set of reference images. Our strategy is founded upon the binned representation of the degradation space and the Frechet distance between degradation distributions. Our results indicate that the proposed technique significantly improves the performance of test images while preserving generalization capabilities in real-world applications.

ICLR Conference 2023 Conference Paper

Knowledge Distillation based Degradation Estimation for Blind Super-Resolution

  • Bin Xia
  • Yulun Zhang 0001
  • Yitong Wang
  • Yapeng Tian
  • Wenming Yang
  • Radu Timofte
  • Luc Van Gool

Blind image super-resolution (Blind-SR) aims to recover a high-resolution (HR) image from its corresponding low-resolution (LR) input image with unknown degradations. Most of the existing works design an explicit degradation estimator for each degradation to guide SR. However, it is infeasible to provide concrete labels of multiple degradation combinations (\eg, blur, noise, jpeg compression) to supervise the degradation estimator training. In addition, these special designs for certain degradation, such as blur, impedes the models from being generalized to handle different degradations. To this end, it is necessary to design an implicit degradation estimator that can extract discriminative degradation representation for all degradations without relying on the supervision of degradation ground-truth. In this paper, we propose a Knowledge Distillation based Blind-SR network (KDSR). It consists of a knowledge distillation based implicit degradation estimator network (KD-IDE) and an efficient SR network. To learn the KDSR model, we first train a teacher network: KD-IDE$_{T}$. It takes paired HR and LR patches as inputs and is optimized with the SR network jointly. Then, we further train a student network KD-IDE$_{S}$, which only takes LR images as input and learns to extract the same implicit degradation representation (IDR) as KD-IDE$_{T}$. In addition, to fully use extracted IDR, we design a simple, strong, and efficient IDR based dynamic convolution residual block (IDR-DCRB) to build an SR network. We conduct extensive experiments under classic and real-world degradation settings. The results show that KDSR achieves SOTA performance and can generalize to various degradation processes. The source codes and pre-trained models will be released.

AAAI Conference 2022 Conference Paper

Coarse-to-Fine Embedded PatchMatch and Multi-Scale Dynamic Aggregation for Reference-Based Super-resolution

  • Bin Xia
  • Yapeng Tian
  • Yucheng Hang
  • Wenming Yang
  • Qingmin Liao
  • Jie Zhou

Reference-based super-resolution (RefSR) has made significant progress in producing realistic textures using an external reference (Ref) image. However, existing RefSR methods obtain high-quality correspondence matchings consuming quadratic computation resources with respect to the input size, limiting its application. Moreover, these approaches usually suffer from scale misalignments between the lowresolution (LR) image and Ref image. In this paper, we propose an Accelerated Multi-Scale Aggregation network (AMSA) for Reference-based Super-Resolution, including Coarse-to-Fine Embedded PatchMatch (CFE-PatchMatch) and Multi-Scale Dynamic Aggregation (MSDA) module. To improve matching efficiency, we design a novel Embedded PatchMacth scheme with random samples propagation, which involves end-to-end training with asymptotic linear computational cost to the input size. To further reduce computational cost and speed up convergence, we apply the coarseto-fine strategy on Embedded PatchMacth constituting CFE- PatchMatch. To fully leverage reference information across multiple scales and enhance robustness to scale misalignment, we develop the MSDA module consisting of Dynamic Aggregation and Multi-Scale Aggregation. The Dynamic Aggregation corrects minor scale misalignment by dynamically aggregating features, and the Multi-Scale Aggregation brings robustness to large scale misalignment by fusing multi-scale information. Experimental results show that the proposed AMSA achieves superior performance over state-of-the-art approaches on both quantitative and qualitative evaluations.

AAAI Conference 2022 Conference Paper

Efficient Non-local Contrastive Attention for Image Super-resolution

  • Bin Xia
  • Yucheng Hang
  • Yapeng Tian
  • Wenming Yang
  • Qingmin Liao
  • Jie Zhou

Non-Local Attention (NLA) brings significant improvement for Single Image Super-Resolution (SISR) by leveraging intrinsic feature correlation in natural images. However, NLA gives noisy information large weights and consumes quadratic computation resources with respect to the input size, limiting its performance and application. In this paper, we propose a novel Efficient Non-Local Contrastive Attention (ENLCA) to perform long-range visual modeling and leverage more relevant non-local features. Specifically, ENLCA consists of two parts, Efficient Non-Local Attention (ENLA) and Sparse Aggregation. ENLA adopts the kernel method to approximate exponential function and obtains linear computation complexity. For Sparse Aggregation, we multiply inputs by an amplification factor to focus on informative features, yet the variance of approximation increases exponentially. Therefore, contrastive learning is applied to further separate relevant and irrelevant features. To demonstrate the effectiveness of ENLCA, we build an architecture called Efficient Non-Local Contrastive Network (ENLCN) by adding a few of our modules in a simple backbone. Extensive experimental results show that ENLCN reaches superior performance over state-of-the-art approaches on both quantitative and qualitative evaluations.

JBHI Journal 2022 Journal Article

MDAN: Mirror Difference Aware Network for Brain Stroke Lesion Segmentation

  • Qiqi Bao
  • Shiyu Mi
  • Bowen Gang
  • Wenming Yang
  • Jie Chen
  • Qingmin Liao

Brain stroke lesion segmentation is of great importance for stroke rehabilitation neuroimaging analysis. Due to the large variance of stroke lesion shapes and similarities of tissue intensity distribution, it remains a challenging task. To help detect abnormalities, the anatomical symmetries of brain magnetic resonance (MR) images have been widely used as visual cues for clinical practices. However, most methods for brain images segmentation do not fully utilize structural symmetry information. This paper presents a novel mirror difference aware network (MDAN) for stroke lesion segmentation. The network uses an encoder-decoder architecture, aiming at holistically exploiting the symmetries of image features. Specifically, a differential feature augmentation (DFA) module is developed in the encoding path to highlight the semantically pathological asymmetries of features in abnormalities. In the DFA module, a Siamese contrastive supervised loss is designed to enhance discriminative features, and a mirror position-based difference augmentation (MDA) module is used to further magnify the discrepancy. Moreover, mirror feature fusion (MFF) modules are applied to efficiently fuse and transfer the information both of the original input and the horizontally flipped features to the decoding path. Extensive experiments on the Anatomical Tracings of Lesions After Stroke (ATLAS) dataset show the proposed MDAN outperforms the state-of-the-art methods.

AAAI Conference 2022 Conference Paper

Pose-Invariant Face Recognition via Adaptive Angular Distillation

  • Zhenduo Zhang
  • Yongru Chen
  • Wenming Yang
  • Guijin Wang
  • Qingmin Liao

Pose-invariant face recognition is a practically useful but challenging task. This paper introduces a novel method to learn pose-invariant feature representation without normalizing profile faces to frontal ones or learning disentangled features. We first design a novel strategy to learn pose-invariant feature embeddings by distilling the angular knowledge of frontal faces extracted by teacher network to student network, which enables the handling of faces with large pose variations. In this way, the features of faces across variant poses can cluster compactly for the same person to create a poseinvariant face representation. Secondly, we propose a Pose- Adaptive Angular Distillation loss to mitigate the negative effect of uneven distribution of face poses in the training dataset to pay more attention to the samples with large pose variations. Extensive experiments on two challenging benchmarks (IJB-A and CFP-FP) show that our approach consistently outperforms the existing methods.

JBHI Journal 2022 Journal Article

RFormer: Transformer-Based Generative Adversarial Network for Real Fundus Image Restoration on a New Clinical Benchmark

  • Zhuo Deng
  • Yuanhao Cai
  • Lu Chen
  • Zheng Gong
  • Qiqi Bao
  • Xue Yao
  • Dong Fang
  • Wenming Yang

Ophthalmologists have used fundus images to screen and diagnose eye diseases. However, different equipments and ophthalmologists pose large variations to the quality of fundus images. Low-quality (LQ) degraded fundus images easily lead to uncertainty in clinical screening and generally increase the risk of misdiagnosis. Thus, real fundus image restoration is worth studying. Unfortunately, real clinical benchmark has not been explored for this task so far. In this paper, we investigate the real clinical fundus image restoration problem. Firstly, We establish a clinical dataset, Real Fundus (RF), including 120 low- and high-quality (HQ) image pairs. Then we propose a novel Transformer-based Generative Adversarial Network (RFormer) to restore the real degradation of clinical fundus images. The key component in our network is the Window-based Self-Attention Block (WSAB) which captures non-local self-similarity and long-range dependencies. To produce more visually pleasant results, a Transformer-based discriminator is introduced. Extensive experiments on our clinical benchmark show that the proposed RFormer significantly outperforms the state-of-the-art (SOTA) methods. In addition, experiments of downstream tasks such as vessel segmentation and optic disc/cup detection demonstrate that our proposed RFormer benefits clinical fundus image analysis and applications.

ICML Conference 2021 Conference Paper

Group Fisher Pruning for Practical Network Compression

  • Liyang Liu
  • Shilong Zhang
  • Zhanghui Kuang
  • Aojun Zhou
  • Jing-Hao Xue
  • Xinjiang Wang
  • Yimin Chen
  • Wenming Yang

Network compression has been widely studied since it is able to reduce the memory and computation cost during inference. However, previous methods seldom deal with complicated structures like residual connections, group/depth-wise convolution and feature pyramid network, where channels of multiple layers are coupled and need to be pruned simultaneously. In this paper, we present a general channel pruning approach that can be applied to various complicated structures. Particularly, we propose a layer grouping algorithm to find coupled channels automatically. Then we derive a unified metric based on Fisher information to evaluate the importance of a single channel and coupled channels. Moreover, we find that inference speedup on GPUs is more correlated with the reduction of memory rather than FLOPs, and thus we employ the memory reduction of each channel to normalize the importance. Our method can be used to prune any structures including those with coupled channels. We conduct extensive experiments on various backbones, including the classic ResNet and ResNeXt, mobile-friendly MobileNetV2, and the NAS-based RegNet, both on image classification and object detection which is under-explored. Experimental results validate that our method can effectively prune sophisticated networks, boosting inference speed without sacrificing accuracy.

ICLR Conference 2021 Conference Paper

Towards Impartial Multi-task Learning

  • Liyang Liu
  • Yi Li 0050
  • Zhanghui Kuang
  • Jing-Hao Xue
  • Yimin Chen
  • Wenming Yang
  • Qingmin Liao
  • Wayne Zhang 0001

Multi-task learning (MTL) has been widely used in representation learning. However, naively training all tasks simultaneously may lead to the partial training issue, where specific tasks are trained more adequately than others. In this paper, we propose to learn multiple tasks impartially. Specifically, for the task-shared parameters, we optimize the scaling factors via a closed-form solution, such that the aggregated gradient (sum of raw gradients weighted by the scaling factors) has equal projections onto individual tasks. For the task-specific parameters, we dynamically weigh the task losses so that all of them are kept at a comparable scale. Further, we find the above gradient balance and loss balance are complementary and thus propose a hybrid balance method to further improve the performance. Our impartial multi-task learning (IMTL) can be end-to-end trained without any heuristic hyper-parameter tuning, and is general to be applied on all kinds of losses without any distribution assumption. Moreover, our IMTL can converge to similar results even when the task losses are designed to have different scales, and thus it is scale-invariant. We extensively evaluate our IMTL on the standard MTL benchmarks including Cityscapes, NYUv2 and CelebA. It outperforms existing loss weighting methods under the same experimental settings.