Arrow Research search

Author name cluster

Chi-Man Pun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers
1 author row

Possible papers

19

JBHI Journal 2026 Journal Article

HDPL: Hypergraph-based Dynamic Prompting Learning for Incomplete Multimodal Medical Learning

  • Xiaomin Zhou
  • Guoheng Huang
  • Qin Zhao
  • Jianbin He
  • Xiaochen Yuan
  • Ming Li
  • Chi-Man Pun
  • Ling Guo

Multimodal learning has garnered significant attention in the medical field due to its ability to provide a more comprehensive perspective utilizing various types of data, that aids in making more accurate decisions. However, the complexity of medical data, coupled with missing modalities, severely hinders predictive accuracy. Existing methods for multimodal learning with missing modalities still face considerable challenges. For instance, approaches that construct multimodal shared feature spaces often result in high computational costs, while methods that infer missing modalities based on complete ones may overly rely on the complete modalities, potentially skewing results. Pre-trained transformer methods address these issues but still have limitations, such as it can only process one missing modality at testing-stage. This is partly because structured data, unlike sequential data, lacks inherent minimum semantic units or natural order. Additionally, the positional encodings generated by this type of methods may introduce information interference when applied to structured data, leading to poor alignment with sequential data during modality fusion in transformer models. To tackle these challenges, we introduce HDPL: Hypergraph-based Dynamic Prompt Learning for Incomplete Multimodal Medical Learning, comprising three modules. The High-Order Hypergraph Embedding module can identify the minimal semantic units within structured data and utilizes hypergraph structures to extract high-dimensional features from clinical data. The Multimodal Medical Data Integrator module closes the distance of the embedding vectors corresponding in the shared space of modality-features, facilitating the integration of modalities in transformer. The Dynamic Network Structure Optimization module is a dynamic learning network by dynamically change the width and depth of network, improving the overall performance of the model, and it alleviates the shortcomings caused by incomplete modality to some extent. Through comprehensive experimentation, we demonstrate the efficiency and robustness of our model in dealing missing modalities and reducing training-burdens. Our code and dataset are available at https://github.com/colorful823/HDPL.

AAAI Conference 2026 Conference Paper

IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection

  • Jiajie Zhu
  • Xia Du
  • Xiaoyuan Liu
  • Ji-Zhe Zhou
  • Qizhen Xu
  • Zheng Lin
  • Chi-Man Pun

The rapid advancements in artificial intelligence have significantly accelerated the adoption of speech recognition technology, leading to its widespread integration across various applications. However, this surge in usage also highlights a critical issue: audio data is highly vulnerable to unauthorized exposure and analysis, posing significant privacy risks for businesses and individuals. This paper introduces an Information-Obfuscation Reversible Adversarial Example (IO-RAE) framework, the pioneering method designed to safeguard audio privacy using reversible adversarial examples. IO-RAE leverages large language models to generate misleading yet contextually coherent content, effectively preventing unauthorized eavesdropping by humans and Automatic Speech Recognition (ASR) systems. Additionally, we propose the Cumulative Signal Attack technique, which mitigates high-frequency noise and enhances attack efficacy by targeting low-frequency signals. Our approach ensures the protection of audio data without degrading its quality or usability. Experimental evaluations demonstrate the superiority of our method, achieving a targeted misguidance rate of 96.5% and a remarkable 100% untargeted misguidance rate in obfuscating target keywords across multiple ASR models, including a commercial black-box system from Google. Furthermore, the quality of the recovered audio, measured by the Perceptual Evaluation of Speech Quality score, reached 4.45, comparable to high-quality original recordings. Notably, the recovered audio processed by ASR systems exhibited an error rate of 0%, indicating nearly lossless recovery. These results highlight the practical applicability and effectiveness of our IO-RAE framework in protecting sensitive audio privacy.

AAAI Conference 2026 Conference Paper

OTI: A Model-free and Visually Interpretable Measure of Image Attackability

  • Jiaming Liang
  • Haowei Liu
  • Chi-Man Pun

Despite the tremendous success of neural networks, benign images can be corrupted by adversarial perturbations to deceive these models. Intriguingly, images differ in their attackability. Specifically, given an attack configuration, some images are easily corrupted, whereas others are more resistant. Evaluating image attackability has important applications in active learning, adversarial training, and attack enhancement. This prompts a growing interest in developing attackability measures. However, existing methods are scarce and suffer from two major limitations: (1) They rely on a model proxy to provide prior knowledge (e.g., gradients or minimal perturbation) to extract model-dependent image features. Unfortunately, in practice, many task-specific models are not readily accessible. (2) Extracted features characterizing image attackability lack visual interpretability, obscuring their direct relationship with the images. To address these, we propose a novel Object Texture Intensity (OTI), a model-free and visually interpretable measure of image attackability, which measures image attackability as the texture intensity of the image's semantic object. Theoretically, we describe the principles of OTI from the perspectives of decision boundaries as well as the mid- and high-frequency characteristics of adversarial perturbations. Comprehensive experiments demonstrate that OTI is effective and computationally efficient. In addition, our OTI provides the adversarial machine learning community with a visual understanding of attackability.

AAAI Conference 2026 Conference Paper

SNS-Grasp: Semantic-guided Noise Scaling for Grasp Generation

  • Zhenhua Tang
  • Yudian Zheng
  • Yuzhang Zhong
  • Haolun Li
  • Yanbin Hao
  • Chi-Man Pun

While diffusion models show promise for intent-based grasp generation, their isotropic noise schedules struggle with joint-specific sensitivity and task-aware variability. This limitation leads to grasps with suboptimal semantic alignment or physical feasibility. To address this challenge, we propose Semantic-guided Noise Scaling for grasp generation (SNS-Grasp), a novel framework that integrates two key innovations. First, the Semantic-guided Noise Scaling Diffusion (SNS-Diff) module generates intent-aware grasps by replacing isotropic noise with anisotropic modulation, dynamically adapting to task semantics and joint-specific sensitivity. Specifically, SNS-Diff leverages a pretrained Intent Recognizer to extract task-aware confidence scores and joint-specific gradient sensitivities from the interaction context. These signals adjust the noise scaling during denoising, downweighting perturbations for semantically critical joints to ensure semantic alignment. Second, the Fine-grained Grasp Refinement (FGR) module establishes dynamic joint-vertex coupling through fine-grained hand-object spatial relationships, enabling iterative optimization of physically executable grasps. Extensive experiments on OakInk and GRAB demonstrate SNS-Grasp's superior performance in semantic accuracy and physical feasibility, with robust generalization to unseen objects.

NeurIPS Conference 2025 Conference Paper

ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization

  • Bo Du
  • Xuekang Zhu
  • Xiaochen Ma
  • Chenfan Qu
  • Kaiwen Feng
  • Zhe Yang
  • Chi-Man Pun
  • Jian Liu

The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models (3 of which are reproduced from scratch), 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) establishes an image forensic fusion protocol evaluation mechanism that supports unified training and testing of diverse forensic models across tasks; iv) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. Specifically, ForensicHub includes 4 forensic tasks, 23 datasets, 42 baseline models, 6 backbones, 11 GPU-accelerated pixel- and image-level evaluation metrics, and realizes 16 kinds of cross-domain evaluations. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs. Code is available at: https: //github. com/scu-zjz/ForensicHub.

JBHI Journal 2025 Journal Article

High-Fidelity Functional Ultrasound Reconstruction via a Visual Auto-Regressive Framework

  • Xuhang Chen
  • Zhuo Li
  • Yanyan Shen
  • Mufti Mahmud
  • Hieu Pham
  • Michael Kwok-Po Ng
  • Chi-Man Pun
  • Shuqiang Wang

Functional ultrasound (fUS) imaging provides exceptional spatiotemporal resolution for neurovascular mapping, yet its practical application is significantly hampered by critical challenges. Foremost among these is data scarcity, arising from ethical considerations and signal degradation through the cranium, which collectively limit dataset diversity and compromise the fairness of downstream machine learning models. To address these limitations, we introduce UltraVAR (Ultrasound Visual Auto-Regressive model), the first data augmentation framework designed for fUS imaging that leverages a pre-trained visual auto-regressive generative model. UltraVAR is designed not only to mitigate data scarcity but also to enhance model fairness through the reconstruction of diverse and physiologically plausible fUS samples. The generated samples preserve essential neurovascular coupling features—specifically, the dynamic interplay between neural activity and microvascular hemodynamics. This capability distinguishes UltraVAR from conventional augmentation techniques, which often disrupt these vital physiological correlations and consequently fail to improve, or even degrade, downstream task performance. The proposed UltraVAR employs a scale-by-scale reconstruction mechanism that meticulously preserves the spatial topological relationships within vascular networks. The framework's fidelity is further enhanced by two integrated modules: the Smooth Scaling Layer, which ensures the preservation of critical image information during multi-scale feature propagation, and the Perception Enhancement Module, which actively suppresses artifact generation via a dynamic residual compensation mechanism. Comprehensive experimental validation demonstrates that datasets augmented with UltraVAR yield statistically significant improvements in downstream classification accuracy. This work establishes a robust foundation for advancing ultrasound-based neuromodulation techniques and brain-computer interface technologies by enabling the reconstruction of high-fidelity, diverse fUS data

IJCAI Conference 2025 Conference Paper

LensNet: An End-to-End Learning Framework for Empirical Point Spread Function Modeling and Lensless Imaging Reconstruction

  • Jiesong Bai
  • Yuhao Yin
  • Yihang Dong
  • Xiaofeng Zhang
  • Chi-Man Pun
  • Xuhang Chen

Lensless imaging stands out as a promising alternative to conventional lens-based systems, particularly in scenarios demanding ultracompact form factors and cost-effective architectures. However, such systems are fundamentally governed by the Point Spread Function (PSF), which dictates how a point source contributes to the final captured signal. Traditional lensless techniques often require explicit calibrations and extensive pre-processing, relying on static or approximate PSF models. These rigid strategies can result in limited adaptability to real-world challenges, including noise, system imperfections, and dynamic scene variations, thus impeding high-fidelity reconstruction. In this paper, we propose LensNet, an end-to-end deep learning framework that integrates spatial-domain and frequency-domain representations in a unified pipeline. Central to our approach is a learnable Coded Mask Simulator (CMS) that enables dynamic, data-driven estimation of the PSF during training, effectively mitigating the shortcomings of fixed or sparsely calibrated kernels. By embedding a Wiener filtering component, LensNet refines global structure and restores fine-scale details, thus alleviating the dependency on multiple handcrafted pre-processing steps. Extensive experiments demonstrate LensNet's robust performance and superior reconstruction quality compared to state-of-the-art methods, particularly in preserving high-frequency details and attenuating noise. The proposed framework establishes a novel convergence between physics-based modeling and data-driven learning, paving the way for more accurate, flexible, and practical lensless imaging solutions for applications ranging from miniature sensors to medical diagnostics. The link of code is https: //github. com/baijiesong/Lensnet.

AAAI Conference 2025 Conference Paper

Mesoscopic Insights: Orchestrating Multi-Scale & Hybrid Architecture for Image Manipulation Localization

  • Xuekang Zhu
  • Xiaochen Ma
  • Lei Su
  • Zhuohang Jiang
  • Bo Du
  • Xiwen Wang
  • Zeyu Lei
  • Wentao Feng

The mesoscopic level serves as a bridge between the macroscopic and microscopic worlds, addressing gaps overlooked by both. Image manipulation localization (IML), a crucial technique to pursue truth from fake images, has long relied on low-level (microscopic-level) traces. However, in practice, most tampering aims to deceive the audience by altering image semantics. As a result, manipulation commonly occurs at the object level (macroscopic level), which is equally important as microscopic traces. Therefore, integrating these two levels into the mesoscopic level presents a new perspective for IML research. Inspired by this, our paper explores how to simultaneously construct mesoscopic representations of micro and macro information for IML and introduces the Mesorch architecture to orchestrate both. Specifically, this architecture i) combines Transformers and CNNs in parallel, with Transformers extracting macro information and CNNs capturing micro details, and ii) explores across different scales, assessing micro and macro information seamlessly. Additionally, based on the Mesorch architecture, the paper introduces two baseline models aimed at solving IML tasks through mesoscopic representation. Extensive experiments across four datasets have demonstrated that our models surpass the current state-of-the-art in terms of performance, computational complexity, and robustness.

AAAI Conference 2024 Conference Paper

COMMA: Co-articulated Multi-Modal Learning

  • Lianyu Hu
  • Liqing Gao
  • Zekang Liu
  • Chi-Man Pun
  • Wei Feng

Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency. Code is available at https://github.com/hulianyuyy/COMMA.

AAAI Conference 2024 Conference Paper

Devignet: High-Resolution Vignetting Removal via a Dual Aggregated Fusion Transformer with Adaptive Channel Expansion

  • Shenghong Luo
  • Xuhang Chen
  • Weiwen Chen
  • Zinuo Li
  • Shuqiang Wang
  • Chi-Man Pun

Vignetting commonly occurs as a degradation in images resulting from factors such as lens design, improper lens hood usage, and limitations in camera sensors. This degradation affects image details, color accuracy, and presents challenges in computational photography. Existing vignetting removal algorithms predominantly rely on ideal physics assumptions and hand-crafted parameters, resulting in the ineffective removal of irregular vignetting and suboptimal results. Moreover, the substantial lack of real-world vignetting datasets hinders the objective and comprehensive evaluation of vignetting removal. To address these challenges, we present VigSet, a pioneering dataset for vignetting removal. VigSet includes 983 pairs of both vignetting and vignetting-free high-resolution (over 4k) real-world images under various conditions. In addition, We introduce DeVigNet, a novel frequency-aware Transformer architecture designed for vignetting removal. Through the Laplacian Pyramid decomposition, we propose the Dual Aggregated Fusion Transformer to handle global features and remove vignetting in the low-frequency domain. Additionally, we propose the Adaptive Channel Expansion Module to enhance details in the high-frequency domain. The experiments demonstrate that the proposed model outperforms existing state-of-the-art methods. The code, models, and dataset are available at https://github.com/CXH-Research/DeVigNet.

NeurIPS Conference 2024 Conference Paper

IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation Detection & Localization

  • Xiaochen Ma
  • Xuekang Zhu
  • Lei Su
  • Bo Du
  • Zhuohang Jiang
  • Bingkui Tong
  • Zeyu Lei
  • Xinyu Yang

A comprehensive benchmark is yet to be established in the Image Manipulation Detection & Localization (IMDL) field. The absence of such a benchmark leads to insufficient and misleading model evaluations, severely undermining the development of this field. However, the scarcity of open-sourced baseline models and inconsistent training and evaluation protocols make conducting rigorous experiments and faithful comparisons among IMDL models challenging. To address these challenges, we introduce IMDL-BenCo, the first comprehensive IMDL benchmark and modular codebase. IMDL-BenCo: i) decomposes the IMDL framework into standardized, reusable components and revises the model construction pipeline, improving coding efficiency and customization flexibility; ii) fully implements or incorporates training code for state-of-the-art models to establish a comprehensive IMDL benchmark; and iii) conducts deep analysis based on the established benchmark and codebase, offering new insights into IMDL model architecture, dataset characteristics, and evaluation standards. Specifically, IMDL-BenCo includes common processing algorithms, 8 state-of-the-art IMDL models (1 of which are reproduced from scratch), 2 sets of standard training and evaluation protocols, 15 GPU-accelerated evaluation metrics, and 3 kinds of robustness evaluation. This benchmark and codebase represent a significant leap forward in calibrating the current progress in the IMDL field and inspiring future breakthroughs. Code is available at: https: //github. com/scu-zjz/IMDLBenCo

AAAI Conference 2024 Conference Paper

PVALane: Prior-Guided 3D Lane Detection with View-Agnostic Feature Alignment

  • Zewen Zheng
  • Xuemin Zhang
  • Yongqiang Mou
  • Xiang Gao
  • Chengxin Li
  • Guoheng Huang
  • Chi-Man Pun
  • Xiaochen Yuan

Monocular 3D lane detection is essential for a reliable autonomous driving system and has recently been rapidly developing. Existing popular methods mainly employ a predefined 3D anchor for lane detection based on front-viewed (FV) space, aiming to mitigate the effects of view transformations. However, the perspective geometric distortion between FV and 3D space in this FV-based approach introduces extremely dense anchor designs, which ultimately leads to confusing lane representations. In this paper, we introduce a novel prior-guided perspective on lane detection and propose an end-to-end framework named PVALane, which utilizes 2D prior knowledge to achieve precise and efficient 3D lane detection. Since 2D lane predictions can provide strong priors for lane existence, PVALane exploits FV features to generate sparse prior anchors with potential lanes in 2D space. These dynamic prior anchors help PVALane to achieve distinct lane representations and effectively improve the precision of PVALane due to the reduced lane search space. Additionally, by leveraging these prior anchors and representing lanes in both FV and bird-eye-viewed (BEV) spaces, we effectively align and merge semantic and geometric information from FV and BEV features. Extensive experiments conducted on the OpenLane and ONCE-3DLanes datasets demonstrate the superior performance of our method compared to existing state-of-the-art approaches and exhibit excellent robustness.

JBHI Journal 2024 Journal Article

Quaternion Cross-Modality Spatial Learning for Multi-Modal Medical Image Segmentation

  • Junyang Chen
  • Guoheng Huang
  • Xiaochen Yuan
  • Guo Zhong
  • Zewen Zheng
  • Chi-Man Pun
  • Jian Zhu
  • Zhixin Huang

Recently, the Deep Neural Networks (DNNs) have had a large impact on imaging process including medical image segmentation, and the real-valued convolution of DNN has been extensively utilized in multi-modal medical image segmentation to accurately segment lesions via learning data information. However, the weighted summation operation in such convolution limits the ability to maintain spatial dependence that is crucial for identifying different lesion distributions. In this paper, we propose a novel Quaternion Cross-modality Spatial Learning (Q-CSL) which explores the spatial information while considering the linkage between multi-modal images. Specifically, we introduce to quaternion to represent data and coordinates that contain spatial information. Additionally, we propose Quaternion Spatial-association Convolution to learn the spatial information. Subsequently, the proposed De-level Quaternion Cross-modality Fusion (De-QCF) module excavates inner space features and fuses cross-modality spatial dependency. Our experimental results demonstrate that our approach compared to the competitive methods perform well with only 0. 01061 M parameters and 9. 95G FLOPs.

IJCAI Conference 2023 Conference Paper

A Large-Scale Film Style Dataset for Learning Multi-frequency Driven Film Enhancement

  • Zinuo Li
  • Xuhang Chen
  • Shuqiang Wang
  • Chi-Man Pun

Film, a classic image style, is culturally significant to the whole photographic industry since it marks the birth of photography. However, film photography is time-consuming and expensive, necessitating a more efficient method for collecting film-style photographs. Numerous datasets that have emerged in the field of image enhancement so far are not film-specific. In order to facilitate film-based image stylization research, we construct FilmSet, a large-scale and high-quality film style dataset. Our dataset includes three different film types and more than 5000 in-the-wild high resolution images. Inspired by the features of FilmSet images, we propose a novel framework called FilmNet based on Laplacian Pyramid for stylizing images across frequency bands and achieving film style outcomes. Experiments reveal that the performance of our model is superior than state-of-the-art techniques. The link of our dataset and code is https: //github. com/CXH-Research/FilmNet.

AAAI Conference 2023 Conference Paper

CEE-Net: Complementary End-to-End Network for 3D Human Pose Generation and Estimation

  • Haolun Li
  • Chi-Man Pun

The limited number of actors and actions in existing datasets make 3D pose estimators tend to overfit, which can be seen from the performance degradation of the algorithm on cross-datasets, especially for rare and complex poses. Although previous data augmentation works have increased the diversity of the training set, the changes in camera viewpoint and position play a dominant role in improving the accuracy of the estimator, while the generated 3D poses are limited and still heavily rely on the source dataset. In addition, these works do not consider the adaptability of the pose estimator to generated data, and complex poses will cause training collapse. In this paper, we propose the CEE-Net, a Complementary End-to-End Network for 3D human pose generation and estimation. The generator extremely expands the distribution of each joint-angle in the existing dataset and limits them to a reasonable range. By learning the correlations within and between the torso and limbs, the estimator can combine different body-parts more effectively and weaken the influence of specific joint-angle changes on the global pose, improving the generalization ability. Extensive ablation studies show that our pose generator greatly strengthens the joint-angle distribution, and our pose estimator can utilize these poses positively. Compared with the state-of-the-art methods, our method can achieve much better performance on various cross-datasets, rare and complex poses.

AAAI Conference 2023 Conference Paper

CoordFill: Efficient High-Resolution Image Inpainting via Parameterized Coordinate Querying

  • Weihuang Liu
  • Xiaodong Cun
  • Chi-Man Pun
  • Menghan Xia
  • Yong Zhang
  • Jue Wang

Image inpainting aims to fill the missing hole of the input. It is hard to solve this task efficiently when facing high-resolution images due to two reasons: (1) Large reception field needs to be handled for high-resolution image inpainting. (2) The general encoder and decoder network synthesizes many background pixels synchronously due to the form of the image matrix. In this paper, we try to break the above limitations for the first time thanks to the recent development of continuous implicit representation. In detail, we down-sample and encode the degraded image to produce the spatial-adaptive parameters for each spatial patch via an attentional Fast Fourier Convolution (FFC)-based parameter generation network. Then, we take these parameters as the weights and biases of a series of multi-layer perceptron (MLP), where the input is the encoded continuous coordinates and the output is the synthesized color value. Thanks to the proposed structure, we only encode the high-resolution image in a relatively low resolution for larger reception field capturing. Then, the continuous position encoding will be helpful to synthesize the photo-realistic high-frequency textures by re-sampling the coordinate in a higher resolution. Also, our framework enables us to query the coordinates of missing pixels only in parallel, yielding a more efficient solution than the previous methods. Experiments show that the proposed method achieves real-time performance on the 2048X2048 images using a single GTX 2080 Ti GPU and can handle 4096X4096 images, with much better performance than existing state-of-the-art methods visually and numerically. The code is available at: https://github.com/NiFangBaAGe/CoordFill.

JBHI Journal 2023 Journal Article

QGD-Net: A Lightweight Model Utilizing Pixels of Affinity in Feature Layer for Dermoscopic Lesion Segmentation

  • Jingchao Wang
  • Guoheng Huang
  • Guo Zhong
  • Xiaochen Yuan
  • Chi-Man Pun
  • Jie Deng

Response: Pixels with location affinity, which can be also called “pixels of affinity, ” have similar semantic information. Group convolution and dilated convolution can utilize them to improve the capability of the model. However, for group convolution, it does not utilize pixels of affinity between layers. For dilated convolution, after multiple convolutions with the same dilated rate, the pixels utilized within each layer do not possess location affinity with each other. To solve the problem of group convolution, our proposed quaternion group convolution uses the quaternion convolution, which promotes the communication between to promote utilizing pixels of affinity between channels. In quaternion group convolution, the feature layers are divided into 4 layers per group, ensuring the quaternion convolution can be performed. To solve the problem of dilated convolution, we propose the quaternion sawtooth wave-like dilated convolutions module (QS module). QS module utilizes quaternion convolution with sawtooth wave-like dilated rates to effectively leverage the pixels that share the location affinity both between and within layers. This allows for an expanded receptive field, ultimately enhancing the performance of the model. In particular, we perform our quaternion group convolution in QS module to design the quaternion group dilated neutral network (QGD-Net). Extensive experiments on Dermoscopic Lesion Segmentation based on ISIC 2016 and ISIC 2017 indicate that our method has significantly reduced the model parameters and highly promoted the precision of the model in Dermoscopic Lesion Segmentation. And our method also shows generalizability in retinal vessel segmentation.

AAAI Conference 2021 Conference Paper

Split then Refine: Stacked Attention-guided ResUNets for Blind Single Image Visible Watermark Removal

  • Xiaodong Cun
  • Chi-Man Pun

Digital watermark is a commonly used technique to protect the copyright of medias. Simultaneously, to increase the robustness of watermark, attacking technique, such as watermark removal, also gets the attention from the community. Previous watermark removal methods require to gain the watermark location from users or train a multi-task network to recover the background indiscriminately. However, when jointly learning, the network performs better on watermark detection than recovering the texture. Inspired by this observation and to erase the visible watermarks blindly, we propose a novel two-stage framework with a stacked attention-guided ResUNets to simulate the process of detection, removal and refinement. In the first stage, we design a multi-task network called SplitNet. It learns the basis features for three sub-tasks altogether while the task-specific features separately use multiple channel attentions. Then, with the predicted mask and coarser restored image, we design RefineNet to smooth the watermarked region with a mask-guided spatial attention. Besides network structure, the proposed algorithm also combines multiple perceptual losses for better quality both visually and numerically. We extensively evaluate our algorithm over four different datasets under various settings and the experiments show that our approach outperforms other state-ofthe-art methods by a large margin.

AAAI Conference 2020 Conference Paper

Towards Ghost-Free Shadow Removal via Dual Hierarchical Aggregation Network and Shadow Matting GAN

  • Xiaodong Cun
  • Chi-Man Pun
  • Cheng Shi

Shadow removal is an essential task for scene understanding. Many studies consider only matching the image contents, which often causes two types of ghosts: color in-consistencies in shadow regions or artifacts on shadow boundaries (as shown in Figure. 1). In this paper, we tackle these issues in two ways. First, to carefully learn the border artifacts-free image, we propose a novel network structure named the dual hierarchically aggregation network (DHAN). It contains a series of growth dilated convolutions as the backbone without any down-samplings, and we hierarchically aggregate multicontext features for attention and prediction, respectively. Second, we argue that training on a limited dataset restricts the textural understanding of the network, which leads to the shadow region color in-consistencies. Currently, the largest dataset contains 2k+ shadow/shadow-free image pairs. However, it has only 0. 1k+ unique scenes since many samples share exactly the same background with different shadow positions. Thus, we design a shadow matting generative adversarial network (SMGAN) to synthesize realistic shadow mattings from a given shadow mask and shadow-free image. With the help of novel masks or scenes, we enhance the current datasets using synthesized shadow images. Experiments show that our DHAN can erase the shadows and produce high-quality ghost-free images. After training on the synthesized and real datasets, our network outperforms other stateof-the-art methods by a large margin. The code is available: http: //github. com/vinthony/ghost-free-shadow-removal/