Arrow Research search

Author name cluster

Wenhan Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

37 papers
2 author rows

Possible papers

37

AAAI Conference 2026 Conference Paper

Event-Guided Super-Resolving Blurry Image via Asymmetric Integral Driven Consistency

  • Chi Zhang
  • Xiang Zhang
  • Lei Yu
  • Gui-Song Xia
  • Yuming Fang
  • Wenhan Yang

Super-Resolution from a Blurry low-resolution image (SRB) constitutes a severely ill-posed inverse problem. Current learning-based SRB approaches primarily rely on synthetic, well-labeled paired datasets to regularize solution spaces, yet they exhibit limited generalizability in practical applications due to significant domain discrepancies between simulated degradations and real-world imaging conditions. To bridge this synthetic-to-real gap, we propose a novel Self-supervised Event-based SRB (SE-SRB) framework that leverages neuromorphic event streams as physical priors and adopts a lightweight neural architecture tailored for effective domain adaptation. Specifically, the proposed SE-SRB introduces a self-supervised learning paradigm based on asymmetric integral driven consistency, which enforces temporal coherence between predictions derived from RGB and asynchronous event streams at different time points. Extensive experiments validate that SE-SRB consistently outperforms state-of-the-art methods on both synthetic and real-world datasets. Built upon a lightweight parallel two-stream architecture, SE-SRB achieves high computational efficiency, featuring reduced parameter count, lower FLOPs, and real-time inference capability (40 FPS).

AAAI Conference 2025 Conference Paper

Backdoor Attacks Against No-Reference Image Quality Assessment Models via a Scalable Trigger

  • Yi Yu
  • Song Xia
  • Xun Lin
  • Wenhan Yang
  • Shijian Lu
  • Yap-Peng Tan
  • Alex Kot

No-Reference Image Quality Assessment (NR-IQA), responsible for assessing the quality of a single input image without using any reference, plays a critical role in evaluating and optimizing computer vision systems, e.g., low-light enhancement. Recent research indicates that NR-IQA models are susceptible to adversarial attacks, which can significantly alter predicted scores with visually imperceptible perturbations. Despite revealing vulnerabilities, these attack methods have limitations, including high computational demands, untargeted manipulation, limited practical utility in white-box scenarios, and reduced effectiveness in black-box scenarios. To address these challenges, we shift our focus to another significant threat and present a novel poisoning-based backdoor attack against NR-IQA (BAIQA), allowing the attacker to manipulate the IQA model's output to any desired target value by simply adjusting a scaling coefficient alpha for the trigger. We propose to inject the trigger in the discrete cosine transform (DCT) domain to improve the local invariance of the trigger for countering trigger diminishment in NR-IQA models due to widely adopted data augmentations. Furthermore, the universal adversarial perturbations (UAP) in the DCT space are designed as the trigger, to increase IQA model susceptibility to manipulation and improve attack effectiveness. In addition to the heuristic method for poison-label BAIQA (P-BAIQA), we explore the design of clean-label BAIQA (C-BAIQA), focusing on alpha sampling and image data refinement, driven by theoretical insights we reveal. Extensive experiments on diverse datasets and various NR-IQA models demonstrate the effectiveness of our attacks.

NeurIPS Conference 2025 Conference Paper

End-to-End Low-Light Enhancement for Object Detection with Learned Metadata from RAWs

  • Xuelin Shen
  • Haifeng Jiao
  • Yitong Wang
  • Yulin He
  • Wenhan Yang

Although RAW images offer advantages over sRGB by avoiding ISP-induced distortion and preserving more information in low-light conditions, their widespread use is limited due to high storage costs, transmission burdens, and the need for significant architectural changes for downstream tasks. To address the issues, this paper explores a new raw-based machine vision paradigm, termed Compact RAW Metadata-guided Image Refinement (CRM-IR). In particular, we propose a Machine Vision-oriented Image Refinement (MV-IR) module that refines sRGB images to better suit machine vision preferences, guided by learned raw metadata. Such a design allows the CRM-IR to focus on extracting the most essential metadata from raw images to support downstream machine vision tasks, while remaining plug-and-play and fully compatible with existing imaging pipelines, without any changes to model architectures or ISP modules. We implement our CRM-IR scheme on various object detection networks, and extensive experiments under low-light conditions demonstrate that it can significantly improve performance with an additional bitrate cost of less than $10^{-3}$ bits per pixel.

AAAI Conference 2025 Conference Paper

Fast Omni-Directional Image Super-Resolution: Adapting the Implicit Image Function with Pixel and Semantic-Wise Spherical Geometric Priors

  • Xuelin Shen
  • Yitong Wang
  • Silin Zheng
  • Kang Xiao
  • Wenhan Yang
  • Xu Wang

In the context of Omni-Directional Image (ODI) Super-Resolution (SR), the unique challenge arises from the non-uniform oversampling characteristics caused by EquiRectangular Projection (ERP). Considerable efforts in designing complex spherical convolutions or polyhedron reprojection offer significant performance improvements but at the expense of cumbersome processing procedures and slower inference speeds. Under these circumstances, this paper proposes a new ODI-SR model characterized by its capacity to perform Fast and Arbitrary-scale ODI-SR processes, denoted as FAOR. The key innovation lies in adapting the implicit image function from the planar image domain to the ERP image domain by incorporating spherical geometric priors at both the latent representation and image reconstruction stages, in a low-overhead manner. Specifically, at the latent representation stage, we adopt a pair of pixel-wise and semantic-wise sphere-to-planar distortion maps to perform affine transformations on the latent representation, thereby incorporating it with spherical properties. Moreover, during the image reconstruction stage, we introduce a geodesic-based resampling strategy, aligning the implicit image function with spherical geometrics without introducing additional parameters. As a result, the proposed FAOR outperforms the state-of-the-art ODI-SR models with a much faster inference speed. Extensive experimental results and ablation studies have demonstrated the effectiveness of our design.

JBHI Journal 2025 Journal Article

M2Trans: Multi-Modal Regularized Coarse-to-Fine Transformer for Ultrasound Image Super-Resolution

  • Zhangkai Ni
  • Runyu Xiao
  • Wenhan Yang
  • Hanli Wang
  • Zhihua Wang
  • Lihua Xiang
  • Liping Sun

Ultrasound image super-resolution (SR) aims to transform low-resolution images into high-resolution ones, thereby restoring intricate details crucial for improved diagnostic accuracy. However, prevailing methods relying solely on image modality guidance and pixel-wise loss functions struggle to capture the distinct characteristics of medical images, such as unique texture patterns and specific colors harboring critical diagnostic information. To overcome these challenges, this paper introduces the Multi-Modal Regularized Coarse-to-fine Transformer (M2Trans) for Ultrasound Image SR. By integrating the text modality, we establish joint image-text guidance during training, leveraging the medical CLIP model to incorporate richer priors from text descriptions into the SR optimization process, enhancing detail, structure, and semantic recovery. Furthermore, we propose a novel coarse-to-fine transformer comprising multiple branches infused with self-attention and frequency transforms to efficiently capture signal dependencies across different scales. Extensive experimental results demonstrate significant improvements over state-of-the-art methods on benchmark datasets, including CCA-US, US-CASE, and our newly created dataset MMUS1K, with a minimum improvement of 0. 17dB, 0. 30dB, and 0. 28dB in terms of PSNR.

ICLR Conference 2025 Conference Paper

Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures

  • Dang Nguyen
  • Wenhan Yang
  • Rathul Anand
  • Yu Yang 0007
  • Baharan Mirzasoleiman

Training with larger mini-batches improves the convergence rate and can yield superior performance. However, training with large mini-batches becomes prohibitive for Large Language Models (LLMs), due to the large GPU memory requirement. To address this problem, an effective approach is finding small mini-batch coresets that closely match the gradient of larger mini-batches. However, this approach becomes infeasible and ineffective for LLMs, due to the highly imbalanced mixture of sources in language data, use of the Adam optimizer, and the very large gradient dimensionality of LLMs. In this work, we address the above challenges by proposing *Coresets for Training LLMs* (CoLM). First, we show that mini-batch coresets found by gradient matching do not contain representative examples of the small sources w.h.p., and thus including all examples of the small sources in the mini-batch coresets is crucial for optimal performance. Second, we normalize the gradients by their historical exponential to find mini-batch coresets for training with Adam. Finally, we leverage zeroth-order methods to find smooth gradient of the last *V*-projection matrix and sparsify it to keep the dimensions with the largest normalized gradient magnitude. We apply CoLM to fine-tuning Phi-2, Phi-3, Zephyr, and Llama-3 models with LoRA on MathInstruct and SuperGLUE benchmark. Remarkably, CoLM reduces the memory requirement of fine-tuning by 2x and even outperforms training with 4x larger mini-batches. Moreover, CoLM seamlessly integrates with existing memory-efficient training methods like LoRA, further reducing the memory requirements of training LLMs.

ICML Conference 2025 Conference Paper

MTL-UE: Learning to Learn Nothing for Multi-Task Learning

  • Yi Yu 0011
  • Song Xia
  • Siyuan Yang 0001
  • Chenqi Kong
  • Wenhan Yang
  • Shijian Lu
  • Yap-Peng Tan
  • Alex C. Kot

Most existing unlearnable strategies focus on preventing unauthorized users from training single-task learning (STL) models with personal data. Nevertheless, the paradigm has recently shifted towards multi-task data and multi-task learning (MTL), targeting generalist and foundation models that can handle multiple tasks simultaneously. Despite their growing importance, MTL data and models have been largely neglected while pursuing unlearnable strategies. This paper presents MTL-UE, the first unified framework for generating unlearnable examples for multi-task data and MTL models. Instead of optimizing perturbations for each sample, we design a generator-based structure that introduces label priors and class-wise feature embeddings which leads to much better attacking performance. In addition, MTL-UE incorporates intra-task and inter-task embedding regularization to increase inter-class separation and suppress intra-class variance which enhances the attack robustness greatly. Furthermore, MTL-UE is versatile with good supports for dense prediction tasks in MTL. It is also plug-and-play allowing integrating existing surrogate-dependent unlearnable methods with little adaptation. Extensive experiments show that MTL-UE achieves superior attacking performance consistently across 4 MTL datasets, 3 base UE methods, 5 model backbones, and 5 MTL task-weighting strategies. Code is available at https: //github. com/yuyi-sd/MTL-UE.

ICML Conference 2025 Conference Paper

Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models

  • Xuelin Shen
  • Jiayin Xu
  • Kangsheng Yin
  • Wenhan Yang

The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. In this context, this paper seeks to protect users’ privacy by implementing defenses at the image compression stage to prevent exploitation. Specifically, we propose a flexible coding method, termed Privacy-Shielded Image Compression (PSIC), that can produce bitstreams with multiple decoding options. By default, the bitstream is decoded to preserve satisfactory perceptual quality while preventing interpretation by VLP models. Our method also retains the original image compression functionality. With a customizable input condition, the proposed scheme can reconstruct the image that preserves its full semantic information. A Conditional Latent Trigger Generation (CLTG) module is proposed to produce bias information based on customizable conditions to guide the decoding process into different reconstructed versions, and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function is designed to leverage the soft labels inferred from the target VLP model’s uncertainty on the training data. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing Learned Image Compression (LIC) models. Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of our design.

NeurIPS Conference 2025 Conference Paper

Prompt-Guided Alignment with Information Bottleneck Makes Image Compression Also a Restorer

  • Xuelin Shen
  • Quan Liu
  • Jiayin Xu
  • Wenhan Yang

Learned Image Compression (LIC) models face critical challenges in real-world scenarios due to various environmental degradations, such as fog and rain. Due to the distribution mismatch between degraded inputs and clean training data, well-trained LIC models suffer from reduced compression efficiency, while retraining dedicated models for diverse degradation types is costly and impractical. Our method addresses the above issue by leveraging prompt learning under the information bottleneck principle, enabling compact extraction of shared components between degraded and clean images for improved latent alignment and compression efficiency. In detail, we propose an Information Bottleneck-constrained Latent Representation Unifying (IB-LRU) scheme, in which a Probabilistic Prompt Generator (PPG) is deployed to simultaneously capture the distribution of different degradations. Such a design dynamically guides the latent-representation process at the encoder through a gated modulation process. Moreover, to promote the degradation distribution capture process, the probabilistic prompt learning is guided by the Information Bottleneck (IB) principle. That is, IB constrains the information encoded in the prompt to focus solely on degradation characteristics while avoiding the inclusion of redundant image contextual information. We apply our IB-LRU method to a variety of state-of-the-art LIC backbones, and extensive experiments under various degradation scenarios demonstrate the effectiveness of our design. Our code will be publicly available.

AAAI Conference 2025 Conference Paper

Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision

  • Kangsheng Yin
  • Quan Liu
  • Xuelin Shen
  • Yulin He
  • Wenhan Yang
  • Shiqi Wang

The image compression model has long struggled with adaptability and generalization, as the decoded bitstream typically serves only human or machine needs and fails to preserve information for unseen visual tasks. Therefore, this paper innovatively introduces supervision obtained from multimodal pre-training models and incorporates adaptive multi-objective optimization tailored to support both human visual perception and machine vision simultaneously with a single bitstream, denoted as Unified and Generalized Image Coding for Machine (UG-ICM). Specifically, to get rid of the reliance between compression models with downstream task supervision, we introduce Contrastive Language-Image Pre-training (CLIP) models into the training constraint for improved generalization. Global-to-instance-wise CLIP supervision is applied to help obtain hierarchical semantics that make models more generalizable for the tasks relying on the information of different granularity. Furthermore, for supporting both human and machine visions with only a unifying bitstream, we incorporate a conditional decoding strategy that takes as conditions human or machine preferences, enabling the bitstream to be decoded into different versions for corresponding preferences. As such, our proposed UG-ICM is fully trained in a self-supervised manner, i.e., without awareness of any specific downstream models and tasks. The extensive experiments have shown that the proposed UG-ICM is capable of achieving remarkable improvements in various unseen machine analytics tasks, while simultaneously providing perceptually satisfying images.

AAAI Conference 2025 Conference Paper

UP-Restorer: When Unrolling Meets Prompts for Unified Image Restoration

  • Minghao Liu
  • Wenhan Yang
  • Jinyi Luo
  • Jiaying Liu

All-in-one restoration needs to implicitly distinguish between different degradation conditions and apply specific prior constraints accordingly. To fulfill this goal, our work makes the first effort to create an all-in-one restoration via unrolling from the typical maximum a-posterior optimization function. This unrolling framework naturally leads to the construction of progressively solving models, which are equivalent to a diffusion enhancer taking as input dynamically generated prompts. Under a score-based diffusion model, the prompts are integrated for propogating and updating several context-related variables, i.e. transmission map, atmospheric light map and noise or rain map progressively. Such learned prompt generation process, which simulates the nonlinear operations in the unrolled solution, is combined with linear operations owning clear physics implications to make the diffusion models well reguarlized and more effective in learning degradation-related visual priors. Experimental results demonstrate that our method achieves significant performance improvements across various image restoration tasks, realizing true all-in-one image restoration.

ICLR Conference 2025 Conference Paper

Which Tasks Should Be Compressed Together? A Causal Discovery Approach for Efficient Multi-Task Representation Compression

  • Sha Guo
  • Jing Chen
  • Zixuan Hu
  • Zhuo Chen 0006
  • Wenhan Yang
  • Yu Lin
  • Xing Jiang
  • Lingyu Duan

Conventional image compression methods are inadequate for intelligent analysis, as they overemphasize pixel-level precision while neglecting semantic significance and the interaction among multiple tasks. This paper introduces a Taskonomy-Aware Multi-Task Compression framework comprising (1) inter-coherent task grouping, which organizes synergistic tasks into shared representations to improve multi-task accuracy and reduce encoding volume, and (2) a conditional entropy-based directed acyclic graph (DAG) that captures causal dependencies among grouped representations. By leveraging parent representations as contextual priors for child representations, the framework effectively utilizes cross-task information to improve entropy model accuracy. Experiments on diverse vision tasks, including Keypoint 2D, Depth Z-buffer, Semantic Segmentation, Surface Normal, Edge Texture, and Autoencoder, demonstrate significant bitrate-performance gains, validating the method’s capability to reduce system entropy uncertainty. These findings underscore the potential of leveraging representation disentanglement, synergy, and causal modeling to learn compact representations, which enable efficient multi-task compression in intelligent systems.

ICML Conference 2024 Conference Paper

Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks

  • Wenhan Yang
  • Jingdong Gao
  • Baharan Mirzasoleiman

Contrastive Language-Image Pre-training (CLIP) on large image-caption datasets has achieved remarkable success in zero-shot classification and enabled transferability to new domains. However, CLIP is extremely more vulnerable to targeted data poisoning and backdoor attacks compared to supervised learning. Perhaps surprisingly, poisoning 0. 0001% of CLIP pre-training data is enough to make targeted data poisoning attacks successful. This is four orders of magnitude smaller than what is required to poison supervised models. Despite this vulnerability, existing methods are very limited in defending CLIP models during pre-training. In this work, we propose a strong defense, SAFECLIP, to safely pre-train CLIP against targeted data poisoning and backdoor attacks. SAFECLIP warms up the model by applying unimodal contrastive learning (CL) on image and text modalities separately. Then, it divides the data into safe and risky sets by applying a Gaussian Mixture Model to the cosine similarity of image-caption pair representations. SAFECLIP pre-trains the model by applying the CLIP loss to the safe set and applying unimodal CL to image and text modalities of the risky set separately. By gradually increasing the size of the safe set during pre-training, SAFECLIP effectively breaks targeted data poisoning and backdoor attacks without harming the CLIP performance. Our extensive experiments on CC3M, Visual Genome, and MSCOCO demonstrate that SAFECLIP significantly reduces the success rate of targeted data poisoning attacks from 93. 75% to 0% and that of various backdoor attacks from up to 100% to 0%, without harming CLIP’s performance.

AAAI Conference 2024 Conference Paper

ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field

  • Zhangkai Ni
  • Peiqi Yang
  • Wenhan Yang
  • Hanli Wang
  • Lin Ma
  • Sam Kwong

Neural Radiance Fields (NeRF) have demonstrated impressive potential in synthesizing novel views from dense input, however, their effectiveness is challenged when dealing with sparse input. Existing approaches that incorporate additional depth or semantic supervision can alleviate this issue to an extent. However, the process of supervision collection is not only costly but also potentially inaccurate. In our work, we introduce a novel model: the Collaborative Neural Radiance Fields (ColNeRF) designed to work with sparse input. The collaboration in ColNeRF includes the cooperation among sparse input source images and the cooperation among the output of the NeRF. Through this, we construct a novel collaborative module that aligns information from various views and meanwhile imposes self-supervised constraints to ensure multi-view consistency in both geometry and appearance. A Collaborative Cross-View Volume Integration module (CCVI) is proposed to capture complex occlusions and implicitly infer the spatial location of objects. Moreover, we introduce self-supervision of target rays projected in multiple directions to ensure geometric and color consistency in adjacent regions. Benefiting from the collaboration at the input and output ends, ColNeRF is capable of capturing richer and more generalized scene representation, thereby facilitating higher-quality results of the novel view synthesis. Our extensive experimental results demonstrate that ColNeRF outperforms state-of-the-art sparse input generalizable NeRF methods. Furthermore, our approach exhibits superiority in fine-tuning towards adapting to new scenes, achieving competitive performance compared to per-scene optimized NeRF-based methods while significantly reducing computational costs. Our code is available at: https://github.com/eezkni/ColNeRF.

NeurIPS Conference 2024 Conference Paper

ContextGS : Compact 3D Gaussian Splatting with Anchor Level Context Model

  • Yufei Wang
  • Zhihao Li
  • Lanqing Guo
  • Wenhan Yang
  • Alex C. Kot
  • Bihan Wen

Recently, 3D Gaussian Splatting (3DGS) has become a promising framework for novel view synthesis, offering fast rendering speeds and high fidelity. However, the large number of Gaussians and their associated attributes require effective compression techniques. Existing methods primarily compress neural Gaussians individually and independently, i. e. , coding all the neural Gaussians at the same time, with little design for their interactions and spatial dependence. Inspired by the effectiveness of the context model in image compression, we propose the first autoregressive model at the anchor level for 3DGS compression in this work. We divide anchors into different levels and the anchors that are not coded yet can be predicted based on the already coded ones in all the coarser levels, leading to more accurate modeling and higher coding efficiency. To further improve the efficiency of entropy coding, e. g. , to code the coarsest level with no already coded anchors, we propose to introduce a low-dimensional quantized feature as the hyperprior for each anchor, which can be effectively compressed. Our work pioneers the context model in the anchor level for 3DGS representation, yielding an impressive size reduction of over 100 times compared to vanilla 3DGS and 15 times compared to the most recent state-of-the-art work Scaffold-GS, while achieving comparable or even higher rendering quality.

ICML Conference 2024 Conference Paper

Correcting Diffusion-Based Perceptual Image Compression with Privileged End-to-End Decoder

  • Yiyang Ma
  • Wenhan Yang
  • Jiaying Liu 0001

The images produced by diffusion models can attain excellent perceptual quality. However, it is challenging for diffusion models to guarantee distortion, hence the integration of diffusion models and image compression models still needs more comprehensive explorations. This paper presents a diffusion-based image compression method that employs a privileged end-to-end decoder model as correction, which achieves better perceptual quality while guaranteeing the distortion to an extent. We build a diffusion model and design a novel paradigm that combines the diffusion model and an end-to-end decoder, and the latter is responsible for transmitting the privileged information extracted at the encoder side. Specifically, we theoretically analyze the reconstruction process of the diffusion models at the encoder side with the original images being visible. Based on the analysis, we introduce an end-to-end convolutional decoder to provide a better approximation of the score function $\nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t)$ at the encoder side and effectively transmit the combination. Experiments demonstrate the superiority of our method in both distortion and perception compared with previous perceptual compression methods.

NeurIPS Conference 2024 Conference Paper

DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

  • Juncheng Wu
  • Zhangkai Ni
  • Hanli Wang
  • Wenhan Yang
  • Yuyin Zhou
  • Shiqi Wang

Image deep features extracted by pre-trained networks are known to contain rich and informative representations. In this paper, we present Deep Degradation Response (DDR), a method to quantify changes in image deep features under varying degradation conditions. Specifically, our approach facilitates flexible and adaptive degradation, enabling the controlled synthesis of image degradation through text-driven prompts. Extensive evaluations demonstrate the versatility of DDR as an image descriptor, with strong correlations observed with key image attributes such as complexity, colorfulness, sharpness, and overall quality. Moreover, we demonstrate the efficacy of DDR across a spectrum of applications. It excels as a blind image quality assessment metric, outperforming existing methodologies across multiple datasets. Additionally, DDR serves as an effective unsupervised learning objective in image restoration tasks, yielding notable advancements in image deblurring and single-image super-resolution. Our code is available at: https: //github. com/eezkni/DDR.

AAAI Conference 2024 Conference Paper

DeS3: Adaptive Attention-Driven Self and Soft Shadow Removal Using ViT Similarity

  • Yeying Jin
  • Wei Ye
  • Wenhan Yang
  • Yuan Yuan
  • Robby T. Tan

Removing soft and self shadows that lack clear boundaries from a single image is still challenging. Self shadows are shadows that are cast on the object itself. Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows. In this paper, we present DeS3, a method that removes hard, soft and self shadows based on adaptive attention and ViT similarity. Our novel ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer. This loss helps guide the reverse sampling towards recovering scene structures. Our adaptive attention is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow. This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows. Different from existing methods that rely on constraints during the training phase, we incorporate the ViT similarity during the sampling stage. Our method outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Specifically, our method outperforms the SOTA method by 16% of the RMSE of the whole image on the LRSS dataset.

UAI Conference 2024 Conference Paper

Graph Contrastive Learning under Heterophily via Graph Filters

  • Wenhan Yang
  • Baharan Mirzasoleiman

Graph contrastive learning (CL) methods learn node representations in a self-supervised manner by maximizing the similarity between the augmented node representations obtained via a GNN-based encoder. However, CL methods perform poorly on graphs with heterophily, where connected nodes tend to belong to different classes. In this work, we address this problem by proposing an effective graph CL method, namely HLCL, for learning graph representations under heterophily. HLCL first identifies a homophilic and a heterophilic subgraph based on the cosine similarity of node features. It then uses a low-pass and a high-pass graph filter to aggregate representations of nodes connected in the homophilic subgraph and differentiate representations of nodes in the heterophilic subgraph. The final node representations are learned by contrasting both the augmented high-pass filtered views and the augmented low-pass filtered node views. Our extensive experiments show that HLCL outperforms state-of-the-art graph CL methods on benchmark datasets with heterophily, as well as large-scale real-world graphs, by up to 7%, and outperforms graph supervised learning methods on datasets with heterophily by up to 10%.

AAAI Conference 2024 Conference Paper

Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding

  • Peijun Bao
  • Yong Xia
  • Wenhan Yang
  • Boon Poh Ng
  • Meng Hwa Er
  • Alex C. Kot

This paper for the first time leverages multi-modal videos for weakly-supervised temporal video grounding. As labeling the video moment is labor-intensive and subjective, the weakly-supervised approaches have gained increasing attention in recent years. However, these approaches could inherently compromise performance due to inadequate supervision. Therefore, to tackle this challenge, we for the first time pay attention to exploiting complementary information extracted from multi-modal videos (e.g., RGB frames, optical flows), where richer supervision is naturally introduced in the weaklysupervised context. Our motivation is that by integrating different modalities of the videos, the model is learned from synergic supervision and thereby can attain superior generalization capability. However, addressing multiple modalities† would also inevitably introduce additional computational overhead, and might become inapplicable if a particular modality is inaccessible. To solve this issue, we adopt a novel route: building a multi-modal distillation algorithm to capitalize on the multi-modal knowledge as supervision for model training, while still being able to work with only the single modal input during inference. As such, we can utilize the benefits brought by the supplementary nature of multiple modalities, without compromising the applicability in practical scenarios. Specifically, we first propose a cross-modal mutual learning framework and train a sophisticated teacher model to learn collaboratively from the multi-modal videos. Then we identify two sorts of knowledge from the teacher model, i.e., temporal boundaries and semantic activation map. And we devise a local-global distillation algorithm to transfer this knowledge to a student model of single-modal input at both local and global levels. Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs.

AAAI Conference 2024 Conference Paper

Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency

  • Peijun Bao
  • Zihao Shao
  • Wenhan Yang
  • Boon Poh Ng
  • Meng Hwa Er
  • Alex C. Kot

Natural language video localization plays a pivotal role in video understanding, and leveraging weakly-labeled data is considered a promising approach to circumvent the laborintensive process of manual annotations. However, this approach encounters two significant challenges: 1) limited input distribution, namely that the limited writing styles of the language query, annotated by human annotators, hinder the model’s generalization to real-world scenarios with diverse vocabularies and sentence structures; 2) the incomplete ground truth, whose supervision guidance is insufficient. To overcome these challenges, we propose an omnipotent distillation algorithm with large language models (LLM). The distribution of the input sample is enriched to obtain diverse multi-view versions while a consistency then comes to regularize the consistency of their results for distillation. Specifically, we first train our teacher model with the proposed intra-model agreement, where multiple sub-models are supervised by each other. Then, we leverage the LLM to paraphrase the language query and distill the teacher model to a lightweight student model by enforcing the consistency between the localization results of the paraphrased sentence and the original one. In addition, to assess the generalization of the model across different dimensions of language variation, we create extensive datasets by building upon existing datasets. Our experiments demonstrate substantial performance improvements adaptively to diverse kinds of language queries.

ICML Conference 2024 Conference Paper

Purify Unlearnable Examples via Rate-Constrained Variational Autoencoders

  • Yi Yu 0011
  • Yufei Wang 0006
  • Song Xia
  • Wenhan Yang
  • Shijian Lu
  • Yap-Peng Tan
  • Alex C. Kot

Unlearnable examples (UEs) seek to maximize testing error by making subtle modifications to training examples that are correctly labeled. Defenses against these poisoning attacks can be categorized based on whether specific interventions are adopted during training. The first approach is training-time defense, such as adversarial training, which can mitigate poisoning effects but is computationally intensive. The other approach is pre-training purification, e. g. , image short squeezing, which consists of several simple compressions but often encounters challenges in dealing with various UEs. Our work provides a novel disentanglement mechanism to build an efficient pre-training purification method. Firstly, we uncover rate-constrained variational autoencoders (VAEs), demonstrating a clear tendency to suppress the perturbations in UEs. We subsequently conduct a theoretical analysis for this phenomenon. Building upon these insights, we introduce a disentangle variational autoencoder (D-VAE), capable of disentangling the perturbations with learnable class-wise embeddings. Based on this network, a two-stage purification approach is naturally developed. The first stage focuses on roughly eliminating perturbations, while the second stage produces refined, poison-free results, ensuring effectiveness and robustness across various scenarios. Extensive experiments demonstrate the remarkable performance of our method across CIFAR-10, CIFAR-100, and a 100-class ImageNet-subset. Code is available at https: //github. com/yuyi-sd/D-VAE.

AAAI Conference 2024 Conference Paper

Seeing Dark Videos via Self-Learned Bottleneck Neural Representation

  • Haofeng Huang
  • Wenhan Yang
  • Lingyu Duan
  • Jiaying Liu

Enhancing low-light videos in a supervised style presents a set of challenges, including limited data diversity, misalignment, and the domain gap introduced through the dataset construction pipeline. Our paper tackles these challenges by constructing a self-learned enhancement approach that gets rid of the reliance on any external training data. The challenge of self-supervised learning lies in fitting high-quality signal representations solely from input signals. Our work designs a bottleneck neural representation mechanism that extracts those signals. More in detail, we encode the frame-wise representation with a compact deep embedding and utilize a neural network to parameterize the video-level manifold consistently. Then, an entropy constraint is applied to the enhanced results based on the adjacent spatial-temporal context to filter out the degraded visual signals, e.g. noise and frame inconsistency. Last, a novel Chromatic Retinex decomposition is proposed to effectively align the reflectance distribution temporally. It benefits the entropy control on different components of each frame and facilitates noise-to-noise training, successfully suppressing the temporal flicker. Extensive experiments demonstrate the robustness and superior effectiveness of our proposed method. Our project is publicly available at: https://huangerbai.github.io/SLBNR/.

ICLR Conference 2024 Conference Paper

Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution

  • Yiyang Ma
  • Huan Yang 0005
  • Wenhan Yang
  • Jianlong Fu
  • Jiaying Liu 0001

Diffusion models, as a kind of powerful generative model, have given impressive results on image super-resolution (SR) tasks. However, due to the randomness introduced in the reverse process of diffusion models, the performances of diffusion-based SR models are fluctuating at every time of sampling, especially for samplers with few resampled steps. This inherent randomness of diffusion models results in ineffectiveness and instability, making it challenging for users to guarantee the quality of SR results. However, our work takes this randomness as an opportunity: fully analyzing and leveraging it leads to the construction of an effective plug-and-play sampling method that owns the potential to benefit a series of diffusion-based SR methods. More in detail, we propose to steadily sample high-quality SR images from pre-trained diffusion-based SR models by solving diffusion ordinary differential equations (diffusion ODEs) with optimal boundary conditions (BCs) and analyze the characteristics between the choices of BCs and their corresponding SR results. Our analysis shows the route to obtain an approximately optimal BC via an efficient exploration in the whole space. The quality of SR results sampled by the proposed method with fewer steps outperforms the quality of results sampled by current methods with randomness from the same pre-trained diffusion-based SR model, which means that our sampling method ''boosts'' current diffusion-based SR models without any additional training.

NeurIPS Conference 2024 Conference Paper

Transferable Adversarial Attacks on SAM and Its Downstream Models

  • Song Xia
  • Wenhan Yang
  • Yi Yu
  • Xun Lin
  • Henghui Ding
  • Lingyu Duan
  • Xudong Jiang

The utilization of large foundational models has a dilemma: while fine-tuning downstream tasks from them holds promise for making use of the well-generalized knowledge in practical applications, their open accessibility also poses threats of adverse usage. This paper, for the first time, explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM), by solely utilizing the information from the open-sourced SAM. In contrast to prevailing transfer-based adversarial attacks, we demonstrate the existence of adversarial dangers even without accessing the downstream task and dataset to train a similar surrogate model. To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm to extract the intrinsic vulnerability inherent in the foundation model, which is then utilized as the prior knowledge to guide the generation of adversarial perturbations. Moreover, by formulating the gradient difference in the attacking process between the open-sourced SAM and its fine-tuned downstream models, we theoretically demonstrate that a deviation occurs in the adversarial update direction by directly maximizing the distance of encoded feature embeddings in the open-sourced SAM. Consequently, we propose a gradient robust loss that simulates the associated uncertainty with gradient-based noise augmentation to enhance the robustness of generated adversarial examples (AEs) towards this deviation, thus improving the transferability. Extensive experiments demonstrate the effectiveness of the proposed universal meta-initialized and gradient robust adversarial attack (UMI-GRAT) toward SAMs and their downstream models. Code is available at https: //github. com/xiasong0501/GRAT.

AAAI Conference 2023 Conference Paper

Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

  • Peijun Bao
  • Wenhan Yang
  • Boon Poh Ng
  • Meng Hwa Er
  • Alex C. Kot

This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level, multi-modal representations are collaboratively learned from audio and visual components by using self-supervised representation learning. At the label level, we propose a novel self-supervised pretext task i.e. label contrasting to self-annotate videos with pseudo-labels for localization model training. Note that irrelevant background would hinder the acquisition of high-quality pseudo-labels and thus lead to an inferior localization model. To address this issue, we then propose an expectation-maximization algorithm that optimizes the pseudo-label acquisition and localization model in a coarse-to-fine manner. Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.

IJCAI Conference 2023 Conference Paper

Dual Prompt Learning for Continual Rain Removal from Single Images

  • Minghao Liu
  • Wenhan Yang
  • Yuzhang Hu
  • Jiaying Liu

Recent efforts have achieved remarkable progress on single image deraining on the stationary distributed data. However, catastrophic forgetting raises practical concerns when applying these methods to real applications, where the data distributions change constantly. In this paper, we investigate the continual learning issue for rain removal and develop a novel efficient continual learned deraining transformer. Different from the typical replay or regularization-based methods that increase overall training time or parameter space, our method relies on compact prompts which are learnable parameters, to maintain both task-invariant and task-specific knowledge. Our prompts are applied at both image and feature levels to leverage effectively transferred knowledge of images and features among different tasks. We conduct comprehensive experiments under widely-used rain removal datasets, where our proposed dual prompt learning consistently outperforms prior state-of-the-art methods. Moreover, we observe that, even though our method is designed for continual learning, it still achieves superior results on the stationary distributed data, which further demonstrates the effectiveness of our method. Our website is available at: http: //liuminghao. com. cn/DPL/.

AAAI Conference 2023 Conference Paper

Estimating Reflectance Layer from a Single Image: Integrating Reflectance Guidance and Shadow/Specular Aware Learning

  • Yeying Jin
  • Ruoteng Li
  • Wenhan Yang
  • Robby T. Tan

Estimating the reflectance layer from a single image is a challenging task. It becomes more challenging when the input image contains shadows or specular highlights, which often render an inaccurate estimate of the reflectance layer. Therefore, we propose a two-stage learning method, including reflectance guidance and a Shadow/Specular-Aware (S-Aware) network to tackle the problem. In the first stage, an initial reflectance layer free from shadows and specularities is obtained with the constraint of novel losses that are guided by prior-based shadow-free and specular-free images. To further enforce the reflectance layer to be independent of shadows and specularities in the second-stage refinement, we introduce an S-Aware network that distinguishes the reflectance image from the input image. Our network employs a classifier to categorize shadow/shadow-free, specular/specular-free classes, enabling the activation features to function as attention maps that focus on shadow/specular regions. Our quantitative and qualitative evaluations show that our method outperforms the state-of-the-art methods in the reflectance layer estimation that is free from shadows and specularities.

NeurIPS Conference 2023 Conference Paper

Robust Contrastive Language-Image Pretraining against Data Poisoning and Backdoor Attacks

  • Wenhan Yang
  • Jingdong Gao
  • Baharan Mirzasoleiman

Contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image-caption pairs crawled from the internet. However, the massive data that powers large multimodal models such as CLIP, makes them extremely vulnerable to various types of targeted data poisoning and backdoor attacks. Despite this vulnerability, robust contrastive vision-language pre-training against such attacks has remained unaddressed. In this work, we propose RoCLIP, the first effective method for robust pre-training multimodal vision-language models against targeted data poisoning and backdoor attacks. RoCLIP effectively breaks the association between poisoned image-caption pairs by considering a relatively large and varying pool of random captions, and matching every image with the text that is most similar to it in the pool instead of its own caption, every few epochs. It also leverages image and text augmentations to further strengthen the defense and improve the performance of the model. Our extensive experiments show that RoCLIP renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training CLIP models. In particular, RoCLIP decreases the success rate for targeted data poisoning attacks from 93. 75% to 12. 5% and that of backdoor attacks down to 0%, while improving the model's linear probe performance by 10% and maintains a similar zero shot performance compared to CLIP. By increasing the frequency of matching, RoCLIP is able to defend strong attacks, which add up to 1% poisoned examples to the data, and successfully maintain a low attack success rate of 12. 5%, while trading off the performance on some tasks.

AAAI Conference 2022 Conference Paper

Low-Light Image Enhancement with Normalizing Flow

  • Yufei Wang
  • Renjie Wan
  • Wenhan Yang
  • Haoliang Li
  • Lap-Pui Chau
  • Alex Kot

To enhance low-light images to normally-exposed ones is highly ill-posed, namely that the mapping relationship between them is one-to-many. Previous works based on the pixel-wise reconstruction losses and deterministic processes fail to capture the complex conditional distribution of normally exposed images, which results in improper brightness, residual noise, and artifacts. In this paper, we investigate to model this one-to-many relationship via a proposed normalizing flow model. An invertible network that takes the low-light images/features as the condition and learns to map the distribution of normally exposed images into a Gaussian distribution. In this way, the conditional distribution of the normally exposed images can be well modeled, and the enhancement process, i. e. . the other inference direction of the invertible network, is equivalent to being constrained by a loss function that better describes the manifold structure of natural images during the training. The experimental results on the existing benchmark datasets show our method achieves better quantitative and qualitative results, obtaining better-exposed illumination, less noise and artifact, and richer colors.

IJCAI Conference 2022 Conference Paper

Semantic Compression Embedding for Generative Zero-Shot Learning

  • Ziming Hong
  • Shiming Chen
  • Guo-Sen Xie
  • Wenhan Yang
  • Jian Zhao
  • Yuanjie Shao
  • Qinmu Peng
  • Xinge You

Generative methods have been successfully applied in zero-shot learning (ZSL) by learning an implicit mapping to alleviate the visual-semantic domain gaps and synthesizing unseen samples to handle the data imbalance between seen and unseen classes. However, existing generative methods simply use visual features extracted by the pre-trained CNN backbone. These visual features lack attribute-level semantic information. Consequently, seen classes are indistinguishable, and the knowledge transfer from seen to unseen classes is limited. To tackle this issue, we propose a novel Semantic Compression Embedding Guided Generation (SC-EGG) model, which cascades a semantic compression embedding network (SCEN) and an embedding guided generative network (EGGN). The SCEN extracts a group of attribute-level local features for each sample and further compresses them into the new low-dimension visual feature. Thus, a dense-semantic visual space is obtained. The EGGN learns a mapping from the class-level semantic space to the dense-semantic visual space, thus improving the discriminability of the synthesized dense-semantic unseen visual features. Extensive experiments on three benchmark datasets, i. e. , CUB, SUN and AWA2, demonstrate the significant performance gains of SC-EGG over current state-of-the-art methods and its baselines.

AAAI Conference 2022 Conference Paper

Semantically Contrastive Learning for Low-Light Image Enhancement

  • Dong Liang
  • Ling Li
  • Mingqiang Wei
  • Shuo Yang
  • Liyan Zhang
  • Wenhan Yang
  • Yun Du
  • Huiyu Zhou

Low-light image enhancement (LLE) remains challenging due to the unfavorable prevailing low-contrast and weakvisibility problems of single RGB images. In this paper, we respond to the intriguing learning-related question – if leveraging both accessible unpaired over/underexposed images and high-level semantic guidance, can improve the performance of cutting-edge LLE models? Here, we propose an effective semantically contrastive learning paradigm for LLE (namely SCL-LLE). Beyond the existing LLE wisdom, it casts the image enhancement task as multi-task joint learning, where LLE is converted into three constraints of contrastive learning, semantic brightness consistency, and feature preservation for simultaneously ensuring the exposure, texture, and color consistency. SCL-LLE allows the LLE model to learn from unpaired positives (normal-light)/negatives (over/underexposed), and enables it to interact with the scene semantics to regularize the image enhancement network, yet the interaction of high-level semantic knowledge and the lowlevel signal prior is seldom investigated in previous methods. Training on readily available open data, extensive experiments demonstrate that our method surpasses the state-of-thearts LLE models over six independent cross-scenes datasets. Moreover, SCL-LLE’s potential to benefit the downstream semantic segmentation under extremely dark conditions is discussed. Source Code: https: //github. com/LingLIx/SCL-LLE.

AAAI Conference 2020 Conference Paper

Coarse-to-Fine Hyper-Prior Modeling for Learned Image Compression

  • Yueyu Hu
  • Wenhan Yang
  • Jiaying Liu

Approaches to image compression with machine learning now achieve superior performance on the compression rate compared to existing hybrid codecs. The conventional learning-based methods for image compression exploits hyper-prior and spatial context model to facilitate probability estimations. Such models have limitations in modeling long-term dependency and do not fully squeeze out the spatial redundancy in images. In this paper, we propose a coarseto-fine framework with hierarchical layers of hyper-priors to conduct comprehensive analysis of the image and more effectively reduce spatial redundancy, which improves the ratedistortion performance of image compression significantly. Signal Preserving Hyper Transforms are designed to achieve an in-depth analysis of the latent representation and the Information Aggregation Reconstruction sub-network is proposed to maximally utilize side-information for reconstruction. Experimental results show the effectiveness of the proposed network to efficiently reduce the redundancies in images and improve the rate-distortion performance, especially for high-resolution images. Our project is publicly available at https: //huzi96. github. io/coarse-to-fine-compression. html.

AAAI Conference 2020 Conference Paper

Towards Scale-Free Rain Streak Removal via Self-Supervised Fractal Band Learning

  • Wenhan Yang
  • Shiqi Wang
  • Dejia Xu
  • Xiaodong Wang
  • Jiaying Liu

Data-driven rain streak removal methods, which most of rely on synthesized paired data, usually come across the generalization problem when being applied in real cases. In this paper, we propose a novel deep-learning based rain streak removal method injected with self-supervision to improve the ability to remove rain streaks in various scales. To realize this goal, we made efforts in two aspects. First, considering that rain streak removal is highly correlated with texture characteristics, we create a fractal band learning (FBL) network based on frequency band recovery. It integrates commonly seen band feature operations with neural modules and effectively improves the capacity to capture discriminative features for deraining. Second, to further improve the generalization ability of FBL for rain streaks in various scales, we add cross-scale self-supervision to regularize the network training. The constraint forces the extracted features of inputs in different scales to be equivalent after rescaling. Therefore, FBL can offer similar responses based on solely image content without the interleave of scale and is capable to remove rain streaks in various scales. Extensive experiments in quantitative and qualitative evaluations demonstrate the superiority of our FBL for rain streak removal, especially for the real cases where very large rain streaks exist, and prove the effectiveness of its each component. Our code will be public available at: https: //github. com/flyywh/AAAI-2020-FBL-SS.

IROS Conference 2016 Conference Paper

Human activity recognition based on weighted limb features

  • Liang Zhang 0010
  • Wenhan Yang
  • Guangming Zhu 0001
  • Peiyi Shen
  • Juan Song

Human activity recognition plays an important role in personal assistive robot, being able to recognize human activity and perform corresponding assistive action is a great challenges for personal assistive robot. Human body is an articulated system of rigid segments that can be divided into five parts, but many existing methods always identify actions based on the motion trajectories of whole body. In this paper, taking into account the fact that most actions can be performed by a few limbs and the other limbs should not impact on the action recognition, we proposed an activity recognition method based on limb weights. The weight of each limb is composed of consistency weight and uniqueness weight, which are learned according to the similarity degree among different sequences for each specific action. The covariance descriptor, which is the concatenation of eigenvalues extracted from covariance matrices, is adopted to represent the motion trajectory of each limb. In order to distinguish action instances from each other in the feature sequences, a simple annotation method is used. Experimental results on the Cornell activity dataset and the Lab dataset show that the proposed method not only can outperform the state-of-the-art algorithms, but also is appropriate to recognize the actions whose non-core limbs' trajectories are different from each other.