Arrow Research search

Author name cluster

Alex Kot

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
1 author row

Possible papers

10

AAAI Conference 2026 Conference Paper

From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge

  • Hui Lu
  • Yi Yu
  • Song Xia
  • Yiming Yang
  • Deepu Rajan
  • Boon Poh Ng
  • Alex Kot
  • Xudong Jiang

Large-scale Video Foundation Models (VFMs) have significantly advanced various video-related tasks, either through task-specific models or Multi-modal Large Language Models (MLLMs). However, the open accessibility of VFMs also introduces critical security risks, as adversaries can exploit full knowledge of the VFMs to launch potent attacks. This paper investigates a novel and practical adversarial threat scenario: attacking downstream models or MLLMs fine-tuned from open-source VFMs, without requiring access to the victim task, training data, model query, and architecture. In contrast to conventional transfer-based attacks that rely on task-aligned surrogate models, we demonstrate that adversarial vulnerabilities can be exploited directly from the VFMs. To this end, we propose the Transferable Video Attack (TVA), a temporal-aware adversarial attack method that leverages the temporal representation dynamics of VFMs to craft effective perturbations. TVA integrates a bidirectional contrastive learning mechanism to maximize the discrepancy between the clean and adversarial features, and introduces a temporal consistency loss that exploits motion cues to enhance the sequential impact of perturbations. TVA avoids the need to train expensive surrogate models or access to domain-specific data, thereby offering a more practical and efficient attack strategy. Extensive experiments across 24 video-related tasks demonstrate the efficacy of TVA against downstream models and MLLMs, revealing a previously underexplored security vulnerability in the deployment of video models.

AAAI Conference 2026 Conference Paper

SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision

  • Zhaoxu Li
  • Chenqi Kong
  • Yi Yu
  • Qiangqiang Wu
  • Xinghao Jiang
  • Ngai-Man Cheung
  • Bihan Wen
  • Alex Kot

Large Vision-Language Models (LVLMs) recently achieve significant breakthroughs in understanding complex visual-textual contexts. However, hallucination issues still limit their real-world applicability. Although previous mitigation methods effectively reduce hallucinations in photographic images, they largely overlook the potential risks posed by stylized images, which play crucial roles in critical scenarios such as game scene understanding, art education, and medical analysis. In this work, we first construct a dataset comprising photographic images and their corresponding stylized versions with carefully annotated caption labels. We then conduct head-to-head comparisons on both discriminative and generative tasks by benchmarking 13 advanced LVLMs on the collected datasets. Our findings reveal that stylized images tend to induce significantly more hallucinations than their photographic counterparts. To address this issue, we propose Style-Aware Visual Early Revision (SAVER), a novel mechanism that dynamically adjusts LVLMs' final outputs based on the token-level visual attention patterns, leveraging early-layer feedback to mitigate hallucinations caused by stylized images. Extensive experiments demonstrate that SAVER achieves state-of-the-art performance in hallucination mitigation across various models, datasets, and tasks.

AAAI Conference 2025 Conference Paper

Backdoor Attacks Against No-Reference Image Quality Assessment Models via a Scalable Trigger

  • Yi Yu
  • Song Xia
  • Xun Lin
  • Wenhan Yang
  • Shijian Lu
  • Yap-Peng Tan
  • Alex Kot

No-Reference Image Quality Assessment (NR-IQA), responsible for assessing the quality of a single input image without using any reference, plays a critical role in evaluating and optimizing computer vision systems, e.g., low-light enhancement. Recent research indicates that NR-IQA models are susceptible to adversarial attacks, which can significantly alter predicted scores with visually imperceptible perturbations. Despite revealing vulnerabilities, these attack methods have limitations, including high computational demands, untargeted manipulation, limited practical utility in white-box scenarios, and reduced effectiveness in black-box scenarios. To address these challenges, we shift our focus to another significant threat and present a novel poisoning-based backdoor attack against NR-IQA (BAIQA), allowing the attacker to manipulate the IQA model's output to any desired target value by simply adjusting a scaling coefficient alpha for the trigger. We propose to inject the trigger in the discrete cosine transform (DCT) domain to improve the local invariance of the trigger for countering trigger diminishment in NR-IQA models due to widely adopted data augmentations. Furthermore, the universal adversarial perturbations (UAP) in the DCT space are designed as the trigger, to increase IQA model susceptibility to manipulation and improve attack effectiveness. In addition to the heuristic method for poison-label BAIQA (P-BAIQA), we explore the design of clean-label BAIQA (C-BAIQA), focusing on alpha sampling and image data refinement, driven by theoretical insights we reveal. Extensive experiments on diverse datasets and various NR-IQA models demonstrate the effectiveness of our attacks.

TMLR Journal 2025 Journal Article

Random Erasing vs. Model Inversion: A Promising Defense or a False Hope?

  • Viet-Hung Tran
  • Ngoc-Bao Nguyen
  • Son T. Mai
  • Hans Vandierendonck
  • Ira Assent
  • Alex Kot
  • Ngai-Man Cheung

Model Inversion (MI) attacks pose a significant privacy threat by reconstructing private training data from machine learning models. While existing defenses primarily concentrate on model-centric approaches, the impact of data on MI robustness remains largely unexplored. In this work, we explore Random Erasing (RE)—a technique traditionally used for improving model generalization under occlusion—and uncover its surprising effectiveness as a defense against MI attacks. Specifically, our novel feature space analysis shows that model trained with RE-images introduces a significant discrepancy between the features of MI-reconstructed images and those of the private data. At the same time, features of private images remain distinct from other classes and well-separated from different classification regions. These effects collectively degrade MI reconstruction quality and attack accuracy while maintaining reasonable natural accuracy. Furthermore, we explore two critical properties of RE including Partial Erasure and Random Location. First, Partial Erasure prevents the model from observing entire objects during training, and we find that this has significant impact on MI, which aims to reconstruct the entire objects. Second, the Random Location of erasure plays a crucial role in achieving a strong privacy-utility trade-off. Our findings highlight RE as a simple yet effective defense mechanism that can be easily integrated with existing privacy-preserving techniques. Extensive experiments of 37 setups demonstrate that our method achieves SOTA performance in privacy-utility tradeoff. The results consistently demonstrate the superiority of our defense over existing defenses across different MI attacks, network architectures, and attack configurations. For the first time, we achieve significant degrade in attack accuracy without decrease in utility for some configurations. Our code and additional results are available at: https://ngoc-nguyen-0.github.io/MIDRE/

NeurIPS Conference 2024 Conference Paper

From Chaos to Clarity: 3DGS in the Dark

  • Zhihao Li
  • Yufei Wang
  • Alex Kot
  • Bihan Wen

Novel view synthesis from raw images provides superior high dynamic range (HDR) information compared to reconstructions from low dynamic range RGB images. However, the inherent noise in unprocessed raw images compromises the accuracy of 3D scene representation. Our study reveals that 3D Gaussian Splatting (3DGS) is particularly susceptible to this noise, leading to numerous elongated Gaussian shapes that overfit the noise, thereby significantly degrading reconstruction quality and reducing inference speed, especially in scenarios with limited views. To address these issues, we introduce a novel self-supervised learning framework designed to reconstruct HDR 3DGS from a limited number of noisy raw images. This framework enhances 3DGS by integrating a noise extractor and employing a noise-robust reconstruction loss that leverages a noise distribution prior. Experimental results show that our method outperforms LDR/HDR 3DGS and previous state-of-the-art (SOTA) self-supervised and supervised pre-trained models in both reconstruction quality and inference speed on the RawNeRF dataset across a broad range of training views. We will release the code upon paper acceptance.

AAAI Conference 2022 Conference Paper

Low-Light Image Enhancement with Normalizing Flow

  • Yufei Wang
  • Renjie Wan
  • Wenhan Yang
  • Haoliang Li
  • Lap-Pui Chau
  • Alex Kot

To enhance low-light images to normally-exposed ones is highly ill-posed, namely that the mapping relationship between them is one-to-many. Previous works based on the pixel-wise reconstruction losses and deterministic processes fail to capture the complex conditional distribution of normally exposed images, which results in improper brightness, residual noise, and artifacts. In this paper, we investigate to model this one-to-many relationship via a proposed normalizing flow model. An invertible network that takes the low-light images/features as the condition and learns to map the distribution of normally exposed images into a Gaussian distribution. In this way, the conditional distribution of the normally exposed images can be well modeled, and the enhancement process, i. e. . the other inference direction of the invertible network, is equivalent to being constrained by a loss function that better describes the manifold structure of natural images during the training. The experimental results on the existing benchmark datasets show our method achieves better quantitative and qualitative results, obtaining better-exposed illumination, less noise and artifact, and richer colors.

TMLR Journal 2022 Journal Article

Variational Disentanglement for Domain Generalization

  • Yufei Wang
  • Haoliang Li
  • Hao Cheng
  • Bihan Wen
  • Lap-Pui Chau
  • Alex Kot

Domain generalization aims to learn a domain-invariant model that can generalize well to the unseen target domain. In this paper, based on the assumption that there exists an invariant feature mapping, we propose an evidence upper bound of the divergence between the category-specific feature and its invariant ground-truth using variational inference. To optimize this upper bound, we further propose an efficient Variational Disentanglement Network (VDN) that is capable of disentangling the domain-specific features and category-specific features (which generalize well to the unseen samples). Besides, the generated novel images from VDN are used to further improve the generalization ability. We conduct extensive experiments to verify our method on three benchmarks, and both quantitative and qualitative results illustrate the effectiveness of our method.

NeurIPS Conference 2021 Conference Paper

Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions

  • Chenyu Yi
  • Siyuan Yang
  • Haoliang Li
  • Yap-Peng Tan
  • Alex Kot

The state-of-the-art deep neural networks are vulnerable to common corruptions (e. g. , input data degradations, distortions, and disturbances caused by weather changes, system error, and processing). While much progress has been made in analyzing and improving the robustness of models in image understanding, the robustness in video understanding is largely unexplored. In this paper, we establish a corruption robustness benchmark, Mini Kinetics-C and Mini SSV2-C, which considers temporal corruptions beyond spatial corruptions in images. We make the first attempt to conduct an exhaustive study on the corruption robustness of established CNN-based and Transformer-based spatial-temporal models. The study provides some guidance on robust model design and training: Transformer-based model performs better than CNN-based models on corruption robustness; the generalization ability of spatial-temporal models implies robustness against temporal corruptions; model corruption robustness (especially robustness in the temporal domain) enhances with computational cost and model capacity, which may contradict the current trend of improving the computational efficiency of models. Moreover, we find the robustness intervention for image-related tasks (e. g. , training models with noise) may not work for spatial-temporal models. Our codes are available on https: //github. com/Newbeeyoung/Video-Corruption-Robustness.

NeurIPS Conference 2020 Conference Paper

Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization

  • Haoliang Li
  • Yufei Wang
  • Renjie Wan
  • Shiqi Wang
  • Tie-Qiang Li
  • Alex Kot

Recently, we have witnessed great progress in the field of medical imaging classification by adopting deep neural networks. However, the recent advanced models still require accessing sufficiently large and representative datasets for training, which is often unfeasible in clinically realistic environments. When trained on limited datasets, the deep neural network is lack of generalization capability, as the trained deep neural network on data within a certain distribution (e. g. the data captured by a certain device vendor or patient population) may not be able to generalize to the data with another distribution. In this paper, we introduce a simple but effective approach to improve the generalization capability of deep neural networks in the field of medical imaging classification. Motivated by the observation that the domain variability of the medical images is to some extent compact, we propose to learn a representative feature space through variational encoding with a novel linear-dependency regularization term to capture the shareable information among medical data collected from different domains. As a result, the trained neural network is expected to equip with better generalization capability to the ``unseen" medical data. Experimental results on two challenging medical imaging classification tasks indicate that our method can achieve better cross-domain generalization capability compared with state-of-the-art baselines.

AAMAS Conference 2010 Conference Paper

A Clustering Approach to Filtering Unfair Testimonies for Reputation Systems

  • Siyuan Liu
  • Chunyan Miao
  • Yin-Leng Theng
  • Alex Kot

The problem of unfair testimonies remains an open issuein reputation systems for online trading communities. Acommon attempt is to use binary ratings to model sellers'reputation. However, this attempt leads to that the researchof tackling unfair testimonies also focuses on reputation systems using binary ratings. In this extended abstract, wepropose a two-stage clustering approach to filter unfair testimonies for reputation systems using multi-nominal ratings. The proposed approach uses clustering to identify unfair testimonies and further contributes to providing buyers a moreaccurate reputation evaluation regarding the target seller.