Author name cluster

Lei Tan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

2 author rows

EAAI Journal 2026 Journal Article

A computer vision-based approach for detecting tenon pull-out damage in ancient timber structures

Juan Wang
Yuan Yao
Yujing Yuan
Lei Tan
Xiaohui Yang
Na Yang

Ancient timber structures featuring mortise-tenon joints are vital cultural heritage forms in China and East Asia. Accurate and swift detection of joint pull-out is essential for ensuring structural safety and stability, thereby protecting this priceless heritage. Traditional tenon pull-out damage detection methods grapple with issues like dataset scarcity, complex backgrounds, wood grain interference, and a lack of suitable quantification methods. Considering the critical role of mortise-tenon connections in ancient timber structures’ structural integrity, adopting modern, non-destructive damage identification techniques is crucial. Computer vision, as a non-destructive approach, is especially apt for this purpose. To tackle these challenges, a computer vision-based method is proposed for precisely locating and quantifying tenon pull-out damage in ancient timber structures. A comprehensive detection process is established, including dataset construction for accuracy validation, damage location identification, and quantification of pull-out extent. Comparative analysis with instrument measurements reveals high recognition accuracy of the method. This approach offers a reference for assessing the condition of mortise-tenon joints in ancient timber structures, significantly aiding in the scientific preservation of cultural heritage.

Details DOI

AAAI Conference 2026 Conference Paper

Aggregating Diverse Cue Experts for AI-Generated Image Detection

Lei Tan
Shuwei Li
Mohan Kankanhalli
Robby T. Tan

The rapid emergence of image synthesis models poses challenges to the generalization of AI-generated image detectors. However, existing methods often rely on model-specific features, leading to overfitting and poor generalization. In this paper, we introduce the Multi-Cue Aggregation Network (MCAN), a novel framework that integrates different yet complementary cues as input. MCAN employs a mixture-of-encoders adapter to dynamically process these cues, enabling more adaptive and robust feature representation. Our cues include the input image itself, which represents the overall content, and high-frequency components that emphasize edge details. Additionally, we introduce a Chromatic Inconsistency (CI) cue, which normalizes intensity values and captures noise information introduced during the image acquisition process in real images, making these noise patterns more distinguishable from those in AI-generated content. Unlike prior methods, MCAN employs a multi-cue aggregation strategy, leveraging spatial, frequency, and chromaticity-based cues. These cues are intrinsically more indicative of real images, enhancing cross-model generalization. Extensive experiments on the GenImage, Chameleon, and UniversalFakeDetect benchmark validate the state-of-the-art performance of MCAN. In the GenImage dataset, MCAN outperforms the best state-of-the-art method by up to 7.4\% in average ACC across eight different image generators.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation

Shuwei Li
Lei Tan
Robby T. Tan

Day-to-night unpaired image translation is important to downstream tasks but remains challenging due to large appearance shifts and the lack of direct pixel-level supervision. Existing methods often introduce semantic hallucinations, where objects from target classes such as traffic signs and vehicles, as well as man-made light effects, are incorrectly synthesized. These hallucinations significantly degrade downstream performance. We propose a novel framework that detects and suppresses hallucinations of target-class features during unpaired translation. To detect hallucination, we design a dual-head discriminator that additionly performs semantic segmentation to identify hallucinated content in background regions. To suppress these hallucinations, we introduce class-specific prototypes, constructed by aggregating features of annotated target-domain objects, which act as semantic anchors for each class. Built upon a Schrödinger Bridge-based translation model, our framework performs iterative refinement, where detected hallucination features are explicitly pushed away from class prototypes in feature space, thus preserving object semantics across the translation trajectory. Experiments show that our method outperforms existing approaches both qualitatively and quantitatively. On the BDD100K dataset, it improves mAP by 15.5% for day-to-night domain adaptation, with a notable 31.7% gain for classes such as traffic lights that are prone to hallucinations.

PDF Details DOI

AAAI Conference 2026 Conference Paper

FIND: A Simple Yet Effective Baseline for Diffusion-Generated Image Detection

Jie Li
Yingying Feng
Chi Xie
Jie Hu
Lei Tan
Jiayi Ji

The remarkable realism of images generated by diffusion models poses critical detection challenges. Current methods utilize reconstruction error as a discriminative feature, exploiting the observation that real images exhibit higher reconstruction errors when processed through diffusion models. However, these approaches require costly reconstruction computations and depend on specific diffusion models, making their performance highly model-dependent. We identify a fundamental difference: real images are more difficult to fit with Gaussian distributions compared to synthetic ones. In this paper, we propose Forgery Identification via Noise Disturbance (FIND), a novel method that requires only a simple binary classifier. It eliminates reconstruction by directly targeting the core distributional difference between real and synthetic images. Our key operation is to add Gaussian noise to real images during training and label these noisy versions as synthetic. This step allows the classifier to focus on the statistical patterns that distinguish real from synthetic images. We theoretically prove that the noise-augmented real images resemble diffusion-generated images in their ease of Gaussian fitting. Furthermore, simply by adding noise, they still retain visual similarity to the original images, highlighting the most discriminative distribution-related features. The proposed FIND improves performance by 11.7% on the GenImage benchmark while running 126x faster than existing methods. By removing the need for auxiliary diffusion models and reconstruction, it offers a practical, efficient, and generalizable way to detect diffusion-generated content.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Joint Implicit and Explicit Language Learning for Pedestrian Attribute Recognition

Yukang Zhang
Lei Tan
Yang Lu
Yan Yan
Hanzi Wang

Pedestrian attribute recognition (PAR) has received increasing attention due to its wide application in video surveillance and pedestrian analysis. Some text-enhanced methods tackle this task by converting attributes into language descriptions to facilitate interactive learning between attributes and visual images. However, these generic languages fail to uniquely describe different pedestrian images, missing individual characteristics. In this paper, we propose a Joint Implicit and Explicit Language Guidance Enhancement Learning (JGEL) method, which converts each pedestrian image into a language description with dual language learning to effectively learn enhanced attribute information. Specifically, we first propose an Implicit Language Guidance Learning (ILGL) stream. It projects visual image features into the text embedding space to generate pseudo-word tokens, implicitly modeling image attributes and providing personalized descriptions. Moreover, we propose an Explicit Attribute Enhancement Learning (EAEL) stream to guide the generated pseudo-word tokens obtained by ILGL explicitly aligned with pedestrian attributes, which can effectively align the pseudo-word tokens with the attribute concepts in the text embedding space. Extensive experiments show that JGEL has significant advantages in improving the performance of PAR and the challenging zero-shot PAR task.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Embedding Robust Watermarking into Pattern to Protect the Copyright of Ceramic Artifacts

Lei Tan
Yuliang Xue
Guobiao Li
Zhenxing Qian
Sheng Li
Chunlei Bao

Ceramic artworks with elegant patterns present enormous collectible value and profits. To claim the copyright, the builder usually pastes their conspicuous stamp on the bottom or side of the ceramic artworks, which inevitably affects the external image of the artwork. In addition, the stamp is weak in resisting forgery attacks due to its visible nature. To address the above issues, we propose in this paper a novel framework for embedding invisible watermarking into patterns of the ceramic artworks. In the framework, a template-based watermarking embedding scheme is designed to map the watermark to an invisible template, which is added to the ceramic pattern to create its watermarked version. A distortion layer is further proposed to model the distortion of ceramic patterns in the ceramic manufacturing process, where a color-halftoning and an adaptive brightness adjustment strategy are developed to counter the print and firing operations that introduce the most significant distortions. Finally, a deep decoder is learned to extract the watermarking from the distorted pattern. Various experiments have been conducted to demonstrate the advantage of our proposed method for protecting the copyright of the ceramic artworks, which provides reliable watermark extraction accuracy without the need for a conspicuous stamp.

PDF Details DOI

ICML Conference 2025 Conference Paper

FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification

Zhen Sun
Lei Tan
Yunhang Shen
Chengmao Cai
Xing Sun 0001
Pingyang Dai
Liujuan Cao
Rongrong Ji

Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: RGB, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.

Details

NeurIPS Conference 2025 Conference Paper

GSAlign: Geometric and Semantic Alignment Network for Aerial-Ground Person Re-Identification

Qiao Li
Jie Li
Yukang Zhang
Lei Tan
Jing Chen
Jiayi Ji

Aerial-Ground person re-identification (AG-ReID) is an emerging yet challenging task that aims to match pedestrian images captured from drastically different viewpoints, typically from unmanned aerial vehicles (UAVs) and ground-based surveillance cameras. The task poses significant challenges due to extreme viewpoint discrepancies, occlusions, and domain gaps between aerial and ground imagery. While prior works have made progress by learning cross-view representations, they remain limited in handling severe pose variations and spatial misalignment. To address these issues, we propose a Geometric and Semantic Alignment Network (GSAlign) tailored for AG-ReID. GSAlign introduces two key components to jointly tackle geometric distortion and semantic misalignment in aerial-ground matching: a Learnable Thin Plate Spline (LTPS) Transformation Module and a Dynamic Alignment Module (DAM). The LTPS module adaptively warps pedestrian features based on a set of learned keypoints, effectively compensating for geometric variations caused by extreme viewpoint changes. In parallel, the DAM estimates visibility-aware representation masks that highlight visible body regions at the semantic level, thereby alleviating the negative impact of occlusions and partial observations in cross-view correspondence. Extensive experiments on the challenging CARGO benchmark demonstrate the effectiveness of GSAlign, achieving significant improvements of +18. 8\% in mAP and +16. 8\% in Rank-1 accuracy over previous state-of-the-art methods.

PDF Details

NeurIPS Conference 2025 Conference Paper

MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification

Yingying Feng
Jie Li
Jie Hu
Yukang Zhang
Lei Tan
Jiayi Ji

The challenge of inconsistent modalities in real-world applications presents significant obstacles to effective object re-identification (ReID). However, most existing approaches assume modality-matched conditions, significantly limiting their effectiveness in modality-mismatched scenarios. To overcome this limitation and achieve a more flexible ReID, we introduce MDReID to allow any-to-any image-level ReID systems. MDReID is inspired by the widely recognized perspective that modality information comprises both modality-shared features, predictable across modalities, and unpredictable modality-specific features, which are inherently modality-dependent and consist of two key components: the Modality Decoupling Module (MDM) and Modality-aware Metric Learning (MML). Specifically, MDM explicitly decomposes modality features into modality-shared and modality-specific representations, enabling effective retrieval in both modality-aligned and mismatched scenarios. MML, a tailored metric learning strategy, further enhances feature discrimination and decoupling by exploiting distributional relationships between shared and specific modality features. Extensive experiments conducted on three challenging multi-modality ReID benchmarks (RGBNT201, RGBNT100, MSVR310) consistently demonstrate the superiority of MDL. MDReID achieves significant mAP improvements of 9. 8\%, 3. 0\%, and 11. 5\% in modality-matched scenarios, and average gains of 3. 4\%, 11. 8\%, and 10. 9\% in modality-mismatched scenarios, respectively.

PDF Details

ICML Conference 2025 Conference Paper

Multi-Modal Object Re-identification via Sparse Mixture-of-Experts

Yingying Feng
Jie Li
Chi Xie
Lei Tan
Jiayi Ji

We present MFRNet, a novel network for multi-modal object re-identification that integrates multi-modal data features to effectively retrieve specific objects across different modalities. Current methods suffer from two principal limitations: (1) insufficient interaction between pixel-level semantic features across modalities, and (2) difficulty in balancing modality-shared and modality-specific features within a unified architecture. To address these challenges, our network introduces two core components. First, the Feature Fusion Module (FFM) enables fine-grained pixel-level feature generation and flexible cross-modal interaction. Second, the Feature Representation Module (FRM) efficiently extracts and combines modality-specific and modality-shared features, achieving strong discriminative ability with minimal parameter overhead. Extensive experiments on three challenging public datasets (RGBNT201, RGBNT100, and MSVR310) demonstrate the superiority of our approach in terms of both accuracy and efficiency, with 8. 4% mAP and 6. 9% accuracy improved in RGBNT201 with negligible additional parameters.

Details

AAAI Conference 2025 Conference Paper

Physical Marker: Revealing Invisible Hyperlinks Hidden in Printed Trademarks

Yuliang Xue
Lei Tan
Guobiao Li
Zhenxing Qian
Sheng Li
Xinpeng Zhang

Embedding links in brand logos is a promising technology, which allows consumers to access the online information of products by capturing physical logo images. Previous physical data hiding methods primarily embed data within cover media in a global manner, making them ineffective for processing brand logos in vector graphics format with a transparent background. To address this issue, we propose in this paper a novel physical deep hiding scheme for invisibly embedding links in printed trademarks. Specifically, the encoder embeds links only into the area of the brand logo under the constraints of a mask, which is generated from the transparency information of the logo image. A background variation distortion is introduced into the distortion layer that approximate practical logo print-camera environments, such that the decoder could be learnt to retrieve the link from the camera-captured logo with various backgrounds. A feature prompt subspace modulator is further proposed and employed in the encoder to enhance the invisibility of the encoded logo pattern and in the decoder to boost hyperlink extraction accuracy. Various experiments have been conducted to demonstrate the advantage of our proposed method for embedding links in printed brand logos, which provides reliable extraction accuracy under both simulated and real scenarios.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Attention Disturbance and Dual-Path Constraint Network for Occluded Person Re-identification

Jiaer Xia
Lei Tan
Pingyang Dai
Mingbo Zhao
Yongjian Wu
Liujuan Cao

Occluded person re-identification (Re-ID) aims to address the potential occlusion problem when matching occluded or holistic pedestrians from different camera views. Many methods use the background as artificial occlusion and rely on attention networks to exclude noisy interference. However, the significant discrepancy between simple background occlusion and realistic occlusion can negatively impact the generalization of the network. To address this issue, we propose a novel transformer-based Attention Disturbance and Dual-Path Constraint Network (ADP) to enhance the generalization of attention networks. Firstly, to imitate real-world obstacles, we introduce an Attention Disturbance Mask (ADM) module that generates an offensive noise, which can distract attention like a realistic occluder, as a more complex form of occlusion. Secondly, to fully exploit these complex occluded images, we develop a DualPath Constraint Module (DPC) that can obtain preferable supervision information from holistic images through dualpath interaction. With our proposed method, the network can effectively circumvent a wide variety of occlusions using the basic ViT baseline. Comprehensive experimental evaluations conducted on person re-ID benchmarks demonstrate the superiority of ADP over state-of-the-art methods.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Occluded Person Re-identification via Saliency-Guided Patch Transfer

Lei Tan
Jiaer Xia
Wenfeng Liu
Pingyang Dai
Yongjian Wu
Liujuan Cao

While generic person re-identification has made remarkable improvement in recent years, these methods are designed under the assumption that the entire body of the person is available. This assumption brings about a significant performance degradation when suffering from occlusion caused by various obstacles in real-world applications. To address this issue, data-driven strategies have emerged to enhance the model's robustness to occlusion. Following the random erasing paradigm, these strategies typically employ randomly generated noise to supersede randomly selected image regions to simulate obstacles. However, the random strategy is not sensitive to location and content, meaning they cannot mimic real-world occlusion cases in application scenarios. To overcome this limitation and fully exploit the real scene information in datasets, this paper proposes a more intuitive and effective data-driven strategy named Saliency-Guided Patch Transfer (SPT). Combined with the vision transformer, SPT divides person instances and background obstacles using salient patch selection. By transferring person instances to different background obstacles, SPT can easily generate photo-realistic occluded samples. Furthermore, we propose an occlusion-aware Intersection over Union (OIoU) with mask-rolling to filter the more suitable combination and a class-ignoring strategy to achieve more stable processing. Extensive experimental evaluations conducted on occluded and holistic person re-identification benchmarks demonstrate that SPT provides a significant performance gain among different ViT-based ReID algorithms on occluded ReID.

PDF Details DOI