Author name cluster

Xiaoshuai Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

39 papers

2 author rows

AAAI Conference 2026 Conference Paper

Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach

Lvpan Cai
Haowei Wang
Jiayi Ji
Yanshu Zhoumen
Shen Chen
Taiping Yao
Xiaoshuai Sun

The rise of AI-generated image tools has made localized forgeries increasingly realistic, posing challenges for visual content integrity. Although recent efforts have explored localized AIGC detection, existing datasets predominantly focus on object-level forgeries while overlooking broader scene edits in regions such as sky or ground. To address these limitations, we introduce BR-Gen, a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations, which are based on semantic calibration to ensure high-quality samples. BR-Gen is constructed through a fully automated ``Perception-Creation-Evaluation'' pipeline to ensure semantic coherence and visual realism. In addition, we further propose NFA-ViT, a Noise-guided Forgery Amplification Vision Transformer that enhances the detection of localized forgeries by amplifying subtle forgery-related features across the entire image. NFA-ViT mines heterogeneous regions in images, i.e., potential edited areas, by noise fingerprints. Subsequently, attention mechanism is introduced to compel the interaction between normal and abnormal features, thereby propagating the traces throughout the entire image, allowing subtle forgeries to influence a broader context and improving overall detection robustness. Extensive experiments demonstrate that BR-Gen constructs entirely new scenarios that are not covered by existing methods. Take a step further, NFA-ViT outperforms existing methods on BR-Gen and generalizes well across current benchmarks.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Qiong Wu
Wenhao Lin
Yiyi Zhou
Weihao Ye
Zhanpeng Zeng
Xiaoshuai Sun
Rongrong Ji

In this paper, we study the visual redundancy problem of multimodal large language models (MLLMs) from the perspective of attention behaviors. Via extensive empirical experiments, we observe and conclude three main inference stages of MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information. Based on this observation, we propose an effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE), which is orthogonal but collaborative to previous token-wise visual compression methods. To validate the efficacy of DyVTE, we apply it to a set of MLLMs, including LLaVA, VILA, EAGLE and InternVL. The experimental results not only show the effectiveness of our DyVTE in improving MLLMs' efficiency, e. g. , DyVTE reduces the computation overhead of LLaVA-1. 5 by up to 45. 7% without performance drop, but also reveal a general pattern across multiple MLLMs, well facilitating the in-depth analysis of MLLMs. Our code is anonymously released at https: //anonymous. 4open. science/r/AnonymousDyVTE-26AB/.