Author name cluster

Xiangyang Luo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

1 author row

EAAI Journal 2026 Journal Article

Anti-forensic for quantization steps estimation based on direct and preemptive adversarial attacks

Jiawei Zhang
Mengjie Wang
Hao Wu
Xin Cheng
Xiangyang Luo
Bin Ma
Hao Wang
Jinwei Wang

Joint Photographic Experts Group (JPEG) quantization steps estimation aims to reveal the compressed history of the images, which can serve as an essential component and powerful technique to support forensics. Nowadays, various deep learning-based estimation methods have been proposed to achieve higher accuracy. However, due to supposing the ideal secure conditions of estimation, their robustness against deliberate attacks (especially adversarial attacks) has not been thoroughly studied, which poses a significant threat to their reliability. To address this issue, as the first attempt, we investigate the robustness of deep learning-based estimation methods against adversarial attacks, which can significantly deteriorate estimation accuracy without noticeable distortion. Specifically, we introduce a generation-based adversarial attack framework and propose two types of anti-forensic attacks, Direct Attack (DA) and Preemptive Attack (PA), to craft adversarial examples on double and single compressed images. To maximize the attack ability, we study the effect of regression and classification objectives on the adversarial property and design a joint loss function for stable and smooth optimization. Extensive experiments prove that the proposed DA and PA can achieve a high attack ability with low perturbation magnitude and satisfactory visual quality. More importantly, the generated adversarial examples present superior transferability across different estimation models and datasets, which proves the generality of the proposed method and also reveals the vulnerability of the existing deep learning-based estimation methods towards adversarial examples. Our code will be publicly available soon.

Details DOI

AAAI Conference 2026 Conference Paper

FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion

Xiangyang Luo
Qingyu Li
Xiaokun Liu
Wenyu Qin
Miao Yang
Meng Wang
Pengfei Wan
Di Zhang

Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content.

PDF Details DOI

AAAI Conference 2026 Conference Paper

VoiceCloak: A Multi-Dimensional Defense Framework Against Unauthorized Diffusion-Based Voice Cloning

Qianyue Hu
Junyan Wu
Wei Lu
Xiangyang Luo

Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning. Additional audio samples of VoiceCloak are available in demo pages.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Flow Matching Based Sequential Recommender Model

Feng Liu
Lixin Zou
Xiangyu Zhao
Min Tang
Liming Dong
Dan Luo
Xiangyang Luo
Chenliang Li

Generative models, particularly diffusion model, have emerged as powerful tools for sequential recommendation. However, accurately modeling user preferences remains challenging due to the noise perturbations inherent in the forward and reverse processes of diffusion-based methods. Towards this end, this study introduces FMRec, a Flow Matching based model that employs a straight flow trajectory and a modified loss tailored for the recommendation task. Additionally, from the diffusion-model perspective, we integrate a reconstruction loss to improve robustness against noise perturbations, thereby retaining user preferences during the forward process. In the reverse process, we employ a deterministic reverse sampler, specifically an ODE-based updating function, to eliminate unnecessary randomness, thereby ensuring that the generated recommendations closely align with user needs. Extensive evaluations on four benchmark datasets reveal that FMRec achieves an average improvement of 6. 53% over state-of-the-art methods. The replication code is available at https: //github. com/FengLiu-1/FMRec.

PDF Details DOI

AAAI Conference 2025 Conference Paper

GLCF: A Global-Local Multimodal Coherence Analysis Framework for Talking Face Generation Detection

Xiaocan Chen
Qilin Yin
Jiarui Liu
Wei Lu
Xiangyang Luo
Jiantao Zhou

Talking face generation (TFG) allows for producing lifelike talking videos of any character using only facial images and accompanying text. Abuse of this technology could pose significant risks to society, creating the urgent need for research into corresponding detection methods. However, research in this field has been hindered by the lack of public datasets. In this paper, we construct the first large-scale multi-scenario talking face dataset (MSTF), which contains 22 audio and video forgery techniques, filling the gap of datasets in this field. The dataset covers 11 generation scenarios and more than 20 semantic scenarios, closer to the practical application scenario of TFG. Besides, we also propose a TFG detection framework, which leverages the analysis of both global and local coherence in the multimodal content of TFG videos. Therefore, a region-focused smoothness detection module (RSFDM) and a discrepancy capture-time frame aggregation module (DCTAM) are introduced to evaluate the global temporal coherence of TFG videos, aggregating multi-grained spatial information. Additionally, a visual-audio fusion module (V-AFM) is designed to evaluate audiovisual coherence within a localized temporal perspective. Comprehensive experiments demonstrate the reasonableness and challenges of our datasets, while also indicating the superiority of our proposed method compared to the state-of-the-art deepfake detection approaches.

PDF Details DOI

AAAI Conference 2025 Conference Paper

PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis

Yifan Xie
Tao Feng
Xin Zhang
Xiangyang Luo
Zixuan Guo
Weijiang Yu
Heng Chang
Fei Ma

Talking head synthesis with arbitrary speech audio is a crucial challenge in the field of digital humans. Recently, methods based on radiance fields have received increasing attention due to their ability to synthesize high-fidelity and identity-consistent talking heads from just a few minutes of training video. However, due to the limited scale of the training data, these methods often exhibit poor performance in audio-lip synchronization and visual quality. In this paper, we propose a novel 3D Gaussian-based method called PointTalk, which constructs a static 3D Gaussian field of the head and deforms it in sync with the audio. It also incorporates an audio-driven dynamic lip point cloud as a critical component of the conditional information, thereby facilitating the effective synthesis of talking heads. Specifically, the initial step involves generating the corresponding lip point cloud from the audio signal and capturing its topological structure. The design of the dynamic difference encoder aims to capture the subtle nuances inherent in dynamic lip movements more effectively. Furthermore, we integrate the audio-point enhancement module, which not only ensures the synchronization of the audio signal with the corresponding lip point cloud within the feature space, but also facilitates a deeper understanding of the interrelations among cross-modal conditional features. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking head synthesis compared to previous methods.

PDF Details DOI

AAAI Conference 2025 Conference Paper

RaCMC: Residual-Aware Compensation Network with Multi-Granularity Constraints for Fake News Detection

Xinquan Yu
Ziqi Sheng
Wei Lu
Xiangyang Luo
Jiantao Zhou

Multimodal fake news detection aims to automatically identify real or fake news, thereby mitigating the adverse effects caused by such misinformation. Although prevailing approaches have demonstrated their effectiveness, challenges persist in cross-modal feature fusion and refinement for classification. To address this, we present a residual-aware compensation network with multi-granularity constraints (RaCMC) for fake news detection, that aims to sufficiently interact and fuse cross-modal features while amplifying the differences between real and fake news. First, a multiscale residual-aware compensation module is designed to interact and fuse features at different scales, and ensure both the consistency and exclusivity of feature interaction, thus acquiring high-quality features. Second, a multi-granularity constraints module is implemented to limit the distribution of both the news overall and the image-text pairs within the news, thus amplifying the differences between real and fake news at the news and feature levels. Finally, a dominant feature fusion reasoning module is developed to comprehensively evaluate news authenticity from the perspectives of both consistency and inconsistency. Experiments on three public datasets, including Weibo17, Politifact and GossipCop, reveal the superiority of the proposed method.

PDF Details DOI

AAAI Conference 2025 Conference Paper

SUMI-IFL: An Information-Theoretic Framework for Image Forgery Localization with Sufficiency and Minimality Constraints

Ziqi Sheng
Wei Lu
Xiangyang Luo
Jiantao Zhou
Xiaochun Cao

Image forgery localization (IFL) is a crucial technique for preventing tampered image misuse and protecting social safety. However, due to the rapid development of image tampering technologies, extracting more comprehensive and accurate forgery clues remains an urgent challenge. To address these challenges, we introduce a novel information-theoretic IFL framework named SUMI-IFL that imposes sufficiency-view and minimality-view constraints on forgery feature representation. First, grounded in the theoretical analysis of mutual information, the sufficiency-view constraint is enforced on the feature extraction network to ensure that the latent forgery feature contains comprehensive forgery clues. Considering that forgery clues obtained from a single aspect alone may be incomplete, we construct the latent forgery feature by integrating several orthogonal individual image features. Second, based on the information bottleneck, the minimality-view constraint is imposed on the feature reasoning network to achieve an accurate and concise forgery feature representation that counters the interference of task-unrelated features. Extensive experiments show the superior performance of SUMI-IFL to existing state-of-the-art methods, not only on in-dataset comparisons but also on cross-dataset comparisons.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network

Junyan Wu
Wenbo Xu
Wei Lu
Xiangyang Luo
Rui Yang
Shize Guo

Audio temporal forgery localization (ATFL) aims to find the precise forgery regions of the partial spoof audio that is purposefully modified. Existing ATFL methods rely on training efficient networks using fine-grained annotations, which are obtained costly and challenging in real-world scenarios. To meet this challenge, in this paper, we propose a progressive audio-language co-learning network (LOCO) that adopts co-learning and self-supervision manners to prompt localization performance under weak supervision scenarios. Specifically, an audio-language co-learning module is first designed to capture forgery consensus features by aligning semantics from temporal and global perspectives. In this module, forgery-aware prompts are constructed by using utterance-level annotations together with learnable prompts, which can incorporate semantic priors into temporal content features dynamically. In addition, a forgery localization module is applied to produce forgery proposals based on fused forgery-class activation sequences. Finally, a progressive refinement strategy is introduced to generate pseudo frame-level labels and leverage supervised semantic contrastive learning to amplify the semantic distinction between real and fake content, thereby continuously optimizing forgery-aware features. Extensive experiments show that the proposed LOCO achieves SOTA performance on three public benchmarks.

PDF Details DOI

IJCAI Conference 2019 Conference Paper

A Review-Driven Neural Model for Sequential Recommendation

Chenliang Li
Xichuan Niu
Xiangyang Luo
Zhenzhong Chen
Cong Quan

Writing review for a purchased item is a unique channel to express a user's opinion in E-Commerce. Recently, many deep learning based solutions have been proposed by exploiting user reviews for rating prediction. In contrast, there has been few attempt to enlist the semantic signals covered by user reviews for the task of collaborative filtering. In this paper, we propose a novel review-driven neural sequential recommendation model (named RNS) by considering user's intrinsic preference (long-term) and sequential patterns (short-term). In detail, RNS is devised to encode each user or item with the aspect-aware representations extracted from the reviews. Given a sequence of historical purchased items for a user, we devise a novel hierarchical attention over attention mechanism to capture sequential patterns at both union-level and individual-level. Extensive experiments on three real-world datasets of different domains demonstrate that RNS obtains significant performance improvement over uptodate state-of-the-art sequential recommendation models.

PDF Details