Author name cluster

Wenhao Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

AAAI Conference 2026 Conference Paper

AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization

Jingyi Liao
Yongyi Su
Rong-Cheng Tu
Zhao Jin
Wenhao Sun
Yiting Li
Xun Xu
Dacheng Tao

While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities across diverse domains, their application to specialized anomaly detection (AD) remains constrained by domain adaptation challenges. Existing Group Relative Policy Optimization (GRPO) based approaches suffer from two critical limitations: inadequate training data utilization when models produce uniform responses, and insufficient supervision over reasoning processes that encourage immediate binary decisions without deliberative analysis. We propose a comprehensive framework addressing these limitations through two synergistic innovations. First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination, generating diverse response patterns essential for GRPO optimization while enabling structured supervision over analytical workflows. Second, we develop a fine-grained reward mechanism incorporating classification accuracy and localization supervision, transforming binary feedback into continuous signals that distinguish genuine analytical insight from spurious correctness. Comprehensive evaluation across multiple industrial datasets shows that our method achieves superior accuracy by enabling general-purpose MLLMs to acquire fine-grained visual discrimination for detecting subtle manufacturing defects.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation

Jianzhi Long
Wenhao Sun
Rong-Cheng Tu
Dacheng Tao

Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference, limiting practical applications. Existing acceleration methods for gen- eral diffusion models fail to exploit the temporal and spatial redundancies unique to talking head generation. In this paper, we propose a task-specific framework addressing these inefficiencies through two key innovations. First, we introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP), caching static features to bypass most model layers in inference time. We also enable parallel prediction using cached features and estimated noisy latents as inputs, efficiently bypassing sequential sampling. Second, we propose Decoupled Foreground Attention (DFA) to further accelerate attention computations, exploiting the spatial decoupling in talking head videos to restrict attention to dynamic foreground regions. Additionally, we remove reference features in certain layers to bring extra speedup. Extensive experiments demonstrate that our framework significantly improves inference speed while preserving video quality.

PDF Details DOI

ICML Conference 2025 Conference Paper

AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Wenhao Sun
Rong-Cheng Tu
Jingyi Liao
Zhao Jin
Dacheng Tao

Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs. Existing video diffusion sampling acceleration methods often rely on costly fine-tuning or exhibit limited generalization capabilities. We propose Asymmetric Reduction and Restoration ( AsymRnR ), a training-free and model-agnostic method to accelerate video DiTs. It builds on the observation that redundancies of feature tokens in DiTs vary significantly across different model blocks, denoising steps, and feature types. Our AsymRnR asymmetrically reduces redundant tokens in the attention operation, achieving acceleration with negligible degradation in output quality and, in some cases, even improving it. We also tailored a reduction schedule to distribute the reduction across components adaptively. To further accelerate this process, we introduce a matching cache for more efficient reduction. Backed by theoretical foundations and extensive experimental validation, AsymRnR integrates into state-of-the-art video DiTs and offers substantial speedup.

Details

AAAI Conference 2025 Conference Paper

Attentive Eraser: Unleashing Diffusion Model’s Object Removal Potential via Self-Attention Redirection Guidance

Wenhao Sun
Xue-Mei Dong
Benlei Cui
Jingqun Tang

Recently, diffusion models have emerged as promising newcomers in the field of generative models, shining brightly in image generation. However, when employed for object removal tasks, they still encounter issues such as generating random artifacts and the incapacity to repaint foreground object areas with appropriate content after removal. To tackle these problems, we propose Attentive Eraser, a tuning-free method to empower pre-trained diffusion models for stable and effective object removal. Firstly, in light of the observation that the self-attention maps influence the structure and shape details of the generated images, we propose Attention Activation and Suppression (ASS), which re-engineers the self-attention mechanism within the pre-trained diffusion models based on the given mask, thereby prioritizing the background over the foreground object during the reverse generation process. Moreover, we introduce Self-Attention Redirection Guidance (SARG), which utilizes the self-attention redirected by ASS to guide the generation process, effectively removing foreground objects within the mask while simultaneously generating content that is both plausible and coherent. Experiments demonstrate the stability and effectiveness of Attentive Eraser in object removal across a variety of pre-trained diffusion models, outperforming even training-based methods. Furthermore, Attentive Eraser can be implemented in various diffusion model architectures and checkpoints, enabling excellent scalability.

PDF Details DOI

ICRA Conference 2025 Conference Paper

KARMA: Augmenting Embodied AI Agents with Long-and-Short Term Memory Systems

Zixuan Wang
Bo Yu 0014
Junzhe Zhao
Wenhao Sun
Sai Hou
Shuai Liang
Xing Hu 0001
Yinhe Han 0001

Embodied AI agents responsible for executing interconnected, long-sequence household tasks often face difficulties with in-context memory, leading to inefficiencies and errors in task execution. To address this issue, we introduce KARMA, an innovative memory system that integrates longterm and short-term memory modules, enhancing large language models (LLMs) for planning in embodied agents through memory-augmented prompting. Karma distinguishes between long-term and short-term memory, with long-term memory capturing comprehensive 3D scene graphs as representations of the environment, while short-term memory dynamically records changes in objects' positions and states. This dualmemory structure allows agents to retrieve relevant past scene experiences, thereby improving the accuracy and efficiency of task planning. Short-term memory employs strategies for effective and adaptive memory replacement, ensuring the retention of critical information while discarding less pertinent data. Compared to state-of-the-art embodied agents enhanced with memory, our memory-augmented embodied AI agent improves success rates by $1. 3 \times$ and $2. 3 \times$ in Composite Tasks and Complex Tasks within the AI2-THOR simulator, respectively, and enhances task execution efficiency by $3. 4 \times$ and $62. 7 \times$. Furthermore, we demonstrate that KARMA's plug-and-play capability allows for seamless deployment on real-world robotic systems, such as mobile manipulation platforms. Through this plug-and-play memory system, KARMA significantly enhances the ability of embodied agents to generate coherent and contextually appropriate plans, making the execution of complex household tasks more efficient. Our code is available at https://github.com/WZX0Swarm0Robotics/KARMA/tree/master.

Details

NeurIPS Conference 2025 Conference Paper

SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

Zhao Jin
Rong-Cheng Tu
Jingyi Liao
Wenhao Sun
Xiao Luo
Shunyu Liu
Dacheng Tao

3D Visual Grounding (3DVG) aims to localize target objects within a 3D scene based on natural language queries. To alleviate the reliance on costly 3D training data, recent studies have explored zero-shot 3DVG by leveraging the extensive knowledge and powerful reasoning capabilities of pre-trained LLMs and VLMs. However, existing paradigms tend to emphasize either spatial (3D-based) or semantic (2D-based) understanding, limiting their effectiveness in complex real-world applications. In this work, we introduce SPAZER — a VLM-driven agent that combines both modalities in a progressive reasoning framework. It first holistically analyzes the scene and produces a 3D rendering from the optimal viewpoint. Based on this, anchor-guided candidate screening is conducted to perform a coarse-level localization of potential objects. Furthermore, leveraging retrieved relevant 2D camera images, 3D-2D joint decision-making is efficiently performed to determine the best-matching object. By bridging spatial and semantic reasoning neural streams, SPAZER achieves robust zero-shot grounding without training on 3D-labeled data. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods, achieving notable gains of $\mathbf{9. 0\}$% and $\mathbf{10. 9\}$% in accuracy.

PDF Details

NeurIPS Conference 2025 Conference Paper

VORTA: Efficient Video Diffusion via Routing Sparse Attention

Wenhao Sun
Rong-Cheng Tu
Yifu Ding
Jingyi Liao
Zhao Jin
Shunyu Liu
Dacheng Tao

Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1. 76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14. 41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https: //github. com/wenhao728/VORTA.

PDF Details