Author name cluster

Ran Yi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

1 author row

AAAI Conference 2026 Conference Paper

TileGS: Adaptive Gaussian Densification Through Tile-Guided Perceptual Analysis

Yiwen Wang
Ran Yi
Lizhuang Ma

3D Gaussian Splatting (3DGS) has become a powerful technique for real-time novel view synthesis, using explicit, end-to-end optimized 3D Gaussians to represent scenes. However, its training objective is primarily based on pixel-wise photometric loss, and its densification strategy fails to account for structural consistency and localized perceptual priorities. As a result, 3DGS struggles to capture fine textures and boundary details in underconstrained areas, leading to inefficient use of representational capacity and degraded rendering quality in critical regions. To overcome this limitation, we introduce TileGS, a tile-wise, perceptually guided framework designed to refine scene representation based on local rendering quality. Our method features a tile-guided densification approach that performs per-tile perceptual analysis between rendered and ground-truth tiles to identify areas and Gaussians requiring refinement. Additionally, we incorporate a tile-level structural loss to enforce localized consistency during training. TileGS is designed to be a plug-and-play framework, seamlessly integrating into existing 3DGS pipelines with minimal adjustments. Experiments across multiple datasets demonstrate that TileGS improves rendering quality while maintaining an efficient representation, showcasing its versatility and effectiveness in diverse rendering scenarios.

PDF Details DOI

AAAI Conference 2026 Conference Paper

UltraGen: High-Resolution Video Generation with Hierarchical Attention

Teng Hu
Jiangning Zhang
Zihan Su
Ran Yi

Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Uncovering and Mitigating Transient Blindness in Multimodal Model Editing

XiaoQi Han
Ru Li
Ran Yi
Hongye Tan
Zhuomin Liang
Victor Gutierrez Basulto
Jeff Z. Pan

Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.

PDF Details DOI

AAAI Conference 2025 Conference Paper

ID-Sculpt: ID-aware 3D Head Generation from Single In-the-wild Portrait Image

Jinkun Hao
Junshu Tang
Jiangning Zhang
Ran Yi
Yijia Hong
Moran Li
Weijian Cao
Yating Wang

While recent works have achieved great success on one-shot 3D common object generation, high quality and fidelity 3D head generation from a single image remains a great challenge. Previous text-based methods for generating 3D heads were limited by text descriptions and image-based methods struggled to produce high-quality head geometry. To handle this challenging problem, we propose a novel framework, ID-Sculpt, to generate high-quality 3D heads while preserving their identities. Our work incorporates the identity information of the portrait image into three parts: 1) geometry initialization, 2) geometry sculpting, and 3) texture generation stages. Given a reference portrait image, we first align the identity features with text features to realize ID-aware guidance enhancement, which contains the control signals representing the face information. We then use the canny map, ID features of the portrait image, and a pre-trained text-to-normal/depth diffusion model to generate ID-aware geometry supervision and 3D-GAN inversion is employed to generate ID-aware geometry initialization. Furthermore, with the ability to inject identity information into 3D head generation, we use ID-aware guidance to calculate ID-aware Score Distillation (ISD) for geometry sculpting. For texture generation, we adopt the ID Consistent Texture Inpainting and Refinement which progressively expands the view for texture inpainting to obtain an initialization UV texture map. We then use the id-aware guidance to provide image-level supervision for noisy multi-view images to obtain a refined texture map. Extensive experiments demonstrate that we can generate high-quality 3D heads with accurate geometry and texture from a single in-the-wild portrait image.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

Jinkun Hao
Naifu Liang
Zhen Luo
Xudong XU
Weipeng Zhong
Ran Yi
Yichen Jin
Zhaoyang Lyu

The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce \textbf{MesaTask-10K}, a large-scale dataset comprising approximately 10, 700 synthetic tabletop scenes with \emph{manually crafted layouts} that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a \textbf{Spatial Reasoning Chain} that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present \textbf{MesaTask}, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts.

PDF Details

NeurIPS Conference 2025 Conference Paper

PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

Teng Hu
Zhentao Yu
Zhengguang Zhou
Jiangning Zhang
Yuan Zhou
Qinglin Lu
Ran Yi

Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines. More comprehensive video results and comparisons are shown on the project page in the supplementary material.

PDF Details

NeurIPS Conference 2025 Conference Paper

WaveAR: Wavelet-Aware Continuous Autoregressive Diffusion for Accurate Human Motion Prediction

shengchuan gao
Shuo Wang
Yabiao Wang
Ran Yi

This work tackles a challenging problem: stochastic human motion prediction (SHMP), which aims to forecast diverse and physically plausible future pose sequences based on a short history of observed motion. While autoregressive sequence models have excelled in related generation tasks, their reliance on vector‐quantized tokenization limits motion fidelity and training stability. To overcome these drawbacks, we introduce \textbf{WaveAR}, a novel AR based framework which is the first successful application of a continuous autoregressive generation paradigm to HMP to our best knowledge. WaveAR consists of two stages. In the first stage, a lightweight Spatio‐Temporal VAE (ST-VAE) compresses the raw 3D-joint sequence into a downsampled latent token stream, providing a compact yet expressive foundation. In the second stage, we apply masked autoregressive prediction directly in this continuous latent space, conditioning on both unmasked latents and multi‐scale spectral cues extracted via a 2D discrete wavelet transform. A fusion module consisting of alternating cross-attention and self-attention layers adaptively fuses temporal context with low- and high-frequency wavelet subbands, and a small MLP‐based diffusion head predicts per-token noise residuals under a denoising loss. By avoiding vector quantization and integrating localized frequency information, WaveAR preserves fine‐grained motion details while maintaining fast inference speed. Extensive experiments on standard benchmarks demonstrate that our approach delivers more accurate and computationally efficient predictions than prior state‐of-the-art methods.

PDF Details

AAAI Conference 2025 Conference Paper

Weighted Poisson-disk Resampling on Large-Scale Point Clouds

Xianhe Jiao
Chenlei Lv
Junli Zhao
Ran Yi
Yu-Hui Wen
Zhenkuan Pan
Zhongke Wu
Yong-Jin Liu

For large-scale point cloud processing, resampling takes the important role of controlling the point number and density while keeping the geometric consistency. However, current methods cannot balance such different requirements. Particularly with large-scale point clouds, classical methods often struggle with decreased efficiency and accuracy. To address such issues, we propose a weighted Poisson-disk (WPD) resampling method to improve the usability and efficiency for the processing. We first design an initial Poisson resampling with a voxel-based estimation strategy. It is able to estimate a more accurate radius of the Poisson-disk while maintaining high efficiency. Then, we design a weighted tangent smoothing step to further optimize the Voronoi diagram for each point. At the same time, sharp features are detected and kept in the optimized results with isotropic property. Finally, we achieve a resampling copy from the original point cloud with the specified point number, uniform density, and high-quality geometric consistency. Experiments show that our method significantly improves the performance of large-scale point cloud resampling for different applications, and provides a highly practical solution.

PDF Details DOI

AAAI Conference 2024 Conference Paper

AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model

Teng Hu
Jiangning Zhang
Ran Yi
Yuzhen Du
Xu Chen
Liang Liu
Yabiao Wang
Chengjie Wang

Anomaly inspection plays an important role in industrial manufacture. Existing anomaly inspection methods are limited in their performance due to insufficient anomaly data. Although anomaly generation methods have been proposed to augment the anomaly data, they either suffer from poor generation authenticity or inaccurate alignment between the generated anomalies and masks. To address the above problems, we propose AnomalyDiffusion, a novel diffusion-based few-shot anomaly generation model, which utilizes the strong prior information of latent diffusion model learned from large-scale dataset to enhance the generation authenticity under few-shot training data. Firstly, we propose Spatial Anomaly Embedding, which consists of a learnable anomaly embedding and a spatial embedding encoded from an anomaly mask, disentangling the anomaly information into anomaly appearance and location information. Moreover, to improve the alignment between the generated anomalies and the anomaly masks, we introduce a novel Adaptive Attention Re-weighting Mechanism. Based on the disparities between the generated anomaly image and normal sample, it dynamically guides the model to focus more on the areas with less noticeable generated anomalies, enabling generation of accurately-matched anomalous image-mask pairs. Extensive experiments demonstrate that our model significantly outperforms the state-of-the-art methods in generation authenticity and diversity, and effectively improves the performance of downstream anomaly inspection tasks. The code and data are available in https://github.com/sjtuplayer/anomalydiffusion.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Continuous Piecewise-Affine Based Motion Model for Image Animation

Hexiang Wang
Fengqi Liu
Qianyu Zhou
Ran Yi
Xin Tan
Lizhuang Ma

Image animation aims to bring static images to life according to driving videos and create engaging visual content that can be used for various purposes such as animation, entertainment, and education. Recent unsupervised methods utilize affine and thin-plate spline transformations based on keypoints to transfer the motion in driving frames to the source image. However, limited by the expressive power of the transformations used, these methods always produce poor results when the gap between the motion in the driving frame and the source image is large. To address this issue, we propose to model motion from the source image to the driving frame in highly-expressive diffeomorphism spaces. Firstly, we introduce Continuous Piecewise-Affine based (CPAB) transformation to model the motion and present a well-designed inference algorithm to generate CPAB transformation from control keypoints. Secondly, we propose a SAM-guided keypoint semantic loss to further constrain the keypoint extraction process and improve the semantic consistency between the corresponding keypoints on the source and driving images. Finally, we design a structure alignment loss to align the structure-related features extracted from driving and generated images, thus helping the generator generate results that are more consistent with the driving action. Extensive experiments on four datasets demonstrate the effectiveness of our method against state-of-the-art competitors quantitatively and qualitatively. Code will be publicly available at: https://github.com/DevilPG/AAAI2024-CPABMM.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

RFENet: Towards Reciprocal Feature Evolution for Glass Segmentation

Ke Fan
Changan Wang
Yabiao Wang
Chengjie Wang
Ran Yi
Lizhuang Ma

Glass-like objects are widespread in daily life but remain intractable to be segmented for most existing methods. The transparent property makes it difficult to be distinguished from background, while the tiny separation boundary further impedes the acquisition of their exact contour. In this paper, by revealing the key co-evolution demand of semantic and boundary learning, we propose a Selective Mutual Evolution (SME) module to enable the reciprocal feature learning between them. Then to exploit the global shape context, we propose a Structurally Attentive Refinement (SAR) module to conduct a fine-grained feature refinement for those ambiguous points around the boundary. Finally, to further utilize the multi-scale representation, we integrate the above two modules into a cascaded structure and then introduce a Reciprocal Feature Evolution Network (RFENet) for effective glass-like object segmentation. Extensive experiments demonstrate that our RFENet achieves state-of-the-art performance on three popular public datasets. Code is available at https: //github. com/VankouF/RFENet.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Exploiting Fine-Grained Face Forgery Clues via Progressive Enhancement Learning

Qiqi Gu
Shen Chen
Taiping Yao
Yang Chen
Shouhong Ding
Ran Yi

With the rapid development of facial forgery techniques, forgery detection has attracted more and more attention due to security concerns. Existing approaches attempt to use frequency information to mine subtle artifacts under high-quality forged faces. However, the exploitation of frequency information is coarse-grained, and more importantly, their vanilla learning process struggles to extract fine-grained forgery traces. To address this issue, we propose a progressive enhancement learning framework to exploit both the RGB and fine-grained frequency clues. Specifically, we perform a fine-grained decomposition of RGB images to completely decouple the real and fake traces in the frequency space. Subsequently, we propose a progressive enhancement learning framework based on a two-branch network, combined with self-enhancement and mutual-enhancement modules. The self-enhancement module captures the traces in different input spaces based on spatial noise enhancement and channel attention. The Mutual-enhancement module concurrently enhances RGB and frequency features by communicating in the shared spatial dimension. The progressive enhancement process facilitates the learning of discriminative features with fine-grained face forgery clues. Extensive experiments on several datasets show that our method outperforms the state-of-the-art face forgery detection methods.

PDF Details

IJCAI Conference 2022 Conference Paper

Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection

Zhihao Gu
Taiping Yao
Yang Chen
Ran Yi
Shouhong Ding
Lizhuang Ma

The rapid development of face forgery techniques has drawn growing attention due to security concerns. Existing deepfake video detection methods always attempt to capture the discriminative features by directly exploiting static temporal convolution to mine temporal inconsistency, without explicit exploration on the diverse temporal dynamics of different forged regions. To effectively and comprehensively capture the various inconsistency, in this paper, we propose a novel Region-Aware Temporal Filter (RATF) module which automatically generates corresponding temporal filters for different spatial regions. Specifically, we decouple the dynamic temporal kernel into a set of region-agnostic basic filters and region-sensitive aggregation weights. And different weights guide the corresponding regions to adaptively learn temporal inconsistency, which greatly enhances the overall representational ability. Moreover, to cover the long-term temporal dynamics, we divide the video into multiple snippets and propose a Cross-Snippet Attention (CSA) to promote the cross-snippet information interaction. Extensive experiments and visualizations on several benchmarks demonstrate the effectiveness of our method against state-of-the-art competitors.

PDF Details DOI