Author name cluster

Lei Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

83 papers

2 author rows

AAAI Conference 2026 Conference Paper

Beyond Semantic Features: Pixel-level Mapping for Generalized AI-Generated Image Detection

Chenming Zhou
Jiaan Wang
Yu Li
Lei Li
Juan Cao
Sheng Tang

The rapid evolution of generative technologies necessitates reliable methods for detecting AI-generated images. A critical limitation of current detectors is their failure to generalize to images from unseen generative models, as they often overfit to source-specific semantic cues rather than learning universal generative artifacts. To overcome this, we introduce a simple yet remarkably effective pixel-level mapping pre-processing step to disrupt the pixel value distribution of images and break the fragile, non-essential semantic patterns that detectors commonly exploit as shortcuts. This forces the detector to focus on more fundamental and generalizable high-frequency traces inherent to the image generation process. Through comprehensive experiments on GAN and diffusion-based generators, we show that our approach significantly boosts the cross-generator performance of state-of-the-art detectors. Extensive analysis further verifies our hypothesis that the disruption of semantic cues is the key to generalization.

PDF Details DOI

EAAI Journal 2026 Journal Article

Multi-modal vehicle trajectory prediction via hierarchical attention and raster-vector maps encoding in unstructured road environments

Zhifa Chen
Peng Chen
Songyue Yang
Lei Li
Jiaqi Wu
Rentao Sun
Zhangyu Wang
Guizhen Yu

Details DOI

AAAI Conference 2026 Conference Paper

Multiple Human Motion Understanding

Lei Li
Sen Jia
Jenq-Neng Hwang

We introduce LLaMMo (Large Language and Multi-Person Motion Assistant), the first instruction-tuning multimodal framework tailored for multi-human motion analysis. LLaMMo incorporates a novel human-centric and social-temporal learner that models and fuses both intra-person dynamics and inter-person dependencies, yielding robust, context-aware representations of complex group behaviors while maintaining low computational overhead. To support LLaMMo, we construct LLaVerse, a large-scale dataset with fine-grained manual annotations covering diverse multi-person activities spanning daily social interaction and professional team sports. Built on top of LLaVerse, we also propose LLaMI-Bench, a dedicated benchmark for evaluating multi-human behavior understanding across motion and video modalities. Extensive experiments demonstrate that LLaMMo consistently outperforms baselines in understanding multi-person interactions under low-latency settings, with notable gains in both social and sport-specific contexts.

PDF Details DOI

AAAI Conference 2026 Conference Paper

PUFM: Efficient Point Cloud Upsampling via Flow Matching

Zhi-Song Liu
Chenhang He
Yakun Ju
Lei Li

Diffusion models have recently been adopted for point cloud upsampling due to their effectiveness in solving ill-posed problems. However, existing upsampling methods often struggle with inefficiencies, as they generate dense point clouds by mapping Gaussian noise to data, overlooking the geometric information already present in sparse inputs. To address this, we propose PUFM, a novel Point Cloud Upsampling via Flow Matching, which learns to directly transform sparse point clouds into their high-fidelity dense counterparts. Our approach first applies midpoint interpolation to densify the sparse input. Then, we construct a continuous interpolant between sparse and dense point clouds and train a neural network to estimate the velocity field for flow matching. Given the unordered nature of point clouds, we introduce a pre-alignment step based on Earth Mover's Distance (EMD) optimization to ensure coherent and meaningful interpolation between sparse and dense representations. This results in a more stable and efficient learning trajectory during flow matching. Experiments on synthetic benchmarks demonstrate that our method delivers superior upsampling quality but with fewer sampling steps. Further experiments on ScanNet and KITTI also show that our approach generalizes well to real-world RGB-D and LiDAR point clouds, making it more practical for real-world applications.

PDF Details DOI

AAAI Conference 2026 Conference Paper

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

Shicheng Li
Lei Li
Kun Ouyang
Shuhuai Ren
Yuanxin Liu
Yuanxing Zhang
Fuzheng Zhang
Lingpeng Kong

Video Large Language Models (Video LLMs) have achieved significant success by adopting the paradigm of large-scale pre-training followed by supervised fine-tuning (SFT). However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and over-reliance on the next-token prediction paradigm, which collectively result in the absence temporal supervision. To address these limitations, we propose TEMPLE (TEMporal Preference Learning), a systematic framework that enhances temporal reasoning capabilities through Direct Preference Optimization (DPO). To address temporal information scarcity in data, we introduce an automated pipeline for systematically constructing temporality-intensive preference pairs comprising three steps: selecting temporally rich videos, designing video-specific perturbation strategies, and evaluating model responses on clean and perturbed inputs. Complementing this data pipeline, we provide additional supervision signals via preference learning and propose a novel Progressive Pre-SFT Alignment strategy featuring two key innovations: a curriculum learning strategy which progressively increases perturbation difficulty to maximize data efficiency; and applying preference optimization before instruction tuning to incentivize fundamental temporal alignment. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. Our findings highlight TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

A Technical Report on “Erasing the Invisible”: The 2024 NeurIPS Competition on Stress Testing Image Watermarks

Mucong Ding
Bang An
Tahseen Rabbani
Chenghao Deng
Anirudh Satheesh
Souradip Chakraborty
Mehrdad Saberi
Yuxin Wen

AI-generated images have become pervasive, raising critical concerns around content authenticity, intellectual property, and the spread of misinformation. Invisible watermarks offer a promising solution for identifying AI-generated images, preserving content provenance without degrading visual quality. However, their real-world robustness remains uncertain due to the lack of standardized evaluation protocols and large-scale stress testing. To bridge this gap, we organized “Erasing the Invisible, ” a NeurIPS 2024 competition and newly established benchmark designed to systematically stress testing the resilience of watermarking techniques. The competition introduced two attack tracks—Black-box and Beige-box—that simulate practical scenarios with varying levels of attacker knowledge on watermarks, providing a comprehensive assessment of watermark robustness. The competition attracted significant global participation, with 2, 722 submissions from 298 teams. Through a rigorous evaluation pipeline featuring real-time feedback and human-verified final rankings, participants developed and demonstrated new attack strategies that revealed critical vulnerabilities in state-of-the-art watermarking methods. On average, the top-5 teams in both tracks could remove watermarks from $\geq$ 89% of the images while preserving high visual quality, setting strong baselines for future research on watermark attacks and defenses. To support continued progress in this field, we summarize the insights and lessons learned from this competition in this paper, and release the benchmark dataset, evaluation toolkit, and competition results. “Erasing the Invisible” establishes a valuable open resource for advancing more robust watermarking techniques and strengthening content provenance in the era of generative AI.