Arrow Research search

Author name cluster

Cunjian Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
1 author row

Possible papers

5

AAAI Conference 2026 Conference Paper

IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation

  • Donghao Zhou
  • Jingyu Lin
  • Guibao Shen
  • Quande Liu
  • Jialin Gao
  • Lihao Liu
  • Lan Du
  • Cunjian Chen

Recent visual generative models enable story generation with consistent characters from text, but human-centric story generation faces additional challenges, such as maintaining detailed and diverse human face consistency and coordinating multiple characters across different images. This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery, which extracts cohesive character identities, and Re-denoising Identity Injection, which re-denoises images to inject identities while preserving desired context. Experiments on the ConsiStory-Human benchmark demonstrate that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations. The framework also shows strong potential for applications such as infinite-length story generation and dynamic character composition.

AAAI Conference 2026 Conference Paper

Mamba-Driven Multi-View Discriminative Clustering via Global-Local Cross-View Sequence Modeling

  • Yuanyang Zhang
  • Xinhang Wan
  • Chao Zhang
  • Jie Xu
  • Cunjian Chen
  • Tien-Tsin Wong
  • Li Yao
  • Yijie Lin

Multi-view clustering (MVC) has recently garnered increasing attention for its ability to partition unlabeled samples into distinct clusters by leveraging complementary and consistent information from different views. Existing MVC methods primarily combine deep neural networks with contrastive learning for cross-view representation learning, yet often overlook the inherent global-local structural relationships among samples. While GNN-based methods capture local structures, they struggle to model global dependencies, leading to inferior inter-cluster separability. In contrast, Transformer-based methods excel at global aggregation but suffer from quadratic complexity, and their attention smoothing effect weakens fine-grained local structures, resulting in suboptimal intra-cluster compactness. To address these limitations, we propose a novel end-to-end MVC framework called Mamba-Driven Multi-View Discriminative Clustering via Global-Local Cross-View Sequence Modeling (MGLC). By flexibly constructing multi-view sequences, MGLC fully exploits the efficient sequence modeling capabilities of Mamba to jointly model cross-view dependencies and global-local structural relationships among samples. Furthermore, MGLC introduces a Cross-Mamba Fusion module to dynamically integrate cross-view and global-local structural representations. Additionally, MGLC incorporates a Dual Calibration Contrastive Learning module, guided by high-confidence pseudo-labels, that adaptively refines both feature and semantic representations while mitigating false negatives among semantically similar samples. Extensive comparative experiments and ablation studies demonstrate the effectiveness of MGLC.

TMLR Journal 2025 Journal Article

Latte: Latent Diffusion Transformer for Video Generation

  • Xin Ma
  • Yaohui Wang
  • Xinyuan Chen
  • Gengyun Jia
  • Ziwei Liu
  • Yuan-Fang Li
  • Cunjian Chen
  • Yu Qiao

We propose Latte, a novel Latent Diffusion Transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, \textit{i.e.}, FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to the text-to-video generation (T2V) task, where Latte achieves results that are competitive with recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

NeurIPS Conference 2025 Conference Paper

SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency

  • Quanjian Song
  • Donghao Zhou
  • Jingyu Lin
  • Fei Shen
  • Jiaze Wang
  • Xiaowei Hu
  • Cunjian Chen
  • Pheng-Ann Heng

Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on character consistency, they often overlook the crucial role of scenes in storytelling, which restricts their creativity in practice. This paper introduces scene-oriented story generation, addressing two key challenges: (i) scene planning, where current methods fail to ensure scene-level narrative coherence by relying solely on text descriptions, and (ii) scene consistency, which remains largely unexplored in terms of maintaining scene consistency across multiple stories. We propose SceneDecorator, a training-free framework that employs VLM-Guided Scene Planning to ensure narrative coherence across different scenes in a ``global-to-local'' manner, and Long-Term Scene-Sharing Attention to maintain long-term scene consistency and subject diversity across generated stories. Extensive experiments demonstrate the superior performance of SceneDecorator, highlighting its potential to unleash creativity in the fields of arts, films, and games.

NeurIPS Conference 2025 Conference Paper

Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

  • Chenchen Tan
  • Youyang Qu
  • Xinghao Li
  • Hui Zhang
  • Shujie Cui
  • Cunjian Chen
  • Longxiang Gao

The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15\% higher accuracy on the ToFU benchmark and 10\% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.