Arrow Research search

Author name cluster

Zixuan Ye

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers
1 author row

Possible papers

4

AAAI Conference 2026 Conference Paper

IPFormer: Instance Prompt-guided Transformer for Multi-modal Multi-shot Video Understanding

  • Yujia Liang
  • Jile Jiao
  • Xuetao Feng
  • Xinchen Liu
  • Kun Liu
  • Yuan Wang
  • Zixuan Ye
  • Hao Lu

Video Large Language Models (VideoLLMs), which adopt large language models for video understanding, have been demonstrated for single-shot videos. However, they usually struggle in multi-shot videos with frequent shot changes, varying camera angles, etc., which makes VideoLLMs hardly answer questions about multiple instances or shots over the whole video. We attribute this challenge to two issues: 1) the lack of multi-shot multi-instance annotations of existing datasets, and 2) the negligence of instance-aware modeling of current VideoLLMs. Therefore, we first introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and question-answering pairs tailored for multi-shot and multi-instance scenarios. Moreover, since the existing VideoLLMs neglect the explicit modeling of instance-related features, we propose a novel Instance Prompt-guided Transformer, named IPFormer, to achieve instance-aware videounderstanding. In the IPFormer, we design a simple but effective instance-aware feature injection module, which encodes instance features as instance prompts via an attention-based connector. By this means, IPFormer can aggregate instance-specific information across multiple shots. Extensive experiments not only show that our dataset and model significantly improve multi-shot video understanding. but also show that our MultiClip-Bench can provide valuable training data and benchmarks for various video understanding tasks.

AAAI Conference 2025 Conference Paper

Training Matting Models Without Alpha Labels

  • Wenze Liu
  • Zixuan Ye
  • Hao Lu
  • Zhiguo Cao
  • Xiangyu Yue

The labeling difficulty has been a longstanding problem in deep image matting. To escape from fine labels, this work explores using rough annotations such as trimaps coarsely indicating the foreground/background as supervision. We present that the cooperation between learned semantics from indicated known regions and proper assumed matting rules can help infer alpha values at transition areas. Inspired by the nonlocal principle in traditional image matting, we build a directional distance consistency loss (DDC loss) at each pixel neighborhood to constrain the alpha values conditioned on the input image. DDC loss forces the distance of similar pairs on the alpha matte and on its corresponding image to be consistent. In this way, the alpha values can be propagated from learned known regions to unknown transition areas. With only images and trimaps, a matting model can be trained under the supervision of a known loss and the proposed DDC loss. Experiments on AM-2K and P3M-10K dataset show that our paradigm achieves comparable performance with the fine-label-supervised baseline, while sometimes offers even more satisfying results than human-labeled ground truth.

AAAI Conference 2023 Conference Paper

Infusing Definiteness into Randomness: Rethinking Composition Styles for Deep Image Matting

  • Zixuan Ye
  • Yutong Dai
  • Chaoyi Hong
  • Zhiguo Cao
  • Hao Lu

We study the composition style in deep image matting, a notion that characterizes a data generation flow on how to exploit limited foregrounds and random backgrounds to form a training dataset. Prior art executes this flow in a completely random manner by simply going through the foreground pool or by optionally combining two foregrounds before foreground-background composition. In this work, we first show that naive foreground combination can be problematic and therefore derive an alternative formulation to reasonably combine foregrounds. Our second contribution is an observation that matting performance can benefit from a certain occurrence frequency of combined foregrounds and their associated source foregrounds during training. Inspired by this, we introduce a novel composition style that binds the source and combined foregrounds in a definite triplet. In addition, we also find that different orders of foreground combination lead to different foreground patterns, which further inspires a quadruplet-based composition style. Results under controlled experiments on four matting baselines show that our composition styles outperform existing ones and invite consistent performance improvement on both composited and real-world datasets. Code is available at: https://github.com/coconuthust/composition_styles

NeurIPS Conference 2022 Conference Paper

SAPA: Similarity-Aware Point Affiliation for Feature Upsampling

  • Hao Lu
  • Wenze Liu
  • Zixuan Ye
  • Hongtao Fu
  • Yuliang Liu
  • Zhiguo Cao

We introduce point affiliation into feature upsampling, a notion that describes the affiliation of each upsampled point to a semantic cluster formed by local decoder feature points with semantic similarity. By rethinking point affiliation, we present a generic formulation for generating upsampling kernels. The kernels encourage not only semantic smoothness but also boundary sharpness in the upsampled feature maps. Such properties are particularly useful for some dense prediction tasks such as semantic segmentation. The key idea of our formulation is to generate similarity-aware kernels by comparing the similarity between each encoder feature point and the spatially associated local region of decoder features. In this way, the encoder feature point can function as a cue to inform the semantic cluster of upsampled feature points. To embody the formulation, we further instantiate a lightweight upsampling operator, termed Similarity-Aware Point Affiliation (SAPA), and investigate its variants. SAPA invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, depth estimation, and image matting. Code is available at: https: //github. com/poppinace/sapa