Arrow Research search

Author name cluster

Zhen Xing

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
1 author row

Possible papers

5

AAAI Conference 2026 Conference Paper

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

  • Sicheng Xie
  • Haidong Cao
  • Zejia Weng
  • Zhen Xing
  • Haoran Chen
  • Shiwei Shen
  • Jiaqi Leng
  • Zuxuan Wu

Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing methods, which often rely on coarsely-aligned video pairs, are typically constrained to learning global or task-level features. As a result, they tend to neglect the fine-grained frame-level dynamics required for complex manipulation and generalization to novel tasks. We posit that this limitation stems from a vicious circle of inadequate datasets and the methods they inspire. To break this cycle, we propose a paradigm shift that treats fine-grained human-robot alignment as a conditional video generation problem. To this end, we first introduce H&R, a novel third-person dataset containing 2,600 episodes of precisely synchronized human and robot motions, collected using a VR teleoperation system. We then present Human2Robot, a framework designed to leverage this data. Human2Robot employs a Video Prediction Model to learn a rich and implicit representation of robot dynamics by generating robot videos from human input, which in turn guides a decoupled action decoder. Our real-world experiments demonstrate that this approach not only achieves high performance on seen tasks but also exhibits significant one-shot generalization to novel positions, objects, instances, and even new task categories.

AAAI Conference 2025 Conference Paper

AdaDiff: Adaptive Step Selection for Fast Diffusion Models

  • Hui Zhang
  • Zuxuan Wu
  • Zhen Xing
  • Jie Shao
  • Yu-Gang Jiang

Diffusion models, as a type of generative model, have achieved impressive results in generating images and videos conditioned on textual conditions. However, the generation process of diffusion models involves denoising dozens of steps to produce photorealistic images/videos, which is computationally expensive. Unlike previous methods that design ``one-size-fits-all'' approaches for speed up, we argue denoising steps should be sample-specific conditioned on the richness of input texts. To this end, we introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies, which are then used by the diffusion model for generation. AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function, balancing inference time and generation quality. We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline using a fixed 50 denoising steps while reducing inference time by at least 33%, going as high as 40%. Furthermore, our method can be used on top of other acceleration methods to provide further speed benefits. Lastly, qualitative analysis shows that AdaDiff allocates more steps to more informative prompts and fewer steps to simpler prompts.

IJCAI Conference 2025 Conference Paper

Multimodal Inference with Incremental Tabular Attributes

  • Xinda Chen
  • Zhen Xing
  • Zixian Zhang
  • Weimin Tan
  • Bo Yan

Multimodal Learning with visual and tabular modalities has become more and more popular nowadays, especially in the healthcare area. Due to the adaptation of new equipment or new factors being introduced, the tabular modality keeps changing. However, the standard process of training multimodal AI models requires tables to have fixed columns in training and inference; thus, it is not suitable for handling dynamically changed tables. Therefore, new methods are needed for efficiently handling such tables in multimodal learning. In this paper, we introduce a new task, multimodal inference with incremental tabular attributes, which aims to enable trained multimodal models to leverage incremental attributes in tabular modality during the inference stage efficiently. We implement a specialized encoder to disentangle the latent representation of incremental tabular attributes inside itself and with the old attributes to reduce information redundancy and further align the incremental attributes with the visual modality with consistency loss to improve information richness. Experimental results across five public datasets show that our method effectively utilizes incremental tabular attributes, achieving state-of-the-art performance in general scenarios. Beyond the inference, we also find that our method achieved better performance in fully supervised settings, evoking a new training style for multimodal learning with tables.

NeurIPS Conference 2024 Conference Paper

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

  • Miaosen Zhang
  • Yixuan Wei
  • Zhen Xing
  • Yifei Ma
  • Zuxuan Wu
  • Ji Li
  • Zheng Zhang
  • Qi Dai

Modern vision models are trained on very large noisy datasets. While these models acquire strong capabilities, they may not follow the user's intent to output the desired results in certain aspects, e. g. , visual aesthetic, preferred style, and responsibility. In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system. Advanced retrieval systems usually adopt a cascade of aesthetic models as re-rankers or filters, which are limited to low-level features like saturation and perform poorly when stylistic, cultural or knowledge contexts are involved. We find that utilizing the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations can make up for this shortcoming. Based on the above findings, we propose a preference-based reinforcement learning method that fine-tunes the vision models to distill the knowledge from both LLMs reasoning and the aesthetic models to better align the vision models with human aesthetics. Meanwhile, with rare benchmarks designed for evaluating retrieval systems, we leverage large multi-modality model (LMM) to evaluate the aesthetic performance with their strong abilities. As aesthetic assessment is one of the most subjective tasks, to validate the robustness of LMM, we further propose a novel dataset named HPIR to benchmark the alignment with human aesthetics. Experiments demonstrate that our method significantly enhances the aesthetic behaviors of the vision models, under several metrics. We believe the proposed algorithm can be a general practice for aligning vision models with human values.

NeurIPS Conference 2024 Conference Paper

GenRec: Unifying Video Generation and Recognition with Diffusion Models

  • Zejia Weng
  • Xitong Yang
  • Zhen Xing
  • Zuxuan Wu
  • Yu-Gang Jiang

Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative process are suitable for video recognition, and eventually joint optimization of generation and recognition. Building upon Stable Video Diffusion, we introduce GenRec, the first unified framework trained with a random-frame conditioning process so as to learn generalized spatial-temporal representations. The resulting framework can naturally supports generation and recognition, and more importantly is robust even when visual inputs contain limited information. Extensive experiments demonstrate the efficacy of GenRec for both recognition and generation. In particular, GenRec achieves competitive recognition performance, offering 75. 8% and 87. 2% accuracy on SSV2 and K400, respectively. GenRec also performs the best on class-conditioned image-to-video generation, achieving 46. 5 and 49. 3 FVD scores on SSV2 and EK-100 datasets. Furthermore, GenRec demonstrates extraordinary robustness in scenarios that only limited frames can be observed. Code will be available at https: //github. com/wengzejia1/GenRec.