Author name cluster

Sirui Han

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

1 author row

AAAI Conference 2026 Conference Paper

ManipDreamer3D: Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory

Ying Li
Xiaobao Wei
Xiaowei Chi
Yuming Li
Zhongyu Zhao
Hao Wang
Ningning Ma
Ming Lu

Data scarcity continues to be a critical bottleneck in the field of robotic manipulation, limiting the ability to train robust and generalizable models. While diffusion models provide a promising approach to synthesizing realistic robotic manipulation videos, their effectiveness hinges on the availability of precise and reasonable control instructions. Current methods primarily rely on 2D trajectories as instruction prompts, which inherently face issues with 3D spatial ambiguity. In this work, we present a novel framework named ManipDreamer3Dfor generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction. Our method combines 3D trajectory planning with a reconstructed 3D occupancy map created from a third-person perspective, along with a novel trajectory-to-video diffusion model. Specifically, ManipDreamer3D first reconstructs the 3D occupancy representation from the input image and then computes an optimized 3D end-effector trajectory, minimizing path length, avoiding collisions and retiming. Next, we employ a latent editing technique to create video sequences from the initial image latent, text instruction and the optimized 3D trajectory. This process conditions our specially trained trajectory-to-video diffusion model to produce robotic pick-and-place videos. Our method significantly reduces human intervention requirements by autonomously planing plausible 3D trajectories. Experimental results demonstrate its superior visual quality and precision.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Outlier Matters: Efficient Long-to-Short Reasoning via Outlier-Guided Model Merging

Qiyuan Zhu
Dezhi Li
Lujun Li
Xiaoyu Qin
Wei Li
Hao Gu
Hua Xu
Sirui Han

Large Reasoning Language Models (LRMs) have recently shown remarkable performance in complex reasoning tasks, but their extensive reasoning chains incur substantial computational overhead. To address this challenge, we propose Outlier-aware Reasoning Conciseness Adaptive Merge (ORCA), a novel plug-and-play model merging framework that leverages outlier activation patterns to fuse base models with reasoning models. Our ORCA introduces three key innovations: (1) adaptive alignment that reduces conflicts between disparate activation patterns during merging, (2) outlier-guided allocation that assigns merging coefficients proportional to each layer's reasoning importance as indicated by outlier concentrations, and (3) dynamic probe-based adjustment that adapts merging coefficients during inference based on input-specific activation characteristics. These strategies allow seamless integration into existing merging pipelines while creating unified models that maintain reasoning accuracy with significantly reduced response verbosity. Comprehensive evaluation across six benchmarks using Qwen and LLaMA models shows ORCA reduces average response length by 55% while improving accuracy by 2.4∼5.7% over existing methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging

Lujun Li
Qiyuan Zhu
Jiacheng Wang
Xiaoyu Qin
Wei Li
Hao Gu
Sirui Han
Yike Guo

Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods aim to achieve greater efficiency by consolidating several experts, they are fundamentally hindered by parameter conflicts arising from expert specialization. In this paper, we present Sub-MoE, a novel MoE compression framework via Subspace Expert Merging. Our key insight is to perform joint Singular Value Decomposition (SVD) on concatenated expert weights, reducing conflicting parameters by extracting shared U-matrices while enabling effective merging of the expert-specific V components. Specifically, Sub-MoE consists of two innovative stages: (1) Adaptive Expert Clustering, which groups functionally coherent experts via K-means clustering based on cosine similarity of expert outputs; and (2) Subspace Expert Merging, which first performs Experts Union Decomposition to derive the shared U-matrix across experts in the same group, then applies frequency-based merging for individual V-matrices, and completes expert reconstruction using the merged V-matrix. In this way, we align and fuse experts in a shared subspace. Additionally, the framework can be extended with intra-expert compression for further inference optimization. Extensive experiments on Mixtral, DeepSeek, and Qwen-1.5/3 MoE LLMs demonstrate that our Sub-MoE significantly outperforms existing expert pruning and merging methods. Notably, our Sub-MoE maintains 96%/86% of original performance with 25%/50% expert reduction on Mixtral-8×7B in zero-shot benchmarks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

Yujin Zhou
Pengcheng Wen
Jiale Chen
Boqin Yin
Han Zhu
Jiaming Ji
Juntao Dai
Chi-Min Chan

The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.

PDF Details DOI