Author name cluster

Sixiang Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

1 author row

AAAI Conference 2026 Conference Paper

FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Jiajun Cao
Qizhe Zhang
Peidong Jia
Xuhui Zhao
Bo Lan
Xiaoan Zhang
Lizhuo
Xiaobao Wei

Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

Sixiang Chen
Jiaming Liu
Siyuan Qian
Han Jiang
Zhuoyang Liu
Chenyang Gu
Xiaoqi Li
Chengkai Hou

Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e. g. , either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator's actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base’s motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We empirically validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks, demonstrating superior performance compared to existing methods.

PDF Details

AAAI Conference 2025 Conference Paper

AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement

Yunlong Lin
Tian Ye
Sixiang Chen
Zhenqi Fu
Yingying Wang
Wenhao Chai
Zhaohu Xing
Wenxue Li

Existing low-light image enhancement (LIE) methods have achieved noteworthy success in solving synthetic distortions, yet they often fall short in practical applications. The limitations arise from two inherent challenges in real-world LIE: 1) the collection of distorted/clean image pairs is often impractical and sometimes even unavailable, and 2) accurately modeling complex degradations presents a non-trivial problem. To overcome them, we propose the Attribute Guidance Diffusion framework (AGLLDiff), a training-free method for effective real-world LIE. Instead of specifically defining the degradation process, AGLLDiff shifts the paradigm and models the desired attributes, such as image exposure, structure and color of normal-light images. These attributes are readily available and impose no assumptions about the degradation process, which guides the diffusion sampling process to a reliable high-quality solution space. Extensive experiments demonstrate that our approach outperforms the current leading unsupervised LIE methods across benchmarks in terms of distortion-based and perceptual-based metrics, and it performs well even in sophisticated wild degradation.

PDF Details DOI

AAAI Conference 2025 Conference Paper

DPLUT: Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors

Yunlong Lin
Zhenqi Fu
Kairun Wen
Tian Ye
Sixiang Chen
Ge Meng
Yingying Wang
Chui Kong

Low-light image enhancement (LIE) aims at precisely and efficiently recovering an image degraded in poor illumination environments. Recent advanced LIE techniques are using deep neural networks, which require lots of low-normal light image pairs, network parameters, and computational resources. As a result, their practicality is limited. In this work, we devise a novel unsupervised LIE framework based on diffusion priors and lookup tables (DPLUT) to achieve efficient low-light image recovery. The proposed approach comprises two critical components: a light adjustment lookup table (LLUT) and a noise suppression lookup table (NLUT). LLUT is optimized with a set of unsupervised losses. It aims at predicting pixel-wise curve parameters for the dynamic range adjustment of a specific image. NLUT is designed to remove the amplified noise after the light brightens. As diffusion models are sensitive to noise, diffusion priors are introduced to achieve high-performance noise suppression. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in terms of visual quality and efficiency.

PDF Details DOI

AAAI Conference 2025 Conference Paper

PromptHaze: Prompting Real-world Dehazing via Depth Anything Model

Tian Ye
Sixiang Chen
Haoyu Chen
Wenhao Chai
Jingjing Ren
Zhaohu Xing
Wenxue Li
Lei Zhu

Real-world image dehazing remains a challenging task due to the diverse nature of haze degradation and the lack of large-scale paired datasets. Existing methods based on hand-crafted priors or generative priors struggle to recover accurate backgrounds and fine details from dense haze regions. In this work, we propose a novel paradigm, PromptHaze, for real-world image dehazing via the depth prompt from the Depth Anything model. By employing a prompt-by-prompt strategy, our method iteratively updates the depth prompt and progressively restores the background through a dehazing network with controllable dehazing strength. Extensive experiments on widely-used real-world dehazing benchmarks demonstrate the superiority of PromptHaze in recovering authentic backgrounds and fine details from various haze scenes, outperforming state-of-the-art methods across multiple quality metrics.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Residual Diffusion Deblurring Model for Single Image Defocus Deblurring

Haoxuan Feng
Haohui Zhou
Tian Ye
Sixiang Chen
Lei Zhu

Defocus deblurring is a challenging task due to the spatially varying nature of defocus blur with multiple plausible solutions of a single given image. However, most existing methods falter when faced with extensive and variable defocus blur, either ignoring it or relying on additional loss functions to enhance perceptual quality. This often results in unrealistic reconstructions and compromised generalizability. In this paper, we propose a novel Residual Diffusion Deblurring Model framework for single image defocus deblurring. Our approach integrates a pre-trained defocus map estimator and a lightweight pre-deblur module with a learnable receptive field, providing crucial posterior information to effectively address large-scale and varying shaped defocus blur. In addition, a carefully-design denoising network enables the generation of diverse reconstructions from a single input. This approach not only significantly improves the perceptual quality of defocus deblurring outputs through multi-step residual learning, but also offers a more efficient inference strategy. Experimental results demonstrate that our method achieves competitive performance on real-world defocus deblurring image datasets across both perceptual and distortion evaluation metrics.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models

Haoyu Chen
Wenbo Li
Jinjin Gu
Jingjing Ren
Sixiang Chen
Tian Ye
Renjing Pei
Kaiwen Zhou

Natural images captured by mobile devices often suffer from multiple types of degradation, such as noise, blur, and low light. Traditional image restoration methods require manual selection of specific tasks, algorithms, and execution sequences, which is time-consuming and may yield suboptimal results. All-in-one models, though capable of handling multiple tasks, typically support only a limited range and often produce overly smooth, low-fidelity outcomes due to their broad data distribution fitting. To address these challenges, we first define a new pipeline for restoring images with multiple degradations, and then introduce RestoreAgent, an intelligent image restoration system leveraging multimodal large language models. RestoreAgent autonomously assesses the type and extent of degradation in input images and performs restoration through (1) determining the appropriate restoration tasks, (2) optimizing the task sequence, (3) selecting the most suitable models, and (4) executing the restoration. Experimental results demonstrate the superior performance of RestoreAgent in handling complex degradation, surpassing human experts. Furthermore, the system’s modular design facilitates the fast integration of new tasks and models.

PDF Details DOI

AAAI Conference 2024 Conference Paper

VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook

Wenbin Zou
Hongxia Gao
Tian Ye
Liang Chen
Weipeng Yang
Shasha Huang
Hongsheng Chen
Sixiang Chen

Night photography often struggles with challenges like low light and blurring, stemming from dark environments and prolonged exposures. Current methods either disregard priors and directly fitting end-to-end networks, leading to inconsistent illumination, or rely on unreliable handcrafted priors to constrain the network, thereby bringing the greater error to the final result. We believe in the strength of data-driven high-quality priors and strive to offer a reliable and consistent prior, circumventing the restrictions of manual priors. In this paper, we propose Clearer Night Image Restoration with Vector-Quantized Codebook (VQCNIR) to achieve remarkable and consistent restoration outcomes on real-world and synthetic benchmarks. To ensure the faithful restoration of details and illumination, we propose the incorporation of two essential modules: the Adaptive Illumination Enhancement Module (AIEM) and the Deformable Bi-directional Cross-Attention (DBCA) module. The AIEM leverages the inter-channel correlation of features to dynamically maintain illumination consistency between degraded features and high-quality codebook features. Meanwhile, the DBCA module effectively integrates texture and structural information through bi-directional cross-attention and deformable convolution, resulting in enhanced fine-grained detail and structural fidelity across parallel decoders. Extensive experiments validate the remarkable benefits of VQCNIR in enhancing image quality under low-light conditions, showcasing its state-of-the-art performance on both synthetic and real-world datasets. The code is available at https://github.com/AlexZou14/VQCNIR.

PDF Details DOI