Author name cluster

Hang Su

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

50 papers

2 author rows

AAAI Conference 2026 Conference Paper

Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding

Youze Wang
Zijun Chen
Ruoyu Chen
Shishen Gu
Wenbo Hu
Jiayang Liu
Yinpeng Dong
Hang Su

Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Dual-Seed Evolutionary Algorithm for Noise Optimization in Diffusion Models

Yuzheng Tan
Yuan He
Yao Zhu
Tianlin Huo
Huanqian Yan
Hang Su
Shuxin Zhang
Guangneng Hu

Diffusion models have emerged as state-of-the-art generative methods, particularly excelling in conditional tasks such as prompt-driven image synthesis. While recent research emphasizes the pivotal role of noise seeds in enhancing text-image alignment and generating human-preferred outputs,these works predominantly rely on random Gaussian noise or heuristic local adjustments,, overlooking the potential of global optimization trategies to systematically improve generation quality. To bridge this gap, we propose Seed Optimization based on Evolution (SOE), a hybrid framework that integrates global evolutionary search with local semantic refinement. The global evolutionary stage conducts seed selection by jointly optimizing text-image alignment (via CLIP-Score) and human preference estimation (via ImageReward), while the local stage employs diffusion inversion to inject conditional semantics into the noise seed. Together, these components constitute a model-agnostic, training-free optimization framework for conditional diffusion models. Extensive experiments across various diffusion models demonstrate that SOE consistently improves semantic fidelity and visual quality, highlighting its generalizability and potential as a plug-and-play enhancement for generative diffusion pipelines.

PDF Details DOI

AAAI Conference 2026 Conference Paper

FedCD: Towards Consolidated Distillation for Heterogeneous Federated Learning

Yichen Li
Hang Su
Huifa Li
Haolin Yang
Xinlin Zhuang
Haochen Xue
Haozhao Wang
Imran Razzak

Knowledge Distillation (KD) serves as an effective approach to addressing heterogeneity issues in Federated Learning (FL), leveraging additional datasets to align local and global models better. There are two primary distillation paradigms: feature-based distillation, which utilizes intermediate-layer features of the network, and logit-based distillation, which employs the final layer's logit outputs. However, existing studies often select distillation methods based on intuitive and empirical evidence when facing different heterogeneous settings, neglecting the intrinsic relationship between distillation paradigms and heterogeneity. This oversight may result in suboptimal federated knowledge distillation performance under heterogeneous conditions. In this paper, we propose the Consolidated Distillation for Heterogeneous Federated Learning - FedCD that balances knowledge representations from both feature-based and logit-based distillation to enhance performance. Specifically, to address the misalignment between knowledge conveyed by features and logits, we aggregate features from different layers via cross-layer attention to preserve semantic knowledge, followed by distribution modeling using Gaussian Mixture Models. This process strengthens knowledge distillation by constraining the transformation of different network layers' features under a consolidated distribution, thereby mitigating impacts from both data and model heterogeneity. Extensive experiments demonstrate that FedCD outperforms state-of-the-art methods by over 10.72% and validate the effectiveness of our approach.

PDF Details DOI

AAAI Conference 2026 Conference Paper

H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

Hongzhe Bi
Lingxuan Wu
Tianwei Lin
Hengkai Tan
Zhizhong Su
Hang Su
Jun Zhu

Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions. The modular design of action encoder and decoder components enables effective knowledge transfer from the unified human embodiment to diverse robot platforms through efficient fine-tuning. Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including π0 and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively. The results validate our core hypothesis that human manipulation data can serve as a powerful foundation for learning bimanual robotic manipulation policies.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ReflexDiffusion: Reflection-Enhanced Trajectory Planning for High-lateral-acceleration Scenarios in Autonomous Driving

Xuemei Yao
Xiao Yang
Jianbin Sun
Liuwei Xie
Xuebin Shao
Xiyu Fang
Hang Su
Kewei Yang

Generating safe and reliable trajectories for autonomous vehicles in long-tail scenarios remains a significant challenge, particularly for High-lateral-acceleration maneuvers such as sharp turns that represent critical safety situations. Existing trajectory planners exhibit systematic failures in these scenarios due to data imbalance, resulting in insufficient representation of vehicle dynamics, road geometry, and environmental constraints in high-risk situations, leading to suboptimal or unsafe trajectory prediction when vehicles operate near their physical boundaries. In this paper, we introduce ReflexDiffusion, a novel inference-stage framework that enhances diffusion-based trajectory planners through reflective adjustment. Our method introduces a gradient-based adjustment mechanism during the iterative denoising process: after each standard trajectory update, we compute the gradient between conditional and unconditional noise predictions to explicitly amplify critical conditioning signals, including road curvature and lateral vehicle dynamics. This amplification enforces strict adherence to physical constraints, particularly improving stability during high-lateral-acceleration maneuvers where precise vehicle-road interaction is paramount. Evaluated on the nuPlan Test14-hard benchmark, ReflexDiffusion achieves a 14.1% improvement in driving score for high-lateral-acceleration scenarios compared to state-of-the-art methods. This demonstrates that inference-time trajectory optimization can effectively compensate for training data sparsity by dynamically reinforcing safety-critical constraints at the handling limits. The framework's architecture-agnostic design enables direct deployment across existing diffusion-based planners, offering a practical solution for improving autonomous vehicle safety in challenging driving conditions.

PDF Details DOI

ECAI Conference 2025 Conference Paper

POSTMAN: Periodic Spectra Transition via Mamba Network for Time Series Forecasting

Kaixin Zhao
Hang Su
Huiyu Liu
Yijun Mo

The periodicity of time series has significantly advanced long-term forecasting and has attracted extensive research efforts. However, existing methods still suffer from neglecting critical low-energy periodic components and high sensitivity to outliers. To address these issues, we propose the PeriOdic Spectra Transition via MAmba Network (POSTMAN). This architecture introduces the periodic spectrum deviation forecasting (PSDF) technique, which extracts the shared spectrum to represent the common periodic features and generates deviation spectra to represent the specific periodic features. The shared periodic spectrum retains the critical low-amplitude components, while the deviation spectra preserve the slight differences between periods. To effectively leverage the differences, we develop a spectral convolution-enhanced Frequency Mamba Block (FMB), which learns the transition patterns of periodic deviation spectra and inhibits the impact of outliers during the transition procedure. Experiments on seven mainstream time series datasets demonstrate that POSTMAN outperforms existing state-of-the-art models in accuracy and robustness.

Details

IROS Conference 2025 Conference Paper

RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation

Chengbo Yuan
Suraj Joshi
Shaoting Zhu
Hang Su
Hang Zhao 0021
Yang Gao 0029

Visual augmentation has become a crucial technique for enhancing the visual robustness of imitation learning. However, existing methods are often limited by prerequisites such as camera calibration or the need for controlled environments (e. g. , green screen setups). In this work, we introduce RoboEngine, the first plug-and-play visual robot data augmentation toolkit. For the first time, users can effortlessly generate physics- and task-aware robot scenes with just a few lines of code. To achieve this, we present a novel robot scene segmentation dataset, a generalizable high-quality robot segmentation model, and a fine-tuned background generation model, which together form the core components of the out-of-the-box toolkit. Using RoboEngine, we demonstrate the ability to generalize robot manipulation tasks across six entirely new scenes, based solely on demonstrations collected from a single scene, achieving a more than 200% performance improvement compared to the no-augmentation baseline. All datasets, model weights, and the toolkit are released https://roboengine.github.io/.

Details

IJCAI Conference 2025 Conference Paper

Self-Consistent Model-based Adaptation for Visual Reinforcement Learning

Xinning Zhou
Chengyang Ying
Yao Feng
Hang Su
Jun Zhu

Visual reinforcement learning agents typically face serious performance declines in real-world applications caused by visual distractions. Existing methods rely on fine-tuning the policy's representations with hand-crafted augmentations. In this work, we propose Self-Consistent Model-based Adaptation (SCMA), a novel method that fosters robust adaptation without modifying the policy. By transferring cluttered observations to clean ones with a denoising model, SCMA can mitigate distractions for various policies as a plug-and-play enhancement. To optimize the denoising model in an unsupervised manner, we derive an unsupervised distribution matching objective with a theoretical analysis of its optimality. We further present a practical algorithm to optimize the objective by estimating the distribution of clean observations with a pre-trained world model. Extensive experiments on multiple visual generalization benchmarks and real robot data demonstrate that SCMA effectively boosts performance across various distractions and exhibits better sample efficiency.

PDF Details DOI