Author name cluster

Kunyu Peng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

2 author rows

AAAI Conference 2026 Conference Paper

HybriDLA: Hybrid Generation for Document Layout Analysis

Yufan Chen
Omar Moured
Ruiping Liu
Junwei Zheng
Kunyu Peng
Jiaming Zhang
Rainer Stiefelhagen

Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M6Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

CSBrain: A Cross-scale Spatiotemporal Brain Foundation Model for EEG Decoding

Yuchen Zhou
Jiamin Wu
Zichen Ren
Zhouheng Yao
Weiheng Lu
Kunyu Peng
Qihao Zheng
Chunfeng Song

Understanding and decoding human brain activity from electroencephalography (EEG) signals is a fundamental problem in neuroscience and artificial intelligence, with applications ranging from cognition and emotion recognition to clinical diagnosis and brain–computer interfaces. While recent EEG foundation models have made progress in generalized brain decoding by leveraging unified architectures and large-scale pretraining, they inherit a scale-agnostic dense modeling paradigm from NLP and vision. This design overlooks an intrinsic property of neural activity—cross-scale spatiotemporal structure. Different EEG task patterns span a broad range of temporal and spatial scales, from brief neural activations to slow-varying rhythms, and from localized cortical activations to large-scale distributed interactions. Ignoring this diversity may lead to suboptimal representations and weakened generalization ability. To address these limitations, we propose CSBrain, a Cross-scale Spatiotemporal Brain foundation model for generalized EEG decoding. CSBrain introduces two key components: (i) Cross-scale Spatiotemporal Tokenization (CST), which aggregates multi-scale features within localized temporal windows and anatomical brain regions into compact scale-aware token representations; and (ii) Structured Sparse Attention (SSA), which models cross-window and cross-region dependencies for diverse decoding tasks, further enriching scale diversities while eliminating the spurious dependencies. CST and SSA are alternately stacked to progressively integrate cross-scale spatiotemporal dependencies. Extensive experiments across 11 representative EEG tasks and 16 datasets demonstrate that CSBrain consistently outperforms both task-specific models and strong foundation baselines. These results establish cross-scale modeling as a key inductive bias for generalized EEG decoding and highlight CSBrain as a robust backbone for future brain–AI research.

PDF Details

IROS Conference 2025 Conference Paper

Dense Semantic Bird-Eye-View Map Generation from Sparse LiDAR Point Clouds via Distribution-aware Feature Fusion

Jinsong Li
Kunyu Peng
Yuxiang Sun 0002

Semantic scene understanding in bird-eye view (BEV) plays a crucial role in autonomous driving. A common approach to generating BEV maps from LiDAR point-cloud data involves constructing a pillar-level representation by projecting 3D point clouds onto a 2D plane. This process partially discards spatial geometric information, and produces sparse semantic maps. However, downstream tasks (e. g. , trajectory planning and prediction), typically require dense grid-like semantic BEV maps rather than sparse segmentation outputs. To bridge this gap, we propose PointDenseBEV, an end-to-end, distribution-aware feature fusion framework. It takes as input sparse LiDAR point clouds and directly generates dense semantic BEV maps. Spatial geometric information and temporal context are embedded as auxiliary semantic cues within the BEV grid representation to enhance semantic density. Extensive experiments on the SemanticKITTI dataset demonstrate that our method achieves competitive performance compared to existing approaches.

Details

ICLR Conference 2025 Conference Paper

Graph-based Document Structure Analysis

Yufan Chen 0001
Ruiping Liu 0001
Junwei Zheng
Di Wen 0006
Kunyu Peng
Jiaming Zhang 0001
Rainer Stiefelhagen

When reading a document, glancing at the spatial layout of a document is an initial step to understand it roughly. Traditional document layout analysis (DLA) methods, however, offer only a superficial parsing of documents, focusing on basic instance detection and often failing to capture the nuanced spatial and logical relationships between instances. These limitations hinder DLA-based models from achieving a gradually deeper comprehension akin to human reading. In this work, we propose a novel graph-based Document Structure Analysis (gDSA) task. This task requires that model not only detects document elements but also generates spatial and logical relations in form of a graph structure, allowing to understand documents in a holistic and intuitive manner. For this new task, we construct a relation graph-based document structure analysis dataset(GraphDoc) with 80K document images and 4.13M relation annotations, enabling training models to complete multiple tasks like reading order, hierarchical structures analysis, and complex inter-element relationship inference. Furthermore, a document relation graph generator (DRGG) is proposed to address the gDSA task, which achieves performance with 57.6% at $mAP_g$@$0.5$ for a strong benchmark baseline on this novel task and dataset. We hope this graphical representation of document structure can mark an innovative advancement in document structure analysis and understanding. The new dataset and code will be made publicly available.

Details

NeurIPS Conference 2025 Conference Paper

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng
Junchao Huang
Xiangsheng Huang
Di Wen
Junwei Zheng
Yufan Chen
Kailun Yang
Jiamin Wu

Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i. e. , RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action segmentation methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i. e. , HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The dataset and code are available at https: //github. com/KPeng9510/HopaDIFF.

PDF Details

ICML Conference 2025 Conference Paper

MindAligner: Explicit Brain Functional Alignment for Cross-Subject Visual Decoding from Limited fMRI Data

Yuqin Dai
Zhouheng Yao
Chunfeng Song
Qihao Zheng
Weijian Mai
Kunyu Peng
Shuai Lu
Wanli Ouyang

Brain decoding aims to reconstruct visual perception of human subject from fMRI signals, which is crucial for understanding brain’s perception mechanisms. Existing methods are confined to the single-subject paradigm due to substantial brain variability, which leads to weak generalization across individuals and incurs high training costs, exacerbated by limited availability of fMRI data. To address these challenges, we propose MindAligner, an explicit functional alignment framework for cross-subject brain decoding from limited fMRI data. The proposed MindAligner enjoys several merits. First, we learn a Brain Transfer Matrix (BTM) that projects the brain signals of an arbitrary new subject to one of the known subjects, enabling seamless use of pre-trained decoding models. Second, to facilitate reliable BTM learning, a Brain Functional Alignment module is proposed to perform soft cross-subject brain alignment under different visual stimuli with a multi-level brain alignment loss, uncovering fine-grained functional correspondences with high interpretability. Experiments indicate that MindAligner not only outperforms existing methods in visual decoding under data-limited conditions, but also provides valuable neuroscience insights in cross-subject functional analysis. The code will be made publicly available.

Details

NeurIPS Conference 2025 Conference Paper

Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model

Ruiping Liu
Junwei Zheng
Yufan Chen
Zirui Wang
Kunyu Peng
Kailun Yang
Jiaming Zhang
Marc Pollefeys

Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs. The established dataset and source code are publicly available at: https: //github. com/RuipingL/Situat3DChange.

PDF Details

IROS Conference 2025 Conference Paper

VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility

Yitian Shi
Di Wen 0006
Guanqi Chen
Edgar Welte
Sheng Liu
Kunyu Peng
Rainer Stiefelhagen
Rania Rayyes

We propose VISO-Grasp, a novel vision-language-informed system designed to systematically address visibility constraints for grasping in severely occluded environments. By leveraging Foundation Models (FMs) for spatial reasoning and active view planning, our framework constructs and updates an instance-centric representation of spatial relationships, enhancing grasp success under challenging occlusions. Furthermore, this representation facilitates active Next-Best-View (NBV) planning and optimizes sequential grasping strategies when direct grasping is infeasible. Additionally, we introduce a multi-view uncertainty-driven grasp fusion mechanism that refines grasp confidence and directional uncertainty in real-time, ensuring robust and stable grasp execution. Extensive real-world experiments demonstrate that VISO-Grasp achieves a success rate of 87. 5% in target-oriented grasping with the fewest grasp attempts outperforming baselines. To the best of our knowledge, VISO-Grasp is the first unified framework integrating FMs into target-aware active view planning and 6-DoF grasping in environments with severe occlusions and entire invisibility constraints. Code is available at: https://github.com/YitianShi/vMF-Contact

Details

NeurIPS Conference 2024 Conference Paper

Advancing Open-Set Domain Generalization Using Evidential Bi-Level Hardest Domain Scheduler

Kunyu Peng
Di Wen
Kailun Yang
Ao Luo
Yufan Chen
Jia Fu
M. Saquib Sarfraz
Alina Roitberg

In Open-Set Domain Generalization (OSDG), the model is exposed to both new variations of data appearance (domains) and open-set conditions, where both known and novel categories are present at test time. The challenges of this task arise from the dual need to generalize across diverse domains and accurately quantify category novelty, which is critical for applications in dynamic environments. Recently, meta-learning techniques have demonstrated superior results in OSDG, effectively orchestrating the meta-train and -test tasks by employing varied random categories and predefined domain partition strategies. These approaches prioritize a well-designed training schedule over traditional methods that focus primarily on data augmentation and the enhancement of discriminative feature learning. The prevailing meta-learning models in OSDG typically utilize a predefined sequential domain scheduler to structure data partitions. However, a crucial aspect that remains inadequately explored is the influence brought by strategies of domain schedulers during training. In this paper, we observe that an adaptive domain scheduler benefits more in OSDG compared with prefixed sequential and random domain schedulers. We propose the Evidential Bi-Level Hardest Domain Scheduler (EBiL-HaDS) to achieve an adaptive domain scheduler. This method strategically sequences domains by assessing their reliabilities in utilizing a follower network, trained with confidence scores learned in an evidential manner, regularized by max rebiasing discrepancy, and optimized in a bilevel manner. We verify our approach on three OSDG benchmarks, i. e. , PACS, DigitsDG, and OfficeHome. The results show that our method substantially improves OSDG performance and achieves more discriminative embeddings for both the seen and unseen categories, underscoring the advantage of a judicious domain scheduler for the generalizability to unseen domains and unseen categories. The source code is publicly available at https: //github. com/KPeng9510/EBiL-HaDS.

PDF Details DOI

ICRA Conference 2024 Conference Paper

MateRobot: Material Recognition in Wearable Robotics for People with Visual Impairments

Junwei Zheng
Jiaming Zhang 0001
Kailun Yang 0001
Kunyu Peng
Rainer Stiefelhagen

People with Visual Impairments (PVI) typically recognize objects through haptic perception. Knowing objects and materials before touching is desired by the target users but under-explored in the field of human-centered robotics. To fill this gap, in this work, a wearable vision-based robotic system, MATERobot, is established for PVI to recognize materials and object categories beforehand. To address the computational constraints of mobile platforms, we propose a lightweight yet accurate model MATEViT to perform pixel-wise semantic segmentation, simultaneously recognizing both objects and materials. Our methods achieve respective 40. 2% and 51. 1% of mIoU on COCOStuff-10K and DMS datasets, surpassing the previous method with +5. 7% and +7. 0% gains. Moreover, on the field test with participants, our wearable system reaches a score of 28 in the NASA-Task Load Index, indicating low cognitive demands and ease of use. Our MATERobot demonstrates the feasibility of recognizing material property through visual cues and offers a promising step towards improving the functionality of wearable robots for PVI. The source code has been made publicly available at MATERobot.

Details

NeurIPS Conference 2024 Conference Paper

Muscles in Time: Learning to Understand Human Motion In-Depth by Simulating Muscle Activations

David Schneider
Simon Reiß
Marco Kugler
Alexander Jaus
Kunyu Peng
Susanne Sutschet
M. S. Sarfraz
Sven Matthiesen

Exploring the intricate dynamics between muscular and skeletal structures is pivotal for understanding human motion. This domain presents substantial challenges, primarily attributed to the intensive resources required for acquiring ground truth muscle activation data, resulting in a scarcity of datasets. In this work, we address this issue by establishing Muscles in Time (MinT), a large-scale synthetic muscle activation dataset. For the creation of MinT, we enriched existing motion capture datasets by incorporating muscle activation simulations derived from biomechanical human body models using the OpenSim platform, a common framework used in biomechanics and human motion research. Starting from simple pose sequences, our pipeline enables us to extract detailed information about the timing of muscle activations within the human musculoskeletal system. Muscles in Time contains over nine hours of simulation data covering 227 subjects and 402 simulated muscle strands. We demonstrate the utility of this dataset by presenting results on neural network-based muscle activation estimation from human pose sequences with two different sequence-to-sequence architectures.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Navigating Open Set Scenarios for Skeleton-Based Action Recognition

Kunyu Peng
Cheng Yin
Junwei Zheng
Ruiping Liu
David Schneider
Jiaming Zhang
Kailun Yang
M. Saquib Sarfraz

In real-world scenarios, human actions often fall outside the distribution of training data, making it crucial for models to recognize known actions and reject unknown ones. However, using pure skeleton data in such open-set conditions poses challenges due to the lack of visual background cues and the distinct sparse structure of body pose sequences. In this paper, we tackle the unexplored Open-Set Skeleton-based Action Recognition (OS-SAR) task and formalize the benchmark on three skeleton-based datasets. We assess the performance of seven established open-set approaches on our task and identify their limits and critical generalization issues when dealing with skeleton information.To address these challenges, we propose a distance-based cross-modality ensemble method that leverages the cross-modal alignment of skeleton joints, bones, and velocities to achieve superior open-set recognition performance. We refer to the key idea as CrossMax - an approach that utilizes a novel cross-modality mean max discrepancy suppression mechanism to align latent spaces during training and a cross-modality distance-based logits refinement method during testing. CrossMax outperforms existing approaches and consistently yields state-of-the-art results across all datasets and backbones. We will release the benchmark, code, and models to the community.

PDF Details DOI

ICML Conference 2024 Conference Paper

Position: Quo Vadis, Unsupervised Time Series Anomaly Detection?

M. Saquib Sarfraz
Mei-Yen Chen
Lukas Layer
Kunyu Peng
Marios Koulakis

The current state of machine learning scholarship in Timeseries Anomaly Detection (TAD) is plagued by the persistent use of flawed evaluation metrics, inconsistent benchmarking practices, and a lack of proper justification for the choices made in novel deep learning-based model designs. Our paper presents a critical analysis of the status quo in TAD, revealing the misleading track of current research and highlighting problematic methods, and evaluation practices. Our position advocates for a shift in focus from solely pursuing novel model designs to improving benchmarking practices, creating non-trivial datasets, and critically evaluating the utility of complex methods against simpler baselines. Our findings demonstrate the need for rigorous evaluation protocols, the creation of simple baselines, and the revelation that state-of-the-art deep anomaly detection models effectively learn linear mappings. These findings suggest the need for more exploration and development of simple and interpretable TAD methods. The increment of model complexity in the state-of-the-art deep-learning based models unfortunately offers very little improvement. We offer insights and suggestions for the field to move forward.

Details

IROS Conference 2024 Conference Paper

Skeleton-Based Human Action Recognition with Noisy Labels

Yi Xu
Kunyu Peng
Di Wen 0006
Ruiping Liu 0001
Junwei Zheng
Yufan Chen 0001
Jiaming Zhang 0001
Alina Roitberg

Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model’s training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study will be made accessible at https://github.com/xuyizdby/NoiseEraSAR.

Details

ICRA Conference 2024 Conference Paper

SynthAct: Towards Generalizable Human Action Recognition based on Synthetic Data

David Schneider 0006
Marco Keller
Zeyun Zhong
Kunyu Peng
Alina Roitberg
Jürgen Beyerer
Rainer Stiefelhagen

Synthetic data generation is a proven method for augmenting training sets without the need for extensive setups, yet its application in human activity recognition is underexplored. This is particularly crucial for human-robot collaboration in household settings, where data collection is often privacy-sensitive. In this paper, we introduce SynthAct, a synthetic data generation pipeline designed to significantly minimize the reliance on real-world data. Leveraging modern 3D pose estimation techniques, SynthAct can be applied to arbitrary 2D or 3D video action recordings, making it applicable for uncontrolled in-the-field recordings by robotic agents or smarthome monitoring systems. We present two SynthAct datasets: AMARV, a large synthetic collection with over 800k multi-view action clips, and Synthetic Smarthome, mirroring the Toyota Smarthome dataset. SynthAct generates a rich set of data, including RGB videos and depth maps from four synchronized views, 3D body poses, normal maps, segmentation masks and bounding boxes. We validate the efficacy of our datasets through extensive synthetic-to-real experiments on NTU RGB+D and Toyota Smarthome. SynthAct is available on our project page 4.

Details

IROS Conference 2023 Conference Paper

Quantized Distillation: Optimizing Driver Activity Recognition Models for Resource-Constrained Environments

Calvin Tanama
Kunyu Peng
Zdravko Marinov
Rainer Stiefelhagen
Alina Roitberg

Deep learning-based models are at the top of most driver observation benchmarks due to their remarkable accuracies but come with a high computational cost, while the resources are often limited in real-world driving scenarios. This paper presents a lightweight framework for resource- efficient driver activity recognition. We enhance 3D MobileNet, a speed-optimized neural architecture for video classification, with two paradigms for improving the trade-off between model accuracy and computational efficiency: knowledge distillation and model quantization. Knowledge distillation prevents large drops in accuracy when reducing the model size by harvesting knowledge from a large teacher model (I3D) via soft labels instead of using the original ground truth. Quantization further drastically reduces the memory and computation requirements by representing the model weights and activations using lower precision integers. Extensive experiments on a public dataset for in-vehicle monitoring during autonomous driving show that our proposed framework leads to an 3- fold reduction in model size and 1. 4-fold improvement in inference time compared to an already speed-optimized architecture. Our code is available at https://github.com/calvintanama/qd-driver-activity-reco.

Details

IROS Conference 2022 Conference Paper

TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration

Kunyu Peng
Alina Roitberg
Kailun Yang 0001
Jiaming Zhang 0001
Rainer Stiefelhagen

Traditional video-based human activity recognition has experienced remarkable progress linked to the rise of deep learning, but this effect was slower as it comes to the downstream task of driver behavior understanding. Understanding the situation inside the vehicle cabin is essential for Advanced Driving Assistant System (ADAS) as it enables identifying distraction, predicting driver's intent and leads to more convenient human-vehicle interaction. At the same time, driver observation systems face substantial obstacles as they need to capture different granularities of driver states, while the complexity of such secondary activities grows with the rising automation and increased driver freedom. Furthermore, a model is rarely deployed under conditions identical to the ones in the training set, as sensor placements and types vary from vehicle to vehicle, constituting a substantial obstacle for real-life deployment of data-driven models. In this work, we present a novel vision-based framework for recognizing secondary driver behaviours based on visual transformers and an additional augmented feature distribution calibration module. This module operates in the latent feature-space enriching and diversifying the training set at feature-level in order to improve generalization to novel data appearances, (e. g. , sensor changes) and general feature quality. Our framework consistently leads to better recognition rates, surpassing previous state-of-the-art results of the public Drive&Act benchmark on all granularity levels. Our code will be made publicly available at https://github.com/KPeng9510/TransDARC.

Details