Author name cluster

Chenyu Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

2 author rows

IROS Conference 2025 Conference Paper

Beyond Anthropomorphism: Enhancing Grasping and Eliminating a Degree of Freedom by Fusing the Abduction of Digits Four and Five

Simon Fritsch
Liam Achenbach
Riccardo Bianco
Nicola Irmiger
Gawain Marti
Samuel Visca
Chenyu Yang
Davide Liconti

This paper presents the SABD hand, a 16-degree-of-freedom (DoF) robotic hand that departs from purely anthropomorphic designs to achieve an expanded grasp envelope, enable manipulation poses beyond human capability, and reduce the required number of actuators. This is achieved by combining the adduction/abduction (Add/Abd) joint of digits four and five into a single joint with a large range of motion. The combined joint increases the workspace of the digits by 400% and reduces the required DoFs while retaining dexterity. Experimental results demonstrate that the combined Add/Abd joint enables the hand to grasp objects with a side distance of up to 200 mm. Reinforcement learning-based investigations show that the design enables grasping policies that are effective not only for handling larger objects but also for achieving enhanced grasp stability. In teleoperated trials, the hand successfully performed 86% of attempted grasps on suitable YCB objects, including challenging non-anthropomorphic configurations. These findings validate the design’s ability to enhance grasp stability, flexibility, and dexterous manipulation without added complexity, making it well-suited for a wide range of applications.

Details

NeurIPS Conference 2025 Conference Paper

LeVo: High-Quality Song Generation with Multi-Preference Alignment

Shun Lei
Yaoxun Xu
Huaicheng Zhang
Wei Tan
Hangting Chen
Yixuan Zhang
Chenyu Yang
Haina Zhu

Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in audio quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, a language model based framework consisting of LeLM and Music Codec. LeLM is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve better vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following ability, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and post-training. Experimental results demonstrate that LeVo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs. Audio examples and source code are available at https: //levo-demo. github. io and https: //github. com/tencent-ailab/songgeneration.

PDF Details

IROS Conference 2025 Conference Paper

ORCA: An Open-Source, Reliable, Cost-Effective, Anthropomorphic Robotic Hand for Uninterrupted Dexterous Task Learning

Clemens C. Christoph
Maximilian Eberlein
Filippos Katsimalis
Arturo Roberti
Aristotelis Sympetheros
Michel R. Vogt
Davide Liconti
Chenyu Yang

General-purpose robots should possess humanlike dexterity and agility to perform tasks with the same versatility as us. A human-like form factor further enables the use of vast datasets of human-hand interactions. However, the primary bottleneck in dexterous manipulation lies not only in software but arguably even more in hardware. Robotic hands that approach human capabilities are often prohibitively expensive, bulky, or require enterprise-level maintenance, limiting their accessibility for broader research and practical applications. What if the research community could get started with reliable dexterous hands within a day? We present the open-source ORCA hand, a reliable and anthropomorphic 17-DoF tendon-driven robotic hand with integrated tactile sensors, fully assembled in less than eight hours and built for a material cost below 2, 000 CHF. We showcase ORCA’s key design features such as popping joints, auto-calibration, and tensioning systems that significantly reduce complexity while increasing reliability, accuracy, and robustness. We benchmark the ORCA hand across a variety of tasks, ranging from teleoperation and imitation learning to zero-shot sim-to-real reinforcement learning. Furthermore, we demonstrate its durability, withstanding more than 10, 000 continuous operation cycles—equivalent to approximately 20 hours—without hardware failure, the only constraint being the duration of the experiment itself. Video is here: youtu.be/kUbPSYMmOds.Design files, source code, and documentation are available at srl. ethz. ch/orcahand.

Details

NeurIPS Conference 2025 Conference Paper

SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

Chenyu Yang
Shuai Wang
Hangting Chen
Wei Tan
Jianwei Yu
Haizhou Li

Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces SongBloom, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: https: //cypress-yang. github. io/SongBloom_demo.

PDF Details

AAAI Conference 2025 Conference Paper

SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor

Chenyu Yang
Shuai Wang
Hangting Chen
Jianwei Yu
Wei Tan
Rongzhi Gu
Yaoxun Xu
Yizhi Zhou

The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although state-of-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flexible and effective production. In this paper, we present SongEditor, the first song editing paradigm that introduces the editing capabilities into language-modeling song generation approaches, facilitating both segment-wise and track-wise modifications. SongEditor offers the flexibility to adjust lyrics, vocals, and accompaniments, as well as synthesizing songs from scratch. The core components of SongEditor include a music tokenizer, an autoregressive language model, and a diffusion generator, enabling generating an entire section, masked lyrics, or even separated vocals and background music. Extensive experiments demonstrate that the proposed SongEditor achieves exceptional performance in end-to-end song editing, as evidenced by both objective and subjective metrics.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

CRAG - Comprehensive RAG Benchmark

Xiao Yang
Kai Sun
Hao Xin
Yushi Sun
Nikita Bhalla
Xiangsen Chen
Sajal Choudhary
Rongze D. Gui

Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)’s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4, 409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve $\le 34\%$ accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracted thousands of participants and submissions. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions. CRAG is available at https: //github. com/facebookresearch/CRAG/.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
Xuan Dong
Wenhai Wang
Lewei Lu

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved image-text data, which is very prevalent on the Internet. Inspired by the recent success of compression learning in natural language processing, we propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data. This method performs latent compression learning by maximizing the mutual information between the inputs and outputs of a causal attention model. The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation. Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets (e. g. , LAION), but can also leverage interleaved pre-training data (e. g. , MMC4) to learn robust visual representations from scratch, showcasing the potential of vision model pre-training with interleaved image-text data.

PDF Details DOI

IROS Conference 2020 Conference Paper

Dynamic Legged Manipulation of a Ball Through Multi-Contact Optimization

Chenyu Yang
Bike Zhang
Jun Zeng 0002
Ayush Agrawal
Koushil Sreenath

The feet of robots are typically used to design locomotion strategies, such as balancing, walking, and running. However, they also have great potential to perform manipulation tasks. In this paper, we propose a model predictive control (MPC) framework for a quadrupedal robot to dynamically balance on a ball and simultaneously manipulate it to follow various trajectories such as straight lines, sinusoids, circles and in-place turning. We numerically validate our controller on the Mini Cheetah robot using different gaits including trotting, bounding, and pronking on the ball.

Details