Author name cluster

Gus Xia

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers

2 author rows

ICLR Conference 2025 Conference Paper

MuPT: A Generative Symbolic Music Pretrained Transformer

Xingwei Qu
Yuelin Bai
Yinghao Ma
Ziya Zhou
Ka Man Lo
Jiaheng Liu
Ruibin Yuan
Lejun Min

In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a $\underline{S}$ynchronized $\underline{M}$ulti-$\underline{T}$rack ABC Notation ($\textbf{SMT-ABC Notation}$), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90\% of the symbolic music data in our training set. Furthermore, we explore the implications of the $\underline{S}$ymbolic $\underline{M}$usic $\underline{S}$caling Law ($\textbf{SMS Law}$) on model performance. The results indicate a promising research direction in music generation, offering extensive resources for further research through our open-source contributions.

Details

NeurIPS Conference 2025 Conference Paper

Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization

Longshen Ou
Jingwei Zhao
Ziyu Wang
Gus Xia
Qihao Liang
Torin Hopkins
Ye Wang

We present a unified framework for automatic multitrack music arrangement that enables a single pre-trained symbolic music model to handle diverse arrangement scenarios, including reinterpretation, simplification, and additive generation. At its core is a segment-level reconstruction objective operating on token-level disentangled content and style, allowing for flexible any-to-any instrumentation transformations at inference time. To support track-wise modeling, we introduce REMI-z, a structured tokenization scheme for multitrack symbolic music that enhances modeling efficiency and effectiveness for both arrangement tasks and unconditional generation. Our method outperforms task-specific state-of-the-art models on representative tasks in different arrangement scenarios---band arrangement, piano reduction, and drum arrangement, in both objective metrics and perceptual evaluations. Taken together, our framework demonstrates strong generality and suggests broader applicability in symbolic music-to-music transformation.

PDF Details

ICLR Conference 2025 Conference Paper

Unsupervised Disentanglement of Content and Style via Variance-Invariance Constraints

Yuxuan Wu
Ziyu Wang 0008
Bhiksha Raj
Gus Xia

We contribute an unsupervised method that effectively learns disentangled content and style representations from sequences of observations. Unlike most disentanglement algorithms that rely on domain-specific labels or knowledge, our method is based on the insight of domain-general statistical differences between content and style --- content varies more among different fragments within a sample but maintains an invariant vocabulary across data samples, whereas style remains relatively invariant within a sample but exhibits more significant variation across different samples. We integrate such inductive bias into an encoder-decoder architecture and name our method after V3 (variance-versus-invariance). Experimental results show that V3 generalizes across multiple domains and modalities, successfully learning disentangled content and style representations, such as pitch and timbre from music audio, digit and color from images of hand-written digits, and action and character appearance from simple animations. V3 demonstrates strong disentanglement performance compared to existing unsupervised methods, along with superior out-of-distribution generalization and few-shot learning capabilities compared to supervised counterparts. Lastly, symbolic-level interpretability emerges in the learned content codebook, forging a near one-to-one alignment between machine representation and human knowledge.

Details

IJCAI Conference 2024 Conference Paper

Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls

Liwei Lin
Gus Xia
Yixiao Zhang
Junyan Jiang

Controllable music generation plays a vital role in human-AI music co-creation. While Large Language Models (LLMs) have shown promise in generating high-quality music, their focus on autoregressive generation limits their utility in music editing tasks. To bridge this gap, To address this gap, we propose a novel approach leveraging a parameter-efficient heterogeneous adapter combined with a masking training scheme. This approach enables autoregressive language models to seamlessly address music inpainting tasks. Additionally, our method integrates frame-level content-based controls, facilitating track-conditioned music refinement and score-conditioned music arrangement. We apply this method to fine-tune MusicGen, a leading autoregressive music generation model. Our experiments demonstrate promising results across multiple music editing tasks, offering more flexible controls for future AI-driven music editing tools. The source codes and a demo page showcasing our work are available at https: //kikyo-16. github. io/AIR.

PDF Details DOI

ICLR Conference 2024 Conference Paper

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Yizhi Li
Ruibin Yuan
Ge Zhang 0009
Yinghao Ma
Xingran Chen
Hanzhi Yin
Chenghao Xiao
Chenghua Lin 0002

Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is partially due to the distinctive challenges associated with modelling musical knowledge, particularly tonal and pitched characteristics of music. To address this research gap, we propose an acoustic **M**usic und**ER**standing model with large-scale self-supervised **T**raining (**MERT**), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified an effective combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantization - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.

Details

IJCAI Conference 2024 Conference Paper

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

Yixiao Zhang
Yukara Ikemiya
Gus Xia
Naoki Murata
Marco A. Martínez-Ramírez
Wei-Hsiang Liao
Yuki Mitsufuji
Simon Dixon

Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, the task of editing these generated music remains a significant challenge. This paper introduces a novel approach to edit music generated by such models, enabling the modification of specific attributes, such as genre, mood, and instrument, while maintaining other aspects unchanged. Our method transforms text editing to the latent space manipulation, and adds an additional constraint to enforce consistency. It seamlessly integrates with existing pretrained text-to-music diffusion models without requiring additional training. Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations. We also show the practical applicability of our approach in real-world music editing scenarios.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Structured Multi-Track Accompaniment Arrangement via Style Prior Modelling

Jingwei Zhao
Gus Xia
Ziyu Wang
Ye Wang

In the realm of music AI, arranging rich and structured multi-track accompaniments from a simple lead sheet presents significant challenges. Such challenges include maintaining track cohesion, ensuring long-term coherence, and optimizing computational efficiency. In this paper, we introduce a novel system that leverages prior modelling over disentangled style factors to address these challenges. Our method presents a two-stage process: initially, a piano arrangement is derived from the lead sheet by retrieving piano texture styles; subsequently, a multi-track orchestration is generated by infusing orchestral function styles into the piano arrangement. Our key design is the use of vector quantization and a unique multi-stream Transformer to model the long-term flow of the orchestration style, which enables flexible, controllable, and structured music generation. Experiments show that by factorizing the arrangement task into interpretable sub-stages, our approach enhances generative capacity while improving efficiency. Additionally, our system supports a variety of music genres and provides style control at different composition hierarchies. We further show that our system achieves superior coherence, structure, and overall arrangement quality compared to existing baselines.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models

Ziyu Wang 0008
Lejun Min
Gus Xia

Recent deep music generation studies have put much emphasis on long-term generation with structures. However, we are yet to see high-quality, well-structured **whole-song** generation. In this paper, we make the first attempt to model a full music piece under the realization of *compositional hierarchy*. With a focus on symbolic representations of pop songs, we define a hierarchical language, in which each level of hierarchy focuses on the semantics and context dependency at a certain music scope. The high-level languages reveal whole-song form, phrase, and cadence, whereas the low-level languages focus on notes, chords, and their local patterns. A cascaded diffusion model is trained to model the hierarchical language, where each level is conditioned on its upper levels. Experiments and analysis show that our model is capable of generating full-piece music with recognizable global verse-chorus structure and cadences, and the music quality is higher than the baselines. Additionally, we show that the proposed model is *controllable* in a flexible way. By sampling from the interpretable hierarchical languages or adjusting pre-trained external representations, users can control the music flow via various features such as phrase harmonic structures, rhythmic patterns, and accompaniment texture.

Details

NeurIPS Conference 2023 Conference Paper

Learning Interpretable Low-dimensional Representation via Physical Symmetry

Xuanjie Liu
Daniel Chin
Yichen Huang
Gus Xia

We have recently seen great progress in learning interpretable music representations, ranging from basic factors, such as pitch and timbre, to high-level concepts, such as chord and texture. However, most methods rely heavily on music domain knowledge. It remains an open question what general computational principles give rise to interpretable representations, especially low-dim factors that agree with human perception. In this study, we take inspiration from modern physics and use physical symmetry as a self-consistency constraint for the latent space. Specifically, it requires the prior model that characterises the dynamics of the latent states to be equivariant with respect to certain group transformations. We show that physical symmetry leads the model to learn a linear pitch factor from unlabelled monophonic music audio in a self-supervised fashion. In addition, the same methodology can be applied to computer vision, learning a 3D Cartesian space from videos of a simple moving object without labels. Furthermore, physical symmetry naturally leads to counterfactual representation augmentation, a new technique which improves sample efficiency.

PDF Details

NeurIPS Conference 2023 Conference Paper

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Ruibin Yuan
Yinghao Ma
Yizhi Li
Ge Zhang
Xingran Chen
Hanzhi Yin
zhuo le
Yiqi Liu

In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 18 tasks on 12 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published to promote future music AI research.

PDF Details

IJCAI Conference 2023 Conference Paper

Q& A: Query-Based Representation Learning for Multi-Track Symbolic Music re-Arrangement

Jingwei Zhao
Gus Xia
Ye Wang

Music rearrangement is a common music practice of reconstructing and reconceptualizing a piece using new composition or instrumentation styles, which is also an important task of automatic music generation. Existing studies typically model the mapping from a source piece to a target piece via supervised learning. In this paper, we tackle rearrangement problems via self-supervised learning, in which the mapping styles can be regarded as conditions and controlled in a flexible way. Specifically, we are inspired by the representation disentanglement idea and propose Q&A, a query-based algorithm for multi-track music rearrangement under an encoder-decoder framework. Q&A learns both a content representation from the mixture and function (style) representations from each individual track, while the latter queries the former in order to rearrange a new piece. Our current model focuses on popular music and provides a controllable pathway to four scenarios: 1) re-instrumentation, 2) piano cover generation, 3) orchestration, and 4) voice separation. Experiments show that our query system achieves high-quality rearrangement results with delicate multi-track structures, significantly outperforming the baselines.

PDF Details DOI