Author name cluster

Mark D. Plumbley

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

2 author rows

AAAI Conference 2024 Conference Paper

Learning Temporal Resolution in Spectrogram for Audio Classification

Haohe Liu
Xubo Liu
Qiuqiang Kong
Wenwu Wang
Mark D. Plumbley

The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.

PDF Details DOI

ICML Conference 2023 Conference Paper

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Haohe Liu
Zehua Chen 0005
Yi Yuan
Xinhao Mei
Xubo Liu 0001
Danilo P. Mandic
Wenwu Wang 0001
Mark D. Plumbley

Text-to-audio (TTA) systems have recently gained attention for their ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn continuous audio representations from contrastive language-audio pretraining (CLAP) embeddings. The pretrained CLAP models enable us to train LDMs with audio embeddings while providing text embeddings as the condition during sampling. By learning the latent representations of audio signals without modelling the cross-modal relationship, AudioLDM improves both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance compared to other open-sourced systems, measured by both objective and subjective metrics. AudioLDM is also the first TTA system that enables various text-guided audio manipulations (e. g. , style transfer) in a zero-shot fashion. Our implementation and demos are available at https: //audioldm. github. io.

Details

IJCAI Conference 2019 Conference Paper

Single-Channel Signal Separation and Deconvolution with Generative Adversarial Networks

Qiuqiang Kong
Yong Xu
Philip J. B. Jackson
Wenwu Wang
Mark D. Plumbley

Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture. Single-channel signal separation and deconvolution is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition, a mixture may contain non-stationary noise which is unseen in the training set. We propose a synthesizing-decomposition (S-D) approach to solve the single-channel separation and deconvolution problem. In synthesizing, a generative model for sources is built using a generative adversarial network (GAN). In decomposition, both mixing filters and sources are optimized to minimize the reconstruction error of the mixture. The proposed S-D approach achieves a peak-to-noise-ratio (PSNR) of 18. 9 dB and 15. 4 dB in image inpainting and completion, outperforming a baseline convolutional neural network PSNR of 15. 3 dB and 12. 2 dB, respectively and achieves a PSNR of 13. 2 dB in source separation together with deconvolution, outperforming a convolutive non-negative matrix factorization (NMF) baseline of 10. 1 dB.

PDF Details

JMLR Journal 2013 Journal Article

Segregating Event Streams and Noise with a Markov Renewal Process Model

Dan Stowell
Mark D. Plumbley

We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a fixed number of sources and/or a fixed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture of an unknown and varying number of similar Markov renewal processes, plus independent clutter noise. The inference simultaneously distinguishes signal from noise as well as clustering signal observations into separate source streams. We illustrate the technique via synthetic experiments as well as an experiment to track a mixture of singing birds. Source code is available. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

PDF Details