YNIMG Journal 2025 Journal Article
Cortical oscillations predict auditory grouping in listeners with various hearing configurations
- Nour Alsabbagh
- Bob McMurray
- Timothy D Griffiths
- Joel I Berger
- Kyogu Lee
- Phillip E Gander
- Inyong Choi
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
YNIMG Journal 2025 Journal Article
NeurIPS Conference 2025 Conference Paper
We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i. e. , source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e. g. , Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories.
ICML Conference 2025 Conference Paper
Flow and diffusion models have demonstrated strong performance and training stability across various tasks but lack two critical properties of simulation-based methods: freedom of dimensionality and adaptability to different inference trajectories. To address this limitation, we propose the Multidimensional Adaptive Coefficient (MAC), a plug-in module for flow and diffusion models that extends conventional unidimensional coefficients to multidimensional ones and enables inference trajectory-wise adaptation. MAC is trained via simulation-based feedback through adversarial refinement. Empirical results across diverse frameworks and datasets demonstrate that MAC enhances generative quality with high training efficiency. Consequently, our work offers a new perspective on inference trajectory optimality, encouraging future research to move beyond vector field design and to leverage training-efficient, simulation-based optimization.
IJCAI Conference 2025 Conference Paper
Generalization of synthesizer sound matching to external instrument sounds is highly challenging due to the non-differentiability of sound synthesis process which prohibits the use of out-of-domain sounds for training with synthesis parameter loss. We propose SynthRL, a novel reinforcement learning (RL)-based approach for cross-domain synthesizer sound matching. By incorporating sound similarity into the reward function, SynthRL effectively optimizes synthesis parameters without ground-truth labels, allowing fine-tuning on out-of-domain sounds. Furthermore, we introduce a transformer-based model architecture and reward-based prioritized experience replay to enhance RL training efficiency, considering the unique characteristics of the task. Experimental results demonstrate that SynthRL outperforms state-of-the-art methods on both in-domain and out-of-domain tasks. Further experimental analysis validates the effectiveness of our reward design, showing a strong correlation with human perception of sound similarity.
AAAI Conference 2025 Conference Paper
Uncertainty estimation has been widely applied for trustworthy automatic speech recognition (ASR) systems across training and inference stages. In the training stage, previous studies show that uncertainty can facilitate self-training by filtering out unlabeled data samples with high uncertainty. However, the current sequence-level uncertainty estimation method for connectionist temporal classification (CTC) based ASR models drops the output probability information and depends only on the textual distance of decoded predictions. In this study, we argue that this results in limited performance improvement and propose a novel output probability-based sequence-level uncertainty estimation method. We also categorize uncertainty as pseudo-label uncertainty and in-training uncertainty for the self-training process. Finally, we present uncertainty-aware self-training for CTC-based ASR models and experimentally show the effectiveness of the proposed method compared to the baselines.
NeurIPS Conference 2024 Conference Paper
While significant advancements have been made in music generation and differentiable sound synthesis within machine learning and computer audition, the simulation of instrument vibration guided by physical laws has been underexplored. To address this gap, we introduce a novel model for simulating the spatio-temporal motion of nonlinear strings, integrating modal synthesis and spectral modeling within a neural network framework. Our model leverages mechanical properties and fundamental frequencies as inputs, outputting string states across time and space that solve the partial differential equation characterizing the nonlinear string. Empirical evaluations demonstrate that the proposed architecture achieves superior accuracy in string motion simulation compared to existing baseline architectures. The code and demo are available online.
NeurIPS Conference 2021 Conference Paper
We present a neural analysis and synthesis (NANSY) framework that can manipulate the voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on information perturbation. The idea is to perturb information in the original input signal (e. g. , formant, pitch, and frequency response), thereby letting synthesis networks selectively take essential attributes to reconstruct the input signal. Because NANSY does not need any bottleneck structures, it enjoys both high reconstruction quality and controllability. Furthermore, NANSY does not require any labels associated with speech data such as text and speaker information, but rather uses a new set of analysis features, i. e. , wav2vec feature and newly proposed pitch feature, Yingram, which allows for fully self-supervised training. Taking advantage of fully self-supervised training, NANSY can be easily extended to a multilingual setting by simply training it with a multilingual dataset. The experiments show that NANSY can achieve significant improvement in performance in several applications such as zero-shot voice conversion, pitch shift, and time-scale modification.
ICLR Conference 2020 Conference Paper
This work seeks the possibility of generating the human face from voice solely based on the audio-visual data without any human-labeled annotations. To this end, we propose a multi-modal learning framework that links the inference stage and generation stage. First, the inference networks are trained to match the speaker identity between the two different modalities. Then the pre-trained inference networks cooperate with the generation network by giving conditional information about the voice. The proposed method exploits the recent development of GANs techniques and generates the human face directly from the speech waveform making our system fully end-to-end. We analyze the extent to which the network can naturally disentangle two latent factors that contribute to the generation of a face image one that comes directly from a speech signal and the other that is not related to it and explore whether the network can learn to generate natural human face image distribution by modeling these factors. Experimental results show that the proposed network can not only match the relationship between the human face and speech, but can also generate the high-quality human face sample conditioned on its speech. Finally, the correlation between the generated face and the corresponding speech is quantitatively measured to analyze the relationship between the two modalities.