Arrow Research search

Author name cluster

Yingru Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
2 author rows

Possible papers

3

ICLR Conference 2025 Conference Paper

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

  • Xueyao Zhang
  • Xiaohui Zhang
  • Kainan Peng
  • Zhenyu Tang
  • Vimal Manohar
  • Yingru Liu
  • Jeff Hwang
  • Dangna Li

The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo’s effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. Audio samples are available at https://versavoice.github.io/.

AAAI Conference 2020 Conference Paper

Adaptive Activation Network and Functional Regularization for Efficient and Flexible Deep Multi-Task Learning

  • Yingru Liu
  • Xuewen Yang
  • Dongliang Xie
  • Xin Wang
  • Li Shen
  • Haozhi Huang
  • Niranjan Balasubramanian

Multi-task learning (MTL) is a common paradigm that seeks to improve the generalization performance of task learning by training related tasks simultaneously. However, it is still a challenging problem to search the flexible and accurate architecture that can be shared among multiple tasks. In this paper, we propose a novel deep learning model called Task Adaptive Activation Network (TAAN) that can automatically learn the optimal network architecture for MTL. The main principle of TAAN is to derive flexible activation functions for different tasks from the data with other parameters of the network fully shared. We further propose two functional regularization methods that improve the MTL performance of TAAN. The improved performance of both TAAN and the regularization methods is demonstrated by comprehensive experiments.

AAAI Conference 2019 Conference Paper

Dynamic Spatial-Temporal Graph Convolutional Neural Networks for Traffic Forecasting

  • Zulong Diao
  • Xin Wang
  • Dafang Zhang
  • Yingru Liu
  • Kun Xie
  • Shaoyao He

Graph convolutional neural networks (GCNN) have become an increasingly active field of research. It models the spatial dependencies of nodes in a graph with a pre-defined Laplacian matrix based on node distances. However, in many application scenarios, spatial dependencies change over time, and the use of fixed Laplacian matrix cannot capture the change. To track the spatial dependencies among traffic data, we propose a dynamic spatio-temporal GCNN for accurate traffic forecasting. The core of our deep learning framework is the finding of the change of Laplacian matrix with a dynamic Laplacian matrix estimator. To enable timely learning with a low complexity, we creatively incorporate tensor decomposition into the deep learning framework, where real-time traffic data are decomposed into a global component that is stable and depends on long-term temporal-spatial traffic relationship and a local component that captures the traffic fluctuations. We propose a novel design to estimate the dynamic Laplacian matrix of the graph with above two components based on our theoretical derivation, and introduce our design basis. The forecasting performance is evaluated with two realtime traffic datasets. Experiment results demonstrate that our network can achieve up to 25% accuracy improvement.