Arrow Research search

Author name cluster

Yanqi Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

ICLR Conference 2024 Conference Paper

Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models

  • Sheng Shen 0001
  • Le Hou
  • Yanqi Zhou
  • Nan Du 0002
  • Shayne Longpre
  • Jason Wei
  • Hyung Won Chung
  • Barret Zoph

Sparse Mixture-of-Experts (MoE) is a neural architecture design that adds learnable parameters to Large Language Models (LLMs) without increasing computational complexity (FLOPs). Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instruction tuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (in the second and third scenarios), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MoE-32B, surpasses the performance of Flan-PaLM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied by FLAN-MoE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.

ICML Conference 2023 Conference Paper

Brainformers: Trading Simplicity for Efficiency

  • Yanqi Zhou
  • Nan Du 0002
  • Yanping Huang
  • Daiyi Peng
  • Chang Lan
  • Da Huang
  • Siamak Shakeri
  • David R. So

Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.

NeurIPS Conference 2023 Conference Paper

Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference

  • Tao Lei
  • Junwen Bai
  • Siddhartha Brahma
  • Joshua Ainslie
  • Kenton Lee
  • Yanqi Zhou
  • Nan Du
  • Vincent Zhao

We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning method that also improves inference efficiency. CoDA generalizes beyond standard adapter approaches to enable a new way of balancing speed and accuracy using conditional computation. Starting with an existing dense pretrained model, CoDA adds sparse activation together with a small number of new parameters and a light-weight training phase. Our experiments demonstrate that the CoDA approach provides an unexpectedly efficient way to transfer knowledge. Across a variety of language, vision, and speech tasks, CoDA achieves a 2x to 8x inference speed-up compared to the state-of-the-art Adapter approaches with moderate to no accuracy loss and the same parameter efficiency.

NeurIPS Conference 2023 Conference Paper

Learning Large Graph Property Prediction via Graph Segment Training

  • Kaidi Cao
  • Mangpo Phothilimthana
  • Sami Abu-El-Haija
  • Dustin Zelle
  • Yanqi Zhou
  • Charith Mendis
  • Jure Leskovec
  • Bryan Perozzi

Learning to predict properties of large graphs is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST), a general framework that utilizes a divide-and-conquer approach to allow learning large graph property prediction with a constant memory footprint. GST first divides a large graph into segments and then backpropagates through only a few segments sampled per training iteration. We refine the GST paradigm by introducing a historical embedding table to efficiently obtain embeddings for segments not sampled for backpropagation. To mitigate the staleness of historical embeddings, we design two novel techniques. First, we finetune the prediction head to fix the input distribution shift. Second, we introduce Stale Embedding Dropout to drop some stale embeddings during training to reduce bias. We evaluate our complete method GST-EFD (with all the techniques together) on two large graph property prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST-EFD is both memory-efficient and fast, while offering a slight boost on test accuracy over a typical full graph training regime.

ICML Conference 2023 Conference Paper

Lifelong Language Pretraining with Distribution-Specialized Experts

  • Wuyang Chen 0001
  • Yanqi Zhou
  • Nan Du 0002
  • Yanping Huang
  • James Laudon
  • Zhifeng Chen
  • Claire Cui

Pretraining on a large-scale corpus has become a standard method to build general language models (LMs). Adapting a model to new data distributions targeting different downstream tasks poses significant challenges. Naive fine-tuning may incur catastrophic forgetting when the over-parameterized LMs overfit the new data but fail to preserve the pretrained features. Lifelong learning (LLL) aims to enable information systems to learn from a continuous data stream across time. However, most prior work modifies the training recipe assuming a static fixed network architecture. We find that additional model capacity and proper regularization are key elements to achieving strong LLL performance. Thus, we propose Lifelong-MoE, an extensible MoE (Mixture-of-Experts) architecture that dynamically adds model capacity via adding experts with regularized pretaining. Our results show that by only introducing a limited number of extra experts while keeping the computation cost constant, our model can steadily adapt to data distribution shifts while preserving the previous knowledge. Compared to existing lifelong learning approaches, Lifelong-MoE achieves better few-shot performance on NLP tasks. More impressively, Lifelong-MoE surpasses multi-task learning on 19 downstream NLU tasks.

ICML Conference 2022 Conference Paper

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  • Nan Du 0002
  • Yanping Huang
  • Andrew M. Dai
  • Simon Tong
  • Dmitry Lepikhin
  • Yuanzhong Xu
  • Maxim Krikun
  • Yanqi Zhou

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest \glam has 1. 2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall fewshot performance across 29 NLP tasks.

NeurIPS Conference 2022 Conference Paper

Mixture-of-Experts with Expert Choice Routing

  • Yanqi Zhou
  • Tao Lei
  • Hanxiao Liu
  • Nan Du
  • Yanping Huang
  • Vincent Zhao
  • Andrew M. Dai
  • Zhifeng Chen

Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e. g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2×. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.

JMLR Journal 2020 Journal Article

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

  • Colin Raffel
  • Noam Shazeer
  • Adam Roberts
  • Katherine Lee
  • Sharan Narang
  • Michael Matena
  • Yanqi Zhou
  • Wei Li

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2020. ( edit, beta )

ICRA Conference 2020 Conference Paper

Omnidirectional Depth Extension Networks

  • Xinjing Cheng
  • Peng Wang 0001
  • Yanqi Zhou
  • Chenye Guan
  • Ruigang Yang

Omnidirectional 360° camera proliferates rapidly for autonomous robots since it significantly enhances the perception ability by widening the field of view (FoV). However, corresponding 360° depth sensors, which are also critical for the perception system, are still difficult or expensive to have. In this paper, we propose a low-cost 3D sensing system that combines an omnidirectional camera with a calibrated projective depth camera, where the depth from the limited FoV can be automatically extended to the rest of recorded omnidirectional image. To accurately recover the missing depths, we design an omnidirectional depth extension convolutional neural network (ODE-CNN), in which a spherical feature transform layer (SFTL) is embedded at the end of feature encoding layers, and a deformable convolutional spatial propagation network (D-CSPN) is appended at the end of feature decoding layers. The former re-samples the neighborhood of each pixel in the omnidirectional coordination to the projective coordination, which reduce the difficulty of feature learning, and the later automatically finds a proper context to well align the structures in the estimated depths via CNN w. r. t. the reference image, which significantly improves the visual quality. Finally, we demonstrate the effectiveness of proposed ODE-CNN over the popular 360D dataset, and show that ODE-CNN significantly outperforms (relatively 33% reduction in depth error) other state-of-the-art (SoTA) methods.

NeurIPS Conference 2020 Conference Paper

Transferable Graph Optimizers for ML Compilers

  • Yanqi Zhou
  • Sudip Roy
  • Amirali Abdolrashidi
  • Daniel Wong
  • Peter Ma
  • Qiumin Xu
  • Hanxiao Liu
  • Phitchaya Phothilimtha

Most compilers for machine learning (ML) frameworks need to solve many correlated optimization problems to generate efficient machine code. Current ML compilers rely on heuristics based algorithms to solve these optimization problems one at a time. However, this approach is not only hard to maintain but often leads to sub-optimal solutions especially for newer model architectures. Existing learning based approaches in the literature are sample inefficient, tackle a single optimization problem, and do not generalize to unseen graphs making them infeasible to be deployed in practice. To address these limitations, we propose an end-to-end, transferable deep reinforcement learning method for computational graph optimization (GO), based on a scalable sequential attention mechanism over an inductive graph neural network. GO generates decisions on the entire graph rather than on each individual node autoregressively, drastically speeding up the search compared to prior methods. Moreover, we propose recurrent attention layers to jointly optimize dependent graph optimization tasks and demonstrate 33%-60% speedup on three graph optimization tasks compared to TensorFlow default optimization. On a diverse set of representative graphs consisting of up to 80, 000 nodes, including Inception-v3, Transformer-XL, and WaveNet, GO achieves on average 21% improvement over human experts and 18% improvement over the prior state of the art with 15x faster convergence, on a device placement task evaluated in real systems.

NeurIPS Conference 2018 Conference Paper

Neural Voice Cloning with a Few Samples

  • Sercan Arik
  • Jitong Chen
  • Kainan Peng
  • Wei Ping
  • Yanqi Zhou

Voice cloning is a highly desired feature for personalized speech interfaces. We introduce a neural voice cloning system that learns to synthesize a person's voice from only a few audio samples. We study two approaches: speaker adaptation and speaker encoding. Speaker adaptation is based on fine-tuning a multi-speaker generative model. Speaker encoding is based on training a separate model to directly infer a new speaker embedding, which will be applied to a multi-speaker generative model. In terms of naturalness of the speech and similarity to the original speaker, both approaches can achieve good performance, even with a few cloning audios. While speaker adaptation can achieve slightly better naturalness and similarity, cloning time and required memory for the speaker encoding approach are significantly less, making it more favorable for low-resource deployment.

NeurIPS Conference 2017 Conference Paper

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

  • Andrew Gibiansky
  • Sercan Arik
  • Gregory Diamos
  • John Miller
  • Kainan Peng
  • Wei Ping
  • Jonathan Raiman
  • Yanqi Zhou

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.