Arrow Research search

Author name cluster

Zhifeng Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers
2 author rows

Possible papers

15

EAAI Journal 2026 Journal Article

Dynamic path smooth unfolding network and learnable random smoothing strategy for magnetic resonance imaging compressed sensing

  • Ziqi Yang
  • Mingfeng Jiang
  • Chenghu Geng
  • Zhifeng Chen
  • Mengyu Jia
  • Xiaocheng Yang
  • Sumei Huang
  • Feng Liu

Deep Unfolding Networks (DUNs) have become the mainstream approach for compressed sensing Magnetic Resonance Imaging (MRI) reconstruction from highly under-sampled k-space data. In this paper, a novel Dynamic Path Smooth Unfolding Network (DPSU-Net) is proposed for compressed sensing MRI reconstruction by dynamically selecting different paths for smooth unfolding. Furthermore, a learnable random smoothing strategy is used to enhance model robustness by introducing perturbations through a noise generator during training stage. Experimental results on the FastMRI T1-weighted and T2-weighted images show that DPSU-Net achieves superior reconstruction performance across different under-sampling rates, with Peak Signal-to-Noise Ratio (PSNR)/Structural Similarity Index Measure (SSIM) of 48. 70/0. 9889 on T1-weighted images and 45. 68/0. 9715 on T2-weighted images, surpassing existing state-of-the-art networks. Ablation studies further confirm the effectiveness and robustness of the dynamic path selection and learnable random smoothing strategies, demonstrating improvements in reconstruction quality.

YNIMG Journal 2025 Journal Article

Accelerating multi-directional diffusion MRI through patch-based joint reconstruction

  • Zhongbiao Xu
  • Rongli Zhang
  • Wei Huang
  • Guanhua Deng
  • Xiaoyun Liang
  • Li Guo
  • Junying Cheng
  • Yaohui Wang

Diffusion magnetic resonance imaging (dMRI) is a valuable technique for studying tissue microstructure and connectivity in the brain. However, acquiring high-resolution dMRI data is time-consuming, limiting its clinical applicability. Traditional parallel imaging techniques can accelerate the acquisition of dMRI, but they are constrained by the geometry factor. In this study, we propose a novel patch-based multiple diffusion directions joint reconstruction method that simultaneously capitalizes on the intra- and inter-image correlation across multiple diffusion directions by grouping similar 3D image patches and then enforces the sparsity of these groups in sensitivity encoding (SENSE) reconstruction, termed PB-SENSE. The simulation and in vivo experiments demonstrated that the proposed method can achieve high-quality images comparable to those obtained from fully sampled data, even with an acceleration of 5. This suggests that the proposed method has the potential to enhance the practical application of high-resolution diffusion imaging.

ICML Conference 2024 Conference Paper

Controlled Decoding from Language Models

  • Sidharth Mudgal
  • Jong Lee
  • Harish Ganapathy
  • YaGuang Li
  • Tao Wang
  • Yanping Huang
  • Zhifeng Chen
  • Heng-Tze Cheng

KL-regularized reinforcement learning (RL) is a popular alignment framework to control the language model responses towards high reward outcomes. We pose a tokenwise RL objective and propose a modular solver for it, called controlled decoding (CD). CD exerts control through a separate prefix scorer module, which is trained to learn a value function for the reward. The prefix scorer is used at inference time to control the generation from a frozen base model, provably sampling from a solution to the RL objective. We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that prefix scorers for multiple rewards may be combined at inference time, effectively solving a multi-objective RL problem with no additional training. We show that the benefits of applying CD transfer to an unseen base model with no further tuning as well. Finally, we show that CD can be applied in a blockwise decoding fashion at inference-time, essentially bridging the gap between the popular best-of-$K$ strategy and tokenwise control through reinforcement learning. This makes CD a promising approach for alignment of language models.

NeurIPS Conference 2024 Conference Paper

Stylus: Automatic Adapter Selection for Diffusion Models

  • Michael Luo
  • Justin Wong
  • Brandon Trabucco
  • Yanping Huang
  • Joseph E. Gonzalez
  • Zhifeng Chen
  • Ruslan Salakhutdinov
  • Ion Stoica

Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters—most of which are highly customized with insufficient descriptions. To generate high quality images, this paper explores the problem of matching the prompt to a Stylus of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP/FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model.

ICML Conference 2023 Conference Paper

Brainformers: Trading Simplicity for Efficiency

  • Yanqi Zhou
  • Nan Du 0002
  • Yanping Huang
  • Daiyi Peng
  • Chang Lan
  • Da Huang
  • Siamak Shakeri
  • David R. So

Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.

EAAI Journal 2023 Journal Article

Deep Deterministic Policy Gradient and Active Disturbance Rejection Controller based coordinated control for gearshift manipulator of driving robot

  • Gang Chen
  • Zhifeng Chen
  • Liangmo Wang
  • Weigong Zhang

In order to improve the shift precision of the gearshift manipulator, a coordinated control method for the gearshift manipulator based on the Deep Deterministic Policy Gradient (DDPG) and Active Disturbance Rejection Controller (ADRC) is proposed. Firstly, the kinematic model and dynamics model of the gearshift manipulator are established. Secondly, a coordinated control strategy is proposed to solve the target rotation angle of the servo motor, so as to deal with the nonlinear trajectory problem caused by the mechanical decoupling strategy. Then, an ADRC-based controller and a DDPG-based parameter adjuster are proposed. A DDPG model is established to adaptively adjust the control parameters of ADRC through the target rotation angle, actual rotation angle and rotation angle error. Finally, the theoretical stability proof and experimental verification of the proposed method are conducted. The test results show that the mean absolute error of the proposed method is 34% and 18% lower than that of ADRC and fuzzy PID methods. It shows the effectiveness of the method.

ICML Conference 2023 Conference Paper

Lifelong Language Pretraining with Distribution-Specialized Experts

  • Wuyang Chen 0001
  • Yanqi Zhou
  • Nan Du 0002
  • Yanping Huang
  • James Laudon
  • Zhifeng Chen
  • Claire Cui

Pretraining on a large-scale corpus has become a standard method to build general language models (LMs). Adapting a model to new data distributions targeting different downstream tasks poses significant challenges. Naive fine-tuning may incur catastrophic forgetting when the over-parameterized LMs overfit the new data but fail to preserve the pretrained features. Lifelong learning (LLL) aims to enable information systems to learn from a continuous data stream across time. However, most prior work modifies the training recipe assuming a static fixed network architecture. We find that additional model capacity and proper regularization are key elements to achieving strong LLL performance. Thus, we propose Lifelong-MoE, an extensible MoE (Mixture-of-Experts) architecture that dynamically adds model capacity via adding experts with regularized pretaining. Our results show that by only introducing a limited number of extra experts while keeping the computation cost constant, our model can steadily adapt to data distribution shifts while preserving the previous knowledge. Compared to existing lifelong learning approaches, Lifelong-MoE achieves better few-shot performance on NLP tasks. More impressively, Lifelong-MoE surpasses multi-task learning on 19 downstream NLU tasks.

TMLR Journal 2023 Journal Article

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

  • Weicheng Kuo
  • AJ Piergiovanni
  • Dahun Kim
  • xiyang luo
  • Benjamin Caine
  • Wei Li
  • Abhijit Ogale
  • Luowei Zhou

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach.

ICML Conference 2022 Conference Paper

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  • Nan Du 0002
  • Yanping Huang
  • Andrew M. Dai
  • Simon Tong
  • Dmitry Lepikhin
  • Yuanzhong Xu
  • Maxim Krikun
  • Yanqi Zhou

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest \glam has 1. 2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall fewshot performance across 29 NLP tasks.

NeurIPS Conference 2022 Conference Paper

Mixture-of-Experts with Expert Choice Routing

  • Yanqi Zhou
  • Tao Lei
  • Hanxiao Liu
  • Nan Du
  • Yanping Huang
  • Vincent Zhao
  • Andrew M. Dai
  • Zhifeng Chen

Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e. g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2×. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.

ICLR Conference 2022 Conference Paper

Scene Transformer: A unified architecture for predicting future trajectories of multiple agents

  • Jiquan Ngiam
  • Vijay Vasudevan
  • Benjamin Caine
  • Zhengdong Zhang
  • Hao-Tien Lewis Chiang
  • Jeffrey Ling
  • Rebecca Roelofs
  • Alex Bewley

Predicting the motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g., vehicles and pedestrians) and their associated behaviors may be diverse and influence one another. Most prior work have focused on predicting independent futures for each agent based on all past motion, and planning against these independent predictions. However, planning against independent predictions can make it challenging to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly, producing consistent futures that account for interactions between agents. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture employs attention to combine features across road elements, agent interactions, and time steps. We evaluate our approach on autonomous driving datasets for both marginal and joint motion prediction, and achieve state of the art performance across two popular datasets. Through combining a scene-centric approach, agent permutation equivariant model, and a sequence masking strategy, we show that our model can unify a variety of motion prediction tasks from joint motion predictions to conditioned prediction.

ICLR Conference 2021 Conference Paper

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

  • Dmitry Lepikhin
  • HyoukJoong Lee
  • Yuanzhong Xu
  • Dehao Chen
  • Orhan Firat
  • Yanping Huang
  • Maxim Krikun
  • Noam Shazeer

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost,ease of programming, and efficient implementation on parallel devices. In this paper we demonstrate conditional computation as a remedy to the above mentioned impediments, and demonstrate its efficacy and utility. We make extensive use of GShard, a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler to enable large scale models with up to trillions of parameters. GShard and conditional computation enable us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts. We demonstrate that such a giant model with 600 billion parameters can efficiently be trained on 2048 TPU v3 cores in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

NeurIPS Conference 2019 Conference Paper

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

  • Yanping Huang
  • Youlong Cheng
  • Ankur Bapna
  • Orhan Firat
  • Dehao Chen
  • Mia Chen
  • HyoukJoong Lee
  • Jiquan Ngiam

Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure. These solutions are often architecture-specific and do not transfer to other machine learning tasks. To address the need for efficient and task-independent model parallelism, we introduce TensorPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, TensorPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, TensorPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is partitioned across multiple accelerators. We demonstrate the advantages of TensorPipe by training large-scale neural networks on two different tasks with distinct network architectures: (i)Image Classification: We train a 557-million-parameter AmoebaNet model and attain a top-1 accuracy of 84. 4% on ImageNet-2012, (ii)Multilingual Neural Machine Translation: We train a single 6-billion-parameter, 128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.

NeurIPS Conference 2018 Conference Paper

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

  • Ye Jia
  • Yu Zhang
  • Ron Weiss
  • Quan Wang
  • Jonathan Shen
  • Fei Ren
  • Zhifeng Chen
  • Patrick Nguyen

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

NeurIPS Conference 2016 Conference Paper

Reward Augmented Maximum Likelihood for Neural Structured Prediction

  • Mohammad Norouzi
  • Samy Bengio
  • Zhifeng Chen
  • Navdeep Jaitly
  • Mike Schuster
  • Yonghui Wu
  • Dale Schuurmans

A key problem in structured output prediction is enabling direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient method that incorporates task reward into maximum likelihood training. We establish a connection between maximum likelihood and regularized expected reward, showing that they are approximately equivalent in the vicinity of the optimal solution. Then we show how maximum likelihood can be generalized by optimizing the conditional probability of auxiliary outputs that are sampled proportional to their exponentiated scaled rewards. We apply this framework to optimize edit distance in the output space, by sampling from edited targets. Experiments on speech recognition and machine translation for neural sequence to sequence models show notable improvements over maximum likelihood baseline by simply sampling from target output augmentations.