Arrow Research search

Author name cluster

Fei Tian

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
2 author rows

Possible papers

13

NeurIPS Conference 2019 Conference Paper

Neural Machine Translation with Soft Prototype

  • Yiren Wang
  • Yingce Xia
  • Fei Tian
  • Fei Gao
  • Tao Qin
  • Cheng Xiang Zhai
  • Tie-Yan Liu

Neural machine translation models usually use the encoder-decoder framework and generate translation from left to right (or right to left) without fully utilizing the target-side global information. A few recent approaches seek to exploit the global information through two-pass decoding, yet have limitations in translation quality and model efficiency. In this work, we propose a new framework that introduces a soft prototype into the encoder-decoder architecture, which allows the decoder to have indirect access to both past and future information, such that each target word can be generated based on the better global understanding. We further provide an efficient and effective method to generate the prototype. Empirical studies on various neural machine translation tasks show that our approach brings significant improvement in generation quality over the baseline model, with little extra cost in storage and inference time, demonstrating the effectiveness of our proposed framework. Specially, we achieve state-of-the-art results on WMT2014, 2015 and 2017 English to German translation.

AAAI Conference 2019 Conference Paper

Non-Autoregressive Machine Translation with Auxiliary Regularization

  • Yiren Wang
  • Fei Tian
  • Di He
  • Tao Qin
  • ChengXiang Zhai
  • Tie-Yan Liu

As a new neural machine translation approach, Non- Autoregressive machine Translation (NAT) has attracted attention recently due to its high efficiency in inference. However, the high efficiency has come at the cost of not capturing the sequential dependency on the target side of translation, which causes NAT to suffer from two kinds of translation errors: 1) repeated translations (due to indistinguishable adjacent decoder hidden states), and 2) incomplete translations (due to incomplete transfer of source side information via the decoder hidden states). In this paper, we propose to address these two problems by improving the quality of decoder hidden representations via two auxiliary regularization terms in the training process of an NAT model. First, to make the hidden states more distinguishable, we regularize the similarity between consecutive hidden states based on the corresponding target tokens. Second, to force the hidden states to contain all the information in the source sentence, we leverage the dual nature of translation tasks (e. g. , English to German and German to English) and minimize a backward reconstruction error to ensure that the hidden states of the NAT decoder are able to recover the source side sentence. Extensive experiments conducted on several benchmark datasets show that both regularization strategies are effective and can alleviate the issues of repeated translations and incomplete translations in NAT models. The accuracy of NAT models is therefore improved significantly over the state-of-the-art NAT models with even better efficiency for inference.

AAAI Conference 2019 Conference Paper

Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder

  • Yingce Xia
  • Tianyu He
  • Xu Tan
  • Fei Tian
  • Di He
  • Tao Qin

Sharing source and target side vocabularies and word embeddings has been a popular practice in neural machine translation (briefly, NMT) for similar languages (e. g. , English to French or German translation). The success of such wordlevel sharing motivates us to move one step further: we consider model-level sharing and tie the whole parts of the encoder and decoder of an NMT model. We share the encoder and decoder of Transformer (Vaswani et al. 2017), the stateof-the-art NMT model, and obtain a compact model named Tied Transformer. Experimental results demonstrate that such a simple method works well for both similar and dissimilar language pairs. We empirically verify our framework for both supervised NMT and unsupervised NMT: we achieve a 35. 52 BLEU score on IWSLT 2014 German to English translation, 28. 98/29. 89 BLEU scores on WMT 2014 English to German translation without/with monolingual data, and a 22. 05 BLEU score on WMT 2016 unsupervised German to English translation.

NeurIPS Conference 2018 Conference Paper

Learning to Teach with Dynamic Loss Functions

  • Lijun Wu
  • Fei Tian
  • Yingce Xia
  • Yang Fan
  • Tao Qin
  • Lai Jian-Huang
  • Tie-Yan Liu

Teaching is critical to human society: it is with teaching that prospective students are educated and human civilization can be inherited and advanced. A good teacher not only provides his/her students with qualified teaching materials (e. g. , textbooks), but also sets up appropriate learning objectives (e. g. , course projects and exams) considering different situations of a student. When it comes to artificial intelligence, treating machine learning models as students, the loss functions that are optimized act as perfect counterparts of the learning objective set by the teacher. In this work, we explore the possibility of imitating human teaching behaviors by dynamically and automatically outputting appropriate loss functions to train machine learning models. Different from typical learning settings in which the loss function of a machine learning model is predefined and fixed, in our framework, the loss function of a machine learning model (we call it student) is defined by another machine learning model (we call it teacher). The ultimate goal of teacher model is cultivating the student to have better performance measured on development dataset. Towards that end, similar to human teaching, the teacher, a parametric model, dynamically outputs different loss functions that will be used and optimized by its student model at different training stages. We develop an efficient learning method for the teacher model that makes gradient based optimization possible, exempt of the ineffective solutions such as policy optimization. We name our method as ``learning to teach with dynamic loss functions'' (L2T-DLF for short). Extensive experiments on real world tasks including image classification and neural machine translation demonstrate that our method significantly improves the quality of various student models.

ICML Conference 2018 Conference Paper

Model-Level Dual Learning

  • Yingce Xia
  • Xu Tan 0003
  • Fei Tian
  • Tao Qin 0001
  • Nenghai Yu
  • Tie-Yan Liu

Many artificial intelligence tasks appear in dual forms like English$\leftrightarrow$French translation and speech$\leftrightarrow$text transformation. Existing dual learning schemes, which are proposed to solve a pair of such dual tasks, explore how to leverage such dualities from data level. In this work, we propose a new learning framework, model-level dual learning, which takes duality of tasks into consideration while designing the architectures for the primal/dual models, and ties the model parameters that playing similar roles in the two tasks. We study both symmetric and asymmetric model-level dual learning. Our algorithms achieve significant improvements on neural machine translation and sentiment analysis.

NeurIPS Conference 2018 Conference Paper

Neural Architecture Optimization

  • Renqian Luo
  • Fei Tian
  • Tao Qin
  • Enhong Chen
  • Tie-Yan Liu

Automatic neural architecture design has shown its potential in discovering powerful neural network architectures. Existing methods, no matter based on reinforcement learning or evolutionary algorithms (EA), conduct architecture search in a discrete space, which is highly inefficient. In this paper, we propose a simple and efficient method to automatic neural architecture design based on continuous optimization. We call this new approach neural architecture optimization (NAO). There are three key components in our proposed approach: (1) An encoder embeds/maps neural network architectures into a continuous space. (2) A predictor takes the continuous representation of a network as input and predicts its accuracy. (3) A decoder maps a continuous representation of a network back to its architecture. The performance predictor and the encoder enable us to perform gradient based optimization in the continuous space to find the embedding of a new architecture with potentially better accuracy. Such a better embedding is then decoded to a network by the decoder. Experiments show that the architecture discovered by our method is very competitive for image classification task on CIFAR-10 and language modeling task on PTB, outperforming or on par with the best results of previous architecture search methods with a significantly reduction of computational resources. Specifically we obtain $2. 11\%$ test set error rate for CIFAR-10 image classification task and $56. 0$ test set perplexity of PTB language modeling task. The best discovered architectures on both tasks are successfully transferred to other tasks such as CIFAR-100 and WikiText-2. Furthermore, combined with the recent proposed weight sharing mechanism, we discover powerful architecture on CIFAR-10 (with error rate $3. 53\%$) and on PTB (with test set perplexity $56. 6$), with very limited computational resources (less than $10$ GPU hours) for both tasks.

ICML Conference 2018 Conference Paper

Towards Binary-Valued Gates for Robust LSTM Training

  • Zhuohan Li 0001
  • Di He 0001
  • Fei Tian
  • Wei Chen 0034
  • Tao Qin 0001
  • Liwei Wang 0001
  • Tie-Yan Liu

Long Short-Term Memory (LSTM) is one of the most widely used recurrent structures in sequence modeling. It aims to use gates to control information flow (e. g. , whether to skip some information or not) in the recurrent computations, although its practical implementation based on soft gates only partially achieves this goal. In this paper, we propose a new way for LSTM training, which pushes the output values of the gates towards 0 or 1. By doing so, we can better control the information flow: the gates are mostly open or closed, instead of in a middle state, which makes the results more interpretable. Empirical studies show that (1) Although it seems that we restrict the model capacity, there is no performance drop: we achieve better or comparable performances due to its better generalization ability; (2) The outputs of gates are not sensitive to their inputs: we can easily compress the LSTM unit in multiple ways, e. g. , low-rank approximation and low-precision approximation. The compressed models are even better than the baseline models without compression.

AAAI Conference 2018 Conference Paper

Word Attention for Sequence to Sequence Text Understanding

  • Lijun Wu
  • Fei Tian
  • Li Zhao
  • Jianhuang Lai
  • Tie-Yan Liu

Attention mechanism has been a key component in Recurrent Neural Networks (RNNs) based sequence to sequence learning framework, which has been adopted in many text understanding tasks, such as neural machine translation and abstractive summarization. In these tasks, the attention mechanism models how important each part of the source sentence is to generate a target side word. To compute such importance scores, the attention mechanism summarizes the source side information in the encoder RNN hidden states (i. e. , ht), and then builds a context vector for a target side word upon a subsequence representation of the source sentence, since ht actually summarizes the information of the subsequence containing the first t-th words in the source sentence. We in this paper, show that an additional attention mechanism called word attention, that builds itself upon word level representations, significantly enhances the performance of sequence to sequence learning. Our word attention can enrich the source side contextual representation by directly promoting the clean word level information in each step. Furthermore, we propose to use contextual gates to dynamically combine the subsequence level and word level contextual information. Experimental results on abstractive summarization and neural machine translation show that word attention significantly improve over strong baselines. In particular, we achieve the state-of-the-art result on WMT’14 English-French translation task with 12M training data.

NeurIPS Conference 2017 Conference Paper

Deliberation Networks: Sequence Generation Beyond One-Pass Decoding

  • Yingce Xia
  • Fei Tian
  • Lijun Wu
  • Jianxin Lin
  • Tao Qin
  • Nenghai Yu
  • Tie-Yan Liu

The encoder-decoder framework has achieved promising progress for many sequence generation tasks, including machine translation, text summarization, dialog system, image captioning, etc. Such a framework adopts an one-pass forward process while decoding and generating a sequence, but lacks the deliberation process: A generated sequence is directly used as final output without further polishing. However, deliberation is a common behavior in human's daily life like reading news and writing papers/articles/books. In this work, we introduce the deliberation process into the encoder-decoder framework and propose deliberation networks for sequence generation. A deliberation network has two levels of decoders, where the first-pass decoder generates a raw sequence and the second-pass decoder polishes and refines the raw sentence with deliberation. Since the second-pass deliberation decoder has global information about what the sequence to be generated might be, it has the potential to generate a better sequence by looking into future words in the raw sentence. Experiments on neural machine translation and text summarization demonstrate the effectiveness of the proposed deliberation networks. On the WMT 2014 English-to-French translation task, our model establishes a new state-of-the-art BLEU score of 41. 5.

AAAI Conference 2015 Conference Paper

Generalization Analysis for Game-Theoretic Machine Learning

  • Haifang Li
  • Fei Tian
  • Wei Chen
  • Tao Qin
  • Zhi-Ming Ma
  • Tie-Yan Liu

For Internet applications like sponsored search, cautions need to be taken when using machine learning to optimize their mechanisms (e. g. , auction) since selfinterested agents in these applications may change their behaviors (and thus the data distribution) in response to the mechanisms. To tackle this problem, a framework called game-theoretic machine learning (GTML) was recently proposed, which first learns a Markov behavior model to characterize agents behaviors, and then learns the optimal mechanism by simulating agents’ behavior changes in response to the mechanism. While GTML has demonstrated practical success, its generalization analysis is challenging because the behavior data are non-i. i. d. and dependent on the mechanism. To address this challenge, first, we decompose the generalization error for GTML into the behavior learning error and the mechanism learning error; second, for the behavior learning error, we obtain novel non-asymptotic error bounds for both parametric and non-parametric behavior learning methods; third, for the mechanism learning error, we derive a uniform convergence bound based on a new concept called nested covering number of the mechanism space and the generalization analysis techniques developed for mixing sequences.

IJCAI Conference 2015 Conference Paper

Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective

  • Yitan Li
  • Linli Xu
  • Fei Tian
  • Liang Jiang
  • Xiaowei Zhong
  • Enhong Chen

Recently significant advances have been witnessed in the area of distributed word representations based on neural networks, which are also known as word embeddings. Among the new word embedding models, skip-gram negative sampling (SGNS) in the word2vec toolbox has attracted much attention due to its simplicity and effectiveness. However, the principles of SGNS remain not well understood, except for a recent work that explains SGNS as an implicit matrix factorization of the pointwise mutual information (PMI) matrix. In this paper, we provide a new perspective for further understanding SGNS. We point out that SGNS is essentially a representation learning method, which learns to represent the co-occurrence vector for a word. Based on the representation learning view, SGNS is in fact an explicit matrix factorization (EMF) of the words’ co-occurrence matrix. Furthermore, extended supervised word embedding can be established based on our proposed representation learning view.

AAAI Conference 2014 Conference Paper

Agent Behavior Prediction and Its Generalization Analysis

  • Fei Tian
  • Haifang Li
  • Wei Chen
  • Tao Qin
  • Enhong Chen
  • Tie-Yan Liu

Machine learning algorithms have been applied to predict agent behaviors in real-world dynamic systems, such as advertiser behaviors in sponsored search and worker behaviors in crowdsourcing. Behavior data in these systems are generated by live agents: once systems change due to the adoption of prediction models learnt from behavior data, agents will observe and respond to these changes by changing their own behaviors accordingly. Therefore, the evolving behavior data will not be identically and independently distributed, posing great challenges to theoretical analysis. To tackle this challenge, in this paper, we propose to use Markov Chain in Random Environments (MCRE) to describe the behavior data, and perform generalization analysis of machine learning algorithms on its basis. We propose a novel technique that transforms the original time-variant MCRE into a higher-dimensional timehomogeneous Markov chain, which is easier to deal with. We prove the convergence of the new Markov chain when time approaches infinity. Then we obtain a generalization bound for the machine learning algorithms on the behavior data generated by the new Markov chain. To the best of our knowledge, this is the first work that performs the generalization analysis on data generated by complex processes in real-world dynamic systems.

AAAI Conference 2014 Conference Paper

Learning Deep Representations for Graph Clustering

  • Fei Tian
  • Bin Gao
  • Qing Cui
  • Enhong Chen
  • Tie-Yan Liu

Recently deep learning has been successfully adopted in many applications such as speech recognition and image classification. In this work, we explore the possibility of employing deep learning in graph clustering. We propose a simple method, which first learns a nonlinear embedding of the original graph by stacked autoencoder, and then runs k-means algorithm on the embedding to obtain clustering result. We show that this simple method has solid theoretical foundation, due to the similarity between autoencoder and spectral clustering in terms of what they actually optimize. Then, we demonstrate that the proposed method is more efficient and flexible than spectral clustering. First, the computational complexity of autoencoder is much lower than spectral clustering: the former can be linear to the number of nodes in a sparse graph while the latter is super quadratic due to eigenvalue decomposition. Second, when additional sparsity constraint is imposed, we can simply employ the sparse autoencoder developed in the literature of deep learning; however, it is nonstraightforward to implement a sparse spectral method. The experimental results on various graph datasets show that the proposed method significantly outperforms conventional spectral clustering, which clearly indicates the effectiveness of deep learning in graph clustering.