Author name cluster

Xing Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers

2 author rows

EAAI Journal 2026 Journal Article

Most complex networks, if not all, inherently possess community and hierarchichal structure. The hierarchy between nodes within these communities provides a more refined perspective for network analysis and optimization compared to the mesoscale community structure. To this end, we introduce a novel method for Local Community Division based on Authority Hierarchy (LCDAH). Our method advances network data mining by constructing an authority hierarchy graph -- a directed structure that explicitly models pairwise authority relationships. Within this graph, densely connected core nodes are efficiently identified at its apex and serve as co-leaders for community formation; communities are subsequently assigned to each node by traversing downward from these cores through the graph. The method not only detects community boundaries with high accuracy, outperforming benchmarks on six real-world networks, but also reveals the internal hierarchical structure, offering insights beyond mere partitioning. We demonstrate its utility in two data mining applications: image clustering via network transformation and analysis of an international trade network, validating its effectiveness in modeling complex systems.

Details DOI

AAAI Conference 2026 Conference Paper

Self-Enhanced Image Clustering with Cross-Modal Semantic Consistency

Zihan Li
Wei Sun
Jing Hu
Jianhua Yin
Xing Wang
Erwei Yin
Jianlong Wu

While large language-image pre-trained models like CLIP offer powerful generic features for image clustering, existing methods typically freeze the encoder. This creates a fundamental mismatch between the model's task-agnostic representations and the demands of a specific clustering task, imposing a ceiling on performance. To break this ceiling, we propose a self-enhanced framework based on cross-modal semantic consistency for efficient image clustering. Our framework first builds a strong foundation via Cross-Modal Semantic Consistency and then specializes the encoder through Self-Enhancement. In the first stage, we focus on Cross-Modal Semantic Consistency. By mining consistency between generated image-text pairs at the instance, cluster assignment, and cluster center levels, we train lightweight clustering heads to align with the rich semantics of the pre-trained model. This alignment process is bolstered by a novel method for generating higher-quality cluster centers and a dynamic balancing regularizer to ensure well-distributed assignments. In the second stage, we introduce a Self-Enhanced fine-tuning strategy. The well-aligned model from the first stage acts as a reliable pseudo-label generator. These self-generated supervisory signals are then used to feed back the efficient, joint optimization of the vision encoder and clustering heads, unlocking their full potential. Extensive experiments on six mainstream datasets show that our method outperforms existing deep clustering methods by significant margins. Notably, our ViT-B/32 model already matches or even surpasses the accuracy of state-of-the-art methods built upon the far larger ViT-L/14.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

InfinityStar: Uniﬁed Spacetime AutoRegressive Modeling for Visual Generation

Jinlai Liu
Jian Han
Bin Yan
Hui Wu
Fengda Zhu
Xing Wang
Yi Jiang
BINGYUE PENG

We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long-duration video synthesis via straightforward temporal autoregression. Through extensive experiments, InfinityStar scores 83. 74 on VBench, outperforming all autoregressive models by large margins, even surpassing diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10$\times$ faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial-level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.

PDF Details

NeurIPS Conference 2025 Conference Paper

LABridge: Text–Image Latent Alignment Framework via Mean-Conditioned OU Process

Huiyang Shao
Xin Xia
Yuxi Ren
Xing Wang
Xuefeng Xiao

Diffusion models have emerged as state‑of‑the‑art in image synthesis. However, it often suffer from semantic instability and slow iterative denoising. We introduce Latent Alignment Framework (LABridge), a novel Text–Image Latent Alignment Framework via an Ornstein–Uhlenbeck (OU) Process, which explicitly preserves and aligns textual and visual semantics in an aligned latent space. LABridge employs a Text-Image Alignment Encoder (TIAE) to encode text prompts into structured priors that are directly aligned with image latents. Instead of a homogeneous Gaussian, Mean-Conditioned OU process smoothly interpolates between these text‑conditioned priors and image latents, improving stability and reducing the number of denoising steps. Extensive experiments on standard text-to-image benchmarks show that LABridge achieves better text–image alignment metric and competitive FID scores compared to leading diffusion baselines. By unifying text and image representations through principled latent alignment, LABridge paves the way for more efficient, semantically consistent, and high‑fidelity text to image generation.

PDF Details

NeurIPS Conference 2025 Conference Paper

VarFlow: Proper Scoring-Rule Diffusion Distillation via Energy Matching

Huiyang Shao
Xin Xia
Yuxi Ren
Xing Wang
Xuefeng Xiao

**Diffusion models** achieve remarkable generative performance but are hampered by slow, iterative inference. Model distillation seeks to train a fast student generator. **Variational Score Distillation (VSD)** offers a principled KL-divergence minimization framework for this task. This method cleverly avoids computing the teacher model's Jacobian, but its student gradient relies on the score of the student's own noisy marginal distribution, $\nabla\_{\mathbf{x}\_t} \log p\_{\phi, t}(\mathbf{x}\_t)$. VSD thus requires approximations, such as training an auxiliary network to estimate this score. These approximations can introduce biases, cause training instability, or lead to an incomplete match of the target distribution, potentially focusing on conditional means rather than broader distributional features. We introduce **VarFlow**, a method based on a **Score-Rule Variational Distillation (SRVD)** framework. VarFlow trains a one-step generator $g_{\phi}(\mathbf{z})$ by directly minimizing an energy distance (derived from the strictly proper energy score) between the student's induced noisy data distribution $p_{\phi, t}(\mathbf{x}_t)$ and the teacher's target noisy distribution $q_t(\mathbf{x}_t)$. This objective is estimated entirely using samples from these two distributions. Crucially, VarFlow bypasses the need to compute or approximate the intractable student score. By directly matching the full noisy marginal distributions, VarFlow aims for a more comprehensive and robust alignment between student and teacher, offering an efficient and theoretically grounded path to high-fidelity one-step generation.

PDF Details

ICLR Conference 2024 Conference Paper

Forward Learning of Graph Neural Networks

Namyong Park 0001
Xing Wang
Antoine Simoulin
Shuai Yang
Grey Yang
Ryan A. Rossi
Puja Trivedi
Nesreen K. Ahmed

Graph neural networks (GNNs) have achieved remarkable success across a wide range of applications, such as recommendation, drug discovery, and question answering. Behind the success of GNNs lies the backpropagation (BP) algorithm, which is the de facto standard for training deep neural networks (NNs). However, despite its effectiveness, BP imposes several constraints, which are not only biologically implausible, but also limit the scalability, parallelism, and flexibility in learning NNs. Examples of such constraints include storage of neural activities computed in the forward pass for use in the subsequent backward pass, and the dependence of parameter updates on non-local signals. To address these limitations, the forward-forward algorithm (FF) was recently proposed as an alternative to BP in the image classification domain, which trains NNs by performing two forward passes over positive and negative data. Inspired by this advance, we propose ForwardGNN in this work, a new forward learning procedure for GNNs, which avoids the constraints imposed by BP via an effective layer-wise local forward training. ForwardGNN extends the original FF to deal with graph data and GNNs, and makes it possible to operate without generating negative inputs (hence no longer forward-forward). Further, ForwardGNN enables each layer to learn from both the bottom-up and top-down signals without relying on the backpropagation of errors. Extensive experiments on real-world datasets show the effectiveness and generality of the proposed forward graph learning framework. We release our code at https://github.com/facebookresearch/forwardgnn.

Details

NeurIPS Conference 2024 Conference Paper

Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis

Yuxi Ren
Xin Xia
Yanzuo Lu
Jiacheng Zhang
Jie Wu
Pan Xie
Xing Wang
Xuefeng Xiao

Recently, a series of diffusion-aware distillation algorithms have emerged to alleviate the computational overhead associated with the multi-step inference process of Diffusion Models (DMs). Current distillation techniques often dichotomize into two distinct aspects: i) ODE Trajectory Preservation; and ii) ODE Trajectory Reformulation. However, these approaches suffer from severe performance degradation or domain shifts. To address these limitations, we propose Hyper-SD, a novel framework that synergistically amalgamates the advantages of ODE Trajectory Preservation and Reformulation, while maintaining near-lossless performance during step compression. Firstly, we introduce Trajectory Segmented Consistency Distillation to progressively perform consistent distillation within pre-defined time-step segments, which facilitates the preservation of the original ODE trajectory from a higher-order perspective. Secondly, we incorporate human feedback learning to boost the performance of the model in a low-step regime and mitigate the performance loss incurred by the distillation process. Thirdly, we integrate score distillation to further improve the low-step generation capability of the model and offer the first attempt to leverage a unified LoRA to support the inference process at all steps. Extensive experiments and user studies demonstrate that Hyper-SD achieves SOTA performance from 1 to 8 inference steps for both SDXL and SD1. 5. For example, Hyper-SDXL surpasses SDXL-Lightning by +0. 68 in CLIP Score and +0. 51 in Aes Score in the 1-step inference.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Improving Gloss-free Sign Language Translation by Reducing Representation Density

Jinhui Ye
Xing Wang
Wenxiang Jiao
Junwei Liang
Hui Xiong

Gloss-free sign language translation (SLT) aims to develop well-performing SLT systems with no requirement for the costly gloss annotations, but currently still lags behind gloss-based approaches significantly. In this paper, we identify a representation density problem that could be a bottleneck in restricting the performance of gloss-free SLT. Specifically, the representation density problem describes that the visual representations of semantically distinct sign gestures tend to be closely packed together in feature space, which makes gloss-free methods struggle with distinguishing different sign gestures and suffer from a sharp performance drop. To address the representation density problem, we introduce a simple but effective contrastive learning strategy, namely SignCL, which encourages gloss-free models to learn more discriminative feature representation in a self-supervised manner. Our experiments demonstrate that the proposed SignCL can significantly reduce the representation density and improve performance across various translation frameworks. Specifically, SignCLachieves a significant improvement in BLEU score for the Sign Language Transformer and GFSLT-VLP on the CSL-Daily dataset by 39\% and 46\%, respectively, without any increase of model parameters. Compared to Sign2GPT, a state-of-the-art method based on large-scale pre-trained vision and language models, SignCLachieves better performance with only 35\% of its parameters. We will release our code and model to facilitate further research.

PDF Details DOI

TIST Journal 2024 Journal Article

KGDA: A Knowledge Graph Driven Decomposition Approach for Cellular Traffic Prediction

Jiahui Gong
Tong Li
Huandong Wang
Yu Liu
Xing Wang
Zhendong Wang
Chao Deng
Junlan Feng

Understanding and accurately predicting cellular traffic data is vital for communication operators and device users, as it facilitates efficient resource allocation and ensures superior service quality. However, large-scale cellular traffic data forecasting remains challenging due to intricate temporal variations and complex spatial relationships. This article proposes a Knowledge Graph Driven Decomposition Approach (KGDA) for precise cellular traffic prediction. The KGDA breaks down the impact of static environmental factors and dynamic autocorrelations of cellular traffic time series, enabling the capture of overall traffic changes and understanding of traffic dependence on past values. Specifically, we propose an urban knowledge graph to capture the static environmental context of base stations, mapping these entities into the same latent space while retaining static environmental knowledge. The cellular traffic is divided into a regular pattern and fluctuating residual components, with the KGDA comprising four modules: a Knowledge Graph Representation Learning model, a traffic regular pattern prediction module, a traffic residual dynamic prediction module, and an attentional fusion module. The first leverages graph neural networks to extract spatial contexts and predict regular patterns, the second utilizes the Bi-directional Long Short-Term Memory (Bi-LSTM) model to capture autocorrelations of traffic time series, and the final module integrates the patterns and residuals to produce the final prediction result. Comprehensive experiments demonstrate that our proposed model outperforms state-of-the-art models by more than 10% in forecasting cellular traffic.

Details DOI

AAAI Conference 2024 Conference Paper

Reliable Data Generation and Selection for Low-Resource Relation Extraction

Junjie Yu
Xing Wang
Wenliang Chen

Automated construction of annotated data holds significant importance in Relation Extraction (RE) tasks due to the hardness and cost of human annotation. In this work, we propose Self-RDGS, a method for Self-supervised Reliable Data Generation and Selection in low-resource RE tasks. At first, we fully utilize the knowledge of triplets as prompts to generate sentences by employing the Large Language Models (LLMs). Since the auto-generated data contains noise, we then propose a ranking-based data selection method to select reliable sentences. Finally, we integrate the data selection and RE model training within a self-supervised iterative framework. Through experimentation on three datasets with low-resource settings, we demonstrate the effectiveness of our proposed approach in constructing annotated data and achieving noteworthy improvements in comparison to multiple baselines. Code, data and models are available at https://github.com/jjyunlp/GenerationRE.

PDF Details DOI

TCS Journal 2023 Journal Article

Complexity and approximation algorithms for two parallel dedicated machine scheduling with conflict constraints

An Zhang
Liang Zhang
Yong Chen
Guangting Chen
Xing Wang

We investigate two parallel dedicated machine scheduling with conflict constraints. The problem of minimizing the makespan has been shown to be NP-hard in the strong sense under the assumption that the processing sequence of jobs on one machine is given and fixed a priori. The problem without any fixed sequence was previously recognized as weakly NP-hard. In this paper, we first present a 9 5 -approximation algorithm for the problem with a fixed sequence. Then we show that the tight approximation ratios of the algorithm are 7 4 and 5 3 for two subproblems which remain strongly NP-hard. We also send an improved algorithm with approximation ratio 3 − 2 ≈ 1. 586 for one subproblem. Finally, we prove that the problem without any fixed sequence is actually strongly NP-hard, and design a 5 3 -approximation algorithm.

Details DOI

NeurIPS Conference 2023 Conference Paper

GLEMOS: Benchmark for Instantaneous Graph Learning Model Selection

Namyong Park
Ryan Rossi
Xing Wang
Antoine Simoulin
Nesreen K. Ahmed
Christos Faloutsos

The choice of a graph learning (GL) model (i. e. , a GL algorithm and its hyperparameter settings) has a significant impact on the performance of downstream tasks. However, selecting the right GL model becomes increasingly difficult and time consuming as more and more GL models are developed. Accordingly, it is of great significance and practical value to equip users of GL with the ability to perform a near-instantaneous selection of an effective GL model without manual intervention. Despite the recent attempts to tackle this important problem, there has been no comprehensive benchmark environment to evaluate the performance of GL model selection methods. To bridge this gap, we present GLEMOS in this work, a comprehensive benchmark for instantaneous GL model selection that makes the following contributions. (i) GLEMOS provides extensive benchmark data for fundamental GL tasks, i. e. , link prediction and node classification, including the performances of 366 models on 457 graphs on these tasks. (ii) GLEMOS designs multiple evaluation settings, and assesses how effectively representative model selection techniques perform in these different settings. (iii) GLEMOS is designed to be easily extended with new models, new graphs, and new performance records. (iv) Based on the experimental results, we discuss the limitations of existing approaches and highlight future research directions. To promote research on this significant problem, we make the benchmark data and code publicly available at https: //namyongpark. github. io/glemos.

PDF Details

JBHI Journal 2023 Journal Article

ProtoHAR: Prototype Guided Personalized Federated Learning for Human Activity Recognition

Dongzhou Cheng
Lei Zhang
Can Bu
Xing Wang
Hao Wu
Aiguo Song

Federated Learning (FL) has recently attracted great interest in sensor-based human activity recognition (HAR) tasks. However, in real-world environment, sensor data on devices is non-independently and identically distributed (Non-IID), e. g. , activity data recorded by most devices is sparse, and sensor data distribution for each client may be inconsistent. As a result, the traditional FL methods in the heterogeneous environment may incur a drifted global model that causes slow convergence and a heavy communication burden. Although some FL methods are gradually being applied to HAR, they are designed for overly ideal scenarios and do not address such Non-IID problem in the real-world setting. It is still a question whether they can be applied to cross-device FL. To tackle this challenge, we propose ProtoHAR, a prototype-guided FL framework for HAR, which aims to decouple the representation and classifier in the heterogeneous FL setting efficiently. It leverages the global prototype to correct the activity feature representation to make the prototype knowledge flow among clients without leaking privacy while solving a better classifier to avoid excessive drift of the local model in personalized training. Extensive experiments are conducted on four publicly available datasets: USC-HAD, UNIMIB-SHAR, PAMAP2, and HARBOX, which are collected in both controlled environments and real-world scenarios. The results show that compared with the state-of-the-art FL algorithms, ProtoHAR achieves the best performance and faster convergence speed in HAR datasets.

Details DOI

TCS Journal 2021 Journal Article

Improved hardness and approximation results for single allocation hub location problems

Xing Wang
Guangting Chen
Yong Chen
Guohui Lin
Yonghao Wang
An Zhang

Given a metric graph G = ( V, E, w ) and an integer k, we aim to find a single allocation k-hub location, which is a spanning subgraph consisting of a clique of size k such that every node outside of the clique is adjacent to exactly one node inside the clique. For various objective functions studied in the literature, we present improved hardness and approximation results.

Details DOI

AAAI Conference 2020 Conference Paper

Neuron Interaction Based Representation Composition for Neural Machine Translation

Jian Li
Xing Wang
Baosong Yang
Shuming Shi
Michael R. Lyu
Zhaopeng Tu

Recent NLP studies reveal that substantial linguistic information can be attributed to single neurons, i. e. , individual dimensions of the representation vectors. We hypothesize that modeling strong interactions among neurons helps to better capture complex information by composing the linguistic properties embedded in individual neurons. Starting from this intuition, we propose a novel approach to compose representations learned by different components in neural machine translation (e. g. , multi-layer networks or multihead attention), based on modeling strong interactions among neurons in the representation vectors. Speciﬁcally, we leverage bilinear pooling to model pairwise multiplicative interactions among individual neurons, and a low-rank approximation to make the model computationally feasible. We further propose extended bilinear pooling to incorporate ﬁrst-order representations. Experiments on WMT14 English⇒German and English⇒French translation tasks show that our model consistently improves performances over the SOTA TRANS- FORMER baseline. Further analyses demonstrate that our approach indeed captures more syntactic and semantic information as expected.

PDF Details

YNICL Journal 2020 Journal Article

Spatiotemporal EEG microstate analysis in drug-free patients with Parkinson's disease

Chunguang Chu
Xing Wang
Lihui Cai
Lei Zhang
Jiang Wang
Chen Liu
Xiaodong Zhu

The clinical diagnosis of Parkinson's disease (PD) is very difficult, especially in the early stage of the disease, because there is no physiological indicator that can be referenced. Drug-free patients with early PD are characterized by clinical symptoms such as impaired motor function and cognitive decline, which was caused by the dysfunction of brain's dynamic activities. The indicators of brain dysfunction in patients with PD at an early unmedicated condition may provide a valuable basis for the diagnosis of early PD and later treatment. In order to find the spatiotemporal characteristic markers of brain dysfunction in PD, the resting-state EEG microstate analysis is used to explore the transient state of the whole brain of 23 drug-free patients with PD on the sub-second timescale compared to 23 healthy controls. EEG microstates reflect a transiently stable brain topological structure with spatiotemporal characteristics, and the spatial characteristic microstate classes and temporal parameters provide insight into the brain's functional activities in PD patients. The further exploration was to explore the relation between temporal microstate parameters and significant clinical symptoms to determine whether these parameters could be used as a basis for clinically assisted diagnosis. Therefore, we used a general linear model (GLM) to explore the relevance of microstate parameters to clinical scales and multiple patient attributes, and the Wilcoxon rank sum test was used to quantify the linear relation between influencing factors and microstate parameters. Results of microstate analysis revealed that there was an unique spatial microstate different from healthy controls in PD, and several other typical microstates had significant differences compared with the normal control group, and these differences were reflected in the microstate parameters, such as longer durations and more occurrences of one class of microstates in PD compared with healthy controls. Furthermore, correlation analysis showed that there was a significant correlation between multiple microstate classes' parameters and significant clinical symptoms, including impaired motor function and cognitive decline. These results indicate that we have found multiple quantifiable feature tags that reflect brain dysfunction in the early stage of PD. Importantly, such temporal dynamics in microstates are correlated with clinical scales which represent the motor function and recognize level. The obtained results may deepen our understanding of the brain dysfunction caused by PD, and obtain some quantifiable signatures to provide an auxiliary reference for the early diagnosis of PD.

Details DOI

AAAI Conference 2019 Conference Paper

Context-Aware Self-Attention Networks

Baosong Yang
Jian Li
Derek F. Wong
Lidia S. Chao
Xing Wang
Zhaopeng Tu

Self-attention model has shown its flexibility in parallel computation and the effectiveness on modeling both long- and short-term dependencies. However, it calculates the dependencies between representations without considering the contextual information, which has proven useful for modeling dependencies among neural representations in various natural language tasks. In this work, we focus on improving self-attention networks through capturing the richness of context. To maintain the simplicity and flexibility of the selfattention networks, we propose to contextualize the transformations of the query and key layers, which are used to calculate the relevance between elements. Specifically, we leverage the internal representations that embed both global and deep contexts, thus avoid relying on external resources. Experimental results on WMT14 English⇒German and WMT17 Chinese⇒English translation tasks demonstrate the effectiveness and universality of the proposed methods. Furthermore, we conducted extensive analyses to quantify how the context vectors participate in the self-attention model.

PDF Details

AAAI Conference 2019 Conference Paper

Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement

Zi-Yi Dou
Zhaopeng Tu
Xing Wang
Longyue Wang
Shuming Shi
Tong Zhang

With the promising progress of deep neural networks, layer aggregation has been used to fuse information across layers in various fields, such as computer vision and machine translation. However, most of the previous methods combine layers in a static fashion in that their aggregation strategy is independent of specific hidden states. Inspired by recent progress on capsule networks, in this paper we propose to use routing-by-agreement strategies to aggregate layers dynamically. Specifically, the algorithm learns the probability of a part (individual layer representations) assigned to a whole (aggregated representations) in an iterative way and combines parts accordingly. We implement our algorithm on top of the state-of-the-art neural machine translation model TRANSFORMER and conduct experiments on the widely-used WMT14 English⇒German and WMT17 Chinese⇒English translation datasets. Experimental results across language pairs show that the proposed approach consistently outperforms the strong baseline model and a representative static aggregation model.

PDF Details

AAAI Conference 2019 Conference Paper

Multiple Independent Subspace Clusterings

Xing Wang
Jun Wang
Carlotta Domeniconi
Guoxian Yu
Guoqiang Xiao
Maozu Guo

Multiple clustering aims at discovering diverse ways of organizing data into clusters. Despite the progress made, it’s still a challenge for users to analyze and understand the distinctive structure of each output clustering. To ease this process, we consider diverse clusterings embedded in different subspaces, and analyze the embedding subspaces to shed light into the structure of each clustering. To this end, we provide a two-stage approach called MISC (Multiple Independent Subspace Clusterings). In the first stage, MISC uses independent subspace analysis to seek multiple and statistical independent (i. e. non-redundant) subspaces, and determines the number of subspaces via the minimum description length principle. In the second stage, to account for the intrinsic geometric structure of samples embedded in each subspace, MISC performs graph regularized semi-nonnegative matrix factorization to explore clusters. It additionally integrates the kernel trick into matrix factorization to handle non-linearly separable clusters. Experimental results on synthetic datasets show that MISC can find different interesting clusterings from the sought independent subspaces, and it also outperforms other related and competitive approaches on real-world datasets.

PDF Details

AAAI Conference 2017 Conference Paper

Neural Machine Translation Advised by Statistical Machine Translation

Xing Wang
Zhengdong Lu
Zhaopeng Tu
Hang Li
Deyi Xiong
Min Zhang

Neural Machine Translation (NMT) is a new approach to machine translation that has made great progress in recent years. However, recent studies show that NMT generally produces ﬂuent but inadequate translations (Tu et al. 2016b; 2016a; He et al. 2016; Tu et al. 2017). This is in contrast to conventional Statistical Machine Translation (SMT), which usually yields adequate but non-ﬂuent translations. It is natural, therefore, to leverage the advantages of both models for better translations, and in this work we propose to incorporate SMT model into NMT framework. More speciﬁcally, at each decoding step, SMT offers additional recommendations of generated words based on the decoding information from NMT (e. g. , the generated partial translation and attention history). Then we employ an auxiliary classiﬁer to score the SMT recommendations and a gating function to combine the SMT recommendations with NMT generations, both of which are jointly trained within the NMT architecture in an end-to-end manner. Experimental results on Chinese-English translation show that the proposed approach achieves signiﬁcant and consistent improvements over state-of-the-art NMT and SMT systems on multiple NIST test sets.

PDF Details