Arrow Research search

Author name cluster

Lijun Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
2 author rows

Possible papers

24

AAAI Conference 2025 Conference Paper

AgentMixer: Multi-Agent Correlated Policy Factorization

  • Zhiyuan Li
  • Wenshuai Zhao
  • Lijun Wu
  • Joni Pajarinen

In multi-agent reinforcement learning, centralized training with decentralized execution (CTDE) methods typically assumes that agents make decisions based on their local observations independently, which may not lead to a correlated joint policy with coordination. Coordination can be explicitly encouraged during training and individual policies can be trained to imitate the correlated joint policy. However, this may lead to an asymmetric learning failure due to the observation mismatch between the joint and individual policies. Inspired by the concept of correlated equilibrium, we introduce a strategy modification called AgentMixer that allows agents to correlate their policies. AgentMixer combines individual partially observable policies into a joint fully observable policy non-linearly. To enable decentralized execution, we introduce Individual-Global-Consistency to guarantee mode consistency during joint training of the centralized and decentralized policies and prove that AgentMixer converges to an ϵ-approximate Correlated Equilibrium. In the Multi-Agent MuJoCo, SMAC-v2, Matrix Game, and Predator-Prey benchmarks, AgentMixer outperforms or matches state-of-the-art methods.

IJCAI Conference 2025 Conference Paper

Explainable Graph Neural Networks via Structural Externalities

  • Lijun Wu
  • Dong Hao
  • Zhiyi Fan

Graph Neural Networks (GNNs) have achieved outstanding performance across a wide range of graph-related tasks. However, their "black-box" nature poses significant challenges to their explainability, and existing methods often fail to effectively capture the intricate interaction patterns among nodes within the network. In this work, we propose a novel explainability framework, GraphEXT, which leverages cooperative game theory and the concept of social externalities. GraphEXT partitions graph nodes into coalitions, decomposing the original graph into independent subgraphs. By integrating graph structure as an externality and incorporating the Shapley value under externalities, GraphEXT quantifies node importance through their marginal contributions to GNN predictions as the nodes transition between coalitions. Unlike traditional Shapley value-based methods that primarily focus on node attributes, our GraphEXT places greater emphasis on the interactions among nodes and the impact of structural changes on GNN predictions. Experimental studies on both synthetic and real-world datasets show that GraphEXT outperforms existing baseline methods in terms of fidelity across diverse GNN architectures, significantly enhancing the explainability of GNN models.

ICLR Conference 2025 Conference Paper

Fast and Accurate Blind Flexible Docking

  • Zizhuo Zhang
  • Lijun Wu
  • Kaiyuan Gao
  • Jiangchao Yao
  • Tao Qin
  • Bo Han 0003

Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To address these challenges, we propose FABFlex, a fast and accurate regression-based multi-task learning model designed for realistic blind flexible docking scenarios, where proteins exhibit flexibility and binding pocket sites are unknown (blind). Specifically, FABFlex's architecture comprises three specialized modules working in concert: (1) A pocket prediction module that identifies potential binding sites, addressing the challenges inherent in blind docking scenarios. (2) A ligand docking module that predicts the bound (holo) structures of ligands from their unbound (apo) states. (3) A pocket docking module that forecasts the holo structures of protein pockets from their apo conformations. Notably, FABFlex incorporates an iterative update mechanism that serves as a conduit between the ligand and pocket docking modules, enabling continuous structural refinements. This approach effectively integrates the three subtasks of blind flexible docking—pocket identification, ligand conformation prediction, and protein flexibility modeling—into a unified, coherent framework. Extensive experiments on public benchmark datasets demonstrate that FABFlex not only achieves superior effectiveness in predicting accurate binding modes but also exhibits a significant speed advantage (208$\times$) compared to existing state-of-the-art methods. Our code is released at~\url{https://github.com/tmlr-group/FABFlex}.

ICLR Conference 2025 Conference Paper

Hot-pluggable Federated Learning: Bridging General and Personalized FL via Dynamic Selection

  • Lei Shen
  • Zhenheng Tang
  • Lijun Wu
  • Yonggang Zhang 0003
  • Xiaowen Chu 0001
  • Tao Qin
  • Bo Han 0003

Personalized federated learning (PFL) achieves high performance by assuming clients only meet test data locally, which does not meet many generic federated learning (GFL) scenarios. In this work, we theoretically show that PMs can be used to enhance GFL with a new learning problem named Selective FL (SFL), which involves optimizing PFL and model selection. However, storing and selecting whole models requires impractical computation and communication costs. To practically solve SFL, inspired by model components that attempt to edit a sub-model for specific purposes, we design an efficient and effective framework named Hot-Pluggable Federated Learning (HPFL). Specifically, clients individually train personalized plug-in modules based on a shared backbone, and upload them with a plug-in marker on the server modular store. In inference stage, an accurate selection algorithm allows clients to identify and retrieve suitable plug-in modules from the modular store to enhance their generalization performance on the target data distribution. Furthermore, we provide differential privacy protection during the selection with theoretical guarantee. Our comprehensive experiments and ablation studies demonstrate that HPFL significantly outperforms state-of-the-art GFL and PFL algorithms. Additionally, we empirically show HPFL's remarkable potential to resolve other practical FL problems such as continual federated learning and discuss its possible applications in one-shot FL, anarchic FL, and FL plug-in market. Our work is the first attempt towards improving GFL performance through a selecting mechanism with personalized plug-ins.

NeurIPS Conference 2025 Conference Paper

Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

  • Honglin Lin
  • Qizhi Pei
  • Zhuoshi Pan
  • Yu Li
  • Xin Gao
  • Juntao Li
  • Conghui He
  • Lijun Wu

Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose \texttt{Caco} (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, \texttt{Caco} first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created \texttt{Caco}-1. 3M dataset demonstrate that \texttt{Caco}-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that \texttt{Caco}’s code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.

TMLR Journal 2025 Journal Article

SE3Set: Harnessing Equivariant Hypergraph Neural Networks for Molecular Representation Learning

  • Hongfei Wu
  • Lijun Wu
  • Guoqing Liu
  • Zhirong Liu
  • Bin Shao
  • Zun Wang

In this paper, we develop SE3Set, an SE(3) equivariant hypergraph neural network architecture tailored for advanced molecular representation learning. Hypergraphs are not merely an extension of traditional graphs; they are pivotal for modeling high-order relationships, a capability that conventional equivariant graph-based methods lack due to their inherent limitations in representing intricate many-body interactions. To achieve this, we first construct hypergraphs by proposing a new fragmentation method that considers both chemical and three-dimensional spatial information of the molecular system. We then design SE3Set, which incorporates equivariance into the hypergraph neural network. This ensures that the learned molecular representations are invariant to spatial transformations, thereby providing robustness essential for the accurate prediction of molecular properties. SE3Set has shown performance on par with state-of-the-art (SOTA) models for small molecule datasets like QM9 and MD17. It demonstrates outstanding performance on the MD22 dataset, achieving a remarkable ~20\% improvement in accuracy across all molecules. Furthermore, on the OE62 dataset, SE3Set outperforms all short-range models. We also conducted a detailed analysis of OE62, highlighting the prevalence of complex many-body interactions in large molecules. This exceptional performance of SE3Set across diverse molecular structures underscores its transformative potential in computational chemistry, offering a route to more accurate and physically nuanced modeling. The code of this work is available at https://github.com/Navantock/SE3Set.

AAAI Conference 2024 Conference Paper

Backpropagation Through Agents

  • Zhiyuan Li
  • Wenshuai Zhao
  • Lijun Wu
  • Joni Pajarinen

A fundamental challenge in multi-agent reinforcement learning (MARL) is to learn the joint policy in an extremely large search space, which grows exponentially with the number of agents. Moreover, fully decentralized policy factorization significantly restricts the search space, which may lead to sub-optimal policies. In contrast, the auto-regressive joint policy can represent a much richer class of joint policies by factorizing the joint policy into the product of a series of conditional individual policies. While such factorization introduces the action dependency among agents explicitly in sequential execution, it does not take full advantage of the dependency during learning. In particular, the subsequent agents do not give the preceding agents feedback about their decisions. In this paper, we propose a new framework Back-Propagation Through Agents (BPTA) that directly accounts for both agents' own policy updates and the learning of their dependent counterparts. This is achieved by propagating the feedback through action chains. With the proposed framework, our Bidirectional Proximal Policy Optimisation (BPPO) outperforms the state-of-the-art methods. Extensive experiments on matrix games, StarCraftII v2, Multi-agent MuJoCo, and Google Research Football demonstrate the effectiveness of the proposed method.

NeurIPS Conference 2024 Conference Paper

Infusing Self-Consistency into Density Functional Theory Hamiltonian Prediction via Deep Equilibrium Models

  • Zun Wang
  • Chang Liu
  • Nianlong Zou
  • He Zhang
  • Xinran Wei
  • Lin Huang
  • Lijun Wu
  • Bin Shao

In this study, we introduce a unified neural network architecture, the Deep Equilibrium Density Functional Theory Hamiltonian (DEQH) model, which incorporates Deep Equilibrium Models (DEQs) for predicting Density Functional Theory (DFT) Hamiltonians. The DEQH model inherently captures the self-consistency nature of Hamiltonian, a critical aspect often overlooked by traditional machine learning approaches for Hamiltonian prediction. By employing DEQ within our model architecture, we circumvent the need for DFT calculations during the training phase to introduce the Hamiltonian's self-consistency, thus addressing computational bottlenecks associated with large or complex systems. We propose a versatile framework that combines DEQ with off-the-shelf machine learning models for predicting Hamiltonians. When benchmarked on the MD17 and QH9 datasets, DEQHNet, an instantiation of the DEQH framework, has demonstrated a significant improvement in prediction accuracy. Beyond a predictor, the DEQH model is a Hamiltonian solver, in the sense that it uses the fixed-point solving capability of the deep equilibrium model to iteratively solve for the Hamiltonian. Ablation studies of DEQHNet further elucidate the network's effectiveness, offering insights into the potential of DEQ-integrated networks for Hamiltonian learning. We open source our implementation at https: //github. com/Zun-Wang/DEQHNet.

AAAI Conference 2023 Conference Paper

AMOM: Adaptive Masking over Masking for Conditional Masked Language Model

  • Yisheng Xiao
  • Ruiyang Xu
  • Lijun Wu
  • Juntao Li
  • Tao Qin
  • Tie-Yan Liu
  • Min Zhang

Transformer-based autoregressive (AR) methods have achieved appealing performance for varied sequence-to-sequence generation tasks, e.g., neural machine translation, summarization, and code generation, but suffer from low inference efficiency. To speed up the inference stage, many non-autoregressive (NAR) strategies have been proposed in the past few years. Among them, the conditional masked language model (CMLM) is one of the most versatile frameworks, as it can support many different sequence generation scenarios and achieve very competitive performance on these tasks. In this paper, we further introduce a simple yet effective adaptive masking over masking strategy to enhance the refinement capability of the decoder and make the encoder optimization easier. Experiments on 3 different tasks (neural machine translation, summarization, and code generation) with 15 datasets in total confirm that our proposed simple method achieves significant performance improvement over the strong CMLM model. Surprisingly, our proposed model yields state-of-the-art performance on neural machine translation (34.62 BLEU on WMT16 EN to RO, 34.82 BLEU on WMT16 RO to EN, and 34.84 BLEU on IWSLT De to En) and even better performance than the AR Transformer on 7 benchmark datasets with at least 2.2x speedup. Our code is available at GitHub.

NeurIPS Conference 2023 Conference Paper

FABind: Fast and Accurate Protein-Ligand Binding

  • Qizhi Pei
  • Kaiyuan Gao
  • Lijun Wu
  • Jinhua Zhu
  • Yingce Xia
  • Shufang Xie
  • Tao Qin
  • Kun He

Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose FABind, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. FABind incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed FABind demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at https: //github. com/QizhiPei/FABind.

AAAI Conference 2023 Conference Paper

Retrosynthesis Prediction with Local Template Retrieval

  • Shufang Xie
  • Rui Yan
  • Junliang Guo
  • Yingce Xia
  • Lijun Wu
  • Tao Qin

Retrosynthesis, which predicts the reactants of a given target molecule, is an essential task for drug discovery. In recent years, the machine learing based retrosynthesis methods have achieved promising results. In this work, we introduce RetroKNN, a local reaction template retrieval method to further boost the performance of template-based systems with non-parametric retrieval. We first build an atom-template store and a bond-template store that contains the local templates in the training data, then retrieve from these templates with a k-nearest-neighbor (KNN) search during inference. The retrieved templates are combined with neural network predictions as the final output. Furthermore, we propose a lightweight adapter to adjust the weights when combing neural network and KNN predictions conditioned on the hidden representation and the retrieved templates. We conduct comprehensive experiments on two widely used benchmarks, the USPTO-50K and USPTO-MIT. Especially for the top-1 accuracy, we improved 7.1% on the USPTO-50K dataset and 12.0% on the USPTO-MIT dataset.These results demonstrate the effectiveness of our method.

TMLR Journal 2022 Journal Article

Direct Molecular Conformation Generation

  • Jinhua Zhu
  • Yingce Xia
  • Chang Liu
  • Lijun Wu
  • Shufang Xie
  • Yusong Wang
  • Tong Wang
  • Tao Qin

Molecular conformation generation aims to generate three-dimensional coordinates of all the atoms in a molecule and is an important task in bioinformatics and pharmacology. Previous methods usually first predict the interatomic distances, the gradients of interatomic distances or the local structures (e.g., torsion angles) of a molecule, and then reconstruct its 3D conformation. How to directly generate the conformation without the above intermediate values is not fully explored. In this work, we propose a method that directly predicts the coordinates of atoms: (1) the loss function is invariant to roto-translation of coordinates and permutation of symmetric atoms; (2) the newly proposed model adaptively aggregates the bond and atom information and iteratively refines the coordinates of the generated conformation. Our method achieves the best results on GEOM-QM9 and GEOM-Drugs datasets. Further analysis shows that our generated conformations have closer properties (e.g., HOMO-LUMO gap) with the groundtruth conformations. In addition, our method improves molecular docking by providing better initial conformations. All the results demonstrate the effectiveness of our method and the great potential of the direct approach. The code is released at \url{https://github.com/DirectMolecularConfGen/DMCG}.

NeurIPS Conference 2021 Conference Paper

Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering

  • Weijiang Yu
  • Haoteng Zheng
  • Mengfei Li
  • Lei Ji
  • Lijun Wu
  • Nong Xiao
  • Nan Duan

Recent advances in the video question answering (i. e. , VideoQA) task have achieved strong success by following the paradigm of fine-tuning each clip-text pair independently on the pretrained transformer-based model via supervised learning. Intuitively, multiple samples (i. e. , clips) should be interdependent to capture similar visual and key semantic information in the same video. To consider the interdependent knowledge between contextual clips into the network inference, we propose a Siamese Sampling and Reasoning (SiaSamRea) approach, which consists of a siamese sampling mechanism to generate sparse and similar clips (i. e. , siamese clips) from the same video, and a novel reasoning strategy for integrating the interdependent knowledge between contextual clips into the network. The reasoning strategy contains two modules: (1) siamese knowledge generation to learn the inter-relationship among clips; (2) siamese knowledge reasoning to produce the refined soft label by propagating the weights of inter-relationship to the predicted candidates of all clips. Finally, our SiaSamRea can endow the current multimodal reasoning paradigm with the ability of learning from inside via the guidance of soft labels. Extensive experiments demonstrate our SiaSamRea achieves state-of-the-art performance on five VideoQA benchmarks, e. g. , a significant +2. 1% gain on MSRVTT-QA, +2. 9% on MSVD-QA, +1. 0% on ActivityNet-QA, +1. 8% on How2QA and +4. 3% (action) on TGIF-QA.

AAAI Conference 2021 Conference Paper

Learning to Reweight with Deep Interactions

  • Yang Fan
  • Yingce Xia
  • Lijun Wu
  • Shufang Xie
  • Weiqing Liu
  • Jiang Bian
  • Tao Qin
  • Xiang-Yang Li

Recently, the concept of teaching has been introduced into machine learning, in which a teacher model is used to guide the training of a student model (which will be used in real tasks) through data selection, loss function design, etc. Learning to reweight, which is a specific kind of teaching that reweights training data using a teacher model, receives much attention due to its simplicity and effectiveness. In existing learning to reweight works, the teacher model only utilizes shallow/surface information such as training iteration number and loss/accuracy of the student model from training/validation sets, but ignores the internal states of the student model, which limits the potential of learning to reweight. In this work, we propose an improved data reweighting algorithm, in which the student model provides its internal states to the teacher model, and the teacher model returns adaptive weights of training samples to enhance the training of the student model. The teacher model is jointly trained with the student model using meta gradients propagated from a validation set. Experiments on image classification with clean/noisy labels and neural machine translation empirically demonstrate that our algorithm makes significant improvement over previous methods.

NeurIPS Conference 2021 Conference Paper

R-Drop: Regularized Dropout for Neural Networks

  • Xiaobo Liang
  • Lijun Wu
  • Juntao Li
  • Yue Wang
  • Qi Meng
  • Tao Qin
  • Wei Chen
  • Min Zhang

Dropout is a powerful and widely used technique to regularize the training of deep neural networks. Though effective and performing well, the randomness introduced by dropout causes unnegligible inconsistency between training and inference. In this paper, we introduce a simple consistency training strategy to regularize dropout, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the above inconsistency. Experiments on $\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e. g. , ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English$\to$German translation ($\bf{30. 91}$ BLEU) and WMT14 English$\to$French translation ($\bf{43. 95}$ BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub\footnote{\url{https: //github. com/dropreg/R-Drop}}.

NeurIPS Conference 2020 Conference Paper

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection

  • Xiang Li
  • Wenhai Wang
  • Lijun Wu
  • Shuo Chen
  • Xiaolin Hu
  • Jun Li
  • Jinhui Tang
  • Jian Yang

One-stage detector basically formulates object detection as dense classification and localization (i. e. , bounding box regression). The classification is usually optimized by Focal Loss and the box location is commonly learned under Dirac delta distribution. A recent trend for one-stage detectors is to introduce an \emph{individual} prediction branch to estimate the quality of localization, where the predicted quality facilitates the classification to improve detection performance. This paper delves into the \emph{representations} of the above three fundamental elements: quality estimation, classification and localization. Two problems are discovered in existing practices, including (1) the inconsistent usage of the quality estimation and classification between training and inference, and (2) the inflexible Dirac delta distribution for localization. To address the problems, we design new representations for these elements. Specifically, we merge the quality estimation into the class prediction vector to form a joint representation, and use a vector to represent arbitrary distribution of box locations. The improved representations eliminate the inconsistency risk and accurately depict the flexible distribution in real data, but contain \emph{continuous} labels, which is beyond the scope of Focal Loss. We then propose Generalized Focal Loss (GFL) that generalizes Focal Loss from its discrete form to the \emph{continuous} version for successful optimization. On COCO {\tt test-dev}, GFL achieves 45. 0\% AP using ResNet-101 backbone, surpassing state-of-the-art SAPD (43. 5\%) and ATSS (43. 6\%) with higher or comparable inference speed.

AAAI Conference 2020 Conference Paper

Transductive Ensemble Learning for Neural Machine Translation

  • Yiren Wang
  • Lijun Wu
  • Yingce Xia
  • Tao Qin
  • ChengXiang Zhai
  • Tie-Yan Liu

Ensemble learning, which aggregates multiple diverse models for inference, is a common practice to improve the accuracy of machine learning tasks. However, it has been observed that the conventional ensemble methods only bring marginal improvement for neural machine translation (NMT) when individual models are strong or there are a large number of individual models. In this paper, we study how to effectively aggregate multiple NMT models under the transductive setting where the source sentences of the test set are known. We propose a simple yet effective approach named transductive ensemble learning (TEL), in which we use all individual models to translate the source test set into the target language space and then finetune a strong model on the translated synthetic corpus. We conduct extensive experiments on different settings (with/without monolingual data) and different language pairs (English↔{German, Finnish}). The results show that our approach boosts strong individual models with significant improvement and benefits a lot from more individual models. Specifically, we achieve the state-of-the-art performances on the WMT2016-2018 English↔German translations.

NeurIPS Conference 2018 Conference Paper

Learning to Teach with Dynamic Loss Functions

  • Lijun Wu
  • Fei Tian
  • Yingce Xia
  • Yang Fan
  • Tao Qin
  • Lai Jian-Huang
  • Tie-Yan Liu

Teaching is critical to human society: it is with teaching that prospective students are educated and human civilization can be inherited and advanced. A good teacher not only provides his/her students with qualified teaching materials (e. g. , textbooks), but also sets up appropriate learning objectives (e. g. , course projects and exams) considering different situations of a student. When it comes to artificial intelligence, treating machine learning models as students, the loss functions that are optimized act as perfect counterparts of the learning objective set by the teacher. In this work, we explore the possibility of imitating human teaching behaviors by dynamically and automatically outputting appropriate loss functions to train machine learning models. Different from typical learning settings in which the loss function of a machine learning model is predefined and fixed, in our framework, the loss function of a machine learning model (we call it student) is defined by another machine learning model (we call it teacher). The ultimate goal of teacher model is cultivating the student to have better performance measured on development dataset. Towards that end, similar to human teaching, the teacher, a parametric model, dynamically outputs different loss functions that will be used and optimized by its student model at different training stages. We develop an efficient learning method for the teacher model that makes gradient based optimization possible, exempt of the ineffective solutions such as policy optimization. We name our method as ``learning to teach with dynamic loss functions'' (L2T-DLF for short). Extensive experiments on real world tasks including image classification and neural machine translation demonstrate that our method significantly improves the quality of various student models.

AAAI Conference 2018 Conference Paper

Word Attention for Sequence to Sequence Text Understanding

  • Lijun Wu
  • Fei Tian
  • Li Zhao
  • Jianhuang Lai
  • Tie-Yan Liu

Attention mechanism has been a key component in Recurrent Neural Networks (RNNs) based sequence to sequence learning framework, which has been adopted in many text understanding tasks, such as neural machine translation and abstractive summarization. In these tasks, the attention mechanism models how important each part of the source sentence is to generate a target side word. To compute such importance scores, the attention mechanism summarizes the source side information in the encoder RNN hidden states (i. e. , ht), and then builds a context vector for a target side word upon a subsequence representation of the source sentence, since ht actually summarizes the information of the subsequence containing the first t-th words in the source sentence. We in this paper, show that an additional attention mechanism called word attention, that builds itself upon word level representations, significantly enhances the performance of sequence to sequence learning. Our word attention can enrich the source side contextual representation by directly promoting the clean word level information in each step. Furthermore, we propose to use contextual gates to dynamically combine the subsequence level and word level contextual information. Experimental results on abstractive summarization and neural machine translation show that word attention significantly improve over strong baselines. In particular, we achieve the state-of-the-art result on WMT’14 English-French translation task with 12M training data.

NeurIPS Conference 2017 Conference Paper

Deliberation Networks: Sequence Generation Beyond One-Pass Decoding

  • Yingce Xia
  • Fei Tian
  • Lijun Wu
  • Jianxin Lin
  • Tao Qin
  • Nenghai Yu
  • Tie-Yan Liu

The encoder-decoder framework has achieved promising progress for many sequence generation tasks, including machine translation, text summarization, dialog system, image captioning, etc. Such a framework adopts an one-pass forward process while decoding and generating a sequence, but lacks the deliberation process: A generated sequence is directly used as final output without further polishing. However, deliberation is a common behavior in human's daily life like reading news and writing papers/articles/books. In this work, we introduce the deliberation process into the encoder-decoder framework and propose deliberation networks for sequence generation. A deliberation network has two levels of decoders, where the first-pass decoder generates a raw sequence and the second-pass decoder polishes and refines the raw sentence with deliberation. Since the second-pass deliberation decoder has global information about what the sequence to be generated might be, it has the potential to generate a better sequence by looking into future words in the raw sentence. Experiments on neural machine translation and text summarization demonstrate the effectiveness of the proposed deliberation networks. On the WMT 2014 English-to-French translation task, our model establishes a new state-of-the-art BLEU score of 41. 5.

IJCAI Conference 2017 Conference Paper

Sequence Prediction with Unlabeled Data by Reward Function Learning

  • Lijun Wu
  • Li Zhao
  • Tao Qin
  • Jianhuang Lai
  • Tie-Yan Liu

Reinforcement learning (RL), which has been successfully applied to sequence prediction, introduces \textit{reward} as sequence-level supervision signal to evaluate the quality of a generated sequence. Existing RL approaches use the ground-truth sequence to define reward, which limits the application of RL techniques to labeled data. Since labeled data is usually scarce and/or costly to collect, it is desirable to leverage large-scale unlabeled data. In this paper, we extend existing RL methods for sequence prediction to exploit unlabeled data. We propose to learn the reward function from labeled data and use the predicted reward as \textit{pseudo reward} for unlabeled data so that we can learn from unlabeled data using the pseudo reward. To get good pseudo reward on unlabeled data, we propose a RNN-based reward network with attention mechanism, trained with purposely biased data distribution. Experiments show that the pseudo reward can provide good supervision and guide the learning process on unlabeled data. We observe significant improvements on both neural machine translation and text summarization.