Author name cluster

Qun Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

32 papers

1 author row

EAAI Journal 2026 Journal Article

Adaptive weighted disentangling variational autoencoder with fine-grained feedback

Zhenyao Yu
Yue Liu
Zitu Liu
Zhengwei Yang
Yike Guo
Qun Liu
Guoyin Wang

TMLR Journal 2026 Journal Article

Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Ge Zhang
Mohammad Ali Alomrani
Hongjian Gu
Jiaming Zhou
Yaochen Hu
Bin Wang
Qun Liu
Mark Coates

Large language models (LLMs) possess vast semantic knowledge but often struggle with complex reasoning tasks, particularly in relational reasoning problems such as kinship or spatial reasoning. In this paper, we present Path-of-Thoughts (PoT), a novel framework for solving relation reasoning that decomposes the task into three key stages: graph extraction, path identification, and reasoning. Unlike previous approaches, PoT efficiently extracts a reasoning graph that identifies crucial entities, relations, and attributes within the context. Subsequently, PoT identifies query-relevant reasoning paths within the graph, facilitating downstream reasoning of potential answers. Experimental evaluations across four datasets of relational reasoning demonstrate that PoT surpasses state-of-the-art baselines by a significant margin (up to 21.3%) without requiring fine-tuning or extensive LLM calls. Furthermore, unlike prior neuro-symbolic methods, PoT exhibits improved resilience against LLM extraction errors and input ambiguity by leveraging the compositional nature of graphs.

AAAI Conference 2026 Conference Paper

Large Language Models Struggle with Unreasonability in Math Problems

Jingyuan Ma
Damai Dai
Zihang Yuan
Rui Li
Weilin Luo
Bin Wang
Qun Liu
Lei Sha

Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues, models frequently proceed as if the problem is well-posed, producing incorrect answers or falling into overthinking and verbose self-correction. To systematically investigate this overlooked vulnerability, we propose the Unreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements. Based on extensive experiments covering 19 LLMs, we find that even state-of-the-art general models like GPT-4o struggle on UMP. While reasoning models such as DeepSeek-R1 demonstrate a higher sensitivity to unreasonable inputs, this often comes at the cost of generating overly long and meaningless responses that fail to converge. We further find that prompting and fine-tuning enhance the detection of unreasonable inputs, with minor and acceptable trade-offs, making them practical solutions in this challenging setting.

PDF Details DOI

TMLR Journal 2026 Journal Article

The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

Jierun Chen
Tiezheng Yu
Haoli Bai
Lewei Yao
Jiannan Wu
Kaican Li
Fei Mi
Chaofan Tao

Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This "synergy dilemma" highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs. Code, dataset, and fine-tuned models are available at https://github.com/JierunChen/SFT-RL-SynergyDilemma.

AAAI Conference 2026 Conference Paper

ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool learning

Xingshan Zeng
Weiwen Liu
Xu Huang
Zezhong Wang
Lingzhi Wang
Liangyou Li
Yasheng Wang
Lifeng Shang

Tool learning, which allows Large Language Models (LLMs) to leverage external tools for solving complex user tasks, has emerged as a promising avenue for extending model capabilities. However, existing approaches primarily focus on data synthesis for fine-tuning LLMs to invoke tools effectively, largely ignoring how to fully stimulate the potential of the model. In this paper, we propose ToolACE-R, a novel framework that includes both model-aware iterative training and adaptive refinement for tool learning. ToolACE-R features a model-aware iterative training procedure that progressively adjust training samples based on the model’s evolving capabilities to maximize its potential. Additionally, it incorporates self-refinement training corpus which emphasizes LLM's ability to iteratively refine their tool calls, optimizing performance without requiring external feedback. Furthermore, we introduce adaptive self-refinement for efficient test-time scaling, where the trained model can autonomously determine when to stop the process based on iterative self-refinement. We conduct extensive experiments across several benchmark datasets, showing that ToolACE-R achieves competitive performance compared to advanced LLMs. The performance can be further improved efficiently through adaptive self-refinement. These results highlight the effectiveness and generalizability of ToolACE-R, offering a promising direction for more efficient and scalable tool learning.

PDF Details DOI

AAAI Conference 2025 Conference Paper

A Comprehensive Evaluation on Event Reasoning of Large Language Models

Zhengwei Tao
Zhi Jin
Yifan Zhang
Xiancai Chen
Haiyan Zhao
Jia Li
Bin Liang
Chongyang Tao

Event reasoning is a fundamental ability that underlies many applications. It requires event schema knowledge to perform global reasoning and needs to deal with the diversity of the inter-event relations and the reasoning paradigms. The extent to which LLMs excel in event reasoning across various relations and reasoning paradigms has not been thoroughly investigated. Additionally, it is still unclear whether LLMs utilize event knowledge in the same way humans do. To mitigate this disparity, we comprehensively evaluate the abilities of event reasoning of LLMs on different relations, paradigms, and levels of abstraction. We introduce a novel benchmark EV2 for EValuation of EVent reasoning. EV2 consists of two levels of evaluation on schema and instance and is comprehensive in relations and reasoning paradigms. We conduct extensive experiments on EV2. We find that 1) LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory. 2) There are imbalances of event reasoning abilities on different relations and paradigms. 3) LLMs have event schema knowledge, however, they're not aligned with humans on how to utilize the knowledge. Based on these findings, we guide the LLMs in utilizing the event schema knowledge as memory leading to improvements in event reasoning.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

Yueqian Wang
Xiaojun Meng
Yuxuan Wang
Jianxin Liang
Qun Liu
Dongyan Zhao

Multi-modal multi-party conversation (MMC) is a less studied yet important topic of research due to that it well fits real-world scenarios and thus potentially has more widely-used applications. Compared with the traditional multi-modal conversations, MMC requires stronger character-centered understanding abilities as there are many interlocutors appearing in both the visual and textual context. To facilitate the study of this problem, we present Friends-MMC in this paper, an MMC dataset that contains 24,000+ unique utterances paired with video context. To explore the character-centered understanding of the dialogue, we also annotate the speaker of each utterance, the names and bounding bboxes of faces that appear in the video. Based on this Friends-MMC dataset, we further study two fundamental MMC tasks: conversation speaker identification and conversation response prediction, both of which have the multi-party nature with the video or image as visual context. For conversation speaker identification, we demonstrate the inefficiencies of existing methods such as pre-trained models, and propose a simple yet effective baseline method that leverages an optimization solver to utilize the context of two modalities to achieve better performance. For conversation response prediction, we fine-tune generative dialogue models on Friend-MMC, and analyze the benefits of speaker information. The code and dataset will be publicly available, and thus we call for more attention on modelling speaker information when understanding conversations.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

RidgeLoRA: Matrix Ridge Enhanced Low-Rank Adaptation of Large Language Models

Junda Zhu
Jun Ai
Yujun Li
Yichun Yin
Yasheng Wang
Lifeng Shang
Qun Liu

As one of the state-of-the-art parameter-efficient fine-tuning~(PEFT) methods, Low-Rank Adaptation (LoRA) enables model optimization with reduced computational cost through trainable low-rank matrix. However, the low-rank nature makes it prone to produce a decrease in the representation ability, leading to suboptimal performance. In order to break this limitation, we propose RidgeLoRA, a lightweight architecture like LoRA that incorporates novel architecture and matrix ridge enhanced full-rank approximation, to match the performance of full-rank training, while eliminating the need for high memory and a large number of parameters to restore the rank of matrices. We provide a rigorous mathematical derivation to prove that RidgeLoRA has a better upper bound on the representations than vanilla LoRA. Furthermore, extensive experiments across multiple domains demonstrate that RidgeLoRA achieves better performance than other LoRA variants, and can even match or surpass full-rank training.

AAAI Conference 2024 Conference Paper

Preparing Lessons for Progressive Training on Language Models

Yu Pan
Ye Yuan
Yichun Yin
Jiaxin Shi
Zenglin Xu
Ming Zhang
Lifeng Shang
Xin Jiang

The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prepares lessons for expanding operations by learning high-layer functionality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Unsupervised Extractive Summarization with Learnable Length Control Strategies

Renlong Jie
Xiaojun Meng
Xin Jiang
Qun Liu

Unsupervised extractive summarization is an important technique in information extraction and retrieval. Compared with supervised method, it does not require high-quality human-labelled summaries for training and thus can be easily applied for documents with different types, domains or languages. Most of existing unsupervised methods including TextRank and PACSUM rely on graph-based ranking on sentence centrality. However, this scorer can not be directly applied in end-to-end training, and the positional-related prior assumption is often needed for achieving good summaries. In addition, less attention is paid to length-controllable extractor, where users can decide to summarize texts under particular length constraint. This paper introduces an unsupervised extractive summarization model based on a siamese network, for which we develop a trainable bidirectional prediction objective between the selected summary and the original document. Different from the centrality-based ranking methods, our extractive scorer can be trained in an end-to-end manner, with no other requirement of positional assumption. In addition, we introduce a differentiable length control module by approximating 0-1 knapsack solver for end-to-end length-controllable extracting. Experiments show that our unsupervised method largely outperforms the centrality-based baseline using a same sentence encoder. In terms of length control ability, via our trainable knapsack module, the performance consistently outperforms the strong baseline without utilizing end-to-end training. Human evaluation further evidences that our method performs the best among baselines in terms of relevance and consistency.

PDF Details DOI

AAAI Conference 2023 Conference Paper

KPT: Keyword-Guided Pre-training for Grounded Dialog Generation

Qi Zhu
Fei Mi
Zheng Zhang
Yasheng Wang
Yitong Li
Xin Jiang
Qun Liu
Xiaoyan Zhu

Incorporating external knowledge into the response generation process is essential to building more helpful and reliable dialog agents. However, collecting knowledge-grounded conversations is often costly, calling for a better pre-trained model for grounded dialog generation that generalizes well w.r.t. different types of knowledge. In this work, we propose KPT (Keyword-guided Pre-Training), a novel self-supervised pre-training method for grounded dialog generation without relying on extra knowledge annotation. Specifically, we use a pre-trained language model to extract the most uncertain tokens in the dialog as keywords. With these keywords, we construct two kinds of knowledge and pre-train a knowledge-grounded response generation model, aiming at handling two different scenarios: (1) the knowledge should be faithfully grounded; (2) it can be selectively used. For the former, the grounding knowledge consists of keywords extracted from the response. For the latter, the grounding knowledge is additionally augmented with keywords extracted from other utterances in the same dialog. Since the knowledge is extracted from the dialog itself, KPT can be easily performed on a large volume and variety of dialogue data. We considered three data sources (open-domain, task-oriented, conversational QA) with a total of 2.5M dialogues. We conduct extensive experiments on various few-shot knowledge-grounded generation tasks, including grounding on dialog acts, knowledge graphs, persona descriptions, and Wikipedia passages. Our comprehensive experiments and analyses demonstrate that KPT consistently outperforms state-of-the-art methods on these tasks with diverse grounding knowledge.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video

Zenan Xu
Xiaojun Meng
Yasheng Wang
Qinliang Su
Zexuan Qiu
Xin Jiang
Qun Liu

Multimodal abstractive summarization for videos (MAS) requires generating a concise textual summary to describe the highlights of a video according to multimodal resources, in our case, the video content and its transcript. Inspired by the success of the large-scale generative pre-trained language model (GPLM) in generating high-quality textual content (e. g. , summary), recent MAS methods have proposed to adapt the GPLM to this task by equipping it with the visual information, which is often obtained through a general-purpose visual feature extractor. However, the generally extracted visual features may overlook some summary-worthy visual information, which impedes model performance. In this work, we propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization. Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary. Extensive experiments on three public multimodal datasets show that our method outperforms all competing baselines. Furthermore, with the advantages of summary-worthy visual information, our model can have a significant improvement on small datasets or even datasets with limited training data.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Reusing Pretrained Models by Multi-linear Operators for Efficient Training

Yu Pan
Ye Yuan
Yichun Yin
Zenglin Xu
Lifeng Shang
Xin Jiang
Qun Liu

Training large models from scratch usually costs a substantial amount of resources. Towards this problem, recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model (termed the ``target model''), leading to a considerable acceleration in training. Despite the successes of these previous studies, they grew pretrained models by mapping partial weights only, ignoring potential correlations across the entire model. As we show in this paper, there are inter- and intra-interactions among the weights of both the pretrained and the target models. As a result, the partial mapping may not capture the complete information and lead to inadequate growth. In this paper, we propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model to further enhance acceleration ability. We utilize multi-linear operators to reduce computational and spacial complexity, enabling acceptable resource requirements. Experiments demonstrate that our method can save 76\% computational costs on DeiT-base transferred from DeiT-small, which outperforms bert2BERT by +12\% and LiGO by +21\%, respectively.

AAAI Conference 2022 Conference Paper

From Fully Trained to Fully Random Embeddings: Improving Neural Machine Translation with Compact Word Embedding Tables

Krtin Kumar
Peyman Passban
Mehdi Rezagholizadeh
Yiusing Lau
Qun Liu

Embedding matrices are key components in neural natural language processing (NLP) models that are responsible to provide numerical representations of input tokens (i. e. words or subwords). In this paper, we analyze the impact and utility of such matrices in the context of neural machine translation (NMT). We show that detracting syntactic and semantic information from word embeddings and running NMT systems with random embeddings is not as damaging as it initially sounds. We also show how incorporating only a limited amount of taskspecific knowledge from fully-trained embeddings can boost the performance NMT systems. Our findings demonstrate that in exchange for negligible deterioration in performance, any NMT model can be run with partially random embeddings. Working with such structures means a minimal memory requirement as there is no longer need to store large embedding tables, which is a significant gain in industrial and on-device settings. We evaluated our embeddings in translating English into German and French and achieved a 5. 3x compression rate. Despite having a considerably smaller architecture, our models in some cases are even able to outperform state-of-the-art baselines.

NeurIPS Conference 2022 Conference Paper

TGEA 2.0: A Large-Scale Diagnostically Annotated Dataset with Benchmark Tasks for Text Generation of Pretrained Language Models

Huibin Ge
Xiaohu Zhao
Chuang Liu
Yulong Zeng
Qun Liu
Deyi Xiong

In order to diagnostically analyze and improve the capability of pretrained language models (PLMs) in text generation, we propose TGEA 2. 0, to date the largest dataset built on machine-authored texts by PLMs with fine-grained semantic annotations on a wide variety of pathological generation errors. We collect 170K nominal, phrasal and sentential prompts from 6M natural sentences in 3 domains. These prompts are fed into 4 generative PLMs with their best decoding strategy to generate paragraphs. 195, 629 sentences are extracted from these generated paragraphs for manual annotation, where 36K erroneous sentences are detected, 42K erroneous spans are located and categorized into an error type defined in a two-level error taxonomy. We define a \textbf{Mi}nimal \textbf{S}et of \textbf{E}rror-related \textbf{W}ords (MiSEW) for each erroneous span, which not only provides error-associated words but also rationalizes the reasoning behind the error. Quality control with a pre-annotation and feedback loop is performed before and during the entire annotation process. With the diagnostically annotated dataset, we propose 5 diagnosis benchmark tasks (i. e. , erroneous text detection, MiSEW extraction, erroneous span location and correction together with error type classification) and 2 pathology mitigation benchmark tasks (pairwise comparison and word prediction). Experiment results on these benchmark tasks demonstrate that TGEA 2. 0 is a challenging dataset that could facilitate further research on automatic diagnosis and pathology mitigation over machine texts. The dataset will be publicly available at https: //github. com/tjunlp-lab/TGEA/.

AAAI Conference 2022 Conference Paper

UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation

Zhengkun Zhang
Xiaojun Meng
Yasheng Wang
Xin Jiang
Qun Liu
Zhenglu Yang

With the rapid increase of multimedia data, a large body of literature has emerged to work on multimodal summarization, the majority of which target at refining salient information from textual and visual modalities to output a pictorial summary with the most relevant images. Existing methods mostly focus on either extractive or abstractive summarization and rely on qualified image captions to build image references. We are the first to propose a Unified framework for Multimodal Summarization grounding on BART, UniMS, that integrates extractive and abstractive objectives, as well as selecting the image output. Specially, we adopt knowledge distillation from a vision-language pretrained model to improve image selection, which avoids any requirement on the existence and quality of image captions. Besides, we introduce a visual guided decoder to better integrate textual and visual modalities in guiding abstractive text generation. Results show that our best model achieves a new state-of-the-art result on a large-scale benchmark dataset. The newly involved extractive objective as well as the knowledge distillation technique are proven to bring a noticeable improvement to the multimodal summarization task.

AAAI Conference 2021 Conference Paper

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Peyman Passban
Yimeng Wu
Mehdi Rezagholizadeh
Qun Liu

Knowledge distillation is considered as a training and compression strategy in which two neural networks, namely a teacher and a student, are coupled together during training. The teacher network is supposed to be a trustworthy predictor and the student tries to mimic its predictions. Usually, a student with a lighter architecture is selected so we can achieve compression and yet deliver high-quality results. In such a setting, distillation only happens for final predictions whereas the student could also benefit from teacher’s supervision for internal components. Motivated by this, we studied the problem of distillation for intermediate layers. Since there might not be a one-to-one alignment between student and teacher layers, existing techniques skip some teacher layers and only distill from a subset of them. This shortcoming directly impacts quality, so we instead propose a combinatorial technique which relies on attention. Our model fuses teacher-side information and takes each layer’s significance into consideration, then performs distillation between combined teacher layers and those of the student. Using our technique, we distilled a 12-layer BERT (Devlin et al. 2019) into 6-, 4-, and 2-layer counterparts and evaluated them on GLUE tasks (Wang et al. 2018). Experimental results show that our combinatorial approach is able to outperform other existing techniques.

AAAI Conference 2021 Conference Paper

HopRetriever: Retrieve Hops over Wikipedia to Answer Complex Questions

Shaobo Li
Xiaoguang Li
Lifeng Shang
Xin Jiang
Qun Liu
Chengjie Sun
Zhenzhou Ji
Bingquan Liu

Collecting supporting evidence from large corpora of text (e. g. , Wikipedia) is of great challenge for open-domain Question Answering (QA). Especially, for multi-hop open-domain QA, scattered evidence pieces are required to be gathered together to support the answer extraction. In this paper, we propose a new retrieval target, hop, to collect the hidden reasoning evidence from Wikipedia for complex question answering. Specifically, the hop in this paper is defined as the combination of a hyperlink and the corresponding outbound link document. The hyperlink is encoded as the mention embedding which models the structured knowledge of how the outbound link entity is mentioned in the textual context, and the corresponding outbound link document is encoded as the document embedding representing the unstructured knowledge within it. Accordingly, we build HopRetriever which retrieves hops over Wikipedia to answer complex questions. Experiments on the HotpotQA dataset demonstrate that HopRetriever outperforms previously published evidence retrieval methods by large margins. Moreover, our approach also yields quantifiable interpretations of the evidence collection process.

AAAI Conference 2021 Conference Paper

Towards Semantics-Enhanced Pre-Training: Can Lexicon Definitions Help Learning Sentence Meanings?

Xuancheng Ren
Xu Sun
Houfeng Wang
Qun Liu

Self-supervised pre-training techniques, albeit relying on large amounts of text, have enabled rapid growth in learning language representations for natural language understanding. However, as radically empirical models on sentences, they are subject to the input data distribution, inevitably incorporating data bias and reporting bias, which may lead to inaccurate understanding of sentences. To address this problem, we propose to adopt a human learner’s approach: when we cannot make sense of a word in a sentence, we often consult the dictionary for specific meanings; but can the same work for empirical models? In this work, we try to inform the pre-trained masked language models of word meanings for semantics-enhanced pre-training. To achieve a contrastive and holistic view of word meanings, a definition pair of two related words is presented to the masked language model such that the model can better associate a word with its crucial semantic features. Both intrinsic and extrinsic evaluations validate the proposed approach on semantics-orientated tasks, with an almost negligible increase of training data.

IJCAI Conference 2020 Conference Paper

Bridging the Gap between Training and Inference for Neural Machine Translation (Extended Abstract)

Wen Zhang
Yang Feng
Qun Liu

Neural Machine Translation (NMT) generates target words sequentially in the way of predicting the next word conditioned on the context words. At training time, it predicts with the ground truth words as context while at inference it has to generate the entire sequence from scratch. This discrepancy of the fed context leads to error accumulation among the translation. Furthermore, word-level training requires strict matching between the generated sequence and the ground truth sequence which leads to overcorrection over different but reasonable translations. In this paper, we address these issues by sampling context words not only from the ground truth sequence but also from the predicted sequence during training. Experimental results on NIST Chinese->English and WMT2014 English->German translation tasks demonstrate that our method can achieve significant improvements on multiple data sets compared to strong baselines.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Dialog State Tracking with Reinforced Data Augmentation

Yichun Yin
Lifeng Shang
Xin Jiang
Xiao Chen
Qun Liu

Neural dialog state trackers are generally limited due to the lack of quantity and diversity of annotated training data. In this paper, we address this difﬁculty by proposing a reinforcement learning (RL) based framework for data augmentation that can generate high-quality data to improve the neural state tracker. Speciﬁcally, we introduce a novel contextual bandit generator to learn ﬁne-grained augmentation policies that can generate new effective instances by choosing suitable replacements for speciﬁc context. Moreover, by alternately learning between the generator and the state tracker, we can keep reﬁning the generative policies to generate more highquality training data for neural state tracker. Experimental results on the WoZ and MultiWoZ (restaurant) datasets demonstrate that the proposed framework signiﬁcantly improves the performance over the state-of-the-art models, especially with limited training data.

NeurIPS Conference 2020 Conference Paper

DynaBERT: Dynamic BERT with Adaptive Width and Depth

Lu Hou
Zhiqi Huang
Lifeng Shang
Xin Jiang
Xiao Chen
Qun Liu

The pre-trained language models like BERT, though powerful in many natural language processing tasks, are both computation and memory expensive. To alleviate this problem, one approach is to compress them for specific tasks before deployment. However, recent works on BERT compression usually compress the large BERT model to a fixed smaller size, and can not fully satisfy the requirements of different edge devices with various hardware performances. In this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT), which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. Comprehensive experiments under various efficiency constraints demonstrate that our proposed dynamic BERT (or RoBERTa) at its largest size has comparable performance as BERT-base (or RoBERTa-base), while at smaller widths and depths consistently outperforms existing BERT compression methods. Code is available at https: //github. com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT.

AAAI Conference 2020 Conference Paper

Multi-Channel Reverse Dictionary Model

Lei Zhang
Fanchao Qi
Zhiyuan Liu
Yasheng Wang
Qun Liu
Maosong Sun

A reverse dictionary takes the description of a target word as input and outputs the target word together with other words that match the description. Existing reverse dictionary methods cannot deal with highly variable input queries and low-frequency target words successfully. Inspired by the description-to-word inference process of humans, we propose the multi-channel reverse dictionary model, which can mitigate the two problems simultaneously. Our model comprises a sentence encoder and multiple predictors. The predictors are expected to identify different characteristics of the target word from the input query. We evaluate our model on English and Chinese datasets including both dictionary deﬁnitions and human-written descriptions. Experimental results show that our model achieves the state-of-the-art performance, and even outperforms the most popular commercial reverse dictionary system on the human-written description dataset. We also conduct quantitative analyses and a case study to demonstrate the effectiveness and robustness of our model. All the code and data of this work can be obtained on https: //github. com/thunlp/MultiRD.

IJCAI Conference 2018 Conference Paper

Learning Tag Dependencies for Sequence Tagging

Yuan Zhang
Hongshen Chen
Yihong Zhao
Qun Liu
Dawei Yin

Sequence tagging is the basis for multiple applications in natural language processing. Despite successes in learning long term token sequence dependencies with neural network, tag dependencies are rarely considered previously. Sequence tagging actually possesses complex dependencies and interactions among the input tokens and the output tags. We propose a novel multi-channel model, which handles different ranges of token-tag dependencies and their interactions simultaneously. A tag LSTM is augmented to manage the output tag dependencies and word-tag interactions, while three mechanisms are presented to efficiently incorporate token context representation and tag dependency. Extensive experiments on part-of-speech tagging and named entity recognition tasks show that the proposed model outperforms the BiLSTM-CRF baseline by effectively incorporating the tag dependency feature.

AAAI Conference 2018 Conference Paper

Translating Pro-Drop Languages With Reconstruction Models

Longyue Wang
Zhaopeng Tu
Shuming Shi
Tong Zhang
Yvette Graham
Qun Liu

Pronouns are frequently omitted in pro-drop languages, such as Chinese, generally leading to signiﬁcant challenges with respect to the production of complete translations. To date, very little attention has been paid to the dropped pronoun (DP) problem within neural machine translation (NMT). In this work, we propose a novel reconstruction-based approach to alleviating DP translation problems for NMT models. Firstly, DPs within all source sentences are automatically annotated with parallel information extracted from the bilingual training corpus. Next, the annotated source sentence is reconstructed from hidden representations in the NMT model. With auxiliary training objectives, in terms of reconstruction scores, the parameters associated with the NMT model are guided to produce enhanced hidden representations that are encouraged as much as possible to embed annotated DP information. Experimental results on both Chinese–English and Japanese–English dialogue translation tasks show that the proposed approach signiﬁcantly and consistently improves translation performance over a strong NMT baseline, which is directly built on the training data annotated with DPs.

IJCAI Conference 2017 Conference Paper

ME-MD: An Effective Framework for Neural Machine Translation with Multiple Encoders and Decoders

Jinchao Zhang
Qun Liu
Jie Zhou

The encoder-decoder neural framework is widely employed for Neural Machine Translation (NMT) with a single encoder to represent the source sentence and a single decoder to generate target words. The translation performance heavily relies on the representation ability of the encoder and the generation ability of the decoder. To further enhance NMT, we propose to extend the original encoder-decoder framework to a novel one, which has multiple encoders and decoders (ME-MD). Through this way, multiple encoders extract more diverse features to represent the source sequence and multiple decoders capture more complicated translation knowledge. Our proposed ME-MD framework is convenient to integrate heterogeneous encoders and decoders with multiple depths and multiple types. Experiment on Chinese-English translation task shows that our ME-MD system surpasses the state-of-the-art NMT system by 2. 1 BLEU points and surpasses the phrase-based Moses by 7. 38 BLEU points. Our framework is general and can be applied to other sequence to sequence tasks.

AIJ Journal 2016 Journal Article

Topic-based term translation models for statistical machine translation

Deyi Xiong
Fandong Meng
Qun Liu

IJCAI Conference 2015 Conference Paper

Joint Learning of Constituency and Dependency Grammars by Decomposed Cross-Lingual Induction

Wenbin Jiang
Qun Liu
Thepchai Supnithi

Cross-lingual induction aims to acquire for one language some linguistic structures resorting to annotations from another language. It works well for simple structured predication problems such as part-of-speech tagging and dependency parsing, but lacks of significant progress for more complicated problems such as constituency parsing and deep semantic parsing, mainly due to the structural non-isomorphism between languages. We propose a decomposed projection strategy for crosslingual induction, where cross-lingual projection is performed in unit of fundamental decisions of the structured predication. Compared with the structured projection that projects the complete structures, decomposed projection achieves better adaptation of non-isomorphism between languages and efficiently acquires the structured information across languages, thus leading to better performance. For joint cross-lingual induction of constituency and dependency grammars, decomposed cross-lingual induction achieves very significant improvement in both constituency and dependency grammar induction.

IJCAI Conference 2015 Conference Paper

Syntax-Based Deep Matching of Short Texts

Mingxuan Wang
Zhengdong Lu
Hang Li
Qun Liu

Many tasks in natural language processing, ranging from machine translation to question answering, can be reduced to the problem of matching two sentences or more generally two short texts. We propose a new approach to the problem, called Deep Match Tree (DEEPMATCHtree), under a general setting. The approach consists of two components, 1) a mining algorithm to discover patterns for matching two short-texts, defined in the product space of dependency trees, and 2) a deep neural network for matching short texts using the mined patterns, as well as a learning algorithm to build the network having a sparse structure. We test our algorithm on the problem of matching a tweet and a response in social media, a hard matching problem proposed in [Wang et al. , 2013], and show that DEEPMATCHtree can outperform a number of competitor models including one without using dependency trees and one based on word-embedding, all with large margins.

AAAI Conference 2014 Conference Paper

Joint Morphological Generation and Syntactic Linearization

Linfeng Song
Yue Zhang
Kai Song
Qun Liu

There has been growing interest in stochastic methods to natural language generation (NLG). While most NLG pipelines separate morphological generation and syntactic linearization, the two tasks are closely related. In this paper, we study joint morphological generation and linearization, making use of word order and inflections information for both tasks and reducing error propagation. Experiments show that the joint method significantly outperforms a strong pipelined baseline (by 1. 1 BLEU points). It also achieves the best reported result on the Generation Challenge 2011 shared task.

IJCAI Conference 2013 Conference Paper

Modeling Lexical Cohesion for Document-Level Machine Translation

Deyi Xiong
Guosheng Ben
Min Zhang
Yajuan Lü
Qun Liu

Lexical cohesion arises from a chain of lexical items that establish links between sentences in a text. In this paper we propose three different models to capture lexical cohesion for document-level machine translation: (a) a direct reward model where translation hypotheses are rewarded whenever lexical cohesion devices occur in them, (b) a conditional probability model where the appropriateness of using lexical cohesion devices is measured, and (c) a mutual information trigger model where a lexical cohesion relation is considered as a trigger pair and the strength of the association between the trigger and the triggered item is estimated by mutual information. We integrate the three models into hierarchical phrase-based machine translation and evaluate their effectiveness on the NIST Chinese-English translation tasks with large-scale training data. Experiment results show that all three models can achieve substantial improvements over the baseline and that the mutual information trigger model performs better than the others.

PDF Details DOI

AAAI Conference 2010 Conference Paper

Forest-Based Semantic Role Labeling

Hao Xiong
Haitao Mi
Yang Liu
Qun Liu

Parsing plays an important role in semantic role labeling (SRL) because most SRL systems infer semantic relations from 1-best parses. Therefore, parsing errors inevitably lead to labeling mistakes. To alleviate this problem, we propose to use packed forest, which compactly encodes all parses for a sentence. We design an algorithm to exploit exponentially many parses to learn semantic relations efficiently. Experimental results on the CoNLL-2005 shared task show that using forests achieves an absolute improvement of 1. 2% in terms of F1 score over using 1-best parses and 0. 6% over using 50-best parses.