Arrow Research search

Author name cluster

Xiaojun Wan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

37 papers
1 author row

Possible papers

37

AAAI Conference 2026 Conference Paper

SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs

  • Zhenliang Zhang
  • Xinyu Hu
  • Xiaojun Wan

Large language models sometimes inadvertently reproduce passages that are copyrighted, exposing downstream applications to legal risk. Most existing studies for inference-time defences focus on surface-level token matching and rely on external blocklists or filters, which add deployment complexity and may overlook semantically paraphrased leakage. In this work, we reframe copyright infringement mitigation as intrinsic semantic-space control and introduce SCOPE, an inference-time method that requires no parameter updates or auxiliary filters. Specifically, the sparse autoencoder (SAE) projects hidden states into a high-dimensional, near-monosemantic space; benefiting from this representation, we identify a copyright-sensitive subspace and clamp its activations during decoding. Experiments on widely recognized benchmarks show that SCOPE mitigates copyright infringement without degrading general utility. Further interpretability analyses confirm that the isolated subspace captures high-level semantics.

AAAI Conference 2025 Conference Paper

DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models

  • Jinxiang Xie
  • Yilin Li
  • Xunjian Yin
  • Xiaojun Wan

Evaluating the performance of Grammatical Error Correction (GEC) models has become increasingly challenging, as large language model (LLM)-based GEC systems often produce corrections that diverge from provided gold references. This discrepancy undermines the reliability of traditional reference-based evaluation metrics. In this study, we propose a novel evaluation framework for GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency, and utilizing a dynamic weighting mechanism. Our framework employs the Analytic Hierarchy Process (AHP) in conjunction with large language models to ascertain the relative importance of various evaluation criteria. Additionally, we develop a dataset incorporating human annotations and LLM-simulated sentences to validate our algorithms and fine-tune more cost-effective models. Experimental results indicate that our proposed approach enhances the effectiveness of GEC model evaluations.

NeurIPS Conference 2025 Conference Paper

Enhancing LLM Watermark Resilience Against Both Scrubbing and Spoofing Attacks

  • Huanming Shen
  • Baizhou Huang
  • Xiaojun Wan

Watermarking is widely regarded as a promising defense against the misuse of large language models (LLMs); however, existing methods are fundamentally constrained by their vulnerability to scrubbing and spoofing attacks. This vulnerability stems from an inherent trade-off governed by watermark window size: smaller windows resist scrubbing better but are easier to reverse-engineer, enabling low-cost statistics-based spoofing attacks. This work expands the trade-off boundary by introducing a novel mechanism, equivalent texture keys, where multiple tokens within a watermark window can independently support the detection. Based on the redundancy, we propose a watermark scheme with S ub-vocabulary decomposed E quivalent t E xture K ey ( SEEK ). SEEK achieves a Pareto improvement, enhancing robustness to scrubbing attacks without sacrificing resistance to spoofing.

AAAI Conference 2024 Conference Paper

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

  • Jie Ruan
  • Xiao Pu
  • Mingqi Gao
  • Xiaojun Wan
  • Yuesheng Zhu

Human evaluation is viewed as a reliable evaluation method for NLG which is expensive and time-consuming. To save labor and costs, researchers usually perform human evaluation on a small subset of data sampled from the whole dataset in practice. However, different selection subsets will lead to different rankings of the systems. To give a more correct inter-system ranking and make the gold standard human evaluation more reliable, we propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. CASF operates through a Learner, a Systematic Sampler and a Constrained Controller to select representative samples for getting a more correct inter-system ranking. Experiment results on 137 real NLG evaluation setups with 44 human evaluation metrics across 16 datasets and 5 NLG tasks demonstrate CASF receives 93.18\% top-ranked system recognition accuracy and ranks first or ranks second on 90.91\% of the human metrics with 0.83 overall inter-system ranking Kendall correlation. Code and data are publicly available online.

AAAI Conference 2024 Conference Paper

History Matters: Temporal Knowledge Editing in Large Language Model

  • Xunjian Yin
  • Jin Jiang
  • Liming Yang
  • Xiaojun Wan

The imperative task of revising or updating the knowledge stored within large language models arises from two distinct sources: intrinsic errors inherent in the model which should be corrected and outdated knowledge due to external shifts in the real world which should be updated. Prevailing efforts in model editing conflate these two distinct categories of edits arising from distinct reasons and directly modify the original knowledge in models into new knowledge. However, we argue that preserving the model's original knowledge remains pertinent. Specifically, if a model's knowledge becomes outdated due to evolving worldly dynamics, it should retain recollection of the historical knowledge while integrating the newfound knowledge. In this work, we introduce the task of Temporal Knowledge Editing (TKE) and establish a benchmark AToKe (Assessment of TempOral Knowledge Editing) to evaluate current model editing methods. We find that while existing model editing methods are effective at making models remember new knowledge, the edited model catastrophically forgets historical knowledge. To address this gap, we propose a simple and general framework termed Multi-Editing with Time Objective (METO) for enhancing existing editing models, which edits both historical and new knowledge concurrently and optimizes the model's prediction for the time of each fact. Our assessments demonstrate that while AToKe is still difficult, METO maintains the effectiveness of learning new knowledge and meanwhile substantially improves the performance of edited models on utilizing historical knowledge.

NeurIPS Conference 2023 Conference Paper

SituatedGen: Incorporating Geographical and Temporal Contexts into Generative Commonsense Reasoning

  • Yunxiang Zhang
  • Xiaojun Wan

Recently, commonsense reasoning in text generation has attracted much attention. Generative commonsense reasoning is the task that requires machines, given a group of keywords, to compose a single coherent sentence with commonsense plausibility. While existing datasets targeting generative commonsense reasoning focus on everyday scenarios, it is unclear how well machines reason under specific geographical and temporal contexts. We formalize this challenging task as SituatedGen, where machines with commonsense should generate a pair of contrastive sentences given a group of keywords including geographical or temporal entities. We introduce a corresponding English dataset consisting of 8, 268 contrastive sentence pairs, which are built upon several existing commonsense reasoning benchmarks with minimal manual labor. Experiments show that state-of-the-art generative language models struggle to generate sentences with commonsense plausibility and still lag far behind human performance. Our dataset is publicly available at https: //github. com/yunx-z/situated_gen.

AAAI Conference 2022 Short Paper

Automatic Slides Generation for Scholarly Papers: A Fine-Grained Dataset and Baselines (Student Abstract)

  • Sheng Xu
  • Xiaojun Wan

Slides are broadly used to present the research works and there are several studies on the problem of automatic slides generation. However, the lack of dataset hinders further research. In this paper, we construct a benchmark dataset for the problem of slides generation from scholarly papers. The dataset is fine-grained and consists of aligned pairs of single slide and specific region of a paper. Then we deploy several baseline models and conduct preliminary experiments. The results show that this task is challenging and awaits more exploration. The dataset and code will be released.

AAAI Conference 2022 Conference Paper

BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles

  • Yunxiang Zhang
  • Xiaojun Wan

A riddle is a question or statement with double or veiled meanings, followed by an unexpected answer. Solving riddle is a challenging task for both machine and human, testing the capability of understanding figurative, creative natural language and reasoning with commonsense knowledge. We introduce BiRdQA, a bilingual multiple-choice question answering dataset with 6614 English riddles and 8751 Chinese riddles. For each riddle-answer pair, we provide four distractors with additional information from Wikipedia. The distractors are automatically generated at scale with minimal bias. Existing monolingual and multilingual QA models fail to perform well on our dataset, indicating that there is a long way to go before machine can beat human on solving tricky riddles. The dataset is publicly available at https: //forms. gle/NvT7DfWhAPhvoFvH7.

AAAI Conference 2022 System Paper

PosterBot: A System for Generating Posters of Scientific Papers with Neural Models

  • Sheng Xu
  • Xiaojun Wan

Posters are broadly used to present the important points of academic papers and can be seen as a special form of document summarization. However, the problem of automatic poster generation is under-investigated. In this paper, we present PosterBot, an automatic poster generation system for academic papers. Given a scholarly paper, PosterBot takes three steps to generate the poster. It first selects the most important sections, and then generates corresponding panels from them. Finally, all panels are integrated to get the complete poster. The demonstration shows the efficacy of our proposed system.

NeurIPS Conference 2022 Conference Paper

Relation-Constrained Decoding for Text Generation

  • Xiang Chen
  • Zhixian Yang
  • Xiaojun Wan

The dominant paradigm for neural text generation nowadays is seq2seq learning with large-scale pretrained language models. However, it is usually difficult to manually constrain the generation process of these models. Prior studies have introduced Lexically Constrained Decoding (LCD) to ensure the presence of pre-specified words or phrases in the output. However, simply applying lexical constraints has no guarantee of the grammatical or semantic relations between words. Thus, more elaborate constraints are needed. To this end, we first propose a new constrained decoding scenario named Relation-Constrained Decoding (RCD), which requires the model's output to contain several given word pairs with respect to the given relations between them. For this scenario, we present a novel plug-and-play decoding algorithm named RElation-guided probability Surgery and bEam ALlocation (RESEAL), which can handle different categories of relations, e. g. , syntactical relations or factual relations. Moreover, RESEAL can adaptively "reseal" the relations to form a high-quality sentence, which can be applied to the inference stage of any autoregressive text generation model. To evaluate our method, we first construct an RCD benchmark based on dependency relations from treebanks with annotated dependencies. Experimental results demonstrate that our approach can achieve better preservation of the input dependency relations compared to previous methods. To further illustrate the effectiveness of RESEAL, we apply our method to three downstream tasks: sentence summarization, fact-based text editing, and data-to-text generation. We observe an improvement in generation quality. The source code is available at https: //github. com/CasparSwift/RESEAL.

AAAI Conference 2021 Conference Paper

Bridging the Domain Gap: Improve Informal Language Translation via Counterfactual Domain Adaptation

  • Ke Wang
  • Guandan Chen
  • Zhongqiang Huang
  • Xiaojun Wan
  • Fei Huang

Despite the near-human performances already achieved on formal texts such as news articles, neural machine translation still has difficulty in dealing with ”user-generated” texts that have diverse linguistic phenomena but lack large-scale high-quality parallel corpora. To address this problem, we propose a counterfactual domain adaptation method to better leverage both large-scale source-domain data (formal texts) and small-scale target-domain data (informal texts). Specifically, by considering effective counterfactual conditions (the concatenations of source-domain texts and the target-domain tag), we construct the counterfactual representations to fill the sparse latent space of the target domain caused by a small amount of data, that is, bridging the gap between the sourcedomain data and the target-domain data. Experiments on English-to-Chinese and Chinese-to-English translation tasks show that our method outperforms the base model that is trained only on the informal corpus by a large margin, and consistently surpasses different baseline methods by +1. 12 ∼ 4. 34 BLEU points on different datasets. Furthermore, we also show that our method achieves competitive performances on cross-domain language translation on four language pairs.

AAAI Conference 2021 Conference Paper

Neural Sentence Simplification with Semantic Dependency Information

  • Zhe Lin
  • Xiaojun Wan

Most previous works on neural sentence simplification exploit seq2seq model to rewrite a sentence without explicitly considering the semantic information of the sentence. This may lead to the semantic deviation of the simplified sentence. In this paper, we leverage semantic dependency graph to aid neural sentence simplification system. We propose a new sentence simplification model with semantic dependency information, called SDISS (as shorthand for Semantic Dependency Information guided Sentence Simplification), which incorporates semantic dependency graph to guide sentence simplification. We evaluate SDISS on three benchmark datasets and it outperforms a number of strong baseline models on the SARI and FKGL metrics. Human evaluation also shows SDISS can produce simplified sentences with better quality.

IJCAI Conference 2020 Conference Paper

Better AMR-To-Text Generation with Graph Structure Reconstruction

  • Tianming Wang
  • Xiaojun Wan
  • Shaowei Yao

AMR-to-text generation is a challenging task of generating texts from graph-based semantic representations. Recent studies formalize this task a graph-to-sequence learning problem and use various graph neural networks to model graph structure. In this paper, we propose a novel approach that generates texts from AMR graphs while reconstructing the input graph structures. Our model employs graph attention mechanism to aggregate information for encoding the inputs. Moreover, better node representations are learned by optimizing two simple but effective auxiliary reconstruction objectives: link prediction objective which requires predicting the semantic relationship between nodes, and distance prediction objective which requires predicting the distance between nodes. Experimental results on two benchmark datasets show that our proposed model improves considerably over strong baselines and achieves new state-of-the-art.

AAAI Conference 2020 Conference Paper

MultiSumm: Towards a Unified Model for Multi-Lingual Abstractive Summarization

  • Yue Cao
  • Xiaojun Wan
  • Jinge Yao
  • Dian Yu

Automatic text summarization aims at producing a shorter version of the input text that conveys the most important information. However, multi-lingual text summarization, where the goal is to process texts in multiple languages and output summaries in the corresponding languages with a single model, has been rarely studied. In this paper, we present MultiSumm, a novel multi-lingual model for abstractive summarization. The MultiSumm model uses the following training regime: (I) multi-lingual learning that contains language model training, auto-encoder training, translation and backtranslation training, and (II) joint summary generation training. We conduct experiments on summarization datasets for five rich-resource languages: English, Chinese, French, Spanish, and German, as well as two low-resource languages: Bosnian and Croatian. Experimental results show that our proposed model significantly outperforms a multi-lingual baseline model. Specifically, our model achieves comparable or even better performance than models trained separately on each language. As an additional contribution, we construct the first summarization dataset for Bosnian and Croatian, containing 177, 406 and 204, 748 samples, respectively.

AAAI Conference 2020 Conference Paper

SemSUM: Semantic Dependency Guided Neural Abstractive Summarization

  • Hanqi Jin
  • Tianming Wang
  • Xiaojun Wan

In neural abstractive summarization, the generated summaries often face semantic irrelevance and content deviation from the input sentences. In this work, we incorporate semantic dependency graphs about predicate-argument structure of input sentences into neural abstractive summarization for the problem. We propose a novel semantics dependency guided summarization model (SemSUM), which can leverage the information of original input texts and the corresponding semantic dependency graphs in a complementary way to guide summarization process. We evaluate our model on the English Gigaword, DUC 2004 and MSR abstractive sentence summarization datasets. Experiments show that the proposed model improves semantic relevance and reduces content deviation, and also brings significant improvements on automatic evaluation ROUGE metrics.

NeurIPS Conference 2019 Conference Paper

Controllable Unsupervised Text Attribute Transfer via Editing Entangled Latent Representation

  • Ke Wang
  • Hang Hua
  • Xiaojun Wan

Unsupervised text attribute transfer automatically transforms a text to alter a specific attribute (e. g. sentiment) without using any parallel data, while simultaneously preserving its attribute-independent content. The dominant approaches are trying to model the content-independent attribute separately, e. g. , learning different attributes' representations or using multiple attribute-specific decoders. However, it may lead to inflexibility from the perspective of controlling the degree of transfer or transferring over multiple aspects at the same time. To address the above problems, we propose a more flexible unsupervised text attribute transfer framework which replaces the process of modeling attribute with minimal editing of latent representations based on an attribute classifier. Specifically, we first propose a Transformer-based autoencoder to learn an entangled latent representation for a discrete text, then we transform the attribute transfer task to an optimization problem and propose the Fast-Gradient-Iterative-Modification algorithm to edit the latent representation until conforming to the target attribute. Extensive experimental results demonstrate that our model achieves very competitive performance on three public data sets. Furthermore, we also show that our model can not only control the degree of transfer freely but also allow to transfer over multiple aspects at the same time.

AAAI Conference 2019 Conference Paper

Hierarchical Attention Networks for Sentence Ordering

  • Tianming Wang
  • Xiaojun Wan

Modeling discourse coherence is an important problem in natural language generation and understanding. Sentence ordering, the goal of which is to organize a set of sentences into a coherent text, is a commonly used task to learn and evaluate the model. In this paper, we propose a novel hierarchical attention network that captures word clues and dependencies between sentences to address this problem. Our model outperforms prior methods and achieves state-of-the-art performance on several datasets in different domains. Furthermore, our experiments demonstrate that the model performs very well even though adding noisy sentences into the set, which shows the robustness and effectiveness of the model. Visualization analysis and case study show that our model captures the structure and pattern of coherent texts not only by simple word clues but also by consecution in context.

IJCAI Conference 2019 Conference Paper

Multi-Domain Sentiment Classification Based on Domain-Aware Embedding and Attention

  • Yitao Cai
  • Xiaojun Wan

Sentiment classification is a fundamental task in NLP. However, as revealed by many researches, sentiment classification models are highly domain-dependent. It is worth investigating to leverage data from different domains to improve the classification performance in each domain. In this work, we propose a novel completely-shared multi-domain neural sentiment classification model to learn domain-aware word embeddings and make use of domain-aware attention mechanism. Our model first utilizes BiLSTM for domain classification and extracts domain-specific features for words, which are then combined with general word embeddings to form domain-aware word embeddings. Domain-aware word embeddings are fed into another BiLSTM to extract sentence features. The domain-aware attention mechanism is used for selecting significant features, by using the domain-aware sentence representation as the query vector. Evaluation results on public datasets with 16 different domains demonstrate the efficacy of our proposed model. Further experiments show the generalization ability and the transferability of our model.

IJCAI Conference 2019 Conference Paper

T-CVAE: Transformer-Based Conditioned Variational Autoencoder for Story Completion

  • Tianming Wang
  • Xiaojun Wan

Story completion is a very challenging task of generating the missing plot for an incomplete story, which requires not only understanding but also inference of the given contextual clues. In this paper, we present a novel conditional variational autoencoder based on Transformer for missing plot generation. Our model uses shared attention layers for encoder and decoder, which make the most of the contextual clues, and a latent variable for learning the distribution of coherent story plots. Through drawing samples from the learned distribution, diverse reasonable plots can be generated. Both automatic and manual evaluations show that our model generates better story plots than state-of-the-art models in terms of readability, diversity and coherence.

IJCAI Conference 2018 Conference Paper

Learning to Explain Ambiguous Headlines of Online News

  • Tianyu Liu
  • Wei Wei
  • Xiaojun Wan

With the purpose of attracting clicks, online news publishers and editors use diverse strategies to make their headlines catchy, with a sacrifice of accuracy. Specifically, a considerable portion of news headlines is ambiguous. Such headlines are unclear relative to the content of the story, and largely degrade the reading experience of the audience. In this paper, we focus on dealing with the information gap caused by the ambiguous news headlines. We define a new task of explaining ambiguous headlines with short informative texts, and build a benchmark dataset for evaluation. We address the task by selecting a proper sentence from the news body to resolve the ambiguity in an ambiguous headline. Both feature engineering methods and neural network methods are explored. For feature engineering, we improve a standard SVM classifier with elaborately designed features. For neural networks, we propose an ambiguity-aware neural matching model based on a previous model. Utilizing automatic and manual evaluation metrics, we demonstrate the efficacy and the complementarity of the two methods, and the ambiguity-aware neural matching model achieves the state-of-the-art performance on this challenging task.

IJCAI Conference 2018 Conference Paper

SentiGAN: Generating Sentimental Texts via Mixture Adversarial Networks

  • Ke Wang
  • Xiaojun Wan

Generating texts of different sentiment labels is getting more and more attention in the area of natural language generation. Recently, Generative Adversarial Net (GAN) has shown promising results in text generation. However, the texts generated by GAN usually suffer from the problems of poor quality, lack of diversity and mode collapse. In this paper, we propose a novel framework - SentiGAN, which has multiple generators and one multi-class discriminator, to address the above problems. In our framework, multiple generators are trained simultaneously, aiming at generating texts of different sentiment labels without supervision. We propose a penalty based objective in the generators to force each of them to generate diversified examples of a specific sentiment label. Moreover, the use of multiple generators and one multi-class discriminator can make each generator focus on generating its own examples of a specific sentiment label accurately. Experimental results on four datasets demonstrate that our model consistently outperforms several state-of-the-art text generation methods in the sentiment accuracy and quality of generated texts.

AAAI Conference 2017 Short Paper

ATSUM: Extracting Attractive Summaries for News Propagation on Microblogs

  • Fang Liu
  • Xiaojun Wan

In this paper, we investigate how to automatically extract attractive summaries for news propagation on microblogs and propose a novel system called ATSUM to achieve this goal via text attractiveness analysis. It first analyzes the sentences in a news article and automatically predict the attractiveness score of each sentence by using the support vector regression method. The predicted attractiveness scores are then incorporated into a summarization system. Experimental results on a manually labeled dataset verify the effectiveness of the proposed methods.

IJCAI Conference 2017 Conference Paper

From Neural Sentence Summarization to Headline Generation: A Coarse-to-Fine Approach

  • Jiwei Tan
  • Xiaojun Wan
  • Jianguo Xiao

Headline generation is a task of abstractive text summarization, and previously suffers from the immaturity of natural language generation techniques. Recent success of neural sentence summarization models shows the capacity of generating informative, fluent headlines conditioned on selected recapitulative sentences. In this paper, we investigate the extension of sentence summarization models to the document headline generation task. The challenge is that extending the sentence summarization model to consider more document information will mostly confuse the model and hurt the performance. In this paper, we propose a coarse-to-fine approach, which first identifies the important sentences of a document using document summarization techniques, and then exploits a multi-sentence summarization model with hierarchical attention to leverage the important sentences for headline generation. Experimental results on a large real dataset demonstrate the proposed approach significantly improves the performance of neural sentence summarization models on the headline generation task.

AAAI Conference 2017 Conference Paper

Greedy Flipping for Constrained Word Deletion

  • Jin-ge Yao
  • Xiaojun Wan

In this paper we propose a simple yet efficient method for constrained word deletion to compress sentences, based on top-down greedy local flipping from multiple random initializations. The algorithm naturally integrates various grammatical constraints in the compression process, without using time-consuming integer linear programming solvers. Our formulation suits for any objective function involving arbitrary local score definition. Experimental results show that the proposed method achieves nearly identical performance with explicit ILP formulation while being much more efficient.

IJCAI Conference 2017 Conference Paper

Learning to Identify Ambiguous and Misleading News Headlines

  • Wei Wei
  • Xiaojun Wan

Accuracy is one of the basic principles of journalism. However, it is increasingly hard to manage due to the diversity of news media. Some editors of online news tend to use catchy headlines which trick readers into clicking. These headlines are either ambiguous or misleading, degrading the reading experience of the audience. Thus, identifying inaccurate news headlines is a task worth studying. Previous work names these headlines ``clickbaits'' and mainly focus on the features extracted from the headlines, which limits the performance since the consistency between headlines and news bodies is underappreciated. In this paper, we clearly redefine the problem and identify ambiguous and misleading headlines separately. We utilize class sequential rules to exploit structure information when detecting ambiguous headlines. For the identification of misleading headlines, we extract features based on the congruence between headlines and bodies. To make use of the large unlabeled data set, we apply a co-training method and gain an increase in performance. The experiment results show the effectiveness of our methods. Then we use our classifiers to detect inaccurate headlines crawled from different sources and conduct a data analysis.

AAAI Conference 2017 Conference Paper

Phrase-Based Presentation Slides Generation for Academic Papers

  • Sida Wang
  • Xiaojun Wan
  • Shikang Du

Automatic generation of presentation slides for academic papers is a very challenging task. Previous methods for addressing this task are mainly based on document summarization techniques and they extract document sentences to form presentation slides, which are not well-structured and concise. In this study, we propose a phrase-based approach to generate well-structured and concise presentation slides for academic papers. Our approach first extracts phrases from the given paper, and then learns both the saliency of each phrase and the hierarchical relationship between a pair of phrases. Finally a greedy algorithm is used to select and align the salient phrases in order to form the well-structured presentation slides. Evaluation results on a real dataset verify the efficacy of our proposed approach.

AAAI Conference 2016 Conference Paper

MicroScholar: Mining Scholarly Information from Chinese Microblogs

  • Yang Yu
  • Xiaojun Wan

For many researchers, one of the biggest issues is the lack of an efficient method to obtain latest academic progresses in related research fields. We notice that many researchers tend to share their research progresses or recommend scholarly information they have known on their microblogs. In order to exploit microblogging to benefit scientific research, we build a system called MicroScholar to automatically collecting and mining scholarly information from Chinese microblogs. In this paper, we briefly introduce the system framework and focus on the component of scholarly microblog categorization. Several kinds of features have been used in the component and experimental results demonstrate their usefulness.

AAAI Conference 2016 Conference Paper

Tweet Timeline Generation with Determinantal Point Processes

  • Jin-ge Yao
  • Feifan Fan
  • Wayne Xin Zhao
  • Xiaojun Wan
  • Edward Chang
  • Jianguo Xiao

The task of tweet timeline generation (TTG) aims at selecting a small set of representative tweets to generate a meaningful timeline and providing enough coverage for a given topical query. This paper presents an approach based on determinantal point processes (DPPs) by jointly modeling the topical relevance of each selected tweet and overall selectional diversity. Aiming at better treatment for balancing relevance and diversity, we introduce two novel strategies, namely spectral rescaling and topical prior. Extensive experiments on the public TREC 2014 dataset demonstrate that our proposed DPP model along with the two strategies can achieve fairly competitive results against the state-of-the-art TTG systems.

IJCAI Conference 2015 Conference Paper

Compressive Document Summarization via Sparse Optimization

  • Jin-ge Yao
  • Xiaojun Wan
  • Jianguo Xiao

In this paper, we formulate a sparse optimization framework for extractive document summarization. The proposed framework has a decomposable convex objective function. We derive an efficient ADMM algorithm to solve it. To encourage diversity in the summaries, we explicitly introduce an additional sentence dissimilarity term in the optimization framework. We achieve significant improvement over previous related work under similar data reconstruction framework. We then generalize our formulation to the case of compressive summarization and derive a block coordinate descent algorithm to optimize the objective function. Performance on DUC 2006 and DUC 2007 datasets shows that our compressive summarization results are competitive against the state-of-the-art results while maintaining reasonable readability.

AAAI Conference 2015 Conference Paper

Learning to Recommend Quotes for Writing

  • Jiwei Tan
  • Xiaojun Wan
  • Jianguo Xiao

In this paper, we propose and address a novel task of recommending quotes for writing. Quote is short for quotation, which is the repetition of someone else’s statement or thoughts. It is a common case in our writing when we would like to cite someone’s statement, like a proverb or a statement by some famous people, to make our composition more elegant or convincing. However, sometimes we are so eager to make a citation of quote somewhere, but have no idea about the relevant quote to express our idea. Because knowing or remembering so many quotes is not easy, it is exciting to have a system to recommend relevant quotes for us while writing. In this paper we tackle this appealing AI task, and build up a learning framework for quote recommendation. We collect abundant quotes from the Internet, and mine real contexts containing these quotes from large amount of electronic books, to build up a dataset for experiments. We explore the particular features of this task, and propose a few useful features to model the characteristics of quotes and the relevance of quotes to contexts. We apply a supervised learning to rank model to integrate multiple features. Experiment results show that, our proposed approach is appropriate for this task and it outperforms other recommendation methods.

AAAI Conference 2015 Conference Paper

Representation Learning for Aspect Category Detection in Online Reviews

  • Xinjie Zhou
  • Xiaojun Wan
  • Jianguo Xiao

User-generated reviews are valuable resources for decision making. Identifying the aspect categories discussed in a given review sentence (e. g. , “food” and “service” in restaurant reviews) is an important task of sentiment analysis and opinion mining. Given a predefined aspect category set, most previous researches leverage handcrafted features and a classification algorithm to accomplish the task. The crucial step to achieve better performance is feature engineering which consumes much human effort and may be unstable when the product domain changes. In this paper, we propose a representation learning approach to automatically learn useful features for aspect category detection. Specifically, a semi-supervised word embedding algorithm is first proposed to obtain continuous word representations on a large set of reviews with noisy labels. Afterwards, we propose to generate deeper and hybrid features through neural networks stacked on the word vectors. A logistic regression classifier is finally trained with the hybrid features to predict the aspect category. The experiments are carried out on a benchmark dataset released by SemEval-2014. Our approach achieves the state-of-theart performance and outperforms the best participating team as well as a few strong baselines.

IJCAI Conference 2009 Conference Paper

  • Xiaojun Wan
  • Jianguo Xiao

Graph-based manifold-ranking methods have been successfully applied to topic-focused multi-document summarization. This paper further proposes to use the multi-modality manifold-ranking algorithm for extracting topic-focused summary from multiple documents by considering the within-document sentence relationships and the cross-document sentence relationships as two separate modalities (graphs). Three different fusion schemes, namely linear form, sequential form and score combination form, are exploited in the algorithm. Experimental results on the DUC benchmark datasets demonstrate the effectiveness of the proposed multi-modality learning algorithms with all the three fusion schemes.

AAAI Conference 2008 Conference Paper

Single Document Keyphrase Extraction Using Neighborhood Knowledge

  • Xiaojun Wan

Existing methods for single document keyphrase extraction usually make use of only the information contained in the specified document. This paper proposes to use a small number of nearest neighbor documents to provide more knowledge to improve single document keyphrase extraction. A specified document is expanded to a small document set by adding a few neighbor documents close to the document, and the graph-based ranking algorithm is then applied on the expanded document set to make use of both the local information in the specified document and the global information in the neighbor documents. Experimental results demonstrate the good effectiveness and robustness of our proposed approach.

IJCAI Conference 2007 Conference Paper

  • Xiaojun Wan
  • Jianwu Yang
  • Jianguo Xiao

Topic-focused multi-document summarization aims to produce a summary biased to a given topic or user profile. This paper presents a novel extractive approach based on manifold-ranking of sentences to this summarization task. The manifold-ranking process can naturally make full use of both the relationships among all the sentences in the documents and the relationships between the given topic and the sentences. The ranking score is obtained for each sentence in the manifold-ranking process to denote the biased information richness of the sentence. Then the greedy algorithm is employed to impose diversity penalty on each sentence. The summary is produced by choosing the sentences with both high biased information richness and high information novelty. Experiments on DUC2003 and DUC2005 are performed and the ROUGE evaluation results show that the proposed approach can significantly outperform existing approaches of the top performing systems in DUC tasks and baseline approaches.

AAAI Conference 2007 Conference Paper

Single Document Summarization with Document Expansion

  • Xiaojun Wan

Existing methods for single document summarization usually make use of only the information contained in the specified document. This paper proposes the technique of document expansion to provide more knowledge to help single document summarization. A specified document is expanded to a small document set by adding a few neighbor documents close to the document, and then the graphranking based algorithm is applied on the expanded document set for extracting sentences from the single document, by making use of both the within-document relationships between sentences of the specified document and the cross-document relationships between sentences of all documents in the document set. The experimental results on the DUC2002 dataset demonstrate the effectiveness of the proposed approach based on document expansion. The cross-document relationships between sentences in the expanded document set are validated to be very important for single document summarization.