Arrow Research search

Author name cluster

Furu Wei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

89 papers
2 author rows

Possible papers

89

AAAI Conference 2026 Conference Paper

Reasoning with Exploration: An Entropy Perspective

  • Daixuan Cheng
  • Shaohan Huang
  • Xuekai Zhu
  • Bo Dai
  • Xin Zhao
  • Zhenliang Zhang
  • Furu Wei

Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting deeper and longer reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.

TMLR Journal 2026 Journal Article

Scaling Large Language Models with Fully Sparse Activations

  • Hongyu Wang
  • Shuming Ma
  • Ruiping Wang
  • Furu Wei

Activation sparsity can reduce the inference cost of large language models (LLMs) by lowering both compute and memory traffic. Yet most existing approaches sparsify only FFN intermediate states, leaving substantial portions of inference effectively dense. We study how to scale fully sparsely activated LLMs, in which every activation participating in linear transformations is sparse. We focus on two questions: how to train such models effectively, and how activation sparsity affects model quality as scale increases. We develop a pre-training recipe that enables effective training fully sparsely activated LLMs from scratch, including using squared ReLU as activation function, top-K sparsification and a straight-through estimator for the remaining linear layers. Extensive experiments spanning model sizes, training-token budgets, and target sparsity levels reveal that its performance gap to dense baselines narrows with model scale, increases nonlinearly with sparsity, while remaining largely insensitive to the training-token budget. Finally, we investigate post-training activation sparsification of pre-trained dense models via both training-free techniques and supervised fine-tuning, and observe a similar trend as pre-training experiments: larger models are more robust to sparsification, and exhibit increasingly sparse activation patterns. Overall, our results provide practical training recipes and empirical guidance for building and scaling LLMs with fully sparse activations.

ICLR Conference 2025 Conference Paper

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

  • Zongyi Li
  • Shujie Hu
  • Shujie Liu 0001
  • Long Zhou
  • Jeongsoo Choi
  • Lingwei Meng
  • Xun Guo
  • Jinyu Li 0001

Text-to-video (T2V) models have recently undergone rapid and substantial advancements. Nevertheless, due to limitations in data and computational resources, achieving efficient generation of long videos with rich motion dynamics remains a significant challenge. To generate high-quality, dynamic, and temporally consistent long videos, this paper presents ARLON, a novel framework that boosts diffusion Transformers with autoregressive (\textbf{AR}) models for long (\textbf{LON}) video generation, by integrating the coarse spatial and long-range temporal information provided by the AR model to guide the DiT model effectively. Specifically, ARLON incorporates several key innovations: 1) A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact and highly quantized visual tokens, bridging the AR and DiT models and balancing the learning complexity and information density; 2) An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model, ensuring effective guidance during video generation; 3) To enhance the tolerance capability of noise introduced from the AR inference, the DiT model is trained with coarser visual latent tokens incorporated with an uncertainty sampling module. Experimental results demonstrate that ARLON significantly outperforms the baseline OpenSora-V1.2 on eight out of eleven metrics selected from VBench, with notable improvements in dynamic degree and aesthetic quality, while delivering competitive results on the remaining three and simultaneously accelerating the generation process. In addition, ARLON achieves state-of-the-art performance in long video generation, outperforming other open-source models in this domain. Detailed analyses of the improvements in inference efficiency are presented, alongside a practical application that demonstrates the generation of long videos using progressive text prompts. Project page: \url{http://aka.ms/arlon}.

JMLR Journal 2025 Journal Article

BitNet: 1-bit Pre-training for Large Language Models

  • Hongyu Wang
  • Shuming Ma
  • Lingxiao Ma
  • Lei Wang
  • Wenhui Wang
  • Li Dong
  • Shaohan Huang
  • Huaijie Wang

The increasing size of large language models (LLMs) has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. Previous research typically applies quantization after pre-training. While these methods avoid the need for model retraining, they often cause notable accuracy loss at extremely low bit-widths. In this work, we explore the feasibility and scalability of 1-bit pre-training. We introduce BitNet b1 and BitNet b1.58, the scalable and stable 1-bit Transformer architecture designed for LLMs. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results show that BitNet b1 achieves competitive performance, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. With the ternary weight, BitNet b1.58 matches the half-precision Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, BitNet defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. It enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

Chain-of-Retrieval Augmented Generation

  • Liang Wang
  • Haonan Chen
  • Nan Yang
  • Xiaolong Huang
  • Zhicheng Dou
  • Furu Wei

This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiveness in addressing complex queries due to imperfect retrieval results. In contrast, our proposed method, CoRAG (Chain-of-Retrieval Augmented Generation), allows the model to dynamically reformulate the query based on the evolving state. To train CoRAG effectively, we utilize rejection sampling to automatically generate intermediate retrieval chains, thereby augmenting existing RAG datasets that only provide the correct final answer. At test time, we propose various decoding strategies to scale the model's test-time compute by controlling the length and number of sampled retrieval chains. Experimental results across multiple benchmarks validate the efficacy of CoRAG, particularly in multi-hop question answering tasks, where we observe more than $10$ points improvement in EM score compared to strong baselines. On the KILT benchmark, CoRAG establishes a new state-of-the-art performance across a diverse range of knowledge-intensive tasks. Furthermore, we offer comprehensive analyses to understand the scaling behavior of CoRAG, laying the groundwork for future research aimed at developing factual and grounded foundation models.

ICLR Conference 2025 Conference Paper

Data Selection via Optimal Control for Language Models

  • Yuxian Gu
  • Li Dong 0004
  • Hongning Wang
  • Yaru Hao
  • Qingxiu Dong
  • Furu Wei
  • Minlie Huang

This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin's Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce **P**MP-based **D**ata **S**election (**PDS**), a framework that approximates optimal data selection by solving the PMP conditions. In our experiments, we adopt PDS to select data from CommmonCrawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which helps mitigate the quick exhaustion of available web-crawled corpora. Our code, model, and data can be found at https://github.com/microsoft/LMOps/tree/main/data_selection.

ICLR Conference 2025 Conference Paper

Differential Transformer

  • Tianzhu Ye
  • Li Dong 0004
  • Yuqing Xia
  • Yutao Sun
  • Yi Zhu
  • Gao Huang 0001
  • Furu Wei

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture for large language models.

ICLR Conference 2025 Conference Paper

Generative Representational Instruction Tuning

  • Niklas Muennighoff
  • Hongjin Su
  • Liang Wang 0046
  • Nan Yang 0002
  • Furu Wei
  • Tao Yu 0009
  • Amanpreet Singh
  • Douwe Kiela

All text-based language problems can be reduced to either generation or embedding. Current models only perform well at one or the other. We introduce generative representational instruction tuning (GRIT) whereby a large language model is trained to handle both generative and embedding tasks by distinguishing between them through instructions. Compared to other open models, our resulting GritLM-7B is among the top models on the Massive Text Embedding Benchmark (MTEB) and outperforms various models up to its size on a range of generative tasks. By scaling up further, GritLM-8x7B achieves even stronger generative performance while still being among the best embedding models. Notably, we find that GRIT matches training on only generative or embedding data, thus we can unify both at no performance loss. Among other benefits, the unification via GRIT speeds up Retrieval-Augmented Generation (RAG) by > 60% for long documents, by no longer requiring separate retrieval and generation models. Models, code, etc. are freely available at https://github.com/ContextualAI/gritlm.

ICML Conference 2025 Conference Paper

Imagine While Reasoning in Space: Multimodal Visualization-of-Thought

  • Chengzu Li
  • Wenshan Wu
  • Huanyu Zhang
  • Yan Xia 0005
  • Shaoguang Mao
  • Li Dong 0004
  • Ivan Vulic
  • Furu Wei

Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.

ICLR Conference 2025 Conference Paper

Preference Optimization for Reasoning with Pseudo Feedback

  • Fangkai Jiao
  • Geyang Guo
  • Xingxing Zhang
  • Nancy F. Chen
  • Shafiq Joty
  • Furu Wei

Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated \emph{test cases}. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.3 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.

NeurIPS Conference 2025 Conference Paper

Reward Reasoning Models

  • Jiaxin Guo
  • Zewen Chi
  • Li Dong
  • Qingxiu Dong
  • Xun Wu
  • Shaohan Huang
  • Furu Wei

Reward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained models are available at https: //huggingface. co/Reward-Reasoning.

ICLR Conference 2025 Conference Paper

Scaling Optimal LR Across Token Horizons

  • Johan Björck
  • Alon Benhaim
  • Vishrav Chaudhary
  • Furu Wei
  • Xia Song

State-of-the-art LLMs are powered by scaling -- scaling model size, training tokens, and cluster size. It is economically infeasible to extensively tune hyperparameters for the largest runs. Instead, approximately optimal hyperparameters must be inferred or transferred from smaller experiments. Hyperparameter transfer across model sizes has been studied in muP. However, hyperparameter transfer across training tokens -- or token horizon -- has not been studied yet. To remedy this we conduct a large-scale empirical study on how optimal learning rate (LR) depends on the token horizon in LLM training. We first demonstrate that the optimal LR changes significantly with token horizon -- longer training necessitates smaller LR. Secondly, we demonstrate that the optimal LR follows a scaling law and that the optimal LR for longer horizons can be accurately estimated from shorter horizons via such scaling laws. We also provide a rule-of-thumb for transferring LR across token horizons with zero overhead over current practices. Lastly, we provide evidence that LLama-1 used too high LR, and thus argue that hyperparameter transfer across data size is an overlooked component of LLM training.

ICLR Conference 2025 Conference Paper

Self-Boosting Large Language Models with Synthetic Preference Data

  • Qingxiu Dong
  • Li Dong 0004
  • Xingxing Zhang 0002
  • Zhifang Sui
  • Furu Wei

Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.

ICLR Conference 2025 Conference Paper

Semi-Parametric Retrieval via Binary Bag-of-Tokens Index

  • Jiawei Zhou 0003
  • Li Dong 0004
  • Furu Wei
  • Lei Chen 0002

Information retrieval has transitioned from standalone systems into essential components across broader applications, with indexing efficiency, cost-effectiveness, and freshness becoming increasingly critical yet often overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval (SiDR), a bi-encoder retrieval framework that decouples retrieval index from neural parameters to enable efficient, low-cost, and parameter-agnostic indexing for emerging use cases. Specifically, in addition to using embeddings as indexes like existing neural retrieval methods, SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness. Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an embedding-based index, SiDR exceeds the performance of conventional neural retrievers while maintaining similar training complexity; (ii) When using a tokenization-based index, SiDR drastically reduces indexing cost and time, matching the complexity of traditional term-based retrieval, while consistently outperforming BM25 on all in-domain datasets; (iii) Additionally, we introduce a late parametric mechanism that matches BM25 index preparation time while outperforming other neural retrieval baselines in effectiveness.

TMLR Journal 2025 Journal Article

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

  • Haoran Li
  • Qingxiu Dong
  • Zhengyang Tang
  • Chaojun Wang
  • Xingxing Zhang
  • Haoyang Huang
  • Shaohan Huang
  • Xiaolong Huang

We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction-tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy. While promising, our approach may inherit biases or inaccuracies from LLM-generated data as in other synthetic data work and is primarily evaluated on exam-style benchmarks. Broader evaluations and data quality control are left for future work.

NeurIPS Conference 2025 Conference Paper

Think Only When You Need with Large Hybrid-Reasoning Models

  • Lingjie Jiang
  • Xun Wu
  • Shaohan Huang
  • Qingxiu Dong
  • Zewen Chi
  • Li Dong
  • Xingxing Zhang
  • Tengchao Lv

Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform reasoning based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate reasoning mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model’s capability for hybrid reasoning. Extensive experimental results show that LHRMs can adaptively perform hybrid reasoning on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended reasoning processes and provides a solid starting point for building hybrid reasoning systems.

NeurIPS Conference 2025 Conference Paper

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

  • Wenkai Yang
  • Shuming Ma
  • Yankai Lin
  • Furu Wei

Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. While current researches continue to explore the benefits of increasing test-time compute by extending the CoT lengths of Large Language Models (LLMs), we are concerned about a potential issue hidden behind the current pursuit of test-time scaling: Would excessively scaling the CoT length actually bring adverse effects to a model's reasoning performance? Our explorations on mathematical reasoning tasks reveal an unexpected finding that scaling with longer CoTs can indeed impair the reasoning performance of LLMs in certain domains. Moreover, we discover that there exists an optimal scaled length distribution that differs across different domains. Based on these insights, we propose a Thinking-Optimal Scaling strategy. Our method first uses a small set of seed data with varying response length distributions to teach the model to adopt different reasoning efforts for deep thinking. Then, the model selects its shortest correct response under different reasoning efforts on additional problems for self-improvement. Our self-improved models built upon Qwen2. 5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks, and achieve performance on par with the teacher model QwQ-32B-Preview that produces the seed data.

ICLR Conference 2024 Conference Paper

Adapting Large Language Models via Reading Comprehension

  • Daixuan Cheng
  • Shaohan Huang
  • Furu Wei

We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data are available at https://github.com/microsoft/LMOps.

NeurIPS Conference 2024 Conference Paper

Boosting Text-to-Video Generative Model with MLLMs Feedback

  • Xun Wu
  • Shaohan Huang
  • Guolong Wang
  • Jing Xiong
  • Furu Wei

Recent advancements in text-to-video generative models, such as Sora, have showcased impressive capabilities. These models have attracted significant interest for their potential applications. However, they often rely on extensive datasets of variable quality, which can result in generated videos that lack aesthetic appeal and do not accurately reflect the input text prompts. A promising approach to mitigate these issues is to leverage Reinforcement Learning from Human Feedback (RLHF), which aims to align the outputs of text-to-video generative with human preferences. However, the considerable costs associated with manual annotation have led to a scarcity of comprehensive preference datasets. In response to this challenge, our study begins by investigating the efficacy of Multimodal Large Language Models (MLLMs) generated annotations in capturing video preferences, discovering a high degree of concordance with human judgments. Building upon this finding, we utilize MLLMs to perform fine-grained video preference annotations across two dimensions, resulting in the creation of VideoPrefer, which includes 135, 000 preference annotations. Utilizing this dataset, we introduce VideoRM, the first general-purpose reward model tailored for video preference in the text-to-video domain. Our comprehensive experiments confirm the effectiveness of both VideoPrefer and VideoRM, representing a significant step forward in the field.

ICLR Conference 2024 Conference Paper

Grounding Multimodal Large Language Models to the World

  • Zhiliang Peng
  • Wenhui Wang 0003
  • Li Dong 0004
  • Yaru Hao
  • Shaohan Huang
  • Shuming Ma
  • Qixiang Ye
  • Furu Wei

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent text spans (i.e., referring expressions and noun phrases) as links in Markdown, i.e., [text span](bounding boxes), where object descriptions are sequences of location tokens. To train the model, we construct a large-scale dataset about grounded image-text pairs (GrIT) together with multimodal corpora. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability to downstream applications, while maintaining the conventional capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning). Kosmos-2 is evaluated on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This study sheds a light on the big convergence of language, multimodal perception, and world modeling, which is a key step toward artificial general intelligence. Code can be found in [https://aka.ms/kosmos-2](https://aka.ms/kosmos-2).

ICLR Conference 2024 Conference Paper

In-context Autoencoder for Context Compression in a Large Language Model

  • Tao Ge 0001
  • Jing Hu 0001
  • Lei Wang
  • Xun Wang 0012
  • Siqing Chen
  • Furu Wei

We propose the In-context Autoencoder (ICAE), leveraging the power of a large language model (LLM) to compress a long context into short compact memory slots that can be directly conditioned on by the LLM for various purposes. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data, enabling it to generate memory slots that accurately and comprehensively represent the original context. Then, it is fine-tuned on instruction data for producing desirable responses to various prompts. Experiments demonstrate that our lightweight ICAE, introducing about 1% additional parameters, effectively achieves $4\times$ context compression based on Llama, offering advantages in both improved latency and GPU memory cost during inference, and showing an interesting insight in memorization as well as potential for scalability. These promising results imply a novel perspective on the connection between working memory in cognitive science and representation learning in LLMs, revealing ICAE's significant implications in addressing the long context problem and suggesting further research in LLM context management. Our data, code and models are available at https://github.com/getao/icae.

IROS Conference 2024 Conference Paper

KOSMOS-E: Learning to Follow Instruction for Robotic Grasping

  • Zhi Wang
  • Xun Wu
  • Shaohan Huang
  • Li Dong 0004
  • Wenhui Wang 0003
  • Shuming Ma
  • Furu Wei

Tuning on instruction-following data has been shown to enhance the capabilities and controllability of language models, but the idea is less explored in the robotic field. In this work, we introduce KOSMOS-E, a Multimodal Large Language Model (MLLM) that leverages instruction-following robotic grasping data to enhance capabilities for precise and intricate robotic grasping maneuvers. To achieve this, we craft a large-scale instruction-following robotic grasping dataset, termed INSTRUCT-GRASP, primarily comprising two aspects: (i) grasp a single object following varying levels of granularity descriptions, e. g. , different angles and aspects, and (ii) grasp a specific object within a multi-object environment following specific attributes, e. g. , color and shape. Extensive experiments show the effectiveness of KOSMOS-E on robotic grasping tasks across a variety of environments.

ICLR Conference 2024 Conference Paper

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

  • Xichen Pan
  • Li Dong 0004
  • Shaohan Huang
  • Zhiliang Peng
  • Wenhu Chen
  • Furu Wei

Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning and cannot accept interleaved multi-image and text input. These limitations keep them far from the ultimate goal of "image as a foreign language in image generation." This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of "image as a foreign language in image generation."

AAAI Conference 2024 Conference Paper

Learning to Rank in Generative Retrieval

  • Yongqi Li
  • Nan Yang
  • Liang Wang
  • Furu Wei
  • Wenjie Li

Generative retrieval stands out as a promising new paradigm in text retrieval that aims to generate identifier strings of relevant passages as the retrieval target. This generative paradigm taps into powerful generative language models, distinct from traditional sparse or dense retrieval methods. However, only learning to generate is insufficient for generative retrieval. Generative retrieval learns to generate identifiers of relevant passages as an intermediate goal and then converts predicted identifiers into the final passage rank list. The disconnect between the learning objective of autoregressive models and the desired passage ranking target leads to a learning gap. To bridge this gap, we propose a learning-to-rank framework for generative retrieval, dubbed LTRGR. LTRGR enables generative retrieval to learn to rank passages directly, optimizing the autoregressive model toward the final passage ranking target via a rank loss. This framework only requires an additional learning-to-rank training phase to enhance current generative retrieval systems and does not add any burden to the inference stage. We conducted experiments on three public benchmarks, and the results demonstrate that LTRGR achieves state-of-the-art performance among generative retrieval methods. The code and checkpoints are released at https://github.com/liyongqi67/LTRGR.

ICML Conference 2024 Conference Paper

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

  • Zhengyang Tang
  • Xingxing Zhang 0002
  • Benyou Wang
  • Furu Wei

Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e. g. , GPT-3. 5). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct MWPBench, a benchmark of Math Word Problems, which is a collection of 9 datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e. g. , LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on MWPBench, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42. 8% in micro average accuracy and 43. 6% in macro average accuracy, respectively.

NeurIPS Conference 2024 Conference Paper

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

  • Wenshan Wu
  • Shaoguang Mao
  • Yadong Zhang
  • Yan Xia
  • Li Dong
  • Lei Cui
  • Furu Wei

Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs. Please find the dataset and codes in our project page.

ICLR Conference 2024 Conference Paper

MiniLLM: Knowledge Distillation of Large Language Models

  • Yuxian Gu
  • Li Dong 0004
  • Furu Wei
  • Minlie Huang

Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.

ICLR Conference 2024 Conference Paper

Mixture of LoRA Experts

  • Xun Wu
  • Shaohan Huang
  • Furu Wei

LoRA has gained widespread acceptance in the fine-tuning of large pre-trained models to cater to a diverse array of downstream tasks, showcasing notable effectiveness and efficiency, thereby solidifying its position as one of the most prevalent fine-tuning techniques. Due to the modular nature of LoRA's plug-and-play plugins, researchers have delved into the amalgamation of multiple LoRAs to empower models to excel across various downstream tasks. Nonetheless, extant approaches for LoRA fusion grapple with inherent challenges. Direct arithmetic merging may result in the loss of the original pre-trained model's generative capabilities or the distinct identity of LoRAs, thereby yielding suboptimal outcomes. On the other hand, Reference tuning-based fusion exhibits limitations concerning the requisite flexibility for the effective combination of multiple LoRAs. In response to these challenges, this paper introduces the Mixture of LoRA Experts (MoLE) approach, which harnesses hierarchical control and unfettered branch selection. The MoLE approach not only achieves superior LoRA fusion performance in comparison to direct arithmetic merging but also retains the crucial flexibility for combining LoRAs effectively. Extensive experimental evaluations conducted in both the Natural Language Processing (NLP) and Vision \& Language (V\&L) domains substantiate the efficacy of MoLE.

NeurIPS Conference 2024 Conference Paper

Multi-Head Mixture-of-Experts

  • Xun Wu
  • Shaohan Huang
  • Wenhui Wang
  • Shuming Ma
  • Li Dong
  • Furu Wei

Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in computational costs. However, it exhibits the low expert activation issue, i. e. , only a small subset of experts are activated for optimization, leading to suboptimal performance and limiting its effectiveness in learning a larger number of experts in complex tasks. In this paper, we propose Multi-Head Mixture-of-Experts (MH-MoE). MH-MoE split each input token into multiple sub-tokens, then these sub-tokens are assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The above operations enables MH-MoE to significantly enhance expert activation while collectively attend to information from various representation spaces within different experts to deepen context understanding. Besides, it's worth noting that our MH-MoE is straightforward to implement and decouples from other SMoE frameworks, making it easy to integrate with these frameworks for enhanced performance. Extensive experimental results across different parameter scales (300M to 7B) and three pre-training tasks—English-focused language modeling, multi-lingual language modeling and masked multi-modality modeling—along with multiple downstream validation tasks, demonstrate the effectiveness of MH-MoE.

NeurIPS Conference 2024 Conference Paper

Multimodal Large Language Models Make Text-to-Image Generative Models Align Better

  • Xun Wu
  • Shaohan Huang
  • Guolong Wang
  • Jing Xiong
  • Furu Wei

Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators. Furthermore, we use two reinforcement learning methods to supervised fine-tune generative models to evaluate the performance of VisionPrefer, and extensive experimental results demonstrate that VisionPrefer significantly improves text-image alignment in compositional image generation across diverse aspects, e. g. , aesthetic, and generalizes better than previous human-preference metrics across various image distributions. Moreover, VisionPrefer indicates that the integration of AI-generated synthetic data as a supervisory signal is a promising avenue for achieving improved alignment with human preferences in vision generative models.

ICLR Conference 2024 Conference Paper

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

  • Dawei Zhu
  • Nan Yang 0002
  • Liang Wang 0046
  • Yifan Song 0002
  • Wenhao Wu
  • Furu Wei
  • Sujian Li

Large Language Models (LLMs) are trained with a pre-defined context length, restricting their use in scenarios requiring long inputs. Previous efforts for adapting LLMs to a longer length usually requires fine-tuning with this target length (Full-length fine-tuning), suffering intensive training cost. To decouple train length from target length for efficient context window extension, we propose Positional Skip-wisE (PoSE) training that smartly simulates long inputs using a fixed context window. This is achieved by first dividing the original context window into several chunks, then designing distinct skipping bias terms to manipulate the position indices of each chunk. These bias terms and the lengths of each chunk are altered for every training example, allowing the model to adapt to all positions within target length. Experimental results show that PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning, with minimal impact on performance. Leveraging this advantage, we have successfully extended the LLaMA model to 128k tokens using a 2k training context window. Furthermore, we empirically confirm that PoSE is compatible with all RoPE-based LLMs and position interpolation strategies. Notably, our method can potentially support infinite length, limited only by memory usage in inference. With ongoing progress for efficient inference, we believe PoSE can further scale the context window beyond 128k.

AAAI Conference 2024 Conference Paper

Text Diffusion with Reinforced Conditioning

  • Yuxuan Liu
  • Tianchi Yang
  • Shaohan Huang
  • Zihan Zhang
  • Haizhen Huang
  • Furu Wei
  • Weiwei Deng
  • Feng Sun

Diffusion models have demonstrated exceptional capability in generating high-quality images, videos, and audio. Due to their adaptiveness in iterative refinement, they provide a strong potential for achieving better non-autoregressive sequence generation. However, existing text diffusion models still fall short in their performance due to a challenge in handling the discreteness of language. This paper thoroughly analyzes text diffusion models and uncovers two significant limitations: degradation of self-conditioning during training and misalignment between training and sampling. Motivated by our findings, we propose a novel Text Diffusion model called TReC, which mitigates the degradation with Reinforced Conditioning and the misalignment by Time-Aware Variance Scaling. Our extensive experiments demonstrate the competitiveness of TReC against autoregressive, non-autoregressive, and diffusion baselines. Moreover, qualitative analysis shows its advanced ability to fully utilize the diffusion process in refining samples.

NeurIPS Conference 2024 Conference Paper

xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token

  • Xin Cheng
  • Xun Wang
  • Xingxing Zhang
  • Tao Ge
  • Si-Qing Chen
  • Furu Wei
  • Huishuai Zhang
  • Dongyan Zhao

This paper introduces xRAG, an innovative context compression method tailored for retrieval-augmented generation. xRAG reinterprets document embeddings in dense retrieval--traditionally used solely for retrieval--as features from the retrieval modality. By employing a modality fusion methodology, xRAG seamlessly integrates these embeddings into the language model representation space, effectively eliminating the need for their textual counterparts and achieving an extreme compression rate. In xRAG, the only trainable component is the modality bridge, while both the retriever and the language model remain frozen. This design choice allows for the reuse of offline-constructed document embeddings and preserves the plug-and-play nature of retrieval augmentation. Experimental results demonstrate that xRAG achieves an average improvement of over 10% across six knowledge-intensive tasks, adaptable to various language model backbones, ranging from a dense 7B model to an 8x7B Mixture of Experts configuration. xRAG not only significantly outperforms previous context compression methods but also matches the performance of uncompressed models on several datasets, while reducing overall FLOPs by a factor of 3. 53. Our work pioneers new directions in retrieval-augmented generation from the perspective of multimodality fusion, and we hope it lays the foundation for future efficient and scalable retrieval-augmented systems.

NeurIPS Conference 2024 Conference Paper

You Only Cache Once: Decoder-Decoder Architectures for Language Models

  • Yutao Sun
  • Li Dong
  • Yi Zhu
  • Shaohan Huang
  • Wenhui Wang
  • Shuming Ma
  • Quanlu Zhang
  • Jianyong Wang

We introduce a decoder-decoder architecture, YOCO, for large language models, which only caches key-value pairs once. It consists of two components, i. e. , a cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. The overall model behaves like a decoder-only Transformer, although YOCO only caches once. The design substantially reduces GPU memory demands, yet retains global attention capability. Additionally, the computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage. Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. We also extend YOCO to 1M context length with near-perfect needle retrieval accuracy. The profiling results show that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes.

TMLR Journal 2023 Journal Article

A Unified View of Masked Image Modeling

  • Zhiliang Peng
  • Li Dong
  • Hangbo Bao
  • Furu Wei
  • Qixiang Ye

Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8 semantic segmentation mIoU metric on ADE20k (512 size). Code is enclosed in the supplementary materials.

ICLR Conference 2023 Conference Paper

Are More Layers Beneficial to Graph Transformers?

  • Haiteng Zhao
  • Shuming Ma
  • Dongdong Zhang 0001
  • Zhi-Hong Deng 0001
  • Furu Wei

Despite that going deep has proven successful in many neural architectures, the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth. Our further analysis reveals the reason is that deep graph transformers are limited by the vanishing capacity of global attention, restricting the graph transformer from focusing on the critical substructure and obtaining expressive features. To this end, we propose a novel graph transformer model named DeepGraph that explicitly employs substructure tokens in the encoded representation, and applies local attention on related nodes to obtain substructure based attention encoding. Our model enhances the ability of the global attention to focus on substructures and promotes the expressiveness of the representations, addressing the limitation of self-attention as the graph transformer deepens. Experiments show that our method unblocks the depth limitation of graph transformers and results in state-of-the-art performance across various graph benchmarks with deeper models.

NeurIPS Conference 2023 Conference Paper

Augmenting Language Models with Long-Term Memory

  • Weizhi Wang
  • Li Dong
  • Hao Cheng
  • Xiaodong Liu
  • Xifeng Yan
  • Jianfeng Gao
  • Furu Wei

Existing large language models (LLMs) can only afford fix-sized inputs due to the input length limit, preventing them from utilizing rich long-context information from past inputs. To address this, we propose a framework, Language Models Augmented with Long-Term Memory (LongMem), which enables LLMs to memorize long history. We design a novel decoupled network architecture with the original backbone LLM frozen as a memory encoder and an adaptive residual side-network as a memory retriever and reader. Such a decoupled memory design can easily cache and update long-term past contexts for memory retrieval without suffering from memory staleness. Enhanced with memory-augmented adaptation training, LongMem can thus memorize long past context and use long-term memory for language modeling. The proposed memory retrieval module can handle unlimited-length context in its memory bank to benefit various downstream tasks. Typically, LongMem can enlarge the long-form memory to 65k tokens and thus cache many-shot extra demonstration examples as long-form memory for in-context learning. Experiments show that our method outperforms strong long-context models on ChapterBreak, a challenging long-context modeling benchmark, and achieves remarkable improvements on memory-augmented in-context learning over LLMs. The results demonstrate that the proposed method is effective in helping language models to memorize and utilize long-form contents.

ICML Conference 2023 Conference Paper

BEATs: Audio Pre-Training with Acoustic Tokenizers

  • Sanyuan Chen
  • Yu Wu 0012
  • Chengyi Wang 0002
  • Shujie Liu 0001
  • Daniel Tompkins
  • Zhuo Chen 0006
  • Wanxiang Che
  • Xiangzhan Yu

We introduce a self-supervised learning (SSL) framework BEATs for general audio representation pre-training, where we optimize an acoustic tokenizer and an audio SSL model by iterations. Unlike the previous audio SSL models that employ reconstruction loss for pre-training, our audio SSL model is trained with the discrete label prediction task, where the labels are generated by a semantic-rich acoustic tokenizer. We propose an iterative pipeline to jointly optimize the tokenizer and the pre-trained model, aiming to abstract high-level semantics and discard the redundant details for audio. The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art (SOTA) results across various audio classification benchmarks, even outperforming previous models that use more training data and model parameters significantly. Specifically, we set a new SOTA mAP 50. 6% on AudioSet-2M without using any external data, and 98. 1% accuracy on ESC-50. The code and pre-trained models are available at https: //aka. ms/beats.

ICLR Conference 2023 Conference Paper

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

  • Yuxin Fang
  • Li Dong 0004
  • Hangbo Bao
  • Xinggang Wang
  • Furu Wei

We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial [MASK] tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.

NeurIPS Conference 2023 Conference Paper

Extensible Prompts for Language Models on Zero-shot Language Style Customization

  • Tao Ge
  • Hu Jing
  • Li Dong
  • Shaoguang Mao
  • Yan Xia
  • Xun Wang
  • Si-Qing Chen
  • Furu Wei

We propose eXtensible Prompt (X-Prompt) for prompting a large language model (LLM) beyond natural language (NL). X-Prompt instructs an LLM with not only NL but also an extensible vocabulary of imaginary words. Registering new imaginary words allows us to instruct the LLM to comprehend concepts that are difficult to describe with NL words, thereby making a prompt more descriptive. Also, these imaginary words are designed to be out-of-distribution (OOD) robust so that they can be (re)used like NL words in various prompts, distinguishing X-Prompt from soft prompt that is for fitting in-distribution data. We propose context-augmented learning (CAL) to learn imaginary words for general usability, enabling them to work properly in OOD (unseen) prompts. We experiment X-Prompt for zero-shot language style customization as a case study. The promising results of X-Prompt demonstrate its potential to facilitate advanced interaction beyond the natural language interface, bridging the communication gap between humans and LLMs.

NeurIPS Conference 2023 Conference Paper

Language Is Not All You Need: Aligning Perception with Language Models

  • Shaohan Huang
  • Li Dong
  • Wenhui Wang
  • Yaru Hao
  • Saksham Singhal
  • Shuming Ma
  • Tengchao Lv
  • Lei Cui

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i. e. , few-shot), and follow instructions (i. e. , zero-shot). Specifically, we train KOSMOS-1 from scratch on web-scale multi-modal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that KOSMOS-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i. e. , transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

ICML Conference 2023 Conference Paper

Magneto: A Foundation Transformer

  • Hongyu Wang 0009
  • Shuming Ma
  • Shaohan Huang
  • Li Dong 0004
  • Wenhui Wang 0003
  • Zhiliang Peng
  • Yu Wu
  • Payal Bajaj

A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name ”Transformers”, the above areas use different implementations for better performance, e. g. , Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i. e. , BERT, and GPT), machine translation, vision pretraining (i. e. , BEiT), speech recognition, and multimodal pretraining (i. e. , BEiT-3).

AAAI Conference 2023 Conference Paper

MoEC: Mixture of Expert Clusters

  • Yuan Xie
  • Shaohan Huang
  • Tianyu Chen
  • Furu Wei

Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE models convert dense layers into sparse experts, and utilize a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation. Such problems are especially severe on tasks with limited data, thus hindering the progress towards improving performance by scaling up. We verify that there exists a performance upper bound of scaling up sparse MoE. In this work, we propose Mixture of Expert Clusters — a general approach to enable expert layers to learn more diverse and appropriate knowledge by imposing variance-based constraints on the routing stage. Given this, we could further propose a cluster-level expert dropout strategy specifically designed for the expert cluster structure. Our experiments reveal that MoEC could improve performance on machine translation and natural language understanding tasks. MoEC plays a positive role in mitigating overfitting and sparse data allocation problems, thus fully releasing the potential of large-scale sparse models.

NeurIPS Conference 2023 Conference Paper

On the Pareto Front of Multilingual Neural Machine Translation

  • Liang Chen
  • Shuming Ma
  • Dongdong Zhang
  • Furu Wei
  • Baobao Chang

In this work, we study how the performance of a given direction changes with its sampling ratio in Multilingual Neural Machine Translation (MNMT). By training over 200 multilingual models with various model sizes, data sizes, and language directions, we find it interesting that the performance of certain translation direction does not always improve with the increase of its weight in the multi-task optimization objective. Accordingly, scalarization method leads to a multitask trade-off front that deviates from the traditional Pareto front when there exists data imbalance in the training corpus, which poses a great challenge to improve the overall performance of all directions. Based on our observations, we propose the Double Power Law to predict the unique performance trade-off front in MNMT, which is robust across various languages, data adequacy, and the number of tasks. Finally, we formulate the sample ratio selection problem in MNMT as an optimization problem based on the Double Power Law. Extensive experiments show that it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget. We release the code at https: //github. com/pkunlp-icler/ParetoMNMT for reproduction.

NeurIPS Conference 2023 Conference Paper

Optimizing Prompts for Text-to-Image Generation

  • Yaru Hao
  • Zewen Chi
  • Li Dong
  • Furu Wei

Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, we propose prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts. Specifically, we first perform supervised fine-tuning with a pretrained language model on a small collection of manually engineered prompts. Then we use reinforcement learning to explore better prompts. We define a reward function that encourages the policy to generate more aesthetically pleasing images while preserving the original user intentions. Experimental results on Stable Diffusion show that our method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. Moreover, reinforcement learning further boosts performance, especially on out-of-domain prompts.

ICLR Conference 2023 Conference Paper

Prototypical Calibration for Few-shot Learning of Language Models

  • Zhixiong Han
  • Yaru Hao
  • Li Dong 0004
  • Yutao Sun
  • Furu Wei

In-context learning of GPT-like models has been recognized as fragile across different hand-crafted templates, and demonstration permutations. In this work, we propose prototypical calibration to adaptively learn a more robust decision boundary for zero- and few-shot classification, instead of greedy decoding. Concretely, our method first adopts Gaussian mixture distribution to estimate the prototypical clusters for all categories. Then we assign each cluster to the corresponding label by solving a weighted bipartite matching problem. Given an example, its prediction is calibrated by the likelihood of prototypical clusters. Experimental results show that prototypical calibration yields a substantial improvement on a diverse set of tasks. Extensive analysis across different scales also indicates that our method calibrates the decision boundary as expected, greatly improving the robustness of GPT to templates, permutations, and class imbalance.

NeurIPS Conference 2023 Conference Paper

TextDiffuser: Diffusion Models as Text Painters

  • Jingye Chen
  • Yupan Huang
  • Tengchao Lv
  • Lei Cui
  • Qifeng Chen
  • Furu Wei

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we demonstrate that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. We will make the code, model and dataset publicly available.

AAAI Conference 2023 Conference Paper

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

  • Minghao Li
  • Tengchao Lv
  • Jingye Chen
  • Lei Cui
  • Yijuan Lu
  • Dinei Florencio
  • Cha Zhang
  • Zhoujun Li

Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at https://aka.ms/trocr.

ICLR Conference 2023 Conference Paper

Visually-Augmented Language Modeling

  • Weizhi Wang
  • Li Dong 0004
  • Hao Cheng 0002
  • Haoyu Song 0002
  • Xiaodong Liu 0003
  • Xifeng Yan
  • Jianfeng Gao 0001
  • Furu Wei

Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on the text-only self-supervised training with massive text data, which precludes them from utilizing relevant visual information when necessary. To address this, we propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. Specifically, VaLM builds on a novel latent text-image alignment method via an image retrieval module to fetch corresponding images given a textual context. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling by attending on both text context and visual knowledge in images. We evaluate VaLM on various visual knowledge intensive commonsense reasoning tasks, which require visual information to excel. The experimental results illustrate that VaLM outperforms all strong language-only and vision-language baselines with substantial gains on reasoning object commonsense including color, size, and shape.

IJCAI Conference 2022 Conference Paper

A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model

  • Xin Sun
  • Tao Ge
  • Shuming Ma
  • Jingjing Li
  • Furu Wei
  • Houfeng Wang

Synthetic data construction of Grammatical Error Correction (GEC) for non-English languages relies heavily on human-designed and language-specific rules, which produce limited error-corrected patterns. In this paper, we propose a generic and language-independent strategy for multilingual GEC, which can train a GEC system effectively for a new non-English language with only two easy-to-access resources: 1) a pre-trained cross-lingual language model (PXLM) and 2) parallel translation data between English and the language. Our approach creates diverse parallel GEC data without any language-specific operations by taking the non-autoregressive translation generated by PXLM and the gold translation as error-corrected sentence pairs. Then, we reuse PXLM to initialize the GEC model and pre-train it with the synthetic data generated by itself, which yields further improvement. We evaluate our approach on three public benchmarks of GEC in different languages. It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian). Further analysis demonstrates that our data construction method is complementary to rule-based approaches.

ICLR Conference 2022 Conference Paper

BEiT: BERT Pre-Training of Image Transformers

  • Hangbo Bao
  • Li Dong 0004
  • Songhao Piao
  • Furu Wei

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e., image patches (such as 16 x 16 pixels), and visual tokens (i.e., discrete tokens). We first ``tokenize'' the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods.

IJCAI Conference 2022 Conference Paper

High-resource Language-specific Training for Multilingual Neural Machine Translation

  • Jian Yang
  • Yuwei Yin
  • Shuming Ma
  • Dongdong Zhang
  • Zhoujun Li
  • Furu Wei

Multilingual neural machine translation (MNMT) trained in multiple language pairs has attracted considerable attention due to fewer model parameters and lower training costs by sharing knowledge among multiple languages. Nonetheless, multilingual training is plagued by language interference degeneration in shared parameters because of the negative interference among different translation directions, especially on high-resource languages. In this paper, we propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference, which adopts the two-stage training with the language-specific selection mechanism. Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder to enhance the translation quality of high-resource directions. Next, the model is further trained on all available corpora to transfer knowledge from high-resource languages (HRLs) to low-resource languages (LRLs). Experimental results show that HLT-MT outperforms various strong baselines on WMT-10 and OPUS-100 benchmarks. Furthermore, the analytic experiments validate the effectiveness of our method in mitigating the negative interference in multilingual training.

NeurIPS Conference 2022 Conference Paper

On the Representation Collapse of Sparse Mixture of Experts

  • Zewen Chi
  • Li Dong
  • Shaohan Huang
  • Damai Dai
  • Shuming Ma
  • Barun Patra
  • Saksham Singhal
  • Payal Bajaj

Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.

AAAI Conference 2022 Conference Paper

Sequence Level Contrastive Learning for Text Summarization

  • Shusheng Xu
  • Xingxing Zhang
  • Yi Wu
  • Furu Wei

Contrastive learning models have achieved great success in unsupervised visual representation learning, which maximize the similarities between feature representations of different views of the same image, while minimize the similarities between feature representations of views of different images. In text summarization, the output summary is a shorter form of the input document and they have similar meanings. In this paper, we propose a contrastive learning model for supervised abstractive text summarization, where we view a document, its gold summary and its model generated summaries as different views of the same mean representation and maximize the similarities between them during training. We improve over a strong sequence-to-sequence text generation model (i. e. , BART) on three different summarization datasets. Human evaluation also shows that our model achieves better faithfulness ratings compared to its counterpart without contrastive objectives. We release our code at https: //github. com/xssstory/SeqCo.

IJCAI Conference 2022 Conference Paper

UM4: Unified Multilingual Multiple Teacher-Student Model for Zero-Resource Neural Machine Translation

  • Jian Yang
  • Yuwei Yin
  • Shuming Ma
  • Dongdong Zhang
  • ShuangZhi Wu
  • Hongcheng Guo
  • Zhoujun Li
  • Furu Wei

Most translation tasks among languages belong to the zero-resource translation problem where parallel corpora are unavailable. Multilingual neural machine translation (MNMT) enables one-pass translation using shared semantic space for all languages compared to the two-pass pivot translation but often underperforms the pivot-based method. In this paper, we propose a novel method, named as Unified Multilingual Multiple teacher-student Model for NMT (UM4). Our method unifies source-teacher, target-teacher, and pivot-teacher models to guide the student model for the zero-resource translation. The source teacher and target teacher force the student to learn the direct source-target translation by the distilled knowledge on both source and target sides. The monolingual corpus is further leveraged by the pivot-teacher model to enhance the student model. Experimental results demonstrate that our model of 72 directions significantly outperforms previous methods on the WMT benchmark.

NeurIPS Conference 2022 Conference Paper

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

  • Hangbo Bao
  • Wenhui Wang
  • Li Dong
  • Qiang Liu
  • Owais Khan Mohammed
  • Kriti Aggarwal
  • Subhojit Som
  • Songhao Piao

We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Multiway Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of Multiway Transformer, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval.

AAAI Conference 2021 Conference Paper

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

  • Yaru Hao
  • Li Dong
  • Furu Wei
  • Ke Xu

The great success of Transformer-based models benefits from the powerful multi-head self-attention mechanism, which learns token dependencies and encodes contextual information from the input. Prior work strives to attribute model decisions to individual input features with different saliency measures, but they fail to explain how these input features interact with each other to reach predictions. In this paper, we propose a self-attention attribution method to interpret the information interactions inside Transformer. We take BERT as an example to conduct extensive studies. Firstly, we apply selfattention attribution to identify the important attention heads, while others can be pruned with marginal performance degradation. Furthermore, we extract the most salient dependencies in each layer to construct an attribution tree, which reveals the hierarchical interactions inside Transformer. Finally, we show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.

ICML Conference 2021 Conference Paper

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

  • Chengyi Wang 0002
  • Yu Wu 0012
  • Yao Qian
  • Ken'ichi Kumatani
  • Shujie Liu 0001
  • Furu Wei
  • Michael Zeng 0001
  • Xuedong Huang 0001

In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both labeled and unlabeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13. 4% and 26. 9% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also verified on a domain-shift speech recognition task, i. e. , a relative word error rate reduction of 6% against the previous approach.

NeurIPS Conference 2020 Conference Paper

BERT Loses Patience: Fast and Robust Inference with Early Exit

  • Wangchunshu Zhou
  • Canwen Xu
  • Tao Ge
  • Julian McAuley
  • Ke Xu
  • Furu Wei

In this paper, we propose Patience-based Early Exit, a straightforward yet effective inference method that can be used as a plug-and-play technique to simultaneously improve the efficiency and robustness of a pretrained language model (PLM). To achieve this, our approach couples an internal-classifier with each layer of a PLM and dynamically stops inference when the intermediate predictions of the internal classifiers do not change for a pre-defined number of steps. Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers. Meanwhile, experimental results with an ALBERT model show that our method can improve the accuracy and robustness of the model by preventing it from overthinking and exploiting multiple classifiers for prediction, yielding a better accuracy-speed trade-off compared to existing early exit methods.

AAAI Conference 2020 Conference Paper

Cross-Lingual Natural Language Generation via Pre-Training

  • Zewen Chi
  • Li Dong
  • Furu Wei
  • Wenhui Wang
  • Xian-Ling Mao
  • Heyan Huang

In this work we focus on transferring supervision signals of natural language generation (NLG) tasks between multiple languages. We propose to pretrain the encoder and the decoder of a sequence-to-sequence model under both monolingual and cross-lingual settings. The pre-training objective encourages the model to represent different languages in the shared space, so that we can conduct zero-shot cross-lingual transfer. After the pre-training procedure, we use monolingual data to fine-tune the pre-trained model on downstream NLG tasks. Then the sequence-to-sequence model trained in a single language can be directly evaluated beyond that language (i. e. , accepting multi-lingual input and producing multi-lingual output). Experimental results on question generation and abstractive summarization show that our model outperforms the machine-translation-based pipeline methods for zero-shot cross-lingual generation. Moreover, crosslingual transfer improves NLG performance of low-resource languages by leveraging rich-resource language data. Our implementation and data are available at https: //github. com/ CZWin32768/xnlg.

AAAI Conference 2020 Conference Paper

Fact-Aware Sentence Split and Rephrase with Permutation Invariant Training

  • Yinuo Guo
  • Tao Ge
  • Furu Wei

Sentence Split and Rephrase aims to break down a complex sentence into several simple sentences with its meaning preserved. Previous studies tend to address the issue by seq2seq learning from parallel sentence pairs, which takes a complex sentence as input and sequentially generates a series of simple sentences. However, the conventional seq2seq learning has two limitations for this task: (1) it does not take into account the facts stated in the long sentence; As a result, the generated simple sentences may miss or inaccurately state the facts in the original sentence. (2) The order variance of the simple sentences to be generated may confuse the seq2seq model during training because the simple sentences derived from the long source sentence could be in any order. To overcome the challenges, we first propose the Fact-aware Sentence Encoding, which enables the model to learn facts from the long sentence and thus improves the precision of sentence split; then we introduce Permutation Invariant Training to alleviate the effects of order variance in seq2seq learning for this task. Experiments on the WebSplit-v1. 0 benchmark dataset show that our approaches can largely improve the performance over the previous seq2seq learning approaches. Moreover, an extrinsic evaluation on oiebenchmark verifies the effectiveness of our approaches by an observation that splitting long sentences with our state-of-theart model as preprocessing is helpful for improving OpenIE performance.

NeurIPS Conference 2020 Conference Paper

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

  • Wenhui Wang
  • Furu Wei
  • Li Dong
  • Hangbo Bao
  • Nan Yang
  • Ming Zhou

Pre-trained language models (e. g. , BERT (Devlin et al. , 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al. , 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i. e. , the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant (Mirzadeh et al. , 2019) also helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99% accuracy on SQuAD 2. 0 and several GLUE benchmark tasks using 50% of the Transformer parameters and computations of the teacher model. We also obtain competitive results in applying deep self-attention distillation to multilingual pre-trained models.

ECAI Conference 2020 Conference Paper

Multimodal Matching Transformer for Live Commenting

  • Chaoqun Duan
  • Lei Cui 0001
  • Shuming Ma
  • Furu Wei
  • Conghui Zhu
  • Tiejun Zhao

Automatic live commenting aims to provide real-time comments on videos for viewers. It encourages users engagement on online video sites, and is also a good benchmark for video-to-text generation. Recent work on this task adopts encoder-decoder models to generate comments. However, these methods do not model the interaction between videos and comments explicitly, so they tend to generate popular comments that are often irrelevant to the videos. In this work, we aim to improve the relevance between live comments and videos by modeling the cross-modal interactions among different modalities. To this end, we propose a multimodal matching transformer to capture the relationships among comments, vision, and audio. The proposed model is based on the transformer framework and can iteratively learn the attention-aware representations for each modality. We evaluate the model on a publicly available live commenting dataset. Experiments show that the multimodal matching transformer model outperforms the state-of-the-art methods.

ICLR Conference 2020 Conference Paper

Self-Adversarial Learning with Comparative Discrimination for Text Generation

  • Wangchunshu Zhou
  • Tao Ge 0001
  • Ke Xu 0001
  • Furu Wei
  • Ming Zhou 0001

Conventional Generative Adversarial Networks (GANs) for text generation tend to have issues of reward sparsity and mode collapse that affect the quality and diversity of generated samples. To address the issues, we propose a novel self-adversarial learning (SAL) paradigm for improving GANs' performance in text generation. In contrast to standard GANs that use a binary classifier as its discriminator to predict whether a sample is real or generated, SAL employs a comparative discriminator which is a pairwise classifier for comparing the text quality between a pair of samples. During training, SAL rewards the generator when its currently generated sentence is found to be better than its previously generated samples. This self-improvement reward mechanism allows the model to receive credits more easily and avoid collapsing towards the limited number of real samples, which not only helps alleviate the reward sparsity issue but also reduces the risk of mode collapse. Experiments on text generation benchmark datasets show that our proposed approach substantially improves both the quality and the diversity, and yields more stable performance compared to the previous GANs for text generation.

ICML Conference 2020 Conference Paper

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

  • Hangbo Bao
  • Li Dong 0004
  • Furu Wei
  • Wenhui Wang 0003
  • Nan Yang 0002
  • Xiaodong Liu 0003
  • Yu Wang 0009
  • Jianfeng Gao 0001

We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM). Given an input text with masked tokens, we rely on conventional masks to learn inter-relations between corrupted tokens and context via autoencoding, and pseudo masks to learn intra-relations between masked spans via partially autoregressive modeling. With well-designed position embeddings and self-attention masks, the context encodings are reused to avoid redundant computation. Moreover, conventional masks used for autoencoding provide global masking information, so that all the position embeddings are accessible in partially autoregressive language modeling. In addition, the two tasks pre-train a unified language model as a bidirectional encoder and a sequence-to-sequence decoder, respectively. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of language understanding and generation tasks across several widely used benchmarks. The code and pre-trained models are available at https: //github. com/microsoft/unilm.

ICLR Conference 2020 Conference Paper

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

  • Weijie Su 0002
  • Xizhou Zhu
  • Yue Cao
  • Bin Li 0025
  • Lewei Lu
  • Furu Wei
  • Jifeng Dai

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark.

AAAI Conference 2019 Conference Paper

Dictionary-Guided Editing Networks for Paraphrase Generation

  • Shaohan Huang
  • Yu Wu
  • Furu Wei
  • Zhongzhi Luan

An intuitive way for a human to write paraphrase sentences is to replace words or phrases in the original sentence with their corresponding synonyms and make necessary changes to ensure the new sentences are fluent and grammatically correct. We propose a novel approach to modeling the process with dictionary-guided editing networks which effectively conduct rewriting on the source sentence to generate paraphrase sentences. It jointly learns the selection of the appropriate word level and phrase level paraphrase pairs in the context of the original sentence from an off-the-shelf dictionary as well as the generation of fluent natural language sentences. Specifically, the system retrieves a set of word level and phrase level paraphrase pairs derived from the Paraphrase Database (PPDB) for the original sentence, which is used to guide the decision of which the words might be deleted or inserted with the soft attention mechanism under the sequence-to-sequence framework. We conduct experiments on two benchmark datasets for paraphrase generation, namely the MSCOCO and Quora dataset. The automatic evaluation results demonstrate that our dictionary-guided editing networks outperforms the baseline methods. On human evaluation, results indicate that the generated paraphrases are grammatically correct and relevant to the input sentence.

AAAI Conference 2019 Conference Paper

LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts

  • Shuming Ma
  • Lei Cui
  • Damai Dai
  • Furu Wei
  • Xu Sun

We introduce the task of automatic live commenting. Live commenting, which is also called “video barrage”, is an emerging feature on online video sites that allows real-time comments from viewers to fly across the screen like bullets or roll at the right side of the screen. The live comments are a mixture of opinions for the video and the chit chats with other comments. Automatic live commenting requires AI agents to comprehend the videos and interact with human viewers who also make the comments, so it is a good testbed of an AI agent’s ability to deal with both dynamic vision and language. In this work, we construct a large-scale live comment dataset with 2, 361 videos and 895, 929 live comments. Then, we introduce two neural models to generate live comments based on the visual and textual contexts, which achieve better performance than previous neural baselines such as the sequence-to-sequence model. Finally, we provide a retrieval-based evaluation protocol for automatic live commenting where the model is asked to sort a set of candidate comments based on the log-likelihood score, and evaluated on metrics such as mean-reciprocal-rank. Putting it all together, we demonstrate the first “LiveBot”. The datasets and the codes can be found at https: //github. com/lancopku/livebot.

AAAI Conference 2019 Conference Paper

Read + Verify: Machine Reading Comprehension with Unanswerable Questions

  • Minghao Hu
  • Furu Wei
  • Yuxing Peng
  • Zhen Huang
  • Nan Yang
  • Dongsheng Li

Machine reading comprehension with unanswerable questions aims to abstain from answering when no answer can be inferred. In addition to extract answers, previous works usually predict an additional “no-answer” probability to detect unanswerable cases. However, they fail to validate the answerability of the question by verifying the legitimacy of the predicted answer. To address this problem, we propose a novel read-then-verify system, which not only utilizes a neural reader to extract candidate answers and produce noanswer probabilities, but also leverages an answer verifier to decide whether the predicted answer is entailed by the input snippets. Moreover, we introduce two auxiliary losses to help the reader better handle answer extraction as well as noanswer detection, and investigate three different architectures for the answer verifier. Our experiments on the SQuAD 2. 0 dataset show that our system obtains a score of 74. 2 F1 on test set, achieving state-of-the-art results at the time of submission (Aug. 28th, 2018).

AAAI Conference 2019 Conference Paper

Response Generation by Context-Aware Prototype Editing

  • Yu Wu
  • Furu Wei
  • Shaohan Huang
  • Yunli Wang
  • Zhoujun Li
  • Ming Zhou

Open domain response generation has achieved remarkable progress in recent years, but sometimes yields short and uninformative responses. We propose a new paradigm, prototypethen-edit for response generation, that first retrieves a prototype response from a pre-defined index and then edits the prototype response according to the differences between the prototype context and current context. Our motivation is that the retrieved prototype provides a good start-point for generation because it is grammatical and informative, and the post-editing process further improves the relevance and coherence of the prototype. In practice, we design a contextaware editing model that is built upon an encoder-decoder framework augmented with an editing vector. We first generate an edit vector by considering lexical differences between a prototype context and current context. After that, the edit vector and the prototype response representation are fed to a decoder to generate a new response. Experiment results on a large scale dataset demonstrate that our new paradigm significantly increases the relevance, diversity and originality of generation results, compared to traditional generative models. Furthermore, our model outperforms retrieval-based methods in terms of relevance and originality.

NeurIPS Conference 2019 Conference Paper

Unified Language Model Pre-training for Natural Language Understanding and Generation

  • Li Dong
  • Nan Yang
  • Wenhui Wang
  • Furu Wei
  • Xiaodong Liu
  • Yu Wang
  • Jianfeng Gao
  • Ming Zhou

This paper presents a new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UniLM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2. 0 and CoQA question answering tasks. Moreover, UniLM achieves new state-of-the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40. 51 (2. 04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35. 75 (0. 86 absolute improvement), the CoQA generative question answering F1 score to 82. 5 (37. 1 absolute improvement), the SQuAD question generation BLEU-4 to 22. 12 (3. 75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2. 67 (human performance is 2. 65). The code and pre-trained models are available at https: //github. com/microsoft/unilm.

IJCAI Conference 2018 Conference Paper

Attention-Fused Deep Matching Network for Natural Language Inference

  • Chaoqun Duan
  • Lei Cui
  • Xinchi Chen
  • Furu Wei
  • Conghui Zhu
  • Tiejun Zhao

Natural language inference aims to predict whether a premise sentence can infer another hypothesis sentence. Recent progress on this task only relies on a shallow interaction between sentence pairs, which is insufficient for modeling complex relations. In this paper, we present an attention-fused deep matching network (AF-DMN) for natural language inference. Unlike existing models, AF-DMN takes two sentences as input and iteratively learns the attention-aware representations for each side by multi-level interactions. Moreover, we add a self-attention mechanism to fully exploit local context information within each sentence. Experiment results show that AF-DMN achieves state-of-the-art performance and outperforms strong baselines on Stanford natural language inference (SNLI), multi-genre natural language inference (MultiNLI), and Quora duplicate questions datasets.

AAAI Conference 2018 Conference Paper

Faithful to the Original: Fact Aware Neural Abstractive Summarization

  • Ziqiang Cao
  • Furu Wei
  • Wenjie Li
  • Sujian Li

Unlike extractive summarization, abstractive summarization has to fuse different parts of the source text, which inclines to create fake facts. Our preliminary study reveals nearly 30% of the outputs from a state-of-the-art neural summarization system suffer from this problem. While previous abstractive summarization approaches usually focus on the improvement of informativeness, we argue that faithfulness is also a vital prerequisite for a practical abstractive summarization system. To avoid generating fake facts in a summary, we leverage open information extraction and dependency parse technologies to extract actual fact descriptions from the source text. The dual-attention sequence-to-sequence framework is then proposed to force the generation conditioned on both the source text and the extracted fact descriptions. Experiments on the Gigaword benchmark dataset demonstrate that our model can greatly reduce fake summaries by 80%. Notably, the fact descriptions also bring significant improvement on informativeness since they often condense the meaning of the source text.

AAAI Conference 2018 Conference Paper

Hierarchical Attention Flow for Multiple-Choice Reading Comprehension

  • Haichao Zhu
  • Furu Wei
  • Bing Qin
  • Ting Liu

In this paper, we focus on multiple-choice reading comprehension which aims to answer a question given a passage and multiple candidate options. We present the hierarchical attention flow to adequately leverage candidate options to model the interactions among passages, questions and candidate options. We observe that leveraging candidate options to boost evidence gathering from the passages play a vital role in this task, which is ignored in previous works. In addition, we explicitly model the option correlations with attention mechanism to obtain better option representations, which are further fed into a bilinear layer to obtain the ranking score for each option. On a large-scale multiple-choice reading comprehension dataset (i. e. the RACE dataset), the proposed model outperforms two previous neural network baselines on both RACE-M and RACE-H subsets and yields the state-of-the-art overall results.

IJCAI Conference 2018 Conference Paper

Multiway Attention Networks for Modeling Sentence Pairs

  • Chuanqi Tan
  • Furu Wei
  • Wenhui Wang
  • Weifeng Lv
  • Ming Zhou

Modeling sentence pairs plays the vital role for judging the relationship between two sentences, such as paraphrase identification, natural language inference, and answer sentence selection. Previous work achieves very promising results using neural networks with attention mechanism. In this paper, we propose the multiway attention networks which employ multiple attention functions to match sentence pairs under the matching-aggregation framework. Specifically, we design four attention functions to match words in corresponding sentences. Then, we aggregate the matching information from each function, and combine the information from all functions to obtain the final representation. Experimental results demonstrate that the proposed multiway attention networks improve the result on the Quora Question Pairs, SNLI, MultiNLI, and answer sentence selection task on the SQuAD dataset.

IJCAI Conference 2018 Conference Paper

Reinforced Mnemonic Reader for Machine Reading Comprehension

  • Minghao Hu
  • Yuxing Peng
  • Zhen Huang
  • Xipeng Qiu
  • Furu Wei
  • Ming Zhou

In this paper, we introduce the Reinforced Mnemonic Reader for machine reading comprehension tasks, which enhances previous attentive readers in two aspects. First, a reattention mechanism is proposed to refine current attentions by directly accessing to past attentions that are temporally memorized in a multi-round alignment architecture, so as to avoid the problems of attention redundancy and attention deficiency. Second, a new optimization approach, called dynamic-critical reinforcement learning, is introduced to extend the standard supervised method. It always encourages to predict a more acceptable answer so as to address the convergence suppression problem occurred in traditional reinforcement learning algorithms. Extensive experiments on the Stanford Question Answering Dataset (SQuAD) show that our model achieves state-of-the-art results. Meanwhile, our model outperforms previous systems by over 6% in terms of both Exact Match and F1 metrics on two adversarial SQuAD datasets.

AAAI Conference 2018 Conference Paper

S-Net: From Answer Extraction to Answer Synthesis for Machine Reading Comprehension

  • Chuanqi Tan
  • Furu Wei
  • Nan Yang
  • Bowen Du
  • Weifeng Lv
  • Ming Zhou

In this paper, we present a novel approach to machine reading comprehension for the MS-MARCO dataset. Unlike the SQuAD dataset that aims to answer a question with exact text spans in a passage, the MS-MARCO dataset defines the task as answering a question from multiple passages and the words in the answer are not necessary in the passages. We therefore develop an extraction-then-synthesis framework to synthesize answers from extraction results. Specifically, the answer extraction model is first employed to predict the most important sub-spans from the passage as evidence, and the answer synthesis model takes the evidence as additional features along with the question and passage to further elaborate the final answers. We build the answer extraction model with state-ofthe-art neural networks for single passage reading comprehension, and propose an additional task of passage ranking to help answer extraction in multiple passages. The answer synthesis model is based on the sequence-to-sequence neural networks with extracted evidences as features. Experiments show that our extraction-then-synthesis method outperforms state-of-the-art methods.

AAAI Conference 2018 Conference Paper

Sequential Copying Networks

  • Qingyu Zhou
  • Nan Yang
  • Furu Wei
  • Ming Zhou

Copying mechanism shows effectiveness in sequence-tosequence based neural network models for text generation tasks, such as abstractive sentence summarization and question generation. However, existing works on modeling copying or pointing mechanism only considers single word copying from the source sentences. In this paper, we propose a novel copying framework, named Sequential Copying Networks (SeqCopyNet), which not only learns to copy single words, but also copies sequences from the input sentence. It leverages the pointer networks to explicitly select a subspan from the source side to target side, and integrates this sequential copying mechanism to the generation process in the encoder-decoder paradigm. Experiments on abstractive sentence summarization and question generation tasks show that the proposed SeqCopyNet can copy meaningful spans and outperforms the baseline models.

AAAI Conference 2017 Conference Paper

Improving Multi-Document Summarization via Text Classification

  • Ziqiang Cao
  • Wenjie Li
  • Sujian Li
  • Furu Wei

Developed so far, multi-document summarization has reached its bottleneck due to the lack of sufficient training data and diverse categories of documents. Text classification just makes up for these deficiencies. In this paper, we propose a novel summarization system called TCSum, which leverages plentiful text classification data to improve the performance of multi-document summarization. TCSum projects documents onto distributed representations which act as a bridge between text classification and summarization. It also utilizes the classification results to produce summaries of different styles. Extensive experiments on DUC generic multidocument summarization datasets show that, TCSum can achieve the state-of-the-art performance without using any hand-crafted features and has the capability to catch the variations of summary styles with respect to different text categories.

AAAI Conference 2016 Conference Paper

TGSum: Build Tweet Guided Multi-Document Summarization Dataset

  • Ziqiang Cao
  • Chengyao Chen
  • Wenjie Li
  • Sujian Li
  • Furu Wei
  • Ming Zhou

The development of summarization research has been significantly hampered by the costly acquisition of reference summaries. This paper proposes an effective way to automatically collect large scales of news-related multi-document summaries with reference to social media’s reactions. We utilize two types of social labels in tweets, i. e. , hashtags and hyper-links. Hashtags are used to cluster documents into different topic sets. Also, a tweet with a hyper-link often highlights certain key points of the corresponding document. We synthesize a linked document cluster to form a reference summary which can cover most key points. To this aim, we adopt the ROUGE metrics to measure the coverage ratio, and develop an Integer Linear Programming solution to discover the sentence set reaching the upper bound of ROUGE. Since we allow summary sentences to be selected from both documents and high-quality tweets, the generated reference summaries could be abstractive. Both informativeness and readability of the collected summaries are verified by manual judgment. In addition, we train a Support Vector Regression summarizer on DUC generic multi-document summarization benchmarks. With the collected data as extra training resource, the performance of the summarizer improves a lot on all the test sets. We release this dataset for further research1.

IJCAI Conference 2016 Conference Paper

Unsupervised Word and Dependency Path Embeddings for Aspect Term Extraction

  • Yichun Yin
  • Furu Wei
  • Li Dong
  • Kaimeng Xu
  • Ming Zhang
  • Ming Zhou

In this paper, we develop a novel approach to aspect term extraction based on unsupervised learning of distributed representations of words and dependency paths. The basic idea is to connect two words (w1 and w2) with the dependency path (r) between them in the embedding space. Specifically, our method optimizes the objective w1 + r ≈ w2 in the low-dimensional space, where the multi-hop dependency paths are treated as a sequence of grammatical relations and modeled by a recurrent neural network. Then, we design the embedding features that consider linear context and dependency context information, for the conditional random field (CRF) based aspect term extraction. Experimental results on the SemEval datasets show that, (1) with only embedding features, we can achieve state-of-the-art results; (2) our embedding method which incorporates the syntactic information among words yields better performance than other representative ones in aspect term extraction.

IJCAI Conference 2015 Conference Paper

A Hybrid Neural Model for Type Classification of Entity Mentions

  • Li Dong
  • Furu Wei
  • Hong Sun
  • Ming Zhou
  • Ke Xu

The semantic class (i. e. , type) of an entity plays a vital role in many natural language processing tasks, such as question answering. However, most of existing type classification systems extensively rely on hand-crafted features. This paper introduces a hybrid neural model which classifies entity mentions to a wide-coverage set of 22 types derived from DBpedia. It consists of two parts. The mention model uses recurrent neural networks to recursively obtain the vector representation of an entity mention from the words it contains. The context model, on the other hand, employs multilayer perceptrons to obtain the hidden representation for contextual information of a mention. Representations obtained by the two parts are used together to predict the type distribution. Using automatically generated data, these two parts are jointly learned. Experimental studies illustrate that the proposed approach outperforms baseline methods. Moreover, when type information provided by our method is used in a question answering system, we observe a 14. 7% relative improvement for the top-1 accuracy of answers.

AAAI Conference 2015 Conference Paper

Ranking with Recursive Neural Networks and Its Application to Multi-Document Summarization

  • Ziqiang Cao
  • Furu Wei
  • Li Dong
  • Sujian Li
  • Ming Zhou

We develop a Ranking framework upon Recursive Neural Networks (R2N2) to rank sentences for multi-document summarization. It formulates the sentence ranking task as a hierarchical regression process, which simultaneously measures the salience of a sentence and its constituents (e. g. , phrases) in the parsing tree. This enables us to draw on word-level to sentence-level supervisions derived from reference summaries. In addition, recursive neural networks are used to automatically learn ranking features over the tree, with hand-crafted feature vectors of words as inputs. Hierarchical regressions are then conducted with learned features concatenating raw features. Ranking scores of sentences and words are utilized to effectively select informative and nonredundant sentences to generate summaries. Experiments on the DUC 2001, 2002 and 2004 multi-document summarization datasets show that R2N2 outperforms state-of-the-art extractive summarization approaches.

AAAI Conference 2014 Conference Paper

Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis

  • Li Dong
  • Furu Wei
  • Ming Zhou
  • Ke Xu

Recursive neural models have achieved promising results in many natural language processing tasks. The main difference among these models lies in the composition function, i. e. , how to obtain the vector representation for a phrase or sentence using the representations of words it contains. This paper introduces a novel Adaptive Multi-Compositionality (AdaMC) layer to recursive neural models. The basic idea is to use more than one composition functions and adaptively select them depending on the input vectors. We present a general framework to model each semantic composition as a distribution over these composition functions. The composition functions and parameters used for adaptive selection are learned jointly from data. We integrate AdaMC into existing recursive neural models and conduct extensive experiments on the Stanford Sentiment Treebank. The results illustrate that AdaMC significantly outperforms state-of-the-art sentiment classification methods. It helps push the best accuracy of sentence-level negative/positive classification from 85. 4% up to 88. 5%.

TIST Journal 2013 Journal Article

Named entity recognition for tweets

  • Xiaohua Liu
  • Furu Wei
  • Shaodian Zhang
  • Ming Zhou

Two main challenges of Named Entity Recognition (NER) for tweets are the insufficient information in a tweet and the lack of training data. We propose a novel method consisting of three core elements: (1) normalization of tweets; (2) combination of a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model; and (3) semisupervised learning framework. The tweet normalization preprocessing corrects common ill-formed words using a global linear model. The KNN-based classifier conducts prelabeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. The semisupervised learning plus the gazetteers alleviate the lack of training data. Extensive experiments show the advantages of our method over the baselines as well as the effectiveness of normalization, KNN, and semisupervised learning.

AAAI Conference 2013 Conference Paper

The Automated Acquisition of Suggestions from Tweets

  • Li Dong
  • Furu Wei
  • Yajuan Duan
  • Xiaohua Liu
  • Ming Zhou
  • Ke Xu

This paper targets at automatically detecting and classifying user’s suggestions from tweets. The short and informal nature of tweets, along with the imbalanced characteristics of suggestion tweets, makes the task extremely challenging. To this end, we develop a classification framework on Factorization Machines, which is effective and efficient especially in classification tasks with feature sparsity settings. Moreover, we tackle the imbalance problem by introducing cost-sensitive learning techniques in Factorization Machines. Extensively experimental studies on a manually annotated real-life data set show that the proposed approach significantly improves the baseline approach, and yields the precision of 71. 06% and recall of 67. 86%. We also investigate the reason why Factorization Machines perform better. Finally, we introduce the first manually annotated dataset for suggestion classification.

AAAI Conference 2012 Conference Paper

Collective Nominal Semantic Role Labeling for Tweets

  • Xiaohua Liu
  • Zhongyang Fu
  • Furu Wei
  • Ming Zhou

Tweets have become an increasingly popular source of fresh information. We investigate the task of Nominal Semantic Role Labeling (NSRL) for tweets, which aims to identify predicate-argument structures defined by nominals in tweets. Studies of this task can help fine-grained information extraction and retrieval from tweets. There are two main challenges in this task: 1) The lack of information in a single tweet, rooted in the short and noisy nature of tweets; and 2) recovery of implicit arguments. We propose jointly conducting NSRL on multiple similar tweets using a graphical model, leveraging the redundancy in tweets to tackle these challenges. Extensive evaluations on a human annotated data set demonstrate that our method outperforms two baselines with an absolute gain of 2. 7% in F1.

AAAI Conference 2012 Conference Paper

Exacting Social Events for Tweets Using a Factor Graph

  • Xiaohua Liu
  • Xiangyang Zhou
  • Zhongyang Fu
  • Furu Wei
  • Ming Zhou

Social events are events that occur between people where at least one person is aware of the other and of the event taking place. Extracting social events can play an important role in a wide range of applications, such as the construction of social network. In this paper, we introduce the task of social event extraction for tweets, an important source of fresh events. One main challenge is the lack of information in a single tweet, which is rooted in the short and noise-prone nature of tweets. We propose to collectively extract social events from multiple similar tweets using a novel factor graph, to harvest the redundance in tweets, i. e. , the repeated occurrences of a social event in several tweets. We evaluate our method on a human annotated data set, and show that it outperforms all baselines, with an absolute gain of 21% in F1.

AAAI Conference 2010 Conference Paper

Constrained Coclustering for Textual Documents

  • Yangqiu Song
  • Shimei Pan
  • Shixia Liu
  • Furu Wei
  • Michelle Zhou
  • Weihong Qian

In this paper, we present a constrained co-clustering approach for clustering textual documents. Our approach combines the benefits of information-theoretic co-clustering and constrained clustering. We use a two-sided hidden Markov random field (HMRF) to model both the document and word constraints. We also develop an alternating expectation maximization (EM) algorithm to optimize the constrained coclustering model. We have conducted two sets of experiments on a benchmark data set: (1) using human-provided category labels to derive document and word constraints for semi-supervised document clustering, and (2) using automatically extracted named entities to derive document constraints for unsupervised document clustering. Compared to several representative constrained clustering and co-clustering approaches, our approach is shown to be more effective for high-dimensional, sparse text data.