Arrow Research search

Author name cluster

Liyuan Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers
2 author rows

Possible papers

21

NeurIPS Conference 2025 Conference Paper

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

  • Liliang Ren
  • Congcong Chen
  • Haoran Xu
  • Young Jin Kim
  • Adam Atkinson
  • Zheng Zhan
  • Jiankai Sun
  • Baolin Peng

Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10× higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https: //github. com/microsoft/ArchScale.

NeurIPS Conference 2025 Conference Paper

Mixture of Inputs: Text Generation Beyond Discrete Token Sampling

  • Yufan Zhuang
  • Liyuan Liu
  • Chandan Singh
  • Jingbo Shang
  • Jianfeng Gao

In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution’s rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.

ICML Conference 2025 Conference Paper

On the Generalization Ability of Next-Token-Prediction Pretraining

  • Zhihao Li
  • Xue Jiang
  • Liyuan Liu
  • Xuelin Zhang
  • Hong Chen 0004
  • Feng Zheng 0001

Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model’s generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.

NeurIPS Conference 2025 Conference Paper

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

  • Yiping Wang
  • Qing Yang
  • Zhiyuan Zeng
  • Liliang Ren
  • Liyuan Liu
  • Baolin Peng
  • Hao Cheng
  • Xuehai He

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2. 5-Math-1. 5B, we identify a single example that elevates model performance on MATH500 from 36. 0\% to 73. 6\% (8. 6\% improvement beyond format correction), and improves the average performance across six common mathematical reasoning benchmarks from 17. 6\% to 35. 7\% (7. 0\% non-format gain). This result matches the performance obtained using the 1. 2k DeepScaleR subset (MATH500: 73. 6\%, average: 35. 9\%), which contains the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74. 8\%, average: 36. 6\%). Similar substantial improvements are observed across various models (Qwen2. 5-Math-7B, Llama3. 2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1. 5B), RL algorithms (GRPO and PPO), and different math examples. In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-category generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term \textit{post-saturation generalization}. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e. g. , by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. Our code, models, and data are open source at https: //github. com/ypwang61/One-Shot-RLVR.

NeurIPS Conference 2025 Conference Paper

Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection

  • Zheng Zhan
  • Liliang Ren
  • Shuohang Wang
  • Liyuan Liu
  • Yang Liu
  • Yeyun Gong
  • Yanzhi Wang
  • Yelong Shen

State Space Models (SSMs) offer remarkable performance gains in efficient sequence modeling, with constant per-step inference-time computation and memory complexity. Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations, positioning them as strong alternatives to Transformers for long sequence modeling. However, efficiently scaling the expressive power of SSMs, particularly with Mixture of Experts (MoE), remains challenging, as naive integration attempts often falter or degrade performance. In this work, we introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts. By sharing routing decisions between projection layers and lightweight sub-modules within Mamba across experts, RoM leverages synergies among linear projection experts for effective and efficient sparse scaling of Mamba layers. At a scale of 1. 3B active parameters (10B total) and 16K training sequence length, RoM achieves language modeling performance equivalent to a dense Mamba model requiring over 2. 3$\times$ more active parameters, and demonstrates consistent perplexity across context lengths. Experimental results further show RoM effectively scales hybrid language models, yielding a 23% FLOPS saving compared to dense Mamba scaling for similar performance. We release our training codebase at https: //github. com/zhanzheng8585/Routing-Mamba.

NeurIPS Conference 2025 Conference Paper

Training Language Models to Generate Quality Code with Program Analysis Feedback

  • Feng Yao
  • Zilong Wang
  • Liyuan Liu
  • Junxia Cui
  • Li Zhong
  • Xiaohan Fu
  • Haohui Mai
  • Viswanathan Krishnan

Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e. g. , SQL injection vulnerabilities) and maintainability (e. g. , missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL (Reinforcement rEwards from Automated anaLysis), a reinforcement learning framework that trains LLMs to generate production-quality code using program analysis–guided feedback. Specifically, REAL integrates two automated signals: (1) static analyzers detecting security and maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.

IJCAI Conference 2025 Conference Paper

Trajectory-Dependent Generalization Bounds for Pairwise Learning with φ-mixing Samples

  • Liyuan Liu
  • Hong Chen
  • Weifu Li
  • Tieliang Gong
  • Hao Deng
  • Yulong Wang

Recently, the mathematical tool from fractal geometry (i. e. , fractal dimension) has been employed to investigate optimization trajectory-dependent generalization ability for some pointwise learning models with independent and identically distributed (i. i. d. ) observations. This paper goes beyond the limitations of pointwise learning and i. i. d. samples, and establishes generalization bounds for pairwise learning with uniformly strong mixing samples. The derived theoretical results fill the gap of trajectory-dependent generalization analysis for pairwise learning, and can be applied to wide learning paradigms, e. g. , metric learning, ranking and gradient learning. Technically, our framework brings concentration estimation with Rademacher complexity and trajectory-dependent fractal dimension together in a coherent way for felicitous learning theory analysis. In addition, the efficient computation of fractal dimension can be guaranteed for random algorithms (e. g. , stochastic gradient descent algorithm for deep neural networks) by bridging topological data analysis tools and the trajectory-dependent fractal dimension.

ICLR Conference 2025 Conference Paper

Vector-ICL: In-context Learning with Continuous Vector Representations

  • Yufan Zhuang
  • Chandan Singh
  • Liyuan Liu
  • Jingbo Shang
  • Jianfeng Gao 0001

Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities on textual data. We explore whether these capabilities can be extended to continuous vectors from diverse domains, obtained from black-box pretrained encoders. By aligning input data with an LLM's embedding space through lightweight projectors, we observe that LLMs can effectively process and learn from these projected vectors, which we term Vector-ICL. In particular, we find that pretraining projectors with general language modeling objectives enables Vector-ICL, while task-specific finetuning further enhances performance. In our experiments across various tasks and modalities, including text reconstruction, numerical function regression, text classification, summarization, molecule captioning, time-series classification, graph classification, and fMRI decoding, Vector-ICL often surpasses both few-shot ICL and domain-specific model or tuning. We further conduct analyses and case studies, indicating the potential of LLMs to process vector representations beyond traditional token-based paradigms.

ICLR Conference 2024 Conference Paper

Fast-ELECTRA for Efficient Pre-training

  • Chengyu Dong
  • Liyuan Liu
  • Hao Cheng 0002
  • Jingbo Shang
  • Jianfeng Gao 0001
  • Xiaodong Liu 0003

ELECTRA pre-trains language models by detecting tokens in a sequence that have been replaced by an auxiliary model. Although ELECTRA offers a significant boost in efficiency, its potential is constrained by the training cost brought by the auxiliary model. Notably, this model, which is jointly trained with the main model, only serves to assist the training of the main model and is discarded post-training. This results in a substantial amount of training cost being expended in vain. To mitigate this issue, we propose Fast-ELECTRA, which leverages an existing language model as the auxiliary model. To construct a learning curriculum for the main model, we smooth its output distribution via temperature scaling following a descending schedule. Our approach rivals the performance of state-of-the-art ELECTRA-style pre-training methods, while significantly eliminating the computation and memory cost brought by the joint training of the auxiliary model. Our method also reduces the sensitivity to hyper-parameters and enhances the pre-training stability.

TMLR Journal 2024 Journal Article

Learning a Decision Tree Algorithm with Transformers

  • Yufan Zhuang
  • Liyuan Liu
  • Chandan Singh
  • Jingbo Shang
  • Jianfeng Gao

Decision trees are renowned for their ability to achieve high predictive performance while remaining interpretable, especially on tabular data. Traditionally, they are constructed through recursive algorithms, where they partition the data at every node in a tree. However, identifying a good partition is challenging, as decision trees optimized for local segments may not yield global generalization. To address this, we introduce MetaTree, a transformer-based model trained via meta-learning to directly produce strong decision trees. Specifically, we fit both greedy decision trees and globally optimized decision trees on a large number of datasets, and train MetaTree to produce only the trees that achieve strong generalization performance. This training enables MetaTree to emulate these algorithms and intelligently adapt its strategy according to the context, thereby achieving superior generalization performance.

ICLR Conference 2024 Conference Paper

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

  • Suyu Ge
  • Yunan Zhang 0001
  • Liyuan Liu
  • Minjia Zhang
  • Jiawei Han 0001
  • Jianfeng Gao 0001

In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various asks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss. We will release our code and the compatible CUDA kernel for reproducibility.

ICLR Conference 2024 Conference Paper

Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs

  • Qingru Zhang
  • Chandan Singh
  • Liyuan Liu
  • Xiaodong Liu 0003
  • Bin Yu 0001
  • Jianfeng Gao 0001
  • Tuo Zhao

In human-written articles, we often leverage the subtleties of text style, such as bold and italics, to guide the attention of readers. These textual emphases are vital for the readers to grasp the conveyed information. When interacting with large language models (LLMs), we have a similar need -- steering the model to pay closer attention to user-specified information, e.g., an instruction. Existing methods, however, are constrained to process plain text and do not support such a mechanism. This motivates us to introduce PASTA -- Post-hoc Attention STeering Approach, a method that allows LLMs to read text with user-specified emphasis marks. To this end, PASTA identifies a small subset of attention heads and applies precise attention reweighting on them, directing the model attention to user-specified parts. Like prompting, PASTA is applied at inference time and does not require changing any model parameters. Experiments demonstrate that PASTA can substantially enhance an LLM's ability to follow user instructions or integrate new knowledge from user inputs, leading to a significant performance improvement on a variety of tasks, e.g., an average accuracy improvement of 22\% for LLAMA-7B. Our code is publicly available at https://github.com/QingruZhang/PASTA .

ICLR Conference 2024 Conference Paper

Toward Student-oriented Teacher Network Training for Knowledge Distillation

  • Chengyu Dong
  • Liyuan Liu
  • Jingbo Shang

How to conduct teacher training for knowledge distillation is still an open problem. It has been widely observed that a best-performing teacher does not necessarily yield the best-performing student, suggesting a fundamental discrepancy between the current teacher training practice and the ideal teacher training strategy. To fill this gap, we explore the feasibility of training a teacher that is oriented toward student performance with empirical risk minimization (ERM). Our analyses are inspired by the recent findings that the effectiveness of knowledge distillation hinges on the teacher’s capability to approximate the true label distribution of training inputs. We theoretically establish that ERM minimizer can approximate the true label distribution of training data as long as the feature extractor of the learner network is Lipschitz continuous and is robust to feature transformations. In light of our theory, we propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM. Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.

NeurIPS Conference 2023 Conference Paper

Bridging Discrete and Backpropagation: Straight-Through and Beyond

  • Liyuan Liu
  • Chengyu Dong
  • Xiaodong Liu
  • Bin Yu
  • Jianfeng Gao

Backpropagation, the cornerstone of deep learning, is limited to computing gradients for continuous variables. This limitation poses challenges for problems involving discrete latent variables. To address this issue, we propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables. First, we examine the widely used Straight-Through (ST) heuristic and demonstrate that it works as a first-order approximation of the gradient. Guided by our findings, we propose ReinMax, which achieves second-order accuracy by integrating Heun’s method, a second-order numerical method for solving ODEs. ReinMax does not require Hessian or other second-order derivatives, thus having negligible computation overheads. Extensive experimental results on various tasks demonstrate the superiority of ReinMax over the state of the art.

ICML Conference 2023 Conference Paper

Understand and Modularize Generator Optimization in ELECTRA-style Pretraining

  • Chengyu Dong
  • Liyuan Liu
  • Hao Cheng 0002
  • Jingbo Shang
  • Jianfeng Gao 0001
  • Xiaodong Liu 0003

Despite the effectiveness of ELECTRA-style pre-training, their performance is dependent on the careful selection of the model size for the auxiliary generator, leading to high trial-and-error costs. In this paper, we present the first systematic study of this problem. Our theoretical investigation highlights the importance of controlling the generator capacity in ELECTRA-style training. Meanwhile, we found it is not handled properly in the original ELECTRA design, leading to the sensitivity issue. Specifically, since adaptive optimizers like Adam will cripple the weighing of individual losses in the joint optimization, the original design fails to control the generator training effectively. To regain control over the generator, we modularize the generator optimization by decoupling the generator optimizer and discriminator optimizer completely, instead of simply relying on the weighted objective combination. Our simple technique reduced the sensitivity of ELECTRA training significantly and obtains considerable performance gain compared to the original design.

NeurIPS Conference 2022 Conference Paper

Label Noise in Adversarial Training: A Novel Perspective to Study Robust Overfitting

  • Chengyu Dong
  • Liyuan Liu
  • Jingbo Shang

We show that label noise exists in adversarial training. Such label noise is due to the mismatch between the true label distribution of adversarial examples and the label inherited from clean examples – the true label distribution is distorted by the adversarial perturbation, but is neglected by the common practice that inherits labels from clean examples. Recognizing label noise sheds insights on the prevalence of robust overfitting in adversarial training, and explains its intriguing dependence on perturbation radius and data quality. Also, our label noise perspective aligns well with our observations of the epoch-wise double descent in adversarial training. Guided by our analyses, we proposed a method to automatically calibrate the label to address the label noise and robust overfitting. Our method achieves consistent performance improvements across various models and datasets without introducing new hyper-parameters or additional tuning.

AAAI Conference 2021 Conference Paper

Empower Distantly Supervised Relation Extraction with Collaborative Adversarial Training

  • Tao Chen
  • Haochen Shi
  • Liyuan Liu
  • Siliang Tang
  • Jian Shao
  • Zhigang Chen
  • Yueting Zhuang

With recent advances in distantly supervised (DS) relation extraction (RE), considerable attention is attracted to leverage multi-instance learning (MIL) to distill high-quality supervision from the noisy DS. Here, we go beyond label noise and identify the key bottleneck of DS-MIL to be its low data utilization: as high-quality supervision being refined by MIL, MIL abandons a large amount of training instances, which leads to a low data utilization and hinders model training from having abundant supervision. In this paper, we propose collaborative adversarial training to improve the data utilization, which coordinates virtual adversarial training (VAT) and adversarial training (AT) at different levels. Specifically, since VAT is label-free, we employ the instance-level VAT to recycle instances abandoned by MIL. Besides, we deploy AT at the bag-level to unleash the full potential of the high-quality supervision got by MIL. Our proposed method brings consistent improvements (∼ 5 absolute AUC score) to the previous state of the art, which verifies the importance of the data utilization issue and the effectiveness of our method.

ICLR Conference 2020 Conference Paper

On the Variance of the Adaptive Learning Rate and Beyond

  • Liyuan Liu
  • Haoming Jiang
  • Pengcheng He
  • Weizhu Chen
  • Xiaodong Liu 0003
  • Jianfeng Gao 0001
  • Jiawei Han 0001

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate -- its variance is problematically large in the early stage, and presume warmup works as a variance reduction technique. We provide both empirical and theoretical evidence to verify our hypothesis. We further propose Rectified Adam (RAdam), a novel variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the efficacy and robustness of RAdam.

ICML Conference 2020 Conference Paper

Towards Adaptive Residual Network Training: A Neural-ODE Perspective

  • Chengyu Dong
  • Liyuan Liu
  • Zichao Li 0002
  • Jingbo Shang

In pursuit of resource-economical machine learning, attempts have been made to dynamically adjust computation workloads in different training stages, i. e. , starting with a shallow network and gradually increasing the model depth (and computation workloads) during training. However, there is neither guarantee nor guidance on designing such network grow, due to the lack of its theoretical underpinnings. In this work, to explore the theory behind, we conduct theoretical analyses from an ordinary differential equation perspective. Specifically, we illustrate the dynamics of network growth and propose a novel performance measure specific to the depth increase. Illuminated by our analyses, we move towards theoretically sound growing operations and schedulers, giving rise to an adaptive training algorithm for residual networks, LipGrow, which automatically increases network depth thus accelerates training. In our experiments, it achieves comparable performance while reducing ∼ 50% of training time.

AAAI Conference 2019 Conference Paper

Cross-Relation Cross-Bag Attention for Distantly-Supervised Relation Extraction

  • Yujin Yuan
  • Liyuan Liu
  • Siliang Tang
  • Zhongfei Zhang
  • Yueting Zhuang
  • Shiliang Pu
  • Fei Wu
  • Xiang Ren

Distant supervision leverages knowledge bases to automatically label instances, thus allowing us to train relation extractor without human annotations. However, the generated training data typically contain massive noise, and may result in poor performances with the vanilla supervised learning. In this paper, we propose to conduct multi-instance learning with a novel Cross-relation Cross-bag Selective Attention (C2 SA), which leads to noise-robust training for distant supervised relation extractor. Specifically, we employ the sentence-level selective attention to reduce the effect of noisy or mismatched sentences, while the correlation among relations were captured to improve the quality of attention weights. Moreover, instead of treating all entity-pairs equally, we try to pay more attention to entity-pairs with a higher quality. Similarly, we adopt the selective attention mechanism to achieve this goal. Experiments with two types of relation extractor demonstrate the superiority of the proposed approach over the state-of-the-art, while further ablation studies verify our intuitions and demonstrate the effectiveness of our proposed two techniques.

AAAI Conference 2018 Conference Paper

Empower Sequence Labeling with Task-Aware Neural Language Model

  • Liyuan Liu
  • Jingbo Shang
  • Xiang Ren
  • Frank Xu
  • Huan Gui
  • Jian Peng
  • Jiawei Han

Linguistic sequence labeling is a general approach encompassing a variety of problems, such as part-of-speech tagging and named entity recognition. Recent advances in neural networks (NNs) make it possible to build reliable models without handcrafted features. However, in many cases, it is hard to obtain sufficient annotations to train these models. In this study, we develop a neural framework to extract knowledge from raw texts and empower the sequence labeling task. Besides word-level knowledge contained in pretrained word embeddings, character-aware neural language models are incorporated to extract character-level knowledge. Transfer learning techniques are further adopted to mediate different components and guide the language model towards the key knowledge. Comparing to previous methods, these task-specific knowledge allows us to adopt a more concise model and conduct more efficient training. Different from most transfer learning methods, the proposed framework does not rely on any additional supervision. It extracts knowledge from self-contained order information of training sequences. Extensive experiments on benchmark datasets demonstrate the effectiveness of leveraging character-level knowledge and the efficiency of co-training. For example, on the CoNLL03 NER task, model training completes in about 6 hours on a single GPU, reaching F1 score of 91. 71±0. 10 without using any extra annotations.