Arrow Research search

Author name cluster

Kewei Tu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
2 author rows

Possible papers

13

ICML Conference 2025 Conference Paper

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

  • Xiang Hu
  • Zhihao Teng
  • Jun Zhao
  • Wei Wu
  • Kewei Tu

Despite the success of Transformers, handling longer contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention, which often requires post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 $\times$ the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-$k$ relevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner, which adapts better to causal language models. Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is $1000 \times$ the training length.

NeurIPS Conference 2025 Conference Paper

Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access

  • Xiang Hu
  • Jiaqi Leng
  • Jun Zhao
  • Kewei Tu
  • Wei Wu

A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose \textbf{H}ierarchical \textbf{S}parse \textbf{A}ttention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selecting the top-$k$ chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.

ICLR Conference 2024 Conference Paper

Augmenting Transformers with Recursively Composed Multi-grained Representations

  • Xiang Hu
  • Qingyang Zhu
  • Kewei Tu
  • Wei Wu

We present ReCAT, a recursive composition augmented Transformer that is able to explicitly model hierarchical syntactic structures of raw texts without relying on gold trees during both learning and inference. Existing research along this line restricts data to follow a hierarchical tree structure and thus lacks inter-span communications. To overcome the problem, we propose a novel contextual inside-outside (CIO) layer that learns contextualized representations of spans through bottom-up and top-down passes, where a bottom-up pass forms representations of high-level spans by composing low-level spans, while a top-down pass combines information inside and outside a span. By stacking several CIO layers between the embedding layer and the attention layers in Transformer, the ReCAT model can perform both deep intra-span and deep inter-span interactions, and thus generate multi-grained representations fully contextualized with other spans. Moreover, the CIO layers can be jointly pre-trained with Transformers, making ReCAT enjoy scaling ability, strong performance, and interpretability at the same time. We conduct experiments on various sentence-level and span-level tasks. Evaluation results indicate that ReCAT can significantly outperform vanilla Transformer models on all span-level tasks and recursive models on natural language inference tasks. More interestingly, the hierarchical structures induced by ReCAT exhibit strong consistency with human-annotated syntactic trees, indicating good interpretability brought by the CIO layers.

AAAI Conference 2024 Conference Paper

Frame Semantic Role Labeling Using Arbitrary-Order Conditional Random Fields

  • Chaoyi Ai
  • Kewei Tu

This paper presents an approach to frame semantic role labeling (FSRL), a task in natural language processing that identifies semantic roles within a text following the theory of frame semantics. Unlike previous approaches which do not adequately model correlations and interactions amongst arguments, we propose arbitrary-order conditional random fields (CRFs) that are capable of modeling full interaction amongst an arbitrary number of arguments of a given predicate. To achieve tractable representation and inference, we apply canonical polyadic decomposition to the arbitrary-order factor in our proposed CRF and utilize mean-field variational inference for approximate inference. We further unfold our iterative inference procedure into a recurrent neural network that is connected to our neural encoder and scorer, enabling end-to-end training and inference. Finally, we also improve our model with several techniques such as span-based scoring and decoding. Our experiments show that our approach achieves state-of-the-art performance in FSRL.

AAAI Conference 2024 Conference Paper

SeqGPT: An Out-of-the-Box Large Language Model for Open Domain Sequence Understanding

  • Tianyu Yu
  • Chengyue Jiang
  • Chao Lou
  • Shen Huang
  • Xiaobin Wang
  • Wei Liu
  • Jiong Cai
  • Yangning Li

Large language models (LLMs) have shown impressive abilities for open-domain NLP tasks. However, LLMs are sometimes too footloose for natural language understanding (NLU) tasks which always have restricted output and input format. Their performances on NLU tasks are highly related to prompts or demonstrations and are shown to be poor at performing several representative NLU tasks, such as event extraction and entity typing. To this end, we present SeqGPT, a bilingual (i.e., English and Chinese) open-source autoregressive model specially enhanced for open-domain natural language understanding. We express all NLU tasks with two atomic tasks, which define fixed instructions to restrict the input and output format but still ``open'' for arbitrarily varied label sets. The model is first instruction-tuned with extremely fine-grained labeled data synthesized by ChatGPT and then further fine-tuned by 233 different atomic tasks from 152 datasets across various domains. The experimental results show that SeqGPT has decent classification and extraction ability, and is capable of performing language understanding tasks on unseen domains. We also conduct empirical studies on the scaling of data and model size as well as on the transfer across tasks. Our models are accessible at https://github.com/Alibaba-NLP/SeqGPT.

ICLR Conference 2023 Conference Paper

A Multi-Grained Self-Interpretable Symbolic-Neural Model For Single/Multi-Labeled Text Classification

  • Xiang Hu
  • Xinyu Kong
  • Kewei Tu

Deep neural networks based on layer-stacking architectures have historically suffered from poor inherent interpretability. Meanwhile, symbolic probabilistic models function with clear interpretability, but how to combine them with neural networks to enhance their performance remains to be explored. In this paper, we try to marry these two systems for text classification via a structured language model. We propose a Symbolic-Neural model that can learn to explicitly predict class labels of text spans from a constituency tree without requiring any access to span-level gold labels. As the structured language model learns to predict constituency trees in a self-supervised manner, only raw texts and sentence-level labels are required as training data, which makes it essentially a general constituent-level self-interpretable classification model. Our experiments demonstrate that our approach could achieve good prediction accuracy in downstream tasks. Meanwhile, the predicted span labels are consistent with human rationales to a certain degree.

AAAI Conference 2022 Conference Paper

Span-Based Semantic Role Labeling with Argument Pruning and Second-Order Inference

  • Zixia Jia
  • Zhaohui Yan
  • Haoyi Wu
  • Kewei Tu

We study graph-based approaches to span-based semantic role labeling. This task is difficult due to the need to enumerate all possible predicate-argument pairs and the high degree of imbalance between positive and negative samples. Based on these difficulties, high-order inference that considers interactions between multiple arguments and predicates is often deemed beneficial but has rarely been used in span-based semantic role labeling. Because even for second-order inference, there are already O(n5 ) parts for a sentence of length n, and exact high-order inference is intractable. In this paper, we propose a framework consisting of two networks: a predicate-agnostic argument pruning network that reduces the number of candidate arguments to O(n), and a semantic role labeling network with an optional second-order decoder that is unfolded from an approximate inference algorithm. Our experiments show that our framework achieves significant and consistent improvement over previous approaches.

AAAI Conference 2019 Conference Paper

Bidirectional Transition-Based Dependency Parsing

  • Yunzhe Yuan
  • Yong Jiang
  • Kewei Tu

Transition-based dependency parsing is a fast and effective approach for dependency parsing. Traditionally, a transitionbased dependency parser processes an input sentence and predicts a sequence of parsing actions in a left-to-right manner. During this process, an early prediction error may negatively impact the prediction of subsequent actions. In this paper, we propose a simple framework for bidirectional transitionbased parsing. During training, we learn a left-to-right parser and a right-to-left parser separately. To parse a sentence, we perform joint decoding with the two parsers. We propose three joint decoding algorithms that are based on joint scoring, dual decomposition, and dynamic oracle respectively. Empirical results show that our methods lead to competitive parsing accuracy and our method based on dynamic oracle consistently achieves the best performance.

AAAI Conference 2018 Conference Paper

Maximum A Posteriori Inference in Sum-Product Networks

  • Jun Mei
  • Yong Jiang
  • Kewei Tu

Sum-product networks (SPNs) are a class of probabilistic graphical models that allow tractable marginal inference. However, the maximum a posteriori (MAP) inference in SPNs is NP-hard. We investigate MAP inference in SPNs from both theoretical and algorithmic perspectives. For the theoretical part, we reduce general MAP inference to its special case without evidence and hidden variables; we also show that it is NP-hard to approximate the MAP problem to 2n for fixed 0 ≤ < 1, where n is the input size. For the algorithmic part, we first present an exact MAP solver that runs reasonably fast and could handle SPNs with up to 1k variables and 150k arcs in our experiments. We then present a new approximate MAP solver with a good balance between speed and accuracy, and our comprehensive experiments on real-world datasets show that it has better overall performance than existing approximate solvers.

AAAI Conference 2017 Conference Paper

Latent Dependency Forest Models

  • Shanbo Chu
  • Yong Jiang
  • Kewei Tu

Probabilistic modeling is one of the foundations of modern machine learning and artificial intelligence. In this paper, we propose a novel type of probabilistic models named latent dependency forest models (LDFMs). A LDFM models the dependencies between random variables with a forest structure that can change dynamically based on the variable values. It is therefore capable of modeling context-specific independence. We parameterize a LDFM using a first-order non-projective dependency grammar. Learning LDFMs from data can be formulated purely as a parameter learning problem, and hence the difficult problem of model structure learning is circumvented. Our experimental results show that LDFMs are competitive with existing probabilistic models.

IJCAI Conference 2016 Conference Paper

Stochastic and-or Grammars: A Unified Framework and Logic Perspective

  • Kewei Tu

Stochastic And-Or grammars (AOG) extend traditional stochastic grammars of language to model other types of data such as images and events. In this paper we propose a representation framework of stochastic AOGs that is agnostic to the type of the data being modeled and thus unifies various domain-specific AOGs. Many existing grammar formalisms and probabilistic models in natural language processing, computer vision, and machine learning can be seen as special cases of this framework. We also propose a domain-independent inference algorithm of stochastic context-free AOGs and show its tractability under a reasonable assumption. Furthermore, we provide two interpretations of stochastic context-free AOGs as a subset of probabilistic logic, which connects stochastic AOGs to the field of statistical relational learning and clarifies their relation with a few existing statistical relational models.

NeurIPS Conference 2013 Conference Paper

Unsupervised Structure Learning of Stochastic And-Or Grammars

  • Kewei Tu
  • Maria Pavlovskaia
  • Song-Chun Zhu

Stochastic And-Or grammars compactly represent both compositionality and reconfigurability and have been used to model different types of data such as images and events. We present a unified formalization of stochastic And-Or grammars that is agnostic to the type of the data being modeled, and propose an unsupervised approach to learning the structures as well as the parameters of such grammars. Starting from a trivial initial grammar, our approach iteratively induces compositions and reconfigurations in a unified manner and optimizes the posterior probability of the grammar. In our empirical evaluation, we applied our approach to learning event grammars and image grammars and achieved comparable or better performance than previous approaches.

IJCAI Conference 2011 Conference Paper

On the Utility of Curricula in Unsupervised Learning of Probabilistic Grammars

  • Kewei Tu
  • Vasant Honavar

We examine the utility of a curriculum (a means of presenting training samples in a meaningful order) in unsupervised learning of probabilistic grammars. We introduce the {\em incremental construction hypothesis} that explains the benefits of a curriculum in learning grammars and offers some useful insights into the design of curricula as well as learning algorithms. We present results of experiments with (a) carefully crafted synthetic data that provide support for our hypothesis and (b) natural language corpus that demonstrate the utility of curricula in unsupervised learning of probabilistic grammars.