Arrow Research search

Author name cluster

Kehai Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers
2 author rows

Possible papers

17

AAAI Conference 2026 Conference Paper

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

  • Hui Huang
  • Yancheng He
  • Wei Liu
  • Muyun Yang
  • Jiaheng Liu
  • Kehai Chen
  • Bing Xu
  • Conghui Zhu

The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error's position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifier exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.

AAAI Conference 2026 Conference Paper

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

  • Hongli Zhou
  • Hui Huang
  • Ziqing Zhao
  • Lvyuan Han
  • Huicheng Wang
  • Kehai Chen
  • Muyun Yang
  • Wei Bao

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

TMLR Journal 2026 Journal Article

S$^2$Transformer: Scalable Structured Transformers for Global Station Weather Forecasting

  • Hongyi Chen
  • Xiucheng Li
  • Xinyang Chen
  • Yun Cheng
  • Jing Li
  • Kehai Chen
  • Liqiang Nie

Global Station Weather Forecasting (GSWF) is a key meteorological research area, critical to energy, aviation, and agriculture. Existing time series forecasting methods often ignore or unidirectionally model spatial correlation when conducting large-scale global station forecasting. This contradicts the intrinsic nature underlying observations of the global weather system, limiting forecast performance. To address this, we propose a novel Spatial Structured Attention Block in this paper. It partitions the spatial graph into a set of subgraphs and instantiates Intra-subgraph Attention to learn local spatial correlation within each subgraph, and aggregates nodes into subgraph representations for message passing among the subgraphs via Inter-subgraph Attention---considering both spatial proximity and global correlation. Building on this block, we develop a multiscale spatiotemporal forecasting model S$^2$Transformer by progressively expanding subgraph scales. The resulting model is both scalable and able to produce structured spatial correlation, and meanwhile, it is easy to implement. The experimental results show that it can achieve performance improvements up to 16.8% over time series forecasting baselines at low running costs.

IJCAI Conference 2025 Conference Paper

A Survey on the Feedback Mechanism of LLM-based AI Agents

  • Zhipeng Liu
  • Xuefeng Bai
  • Kehai Chen
  • Xinyang Chen
  • Xiucheng Li
  • Yang Xiang
  • Jin Liu
  • Hong-Dong Li

Large language models (LLMs) are increasingly being adopted to develop general-purpose AI agents. However, it remains challenging for these LLM-based AI agents to efficiently learn from feedback and iteratively optimize their strategies. To address this challenge, tremendous efforts have been dedicated to designing diverse feedback mechanisms for LLM-based AI agents. To provide a comprehensive overview of this rapidly evolving field, this paper presents a systematic review of these studies, offering a holistic perspective on the feedback mechanisms in LLM-based AI agents. We begin by discussing the construction of LLM-based AI agents, introducing a generalized framework that encapsulates much of the existing work. Next, we delve into the exploration of feedback mechanisms, categorizing them into four distinct types: internal feedback, external feedback, multi-agent feedback, and human feedback. Additionally, we provide an overview of evaluation protocols and benchmarks specifically tailored for LLM-based AI agents. Finally, we highlight the significant challenges and identify potential directions for future studies. The relevant papers are summarized and will be consistently updated at https: //github. com/kevinson7515/Agents-Feedback-Mechanisms.

NeurIPS Conference 2025 Conference Paper

Exploring the Translation Mechanism of Large Language Models

  • Hongbin Zhang
  • Kehai Chen
  • Xuefeng Bai
  • Xiucheng Li
  • Yang Xiang
  • Min Zhang

While large language models (LLMs) demonstrate remarkable success in multilingual translation, their internal core translation mechanisms, even at the fundamental word level, remain insufficiently understood. To address this critical gap, this work introduces a systematic framework for interpreting the mechanism behind LLM translation from the perspective of computational components. This paper first proposes subspace-intervened path patching for precise, fine-grained causal analysis, enabling the detection of components crucial to translation tasks and subsequently characterizing their behavioral patterns in human-interpretable terms. Comprehensive experiments reveal that translation is predominantly driven by a sparse subset of components: specialized attention heads serve critical roles in extracting source language, translation indicators, and positional features, which are then integrated and processed by specific multi-layer perceptrons (MLPs) into intermediary English-centric latent representations before ultimately yielding the final translation. The significance of these findings is underscored by the empirical demonstration that targeted fine-tuning a minimal parameter subset (<5%) enhances translation performance while preserving general capabilities. This result further indicates that these crucial components generalize effectively to sentence-level translation and are instrumental in elucidating more intricate translation tasks.

AAAI Conference 2025 Conference Paper

Look Before You Leap: Enhance Attention and Vigilance Regarding Harmful Content with GuidelineLLM

  • Shaoqing Zhang
  • Zhuosheng Zhang
  • Kehai Chen
  • Rongxiang Weng
  • Muyun Yang
  • Tiejun Zhao
  • Min Zhang

Despite being empowered with alignment mechanisms, large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks that can compromise their alignment mechanisms. This vulnerability poses significant risks to real-world applications. Existing work faces challenges in both training efficiency and generalization capabilities (i.e., Reinforcement Learning from Human Feedback and Red-Teaming). Developing effective strategies to enable LLMs to resist continuously evolving jailbreak attempts represents a significant challenge. To address this challenge, we propose a novel defensive paradigm called GuidelineLLM, which assists LLMs in recognizing queries that may have harmful content. Before LLMs respond to a query, GuidelineLLM first identifies potential risks associated with the query, summarizes these risks into guideline suggestions, and then feeds these guidelines to the responding LLMs. Importantly, our approach eliminates the necessity for additional safety fine-tuning of the LLMs themselves; only the GuidelineLLM requires fine-tuning. This characteristic enhances the general applicability of GuidelineLLM across various LLMs. Experimental results demonstrate that GuidelineLLM can significantly reduce the attack success rate (ASR) against LLM (an average reduction of 34.17% ASR) while maintaining the usefulness of LLM in handling benign queries.

NeurIPS Conference 2025 Conference Paper

MASTER: Enhancing Large Language Model via Multi-Agent Simulated Teaching

  • Liang Yue
  • Yihong Tang
  • Kehai Chen
  • Jie Liu
  • Min Zhang

Instruction fine-tuning is crucial in NLP tasks, enhancing pretrained models' instruction-following capabilities and task-specific performance. However, obtaining high-quality fine-tuning data for large models is challenging due to data collection difficulties and high production costs. To address this, we propose MASTER, a novel data augmentation method that enriches original data through interactions among multiple agents with varying cognitive levels. We simulate three pedagogically grounded teaching scenarios, leveraging multi-agent conversations to generate high-quality teacher-student interaction data. Utilizing MASTER, we construct BOOST-QA, a fine-tuning dataset augmented from existing datasets like Orca-Math-200k, ProcQA, and OpenHermes2. 5. Experiments show that models fine-tuned with BOOST-QA perform excellently across multiple benchmarks, demonstrating strong multitask generalization. Notably, MASTER significantly improves models' reasoning abilities in complex tasks, providing valuable insights for future research.

NeurIPS Conference 2025 Conference Paper

Thinking in Character: Advancing Role-Playing Agents with Role-Aware Reasoning

  • Yihong Tang
  • Kehai Chen
  • Muyun Yang
  • Zheng-Yu Niu
  • Jing Li
  • Tiejun Zhao
  • Min Zhang

The advancement of Large Language Models (LLMs) has spurred significant interest in Role-Playing Agents (RPAs) for applications such as emotional companionship and virtual interaction. However, recent RPAs are often built on explicit dialogue data, lacking deep, human-like internal thought processes, resulting in superficial knowledge and style expression. While Large Reasoning Models (LRMs) can be employed to simulate character thought, their direct application is hindered by attention diversion (i. e. , RPAs forget their role) and style drift (i. e. , overly formal and rigid reasoning rather than character-consistent reasoning). To address these challenges, this paper introduces a novel Role-Aware Reasoning (RAR) method, which consists of two important stages: Role Identity Activation (RIA) and Reasoning Style Optimization (RSO). RIA explicitly guides the model with character profiles during reasoning to counteract attention diversion, and then RSO aligns reasoning style with the character and scene via LRM distillation to mitigate style drift. Extensive experiments demonstrate that the proposed RAR significantly enhances the performance of RPAs by effectively addressing attention diversion and style drift.

NeurIPS Conference 2025 Conference Paper

Unified Transferability Metrics for Time Series Foundation Models

  • Weiyang Zhang
  • Xinyang Chen
  • Xiucheng Li
  • Kehai Chen
  • Weili Guan
  • Liqiang Nie

With the increasing number of time series pre-trained models, designing transferability evaluation metrics for time series has become an urgent problem to address. While transferability evaluation has been extensively studied in computer vision, we aim to address a critical gap by developing tailored metrics for time series analysis. In this paper, we introduce TEMPLATE, a transferability estimation framework specifically tailored for versatile time series analysis, comprising three complementary metrics: (1) Dependency Learning Score quantifies a model’s capacity to capture temporal dependencies. (2) Pattern Learning Score evaluates the representation quality in extracting discriminative temporal patterns. (3) Task Adaptation Score assesses cross-task generalization capability, enabling versatile time series analysis. TEMPLATE presents a versatile framework compatible with both classification and regression paradigms. Through comprehensive benchmarking across 5 distinct downstream tasks, our method demonstrates superior capability in identifying optimal pre-trained models from heterogeneous model pools for transfer learning. Compared to the state-of-the-art method ETran, our approach improves the weighted Kendall's $\tau_w$ across 5 downstream tasks by 35\%. The code is available at https: //github. com/ooooooover/TEMPLATE.

NeurIPS Conference 2025 Conference Paper

XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

  • Zhenyu Li
  • Kehai Chen
  • Yunfei Long
  • Xuefeng Bai
  • Yaoyin Zhang
  • Xuchen Wei
  • Juntao Li
  • Min Zhang

Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings lacks systematic investigation, with existing evaluations lacking fine-grained constraint analysis across diverse linguistic contexts. We introduce XIFBench, a comprehensive constraint-based benchmark for evaluating multilingual instruction-following abilities of LLMs, comprising 558 instructions with 0-5 additional constraints across five categories ( Content, Style, Situation, Format, and Numerical ) in six languages spanning different resource levels. To support reliable and consistent cross-lingual evaluation, we implement three methodological innovations: cultural accessibility annotation, constraint-level translation validation, and requirement-based evaluation using English requirements as semantic anchors across languages. Extensive experiments with various LLMs not only quantify performance disparities across resource levels but also provide detailed insights into how language resources, constraint categories, instruction complexity, and cultural specificity influence multilingual instruction-following. Our code and data are available at https: //github. com/zhenyuli801/XIFBench.

IJCAI Conference 2022 Conference Paper

Effective Graph Context Representation for Document-level Machine Translation

  • Kehai Chen
  • Muyun Yang
  • Masao Utiyama
  • Eiichiro Sumita
  • Rui Wang
  • Min Zhang

Document-level neural machine translation (DocNMT) universally encodes several local sentences or the entire document. Thus, DocNMT does not consider the relevance of document-level contextual information, for example, some context (i. e. , content words, logical order, and co-occurrence relation) is more effective than another auxiliary context (i. e. , functional and auxiliary words). To address this issue, we first utilize the word frequency information to recognize content words in the input document, and then use heuristical relations to summarize content words and sentences as a graph structure without relying on external syntactic knowledge. Furthermore, we apply graph attention networks to this graph structure to learn its feature representation, which allows DocNMT to more effectively capture the document-level context. Experimental results on several widely-used document-level benchmarks demonstrated the effectiveness of the proposed approach.

AAAI Conference 2021 Conference Paper

Document-Level Relation Extraction with Reconstruction

  • Wang Xu
  • Kehai Chen
  • Tiejun Zhao

In document-level relation extraction (DocRE), graph structure is generally used to encode relation information in the input document to classify the relation category between each entity pair, and has greatly advanced the DocRE task over the past several years. However, the learned graph representation universally models relation information between all entity pairs regardless of whether there are relationships between these entity pairs. Thus, those entity pairs without relationships disperse the attention of the encoder-classifier DocRE for ones with relationships, which may further hind the improvement of DocRE. To alleviate this issue, we propose a novel encoder-classifierreconstructor model for DocRE. The reconstructor manages to reconstruct the ground-truth path dependencies from the graph representation, to ensure that the proposed DocRE model pays more attention to encode entity pairs with relationships in the training. Furthermore, the reconstructor is regarded as a relationship indicator to assist relation classification in the inference, which can further improve the performance of DocRE model. Experimental results on a large-scale DocRE dataset show that the proposed model can significantly improve the accuracy of relation extraction on a strong heterogeneous graph-based baseline. The code is publicly available at https: //github. com/xwjim/DocRE-Rec.

ICLR Conference 2020 Conference Paper

Data-dependent Gaussian Prior Objective for Language Generation

  • Zuchao Li
  • Rui Wang 0015
  • Kehai Chen
  • Masao Utiyama
  • Eiichiro Sumita
  • Zhuosheng Zhang 0001
  • Hai Zhao 0001

For typical sequence prediction problems such as language generation, maximum likelihood estimation (MLE) has commonly been adopted as it encourages the predicted sequence most consistent with the ground-truth sequence to have the highest probability of occurring. However, MLE focuses on once-to-all matching between the predicted sequence and gold-standard, consequently treating all incorrect predictions as being equally incorrect. We refer to this drawback as {\it negative diversity ignorance} in this paper. Treating all incorrect predictions as equal unfairly downplays the nuance of these sequences' detailed token-wise structure. To counteract this, we augment the MLE loss by introducing an extra Kullback--Leibler divergence term derived by comparing a data-dependent Gaussian prior and the detailed training prediction. The proposed data-dependent Gaussian prior objective (D2GPo) is defined over a prior topological order of tokens and is poles apart from the data-independent Gaussian prior (L2 regularization) commonly adopted in smoothing the training of MLE. Experimental results show that the proposed method makes effective use of a more detailed prior in the data and has improved performance in typical language generation tasks, including supervised and unsupervised machine translation, text summarization, storytelling, and image captioning.

AAAI Conference 2020 Conference Paper

Explicit Sentence Compression for Neural Machine Translation

  • Zuchao Li
  • Rui Wang
  • Kehai Chen
  • Masao Utiyama
  • Eiichiro Sumita
  • Zhuosheng Zhang
  • Hai Zhao

State-of-the-art Transformer-based neural machine translation (NMT) systems still follow a standard encoder-decoder framework, in which source sentence representation can be well done by an encoder with self-attention mechanism. Though Transformer-based encoder may effectively capture general information in its resulting source sentence representation, the backbone information, which stands for the gist of a sentence, is not specifically focused on. In this paper, we propose an explicit sentence compression method to enhance the source sentence representation for NMT. In practice, an explicit sentence compression goal used to learn the backbone information in a sentence. We propose three ways, including backbone source-side fusion, targetside fusion, and both-side fusion, to integrate the compressed sentence into NMT. Our empirical tests on the WMT Englishto-French and English-to-German translation tasks show that the proposed sentence compression method significantly improves the translation performances over strong baselines.

ICLR Conference 2020 Conference Paper

Neural Machine Translation with Universal Visual Representation

  • Zhuosheng Zhang 0001
  • Kehai Chen
  • Rui Wang 0015
  • Masao Utiyama
  • Eiichiro Sumita
  • Zuchao Li
  • Hai Zhao 0001

Though visual information has been introduced for enhancing neural machine translation (NMT), its effectiveness strongly relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we present a universal visual representation learned over the monolingual corpora with image annotations, which overcomes the lack of large-scale bilingual sentence-image pairs, thereby extending image applicability in NMT. In detail, a group of images with similar topics to the source sentence will be retrieved from a light topic-image lookup table learned over the existing sentence-image pairs, and then is encoded as image representations by a pre-trained ResNet. An attention layer with a gated weighting is to fuse the visual information and text information as input to the decoder for predicting target translations. In particular, the proposed method enables the visual information to be integrated into large-scale text-only NMT in addition to the multimodel NMT. Experiments on four widely used translation datasets, including the WMT'16 English-to-Romanian, WMT'14 English-to-German, WMT'14 English-to-French, and Multi30K, show that the proposed approach achieves significant improvements over strong baselines.

AAAI Conference 2018 Conference Paper

Syntax-Directed Attention for Neural Machine Translation

  • Kehai Chen
  • Rui Wang
  • Masao Utiyama
  • Eiichiro Sumita
  • Tiejun Zhao

Attention mechanism, including global attention and local attention, plays a key role in neural machine translation (NMT). Global attention attends to all source words for word prediction. In comparison, local attention selectively looks at fixed-window source words. However, alignment weights for the current target word often decrease to the left and right by linear distance centering on the aligned source position and neglect syntax distance constraints. In this paper, we extend the local attention with syntax-distance constraint, which focuses on syntactically related source words with the predicted target word to learning a more effective context vector for predicting translation. Moreover, we further propose a double context NMT architecture, which consists of a global context vector and a syntax-directed context vector from the global attention, to provide more translation performance for NMT from source representation. The experiments on the largescale Chinese-to-English and English-to-German translation tasks show that the proposed approach achieves a substantial and significant improvement over the baseline system.

AAAI Conference 2017 Conference Paper

Translation Prediction with Source Dependency-Based Context Representation

  • Kehai Chen
  • Tiejun Zhao
  • Muyun Yang
  • Lemao Liu

Learning context representations is very promising to improve translation results, particularly through neural networks. Previous efforts process the context words sequentially and neglect their internal syntactic structure. In this paper, we propose a novel neural network based on bi-convolutional architecture to represent the source dependency-based context for translation prediction. The proposed model is able to not only encode the long-distance dependencies but also capture the functional similarities for better translation prediction (i. e. , ambiguous words translation and word forms translation). Examined by a largescale Chinese-English translation task, the proposed approach achieves a significant improvement (of up to +1. 9 BLEU points) over the baseline system, and meanwhile outperforms a number of context-enhanced comparison system.