Arrow Research search

Author name cluster

Lei Hou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
1 author row

Possible papers

12

EAAI Journal 2026 Journal Article

Ensemble Kalman filter-driven adaptive modeling for real-time fracturing pressure forecasting

  • Zhengxin Zhang
  • Lei Hou
  • Qian Sun
  • Mao Sheng
  • Fengshou Zhang
  • Tingxue Jiang
  • Xiaobing Bian
  • Jiangfeng Luo

Accurately forecasting fracturing pressure significantly enhances the safety and efficiency of hydraulic fracturing design. Traditional machine learning methods rely on offline datasets for model training, making it challenging to accurately capture the dynamic variations in features under real-time conditions. This study presents a data assimilation-driven workflow for predicting fracturing pressure, enabling real-time capture of feature variations and adaptive model updates. By integrating the Gated Recurrent Unit (GRU) with the Ensemble Kalman Filter (EnKF), this study develops and evaluates three updating strategies: updating the GRU model parameters, updating the GRU model hidden states, and updating both the GRU model parameters and the hidden state simultaneously. Sensitivity analyses were conducted on two key parameters—process noise and ensemble size. The results demonstrate that the approach of adjusting the GRU model parameters, with a process noise of 0. 01 and an ensemble size of 50, delivers optimal performance. The optimal configuration was adopted for subsequent case studies, where improved pressure prediction accuracy in the first case and optimized fracturing design in the second confirmed the workflow's overall effectiveness. The EnKF-updated predictions reduced Symmetric Mean Absolute Percentage Error (SMAPE) from 3. 56%-6. 48% to 0. 90%-3. 02% and Root Mean Squared Error (RMSE) from 3. 51 to 6. 19 to 1. 05-2. 80, outperforming direct GRU model predictions. Furthermore, the optimized fracturing design increased the cumulative proppant volume from 124. 96 cubic meters (m3) to 147. 89 m3, a 18. 35% improvement. By capturing real-time feature variations, this new workflow provides a robust solution for improving pressure prediction accuracy, thereby enabling optimization of fracturing designs.

NeurIPS Conference 2025 Conference Paper

AGENTIF: Benchmarking Large Language Models Instruction Following Ability in Agentic Scenarios

  • Yunjia Qi
  • Hao Peng
  • Xiaozhi Wang
  • Amy Xin
  • Youfeng Liu
  • Bin Xu
  • Lei Hou
  • Juanzi Li

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from $50$ real-world agentic applications. (2) Long, averaging $1, 723$ words with a maximum of $15, 630$ words. (3) Complex, averaging $11. 9$ constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect $707$ human-annotated instructions across $50$ agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

AAAI Conference 2025 Conference Paper

EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents

  • Mengna Zhu
  • Kaisheng Zeng
  • Mao Wang
  • Kaiming Xiao
  • Lei Hou
  • Hongbin Huang
  • Juanzi Li

In real life, many dynamic events, such as major disasters and large-scale sports events, evolve continuously over time. Obtaining an overview of these events can help people quickly understand the situation and respond more effectively. This is challenging because the key information of the event is often scattered across multiple documents, involving complex event knowledge understanding and reasoning, which is under-explored in previous work. Therefore, we proposed the Event-Centric Multi-Document Summarization task, which aims to generate concise and comprehensive summaries of a given event based on multiple related news documents. Based on this, we constructed the EventSum dataset, which was constructed using Baidu Baike entries and underwent extensive human annotation, to facilitate relevant research. It is the first large-scale Chinese multi-document summarization dataset, containing 5,100 events and a total of 57,984 news documents, with an average of 11.4 input news documents and 13,471 characters per event. To ensure data quality and mitigate potential data leakage, we adopted a multi-stage annotation approach for manually labeling the test set. Given the complexity of event-related information, existing metrics struggle to comprehensively assess the quality of generated summaries. We designed specific metrics including Event Recall, Argument Recall, Causal Recall, and Temporal Recall along with corresponding calculation methods for evaluation. We conducted comprehensive experiments on EventSum to evaluate the performance of advanced long-context Large Language Models (LLMs) on this task. Our experimental results indicate that: 1) The event-centric multi-document summarization task remains challenging for existing long-context LLMs; 2) The recall metrics we designed are crucial for evaluating the comprehensiveness of the summary information.

NeurIPS Conference 2025 Conference Paper

How do Transformers Learn Implicit Reasoning?

  • Jiaran Ye
  • Zijun Yao
  • Zhidian Huang
  • Liangming Pan
  • Jinxin Liu
  • Yushi Bai
  • Amy Xin
  • Liu Weichuan

Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly---producing correct answers without explicitly verbalizing intermediate steps---but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three-stage developmental trajectory: early memorization, followed by in-distribution generalization, and eventually cross-distribution generalization. We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures. To interpret these behaviors, we introduce two diagnostic tools: cross-query semantic patching, which identifies semantically reusable intermediate representations, and a cosine-based representational lens, which reveals that successful reasoning correlates with the cosine-base clustering in hidden space. This clustering phenomenon in turn provides a coherent explanation for the behavioral dynamics observed across training, linking representational structure to reasoning capability. These findings provide new insights into the interpretability of implicit multi-hop reasoning in LLMs, helping to clarify how complex reasoning processes unfold internally and offering pathways to enhance the transparency of such models.

NeurIPS Conference 2025 Conference Paper

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

  • Jianhui Chen
  • Xiaozhi Wang
  • Zijun Yao
  • Yushi Bai
  • Lei Hou
  • Juanzi Li

Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety. Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about 5% safety neurons, and by only patching their activations we can restore over 90% of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the ''alignment tax'' phenomenon by revealing that the key neurons for model safety and helpfulness significantly overlap, yet they require different activation patterns for the same neurons. Furthermore, we demonstrate an application of our findings in safeguarding LLMs by detecting unsafe outputs before generation.

NeurIPS Conference 2023 Conference Paper

Benchmarking Foundation Models with Language-Model-as-an-Examiner

  • Yushi Bai
  • Jiahao Ying
  • Yixin Cao
  • Xin Lv
  • Yuze He
  • Xiaozhi Wang
  • Jifan Yu
  • Kaisheng Zeng

Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation, the examiner combines both scoring and ranking measurements, providing a reliable result as it aligns closely with human annotations. (3) We additionally propose a decentralized Peer-examination method to address the biases in a single examiner. Our data and benchmarking results are available at: http: //lmexam. xlore. cn.

AAAI Conference 2023 Conference Paper

Unveiling the Black Box of PLMs with Semantic Anchors: Towards Interpretable Neural Semantic Parsing

  • Lunyiu Nie
  • Jiuding Sun
  • Yanlin Wang
  • Lun Du
  • Shi Han
  • Dongmei Zhang
  • Lei Hou
  • Juanzi Li

The recent prevalence of pretrained language models (PLMs) has dramatically shifted the paradigm of semantic parsing, where the mapping from natural language utterances to structured logical forms is now formulated as a Seq2Seq task. Despite the promising performance, previous PLM-based approaches often suffer from hallucination problems due to their negligence of the structural information contained in the sentence, which essentially constitutes the key semantics of the logical forms. Furthermore, most works treat PLM as a black box in which the generation process of the target logical form is hidden beneath the decoder modules, which greatly hinders the model's intrinsic interpretability. To address these two issues, we propose to incorporate the current PLMs with a hierarchical decoder network. By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks, namely Semantic Anchor Extraction and Semantic Anchor Alignment, for training the hierarchical decoders and probing the model intermediate representations in a self-adaptive manner alongside the fine-tuning process. We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines. More importantly, by analyzing the intermediate representations of the hierarchical decoders, our approach also makes a huge step toward the interpretability of PLMs in the domain of semantic parsing.

AAAI Conference 2020 Conference Paper

Image Enhanced Event Detection in News Articles

  • Meihan Tong
  • Shuai Wang
  • Yixin Cao
  • Bin Xu
  • Juanzi Li
  • Lei Hou
  • Tat-Seng Chua

Event detection is a crucial and challenging sub-task of event extraction, which suffers from a severe ambiguity issue of trigger words. Existing works mainly focus on using textual context information, while there naturally exist many images accompanied by news articles that are yet to be explored. We believe that images not only reflect the core events of the text, but are also helpful for the disambiguation of trigger words. In this paper, we first contribute an image dataset supplement to ED benchmarks (i. e. , ACE2005) for training and evaluation. We then propose a novel Dual Recurrent Multimodal Model, DRMM, to conduct deep interactions between images and sentences for modality features aggregation. DRMM utilizes pre-trained BERT and ResNet to encode sentences and images, and employs an alternating dual attention to select informative features for mutual enhancements. Our superior performance compared to six state-of-art baselines as well as further ablation studies demonstrate the significance of image modality and effectiveness of the proposed architecture. The code and image dataset are avaliable at https: //github. com/ shuaiwa16/image-enhanced-event-extraction.

AAAI Conference 2019 Conference Paper

DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization

  • Jiaxin Shi
  • Chen Liang
  • Lei Hou
  • Juanzi Li
  • Zhiyuan Liu
  • Hanwang Zhang

We propose DeepChannel, a robust, data-efficient, and interpretable neural model for extractive document summarization. Given any document-summary pair, we estimate a salience score, which is modeled using an attention-based deep neural network, to represent the salience degree of the summary for yielding the document. We devise a contrastive training strategy to learn the salience estimation network, and then use the learned salience score as a guide and iteratively extract the most salient sentences from the document as our generated summary. In experiments, our model not only achieves state-of-the-art ROUGE scores on CNN/Daily Mail dataset, but also shows strong robustness in the out-of-domain test on DUC2007 test set. Moreover, our model reaches a ROUGE-1 F-1 score of 39. 41 on CNN/Daily Mail test set with merely 1/100 training set, demonstrating a tremendous data efficiency.

AAAI Conference 2019 Conference Paper

Learning to Embed Sentences Using Attentive Recursive Trees

  • Jiaxin Shi
  • Lei Hou
  • Juanzi Li
  • Zhiyuan Liu
  • Hanwang Zhang

Sentence embedding is an effective feature representation for most deep learning-based NLP tasks. One prevailing line of methods is using recursive latent tree-structured networks to embed sentences with task-specific structures. However, existing models have no explicit mechanism to emphasize taskinformative words in the tree structure. To this end, we propose an Attentive Recursive Tree model (AR-Tree), where the words are dynamically located according to their importance in the task. Specifically, we construct the latent tree for a sentence in a proposed important-first strategy, and place more attentive words nearer to the root; thus, AR-Tree can inherently emphasize important words during the bottomup composition of the sentence embedding. We propose an end-to-end reinforced training strategy for AR-Tree, which is demonstrated to consistently outperform, or be at least comparable to, the state-of-the-art sentence embedding methods on three sentence understanding tasks.

IJCAI Conference 2013 Conference Paper

What Users Care About: A Framework for Social Content Alignment

  • Lei Hou
  • Juanzi Li
  • Xiaoli Li
  • Jiangfeng Qu
  • Xiaofei Guo
  • Ou Hui
  • Jie Tang

With the rapid proliferation of social media, more and more people freely express their opinions (or comments) on news, products, and movies through online services such as forums, discussion groups, and microblogs. Those comments may be concerned with different aspects (topics) of the target Web document (e. g. , a news page). It would be interesting to align the social comments to the corresponding subtopics contained in the Web document. In this paper, we propose a novel framework that is able to automatically detect the subtopics from a given Web document, and also align the associated social comments with the detected subtopics. This provides a new view of the Web standard document and its associated user generated content through topics, which facilitates the readers to quickly focus on those hot topics or grasp topics that they are interested in. Extensive experiments show that our proposed framework significantly outperforms the existing stateof-the-art methods in social content alignment.