Arrow Research search

Author name cluster

Zhen Tan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

AAAI Conference 2026 Conference Paper

Iterative Multi-Granular RAG with Contextual Hierarchical Graph

  • Yanli Hu
  • Teng Liu
  • Zhuangyi Zhou
  • Weixin Zeng
  • Zhen Tan
  • Xiang Zhao

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) with external knowledge retrieval, improving factual accuracy and knowledge coverage. However, existing RAG approaches face a fundamental trade-off when handling complex reasoning: while traditional iterative retrieval methods offer flexibility, their local perspective limits their ability to establish global knowledge connections. In contrast, structure-augmented RAG methods capture global relationships but incur significant construction costs. To fill in this gap, we propose MGranRAG, an innovative framework designed to integrate precise local retrieval with structured global reasoning. Our approach circumvents expensive semantic extraction by employing a lightweight contextual hierarchical graph, effectively combining the local adaptability of iterative retrieval with the global consistency of structured knowledge. The framework adopts a novel iterative optimization scheme: at the local level, the LLM identifies multi-granular contextual evidence, such as key sentences and phrases, within retrieved passages to refine retrieval. At the global level, these multi-granularity evidence nodes are then mapped and propagated within the structured hierarchical graph, enabling the diffusion of rich contextual information at different levels to introduce global semantic constraints and reorder retrieval results. This coordination between local and global iterative processes dynamically balances retrieval accuracy and contextual coherence. Experimental results on challenging multi-hop and open-domain question answering datasets show that our proposal achieves new state-of-the-art performance in both retrieval and answer accuracy.

AAAI Conference 2026 Conference Paper

Model Editing as a Double-Edged Sword: Steering Agent Behavior Toward Beneficence or Harm

  • Baixiang Huang
  • Zhen Tan
  • Haoran Wang
  • Zijie Liu
  • Dawei Li
  • Ali Payani
  • Huan Liu
  • Tianlong Chen

Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent’s global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through extensive evaluations of agents built on frontier LLMs, BehaviorBench validates the effectiveness of behavior editing across a wide range of models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.

AAAI Conference 2026 Conference Paper

Multi-granularity Temporal Knowledge Editing over Large Language Models

  • Simiao Zhao
  • Ning Pang
  • Zhen Tan
  • Yanli Hu
  • Weidong Xiao
  • Xiang Zhao

The evolving worldly dynamics necessitate continuous revision and updating of knowledge within Large Language Models (LLMs), driving the development of Knowledge Editing (KE) techniques. Recently, a novel paradigm of Temporal Knowledge Editing (TKE) has been proposed, emphasizing that models deployed in dynamic environments should integrate new information while retaining historical knowledge. However, we observe that current definitions and methods for TKE are insufficient, as they do not effectively capture or adapt to the fine-grained temporal dynamics inherent in real-world knowledge evolution. In this paper, we introduce the notion of multi-granularity TKE, encompassing temporal knowledge across yearly, monthly, and daily granularities, and propose a corresponding dataset, named MTKE. We argue that comprehending and retaining knowledge across different temporal granularities is crucial for LLMs to accurately reflect real-world changes. The key challenge lies in integrating new temporal knowledge at various granularities while also preserving relevant historical knowledge, thus ensuring LLMs maintain a consistent and accurate understanding over time. To achieve this, we propose a Sparse Parameter-Injected Knowledge Editing method, dubbed SPIKE, which anchors both temporal knowledge and subject positions within the model. Experiments demonstrate that our method effectively preserves historical knowledge performance while accurately incorporating dynamic temporal knowledge across multi-granularity temporal scenarios.

AAAI Conference 2026 Conference Paper

OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

  • Zezhen Ding
  • Zhen Tan
  • Jiheng Zhang
  • Tianlong Chen

Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of 67.7%, using only 1/10 the synthetic data required by prior methods such as ORLM, exceeding ORLM’s solving accuracy by up to 4.2%. Remarkably, OR-R1 outperforms ORLM by over 2.4% with just 100 synthetic samples. Furthermore, TGRPO contributes an additional 3.1%–6.4% improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from 13% to 7%. Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.

NeurIPS Conference 2025 Conference Paper

$\texttt{BetaConform}$: Efficient MAP Estimation of LLM Ensemble Judgment Performance with Prior Transfer

  • Huaizhi Qu
  • Inyoung Choi
  • Zhen Tan
  • Song Wang
  • Sukwon Yun
  • Qi Long
  • Faizan Siddiqui
  • Kwonjoon Lee

LLM ensembles are widely used for LLM judges. However, how to estimate their accuracy, especially in an efficient way, is unknown. In this paper, we present a principled $\textit{maximum a posteriori}$ (MAP) framework for an economical and precise estimation of the performance of LLM ensemble judgment. We first propose a mixture of Beta-Binomial distributions to model the judgment distribution, revising from the vanilla Binomial distribution. Next, we introduce a conformal prediction-driven approach that enables adaptive stopping during iterative sampling to balance accuracy with efficiency. Furthermore, we design a prior transfer mechanism that utilizes learned distributions on open-source datasets to improve estimation on a target dataset when only scarce annotations are available. Finally, we present $\texttt{BetaConform}$, a framework that integrates our distribution assumption, adaptive stopping, and the prior transfer mechanism to deliver a theoretically guaranteed distribution estimation of LLM ensemble judgment with minimum labeled samples. $\texttt{BetaConform}$ is also validated empirically. For instance, with only $10$ samples from the TruthfulQA dataset, for a Llama ensembled judge, $\texttt{BetaConform}$ gauges its performance with an error margin as small as $3. 37\\%$.

AAAI Conference 2025 Conference Paper

BrainMAP: Learning Multiple Activation Pathways in Brain Networks

  • Song Wang
  • Zhenyu Lei
  • Zhen Tan
  • Jiaqi Ding
  • Xinyu Zhao
  • Yushun Dong
  • Guorong Wu
  • Tianlong Chen

Functional Magnetic Resonance Image (fMRI) is commonly employed to study human brain activity, since it offers insight into the relationship between functional fluctuations and human behavior. To enhance analysis and comprehension of brain activity, Graph Neural Networks (GNNs) have been widely applied to the analysis of functional connectivities (FC) derived from fMRI data, due to their ability to capture the synergistic interactions among brain regions. However, in the human brain, performing complex tasks typically involves the activation of certain pathways, which could be represented as paths across graphs. As such, conventional GNNs struggle to learn from these pathways due to the long-range dependencies of multiple pathways. To address these challenges, we introduce a novel framework BrainMAP to learn multiple pathways in brain networks. BrainMAP leverages sequential models to identify long-range correlations among sequentialized brain regions and incorporates an aggregation module based on Mixture of Experts (MoE) to learn from multiple pathways. Our comprehensive experiments highlight BrainMAP's superior performance. Furthermore, our framework enables explanatory analyses of crucial brain regions involved in tasks.

AAAI Conference 2025 Conference Paper

Cross-modal Multi-task Learning for Multimedia Event Extraction

  • Jianwei Cao
  • Yanli Hu
  • Zhen Tan
  • Xiang Zhao

Multimedia event extraction aims to jointly extract event structural knowledge from multiple modalities, thus improving the comprehension and utilization of events in the growing multimedia content (e.g., multimedia news). A key challenge in multimedia event extraction is to establish cross-modal correlations during training without multimedia event annotations. Considering the complexity and cost of annotation across modalities, the multimedia event extraction task only provides parallel annotated data for evaluation. Previous works attempt to learn implicit correlations directly from unlabeled image-text pairs, but do not yield substantially better performance for event-centric tasks. To address this problem, we propose a cross-modal multi-task learning framework X-MTL to establish cross-modal correlations at the task level, which can simultaneously address four key tasks of multimedia event extraction: trigger detection, argument extraction, verb classification, and role classification. Specifically, to process inputs from different modalities and tasks, we utilize two separate modality-specific encoders and a modality-shared encoder to learn joint task representations, and introduce textual and visual prompt learning methods to enrich and unify task inputs. To resolve task conflict in cross-modal multi-task learning, we propose a pseudo label based knowledge distillation method, combined with dynamic weight adjustment method, which can effectively lift the performance to surpass the separately-trained models. On the Multimedia Event Extraction benchmark M2E2, experimental results show that X-MTL surpasses the current state-of-the-art (SOTA) methods by 4.1% for multimedia event mention and 8.2% for multimedia argument role.

ICML Conference 2025 Conference Paper

Editable Concept Bottleneck Models

  • Lijie Hu
  • Chenyang Ren
  • Zhengyu Hu
  • Hongbin Lin
  • Cheng-Long Wang 0003
  • Zhen Tan
  • Weimin Lyu
  • Jingfeng Zhang

Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a human-understandable concept layer. However, most previous studies focused on cases where the data, including concepts, are clean. In many scenarios, we always need to remove/insert some training data or new concepts from trained CBMs due to different reasons, such as privacy concerns, data mislabelling, spurious concepts, and concept annotation errors. Thus, the challenge of deriving efficient editable CBMs without retraining from scratch persists, particularly in large-scale applications. To address these challenges, we propose Editable Concept Bottleneck Models (ECBMs). Specifically, ECBMs support three different levels of data removal: concept-label-level, concept-level, and data-level. ECBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for re-training. Experimental results demonstrate the efficiency and effectiveness of our ECBMs, affirming their adaptability within the realm of CBMs.

TMLR Journal 2025 Journal Article

Generative Risk Minimization for Out-of-Distribution Generalization on Graphs

  • Song Wang
  • Zhen Tan
  • Yaochen Zhu
  • Chuxu Zhang
  • Jundong Li

Out-of-distribution (OOD) generalization on graphs aims at dealing with scenarios where the test graph distribution differs from the training graph distributions. Compared to i.i.d. data like images, the OOD generalization problem on graph-structured data remains challenging due to the non-i.i.d. property and complex structural information on graphs. Recently, several works on graph OOD generalization have explored extracting invariant subgraphs that share crucial classification information across different distributions. Nevertheless, such a strategy could be suboptimal for entirely capturing the invariant information, as the extraction of discrete structures could potentially lead to the loss of invariant information or the involvement of spurious information. In this paper, we propose an innovative framework, named Generative Risk Minimization (GRM), designed to generate an invariant subgraph for each input graph to be classified, instead of extraction. To address the challenge of optimization in the absence of optimal invariant subgraphs (i.e., ground truths), we derive a tractable form of the proposed GRM objective by introducing a latent causal variable, and its effectiveness is validated by our theoretical analysis. We further conduct extensive experiments across a variety of real-world graph datasets for both node-level and graph-level OOD generalization, and the results demonstrate the superiority of our framework GRM.

NeurIPS Conference 2025 Conference Paper

IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

  • Yifan Li
  • Yuhang Chen
  • Anh Dao
  • Lichi Li
  • Zhongyi Cai
  • Zhen Tan
  • Tianlong Chen
  • Yu Kong

Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical industrial warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse scenarios and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments.

IJCAI Conference 2025 Conference Paper

Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract)

  • Zhen Tan
  • Lu Cheng
  • Song Wang
  • Yuan Bo
  • Jundong Li
  • Huan Liu

Pretrained language models (PLMs) achieve state-of-the-art results but often function as ``black boxes'', hindering interpretability and responsible deployment. While methods like attention analysis exist, they often lack clarity and intuitiveness. We propose interpreting PLMs through high-level, human-understandable concepts using Concept Bottleneck Models (CBMs). This extended abstract introduces C3M (ChatGPT-guided Concept augmentation with Concept-level Mixup), a novel framework for training Concept-Bottleneck-Enabled PLMs (CBE-PLMs). C3M leverages Large Language Models (LLMs) like ChatGPT to augment concept sets and generate noisy concept labels, combined with a concept-level MixUp mechanism to enhance robustness and effectively learn from both human-annotated and machine-generated concepts. Empirical results show our approach provides intuitive explanations, aids model diagnosis via test-time intervention, and improves the interpretability-utility trade-off, even with limited or noisy concept annotations. This is an concise version of [Tan et al. , 2024b], recipient of the Best Paper Award at PAKDD 2024. Code and data are released at https: //github. com/Zhen-Tan-dmml/CBM_NLP. git.

AAAI Conference 2025 Conference Paper

Logic Induced High-Order Reasoning Network for Event-Event Relation Extraction

  • Peixin Huang
  • Xiang Zhao
  • Minghao Hu
  • Zhen Tan
  • Weidong Xiao

To understand a document with multiple events, event-event relation extraction (ERE) emerges as a crucial task, aiming to discern how natural events temporally or structurally associate with each other. To achieve this goal, our work addresses the problems of temporal event relation extraction (TRE) and subevent relation extraction (SRE). The latest methods for such problems have commonly built document-level event graphs for global reasoning across sentences. However, the edges between events are usually derived from external tools heuristically, which are not always reliable and may introduce noise. Moreover, they are not capable of preserving logical constraints among event relations, e.g., coreference constraint, symmetry constraint and conjunction constraint. These constraints guarantee coherence between different relation types, enabling the generation of a unified event evolution graph. In this work, we propose a novel method named LogicERE, which performs high-order event relation reasoning through modeling logic constraints. Specifically, different from conventional event graphs, we design a logic constraint induced graph (LCG) without any external tools. LCG involves event nodes where the interactions among them can model the coreference constraint, and event pairs nodes where the interactions among them can retain the symmetry constraint and conjunction constraint. Then we perform high-order reasoning on LCG with relational graph transformer to obtain enhanced event and event pair embeddings. Finally, we further incorporate logic constraint information via a joint logic learning module. Extensive experiments demonstrate the effectiveness of the proposed method with state-of-the-art performance on benchmark datasets.

NeurIPS Conference 2025 Conference Paper

Multi-Agent Debate for LLM Judges with Adaptive Stability Detection

  • Tianyu Hu
  • Zhen Tan
  • Song Wang
  • Huaizhi Qu
  • Tianlong Chen

With advancements in reasoning capabilities, Large Language Models (LLMs) are increasingly employed for automated judgment tasks. While LLMs-as-Judges offer promise in automating evaluations, current approaches often rely on simplistic aggregation methods (e. g. , majority voting), which can fail even when individual agents provide correct answers. To address this, we propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses. We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles. To enhance efficiency, we introduce a stability detection mechanism that models judge consensus dynamics via a time-varying Beta-Binomial mixture, with adaptive stopping based on distributional similarity (Kolmogorov-Smirnov test). This mechanism models the judges' collective correct rate dynamics using a time-varying mixture of Beta-Binomial distributions and employs an adaptive stopping criterion based on distributional similarity (Kolmogorov-Smirnov statistic). Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.

AAAI Conference 2025 Conference Paper

Tuning-Free Accountable Intervention for LLM Deployment – a Metacognitive Approach

  • Zhen Tan
  • Jie Peng
  • Song Wang
  • Lijie Hu
  • Tianlong Chen
  • Huan Liu

Large Language Models (LLMs) have brought significant advances across various NLP tasks through few-shot or zero-shot prompting, bypassing the need for parameter tuning. However, the "black-box" nature behind their massive parameter sizes increases the "hallucination" concerns, especially in high-stakes applications (e.g., healthcare), where decision mistakes can lead to severe consequences. In contrast, human decision-making relies on complex cognitive processes, such as the ability to sense and adaptively correct mistakes through conceptual understanding. Drawing inspiration from human cognition, we propose an innovative metacognitive approach CLEAR, to equip LLMs with capabilities for self-aware error identification and correction. Our framework constructs concept-specific sparse subnetworks that indicate decision processes. This provides a novel interface for model {intervention} after deployment. The benefits include: (i) at inference time, our metacognitive LLMs can self-consciously identify potential mispredictions with minimum human involvement, (ii) the model can self-correct its errors efficiently without additional tuning, and (iii) the correction procedure is not only self-explanatory but also user-friendly, enhancing model interpretability and accessibility. With these metacognitive features, our approach pioneers a new path toward the trustworthiness of LLMs.

AAAI Conference 2024 Conference Paper

Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention

  • Zhen Tan
  • Tianlong Chen
  • Zhenyu Zhang
  • Huan Liu

Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. However, the enigmatic ``black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. While past approaches, such as attention visualization, pivotal subnetwork extraction, and concept-based analyses, offer some insight, they often focus on either local or global explanations within a single dimension, occasionally falling short in providing comprehensive clarity. In response, we propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs. Our framework, termed SparseCBM, innovatively integrates sparsity to elucidate three intertwined layers of interpretation: input, subnetwork, and concept levels. In addition, the newly introduced dimension of interpretable inference-time intervention facilitates dynamic adjustments to the model during deployment. Through rigorous empirical evaluations on real-world datasets, we demonstrate that SparseCBM delivers a profound understanding of LLM behaviors, setting it apart in both interpreting and ameliorating model inaccuracies. Codes are provided in supplements.

AAAI Conference 2019 Conference Paper

Jointly Extracting Multiple Triplets with Multilayer Translation Constraints

  • Zhen Tan
  • Xiang Zhao
  • Wei Wang
  • Weidong Xiao

Triplets extraction is an essential and pivotal step in automatic knowledge base construction, which captures structural information from unstructured text corpus. Conventional extraction models use a pipeline of named entity recognition and relation classification to extract entities and relations, respectively, which ignore the connection between the two tasks. Recently, several neural network-based models were proposed to tackle the problem, and achieved state-of-the-art performance. However, most of them are unable to extract multiple triplets from a single sentence, which are yet commonly seen in real-life scenarios. To close the gap, we propose in this paper a joint neural extraction model for multitriplets, namely, TME, which is capable of adaptively discovering multiple triplets simultaneously in a sentence via ranking with translation mechanism. In experiment, TME exhibits superior performance and achieves an improvement of 37. 6% on F1 score over state-of-the-art competitors.