Author name cluster

Xi Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

147 papers

2 author rows

AAAI Conference 2026 Conference Paper

ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

Pengze Li
Jiaqi Liu
Junchi Yu
Lihao Liu
Mingyu Ding
Wanli Ouyang
Shixiang Tang
Xi Chen

Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In an RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce’s fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.

PDF Details DOI

AIIM Journal 2026 Journal Article

Evidential reasoning-enabled deep learning for reliable treatment outcome prediction in cancer therapy

Xi Chen
Xiaoxu Deng
Zhiguo Zhou

Details DOI

AAAI Conference 2026 Conference Paper

How Does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Xi Chen
Aske Plaat
Niki van Stein

Chain‑of‑thought (CoT) prompting boosts Large Language Models accuracy on multi‑step tasks, yet whether the generated ``thoughts'' reflect the true internal reasoning process is unresolved. We present the first feature‑level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia‑70M and Pythia‑2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT‑reasoning features into a noCoT run raises answer log‑probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear contrast for these two scales. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch‑curves and random‑feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

Zeyu Yang
Lai Wei
Roman Koshkin
Xi Chen
Satoshi Nakamura

This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus (En to De/Zh/Ja) demonstrate significant translation quality improvements across languages, validating the effectiveness of syntactic structures in LLM-driven SimulST systems.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Seeing Is Believing: Grounding Long-Video Understanding in Spatio-Temporal Visual Evidence

Zhaoyang Wei
Guoliang Wang
Guohua Gao
Yanchao Hao
Mingda Li
Wenchao Ding
Xi Chen
Shizhu He

Although Vision Language Models (VLMs) have excelled at image and video understanding, applying them to hour-long videos is held back by two interrelated challenges: exorbitant computational expense and a qualitative breakdown in long-term temporal reasoning. Thus, models tend to generate answers based on speculation instead of solid visual facts, causing both factually incorrect and plausible hallucinations. This problem is compounded by current benchmarks that, by only emphasizing final answers, lack an effective mechanism to check whether reasoning is substantiated by specific visual evidence. This makes it hard to differentiate between true understanding and pretend comprehension, inhibiting targeted model refinement. To address these interrelated challenges of model fragility and evaluation weakness, we adopt a twofold strategy. First, we present EV²-Bench, a large-scale benchmark that breaks new ground by an evaluation paradigm built upon spatio-temporal visual evidence, forcing models to justify answers with checkable hints. Second, we put forward DynamicSelect, an adaptive token compression system that efficiently condenses salient information by a dynamic semantic selector and a hierarchical compression strategy. Comprehensive experiments demonstrate that DynamicSelect significantly outperforms the baselines on EV²-Bench as well as other public benchmarks. Our study offers not only a more effective approach to long-video understanding but also a more stringent evaluation paradigm, indicating the way toward more robust models.

PDF Details DOI

TMLR Journal 2026 Journal Article

Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning in GRPO

Peter Chen
Xiaopeng Li
Ziniu Li
Xi Chen
Tianyi Lin

Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO)~\citep{Shao-2024-Deepseekmath}, has shown strong empirical results in training recent reasoning models~\citep{Guo-2025-Deepseek}, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these failure signals. We introduce a simple framework to mitigate the all-negative-sample issue by incorporating response diversity within groups using a \textit{step-wise} judge model, which can be trained directly or adapted from existing LLMs. In a simplified setting, we prove that this diversification accelerates GRPO’s learning dynamics. We then empirically validate Stepwise Guided Policy Optimization (SGPO) across model sizes (7B, 14B, 32B) in both offline and online training on nine reasoning benchmarks (including base and distilled variants). Overall, SGPO improves average performance and is effective in early and mid-training when all-negative groups are prevalent, while improvements are not uniform across every benchmark and depend on the structure and informativeness of negative samples. Finally, SGPO does not require the judge model to generate correct solutions, distinguishing it from knowledge distillation methods.