Arrow Research search

Author name cluster

Zexu Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
2 author rows

Possible papers

6

AAAI Conference 2026 Conference Paper

Beyond Step Pruning: Information Theory Based Step-level Optimization for Self-Refining Large Language Models

  • Jinman Zhao
  • Erxue Min
  • Hui Wu
  • Ziheng Li
  • Zexu Sun
  • Hengyi Cai
  • Shuaiqiang Wang
  • Xu Chen

Large language models (LLMs) have shown impressive capabilities in natural language tasks, yet they continue to struggle with multi-step mathematical reasoning, where correctness depends on a precise chain of intermediate steps. Preference optimization methods such as Direct Preference Optimization (DPO) have improved answer-level alignment, but they often overlook the reasoning process itself, providing little supervision over intermediate steps that are critical for complex problem-solving. Existing fine-grained approaches typically rely on strong annotators or reward models to assess the quality of individual steps. However, reward models are vulnerable to reward hacking. To address this, we propose ISLA, a reward-model-free framework that constructs step-level preference data directly from SFT gold traces. ISLA also introduces a self-improving pruning mechanism that identifies informative steps based on two signals: their marginal contribution to final accuracy (relative accuracy) and the model’s uncertainty, inspired by the concept of information gain. Empirically, ISLA achieves better performance than DPO while using only 12% of the training tokens, demonstrating that careful step-level selection can significantly improve both reasoning accuracy and training efficiency.

ICML Conference 2025 Conference Paper

Invariant Deep Uplift Modeling for Incentive Assignment in Online Marketing via Probability of Necessity and Sufficiency

  • Zexu Sun
  • Qiyu Han
  • Hao Yang 0045
  • Anpeng Wu
  • Minqin Zhu
  • Dugang Liu
  • Chen Ma 0001
  • Yunpeng Weng

In online platforms, incentives ( e. g. , discounts, coupons) are used to boost user engagement and revenue. Uplift modeling methods are developed to estimate user responses from observational data, often incorporating distribution balancing to address selection bias. However, these methods are limited by in-distribution testing data, which mirrors the training data distribution. In reality, user features change continuously due to time, geography, and other factors, especially on complex online marketing platforms. Thus, effective uplift modeling method for out-of-distribution data is crucial. To address this, we propose a novel uplift modeling method I nvariant D eep U plift M odeling, namely IDUM, which uses invariant learning to enhance out-of-distribution generalization by identifying causal factors that remain consistent across domains. IDUM further refines these features into necessary and sufficient factors and employs a masking component to reduce computational costs by selecting the most informative invariant features. A balancing discrepancy component is also introduced to mitigate selection bias in observational data. We conduct extensive experiments on public and real-world datasets to demonstrate IDUM’s effectiveness in both in-distribution and out-of-distribution scenarios in online marketing. Furthermore, we also provide theoretical analysis and related proofs to support our IDUM’s generalizability.

NeurIPS Conference 2025 Conference Paper

Learning to Focus: Causal Attention Distillation via Gradient‐Guided Token Pruning

  • Yiju Guo
  • Wenkai Yang
  • Zexu Sun
  • Ning Ding
  • Zhiyuan Liu
  • Yankai Lin

Large language models (LLMs) have demonstrated significant improvements in contextual understanding. However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace. Specifically, our preliminary experiments reveal that certain distracting patterns can misdirect the model’s attention during inference, and removing these patterns substantially improves reasoning accuracy and generation quality. We attribute this phenomenon to spurious correlations in the training data, which obstruct the model’s capacity to infer authentic causal instruction–response relationships. This phenomenon may induce redundant reasoning processes, potentially resulting in significant inference overhead and, more critically, the generation of erroneous or suboptimal responses. To mitigate this, we introduce a two-stage framework called Learning to Focus (LeaF) leveraging intervention-based inference to disentangle confounding factors. In the first stage, LeaF employs gradient-based comparisons with an advanced teacher to automatically identify confounding tokens based on causal relationships in the training corpus. Then, in the second stage, it prunes these tokens during distillation to enact intervention, aligning the student’s attention with the teacher’s focus distribution on truly critical context tokens. Experimental results demonstrate that LeaF not only achieves an absolute improvement in various mathematical reasoning, code generation and multi-hop question answering benchmarks but also effectively suppresses attention to confounding tokens during inference, yielding a more interpretable and reliable reasoning model.

ICML Conference 2025 Conference Paper

Rethinking Causal Ranking: A Balanced Perspective on Uplift Model Evaluation

  • Minqin Zhu
  • Zexu Sun
  • Ruoxuan Xiong
  • Anpeng Wu
  • Baohong Li
  • Caizhi Tang
  • Jun Zhou 0011
  • Fei Wu 0001

Uplift modeling is crucial for identifying individuals likely to respond to a treatment in applications like marketing and customer retention, but evaluating these models is challenging due to the inaccessibility of counterfactual outcomes in real-world settings. In this paper, we identify a fundamental limitation in existing evaluation metrics, such as the uplift and Qini curves, which fail to rank individuals with binary negative outcomes accurately. This can lead to biased evaluations, where biased models receive higher curve values than unbiased ones, resulting in suboptimal model selection. To address this, we propose the Principled Uplift Curve (PUC), a novel evaluation metric that assigns equal curve values of individuals with both positive and negative binary outcomes, offering a more balanced and unbiased assessment. We then derive the Principled Uplift Loss (PUL) function from the PUC and integrate it into a new uplift model, the Principled Treatment and Outcome Network (PTONet), to reduce bias during uplift model training. Experiments on both simulated and real-world datasets demonstrate that the PUC provides less biased evaluations, while PTONet outperforms existing methods. The source code is available at: https: //github. com/euzmin/PUC.

ICLR Conference 2025 Conference Paper

Uncertainty and Influence aware Reward Model Refinement for Reinforcement Learning from Human Feedback

  • Zexu Sun
  • Yiju Guo
  • Yankai Lin 0001
  • Xu Chen 0017
  • Qi Qi 0003
  • Xing Tang 0007
  • Xiuqiang He 0001
  • Ji-Rong Wen

Reinforcement Learning from Human Feedback (RLHF) has emerged as a standard and effective approach for training large language models (LLMs) with human preferences. In this framework, a learned reward model approximates human preferences and guides policy optimization, making it crucial to develop an accurate reward model. However, without the ``true'' reward function, challenges arise when the reward model is an imperfect proxy for human preference. Since the policy optimization continuously shifts the human preference training dataset's distribution. The fixed reward model suffers from this problem of off-distribution, especially the on policy methods. While collecting new preference data can mitigate this issue, it is costly and challenging to optimize. Thus, reusing the policy interaction samples becomes a possible way to further refine the reward model. To tackle these challenges, we introduce a novel method \textbf{U}ncertainty-\textbf{G}radient based \textbf{D}ata \textbf{A}ugmentation (\textbf{UGDA} for short) to enhance reward modeling by leveraging policy samples to maintain on-distribution performance. Specifically, UGDA selects interaction samples based on the uncertainty of the reward ensembles and the gradient based influence of policy optimization. After the reward relabeling of selected samples, we use supervised learning to refine the reward ensembles, then get the retrained policy. Extensive experiments demonstrate that by leveraging UGDA to select a few samples without the costly human preference data collection, we can improve the ability of the policy and surpass the state-of-the-art methods.

NeurIPS Conference 2023 Conference Paper

Offline Imitation Learning with Variational Counterfactual Reasoning

  • Zexu Sun
  • Bowei He
  • Jinxin Liu
  • Xu Chen
  • Chen Ma
  • Shuai Zhang

In offline imitation learning (IL), an agent aims to learn an optimal expert behavior policy without additional online environment interactions. However, in many real-world scenarios, such as robotics manipulation, the offline dataset is collected from suboptimal behaviors without rewards. Due to the scarce expert data, the agents usually suffer from simply memorizing poor trajectories and are vulnerable to the variations in the environments, lacking the capability of generalizing to new environments. To automatically generate high-quality expert data and improve the generalization ability of the agent, we propose a framework named \underline{O}ffline \underline{I}mitation \underline{L}earning with \underline{C}ounterfactual data \underline{A}ugmentation (OILCA) by doing counterfactual inference. In particular, we leverage identifiable variational autoencoder to generate \textit{counterfactual} samples for expert data augmentation. We theoretically analyze the influence of the generated expert data and the improvement of generalization. Moreover, we conduct extensive experiments to demonstrate that our approach significantly outperforms various baselines on both \textsc{DeepMind Control Suite} benchmark for in-distribution performance and \textsc{CausalWorld} benchmark for out-of-distribution generalization.