Arrow Research search

Author name cluster

Yuhao Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
2 author rows

Possible papers

8

AAAI Conference 2026 Conference Paper

AdaMCoT: Rethinking Cross-Lingual Factual Reasoning Through Adaptive Multilingual Chain-of-Thought

  • Zheng Weihua
  • Xin Huang
  • Zhengyuan Liu
  • Tarun Kumar Vangani
  • Bowei Zou
  • Xiyan Tao
  • Yuhao Wu
  • AiTi Aw

Large language models (LLMs) have shown impressive multilingual capabilities through pretraining on diverse corpora. While these models show strong reasoning abilities, their performance varies significantly across languages due to imbalanced training data distribution. Existing approaches using sample-level translation for extensive multilingual pretraining and cross-lingual tuning face scalability challenges and often fail to capture nuanced reasoning processes across languages. In this paper, we introduce **AdaMCoT** (Adaptive Multilingual Chain-of-Thought), a framework that enhances multilingual factual reasoning by dynamically routing thought processes in intermediary “thinking languages” before generating target-language responses. AdaMCoT leverages a language-agnostic core and incorporates an adaptive, reward-based mechanism for selecting optimal reasoning pathways without requiring additional pretraining. Our comprehensive evaluation across multiple benchmarks demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency, with particularly strong performance gains in low-resource language settings. An in-depth analysis of the model’s hidden states and semantic space further elucidates the underlying mechanism of our method. The results suggest that adaptive reasoning paths can effectively bridge the performance gap between high- and low-resource languages while maintaining cultural and linguistic nuances.

AAAI Conference 2026 Conference Paper

NOTAM-Evolve: A Knowledge-Guided Self-Evolving Optimization Framework with LLMs for NOTAM Interpretation

  • Maoqi Liu
  • Quan Fang
  • Yuhao Wu
  • Can Zhao
  • Yang Yang
  • Kaiquan Cai

Accurate interpretation of Notices To Airmen (NOTAMs) is critical for aviation safety, yet their condensed and cryptic language poses significant challenges to both manual and automated processing. Existing automated systems are typically limited to "Shallow Parsing," failing to extract the actionable intelligence needed for operational decisions. We formalize the complete interpretation task as "Deep Parsing," a dual-reasoning challenge requiring both dynamic knowledge grounding (linking the NOTAM to evolving real-world aeronautical data) and schema-based inference (applying static domain rules to deduce operational status). To tackle this challenge, we propose NOTAM-Evolve, a self-evolving framework that enables a Large Language Model (LLM) to autonomously master complex NOTAM interpretation. Leveraging a knowledge graph-enhanced retrieval module for data grounding, the framework introduces a crucial closed-loop learning process where the LLM progressively improves from its own outputs, minimizing the need for extensive human-annotated reasoning traces. In conjunction with this framework, we introduce a new benchmark dataset of 10,000 expert-annotated NOTAMs. Our experiments demonstrate that NOTAM-Evolve achieves a 30.4% absolute accuracy improvement over the base LLM, establishing a new state-of-the-art on the task of structured NOTAM interpretation.

AAAI Conference 2026 Conference Paper

State-Derivative-Aware Neural Controlled Differential Equations for Multivariate Time Series Anomaly Detection and Diagnosis

  • Xin Sun
  • Heng Zhou
  • Yuhao Wu
  • Chao Li

Multivariate time series anomaly detection is a crucial factor in real-world applications but a challenging task due to the complex temporal dependencies and system dynamics. Reconstruction-based methods have made great improvements in recent years. However, we observe an issue these methods are suffering, that they primarily measure deviations in the time points themselves when performing anomaly detection but ignore changes in the dynamic properties of the system. In these cases, they are unable to produce sufficient reconstruction errors to detect anomalies, so some potential abnormal time points caused by the dynamic evolution of the system are missing. To address this problem, we propose a novel method, SDA2D, which models system dynamics by the derivative of the NCDE-derived state vector with respect to time, enabling the learning of reconstruction deviation and system evolution jointly. Our experimental results show that SDA2D achieves noticeable improvements in four benchmark datasets, and the visualization also provides further instructions for anomaly diagnosis, which helps locate the sources of these anomalies.

ICLR Conference 2025 Conference Paper

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

  • Yuhao Wu
  • Ming Shan Hee
  • Zhiqiang Hu
  • Roy Ka-Wei Lee

Current benchmarks like ``$\textit{Needle-in-a-Haystack}$'' ($\textit{NIAH}$), $\textit{Ruler}$, and $\textit{Needlebench}$ focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences—a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce $\textit{LongGenBench}$, a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, $\textit{LongGenBench}$ evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on $\textit{Ruler}$, all models struggled with long text generation on $\textit{LongGenBench}$, particularly as text length increased. This suggests that current LLMs are not yet equipped to meet the demands of real-world, long-form text generation. We open-source $\textit{LongGenBench}$ to promote comprehensive evaluation and improvement in this critical area, with code and data available at ${anonymousurl}$.

IJCAI Conference 2025 Conference Paper

Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs

  • Ziyu Ge
  • Yuhao Wu
  • Daniel Wai Kit Chin
  • Roy Ka-Wei Lee
  • Rui Cao

Large Language Models (LLMs) augmented with retrieval mechanisms have demonstrated significant potential in fact-checking tasks by integrating external knowledge. However, their reliability decreases when confronted with conflicting evidence from sources of varying credibility. This paper presents the first systematic evaluation of Retrieval-Augmented Generation (RAG) models for fact-checking in the presence of conflicting evidence. To support this study, we introduce CONFACT (Conflicting Evidence for Fact-Checking), a novel dataset comprising questions paired with conflicting information from various sources. Extensive experiments reveal critical vulnerabilities in state-of-the-art RAG methods, particularly in resolving conflicts stemming from differences in media source credibility. To address these challenges, we investigate strategies to integrate media background information into both the retrieval and generation stages. Our results show that effectively incorporating source credibility significantly enhances the ability of RAG models to resolve conflicting evidence and improve fact-checking performance.

ICML Conference 2024 Conference Paper

Mitigating Label Noise on Graphs via Topological Sample Selection

  • Yuhao Wu
  • Jiangchao Yao
  • Xiaobo Xia
  • Jun Yu 0001
  • Ruxin Wang 0002
  • Bo Han 0003
  • Tongliang Liu

Despite the success of the carefully-annotated benchmarks, the effectiveness of existing graph neural networks (GNNs) can be considerably impaired in practice when the real-world graph data is noisily labeled. Previous explorations in sample selection have been demonstrated as an effective way for robust learning with noisy labels, however, the conventional studies focus on i. i. d data, and when moving to non-iid graph data and GNNs, two notable challenges remain: (1) nodes located near topological class boundaries are very informative for classification but cannot be successfully distinguished by the heuristic sample selection. (2) there is no available measure that considers the graph topological information to promote sample selection in a graph. To address this dilemma, we propose a $\textit{Topological Sample Selection}$ (TSS) method that boosts the informative sample selection process in a graph by utilising topological information. We theoretically prove that our procedure minimizes an upper bound of the expected risk under target clean distribution, and experimentally show the superiority of our method compared with state-of-the-art baselines.

ICML Conference 2024 Conference Paper

Unraveling the Impact of Heterophilic Structures on Graph Positive-Unlabeled Learning

  • Yuhao Wu
  • Jiangchao Yao
  • Bo Han 0003
  • Lina Yao 0001
  • Tongliang Liu

While Positive-Unlabeled (PU) learning is vital in many real-world scenarios, its application to graph data still remains under-explored. We unveil that a critical challenge for PU learning on graph lies on the edge heterophily, which directly violates the $\textit{irreducibility assumption}$ for $\textit{Class-Prior Estimation}$ (class prior is essential for building PU learning algorithms) and degenerates the latent label inference on unlabeled nodes during classifier training. In response to this challenge, we introduce a new method, named $\textit{$\underline{G}$raph $\underline{P}$U Learning with $\underline{L}$abel Propagation Loss}$ (GPL). Specifically, GPL considers learning from PU nodes along with an intermediate heterophily reduction, which helps mitigate the negative impact of the heterophilic structure. We formulate this procedure as a bilevel optimization that reduces heterophily in the inner loop and efficiently learns a classifier in the outer loop. Extensive experiments across a variety of datasets have shown that GPL significantly outperforms baseline methods, confirming its effectiveness and superiority.