PHANTOM: A Benchmark for Hallucination Detection in Financial Long-Context QA

Lanlan Ji; Dominic Seyler; Gunkirat Kaur; Manjunath Hegde; Koustuv Dasgupta; Bing Xiang

Back to NeurIPS

NeurIPS 2025

PHANTOM: A Benchmark for Hallucination Detection in Financial Long-Context QA

Conference Paper Datasets and Benchmarks Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

While Large Language Models (LLMs) show great promise, their tendencies to hallucinate pose significant risks in high-stakes domains like finance, especially when used for regulatory reporting and decision-making. Existing hallucination detection benchmarks fail to capture the complexities of financial benchmarks, which require high numerical precision, nuanced understanding of the language of finance, and ability to handle long-context documents. To address this, we introduce PHANTOM, a novel benchmark dataset for evaluating hallucination detection in long-context financial QA. Our approach first generates a seed dataset of high-quality "query-answer-document (chunk)" triplets, with either hallucinated or correct answers - that are validated by human annotators and subsequently expanded to capture various context lengths and information placements. We demonstrate how PHANTOM allows fair comparison of hallucination detection models and provides insights into LLM performance, offering a valuable resource for improving hallucination detection in financial applications. Further, our benchmarking results highlight the severe challenges out-of-the-box models face in detecting real-world hallucinations on long context data, and establish some promising directions towards alleviating these challenges, by fine-tuning open-source LLMs using PHANTOM.

PHANTOM: A Benchmark for Hallucination Detection in Financial Long-Context QA

Abstract

Authors

Keywords

Context