FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification

Gwok-Waa Wan; SamZaak Wong; Shengchu Su; Chenxu Niu; Ning Wang; Xinlai Wan; Qixiang Chen; Mengnv Xing; Jingyi Zhang; Jianmin Ye; Yubo Wang; Rongchang Song; Tao Ni; Qiang Xu; Nan Guan; Zhe Jiang; Xi Wang; Yong Chen; Jun Yang

doi:10.1609/aaai.v40i2.37079

Back to AAAI

AAAI 2026

FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification

Conference Paper AAAI Technical Track on Application Domains II Artificial Intelligence

PDF Details DOI

Abstract

We introduce FIXME, the first end-to-end and large-scale benchmark for evaluating Large Language Models (LLMs) in hardware design functional verification (FV). Comprising 747 tasks derived from real-world hardware designs, FIXME spans five core FV sub-sets: specification comprehension, reference model generation, testbench generation, assertion design, and RTL debugging. To ensure high data quality, we developed an AI-human collaborative framework for agile data curation and annotation. This process resulted in 25,000 lines of verified RTL, 35,000 lines of enhanced testbenches, and over 1,200 SystemVerilog Assertions. Furthermore, through expert-guided optimization within the multi-agent aided flow, we achieved a remarkable 45.57% improvement in average functional coverage, underscoring the benchmark's robustness. Through evaluation of state-of-the-art LLMs like GPT-4.1, FIXME identifies key limitations and provides actionable insights, advancing the potential of LLM-driven automation in hardware design functional verification.

FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification

Abstract

Authors

Keywords

Context