LLM evaluation and benchmark reliability
FrontierAudit: Stability Audits for Cost-Aware LLM Benchmark Claims
Abstract
Cost-aware LLM benchmark papers often present a single frontier winner as if the best policy were exact and stable. We argue that this style of reporting can overstate what the evidence supports when frontier comparisons are close, cost schedules are partly conventional and subgroup or family structure is hidden by pooled summaries. We introduce FrontierAudit, a reporting-oriented audit protocol for calibrating claim strength in cost-aware routing and evidence-acquisition evaluations. FrontierAudit separates exact winner promotion from weaker but still useful conclusions by combining a threshold-based taxonomy, an empirical exact-promotion certificate and conservative follow-on checks for near ties, subgroup stability, transfer and budget semantics. Across audited routing-style and biomedical case studies we find that many apparent exact winners do not support exact headline language once these checks are applied and instead justify weaker claim forms such as cost-profile dependence, partial support or transfer-limited conclusions. At the same time, the audit identifies the narrower conditions under which exact promotion is genuinely supported. The main contribution is not a new router or benchmark release but a reusable protocol for matching benchmark claims to the strength of the available evidence.
Research Contribution
We designed an audit protocol for cost-aware LLM benchmark claims and applied it to routing-style and biomedical case studies. The goal was to test whether the evidence supports the strength of the claim being made.
Hypothesis and Significance
Many benchmark winners are less stable than they look. Changes in costs, subgroups, families or budget assumptions can make exact headline language too strong.
For LLM evaluation the main contribution is a stricter connection between empirical evidence and the claims made from it. The protocol fits a broader shift toward agent and model evaluation that cares about reliability, robustness and claim strength beyond leaderboard position alone.