AI Benchmarks Need a New Standard
AI benchmarks are flawed. The standard evaluation methods used to test AI capabilities diverge from real-world applications, raising the need for new methods to replace them.
Existing benchmarks have operated by comparing the performance of AI models against individual humans on isolated problems with clear-cut answers, such as chess, advanced mathematics, coding, and essay writing. This approach, favored for its ease of standardization, comparison, and generation of rankings and scores, has been widely adopted across both industry and academia.
The issue arises from the gap between how AI is actually used and how benchmarks measure it. Real-world AI does not operate in isolation. It interacts with multiple people in complex, uncertain environments, and its performance is only truly revealed through extended usage.
A case observing radiology AI models approved by the FDA in radiology departments of hospitals in California, USA, and London, UK, illustrates this. While these models demonstrated faster and more accurate medical image interpretation than professional radiologists on benchmarks, actual hospital settings required more time to interpret AI outputs according to hospital-specific reporting standards and country-specific regulatory requirements. AI lauded as a productivity tool in benchmarks ended up causing workflow delays in practice.
As the discrepancy between benchmark performance and real-world performance repeats, the need for new standards to evaluate how AI operates within human teams, workflows, and organizations over extended periods has emerged. Based on research into actual AI deployment cases since 2022 targeting SMEs, healthcare, humanitarian, non-profit, and higher education institutions in the UK, US, and Asia, the 'HAIC benchmark (Human–AI, Context-Specific Evaluation)' has been proposed as an alternative. This approach shifts from measuring performance at a single-task level to comprehensively assessing how AI interacts with various stakeholders within a real organization and what outcomes it produces.