AI Benchmarks Need a New Standard

박세미 기자· 3/31/2026, 10:46:53 PM· Updated 5/12/2026, 5:09:56 AM

AI benchmarks are flawed. The standard evaluation methods used to test AI capabilities diverge from real-world applications, raising the need for new methods to replace them.

Existing benchmarks have operated by comparing the performance of AI models against individual humans on isolated problems with clear-cut answers, such as chess, advanced mathematics, coding, and essay writing. This approach, favored for its ease of standardization, comparison, and generation of rankings and scores, has been widely adopted across both industry and academia.

The issue arises from the gap between how AI is actually used and how benchmarks measure it. Real-world AI does not operate in isolation. It interacts with multiple people in complex, uncertain environments, and its performance is only truly revealed through extended usage.

A case observing radiology AI models approved by the FDA in radiology departments of hospitals in California, USA, and London, UK, illustrates this. While these models demonstrated faster and more accurate medical image interpretation than professional radiologists on benchmarks, actual hospital settings required more time to interpret AI outputs according to hospital-specific reporting standards and country-specific regulatory requirements. AI lauded as a productivity tool in benchmarks ended up causing workflow delays in practice.

As the discrepancy between benchmark performance and real-world performance repeats, the need for new standards to evaluate how AI operates within human teams, workflows, and organizations over extended periods has emerged. Based on research into actual AI deployment cases since 2022 targeting SMEs, healthcare, humanitarian, non-profit, and higher education institutions in the UK, US, and Asia, the 'HAIC benchmark (Human–AI, Context-Specific Evaluation)' has been proposed as an alternative. This approach shifts from measuring performance at a single-task level to comprehensively assessing how AI interacts with various stakeholders within a real organization and what outcomes it produces.

쿠팡 파트너스 활동의 일환으로 일정 수수료를 제공받습니다

AI Benchmarks Need a New Standard

Related Articles

Code Claude: A Paradigm Shift in Coding Approaches

AI and Quantum Computing: Threatening Cryptocurrency Security

AI Firm Anthropic Finds 10,000 Security Vulnerabilities in Critical Global Software

AI Science: Google I/O Unveils New Possibilities

Nvidia Unveils 'Vera' CPU, Signals Expansion into AI Inference Market

AI Unlocks the Future of Coding and Scientific Research