Designing Benchmarks for Knowledge Work in AI: Bridging Performance and Real-World Application

This paper proposes a three-step framework for designing benchmarks for LLM agents engaged in knowledge work. It addresses the gap where traditional NLP evaluations fail to measure real-world competence, focusing on explicitly defining work activities, testing settings, and scoring.

The development of LLM agents for complex knowledge work (such as coding, research, and healthcare) requires rigorous evaluation methods that move beyond traditional NLP tasks. This research introduces a three-step approach to ensure benchmark performance reliably reflects real-world capabilities:

Defining the work activity under evaluation.
Specifying the tested environment, including materials, tools, roles, and constraints.
Scoring the resulting work product.

The authors draw on work studies, noting that knowledge work is structured around roles, materials, tools, and artifacts. They translate these concerns into benchmark design guidance, defining how tasks should map to work activities and how scores should focus on the system's delivered output.

To operationalize this, the approach derives an inventory of 18 work activities from the O*NET occupational task database. This methodology is demonstrated through case studies like GDPval (occupational deliverables), OfficeQA Pro (document analysis), and APEX-SWE (software engineering), showing how benchmark design choices directly influence the work a score can support and reveal critical gaps between the task, setting, product, and overall work claim.

Source

arXiv – AI · arxiv.org

Read at source

More from this section

Research2d ago

Theoretical Link Between Scaling Laws and Weight Spectra in Shallow Neural Networks

Read

ResearchJun 10

Cluster-Aware Causal Mixer for Real-Time Anomaly Detection in Multivariate Time Series

Read

ResearchJun 10

Robust Causal Discovery in Time Series via Power-Law Spectral Features

Read

ResearchMay 26

Designing Benchmarks for Knowledge Work in AI: Bridging Performance and Real-World Application

Defining the work activity under evaluation.
Specifying the tested environment, including materials, tools, roles, and constraints.
Scoring the resulting work product.

Source

arXiv – AI · arxiv.org

Read at source

More from this section

Research2d ago

Theoretical Link Between Scaling Laws and Weight Spectra in Shallow Neural Networks

Read

ResearchJun 10

Cluster-Aware Causal Mixer for Real-Time Anomaly Detection in Multivariate Time Series

Read

ResearchJun 10

Robust Causal Discovery in Time Series via Power-Law Spectral Features

Read