This paper proposes a three-step framework for designing benchmarks for LLM agents engaged in knowledge work. It addresses the gap where traditional NLP evaluations fail to measure real-world competence, focusing on explicitly defining work activities, testing settings, and scoring.
The development of LLM agents for complex knowledge work (such as coding, research, and healthcare) requires rigorous evaluation methods that move beyond traditional NLP tasks. This research introduces a three-step approach to ensure benchmark performance reliably reflects real-world capabilities:
The authors draw on work studies, noting that knowledge work is structured around roles, materials, tools, and artifacts. They translate these concerns into benchmark design guidance, defining how tasks should map to work activities and how scores should focus on the system's delivered output.
To operationalize this, the approach derives an inventory of 18 work activities from the O*NET occupational task database. This methodology is demonstrated through case studies like GDPval (occupational deliverables), OfficeQA Pro (document analysis), and APEX-SWE (software engineering), showing how benchmark design choices directly influence the work a score can support and reveal critical gaps between the task, setting, product, and overall work claim.