×
AI scores 64% on $500K knowledge work benchmark, implicating law, medicine and more
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Mercor, an AI data company, has released the AI Productivity Index (APEX), a comprehensive benchmark that tests whether AI models can perform high-value knowledge work across law, medicine, finance, and management consulting. The benchmark represents a paradigm shift from abstract AI testing to directly measuring models’ ability to complete economically valuable tasks that professionals typically handle.

What you should know: APEX consists of 200 carefully designed tasks created by experienced professionals from top-tier firms, with input from former McKinsey executives, Harvard Business School leadership, and Harvard Law professors.

  • Tasks include diagnosing patients based on multimedia evidence, providing legal advice on estate planning, and conducting financial valuations of healthcare technology companies.
  • The benchmark was developed at a cost of over $500,000, contracting white-collar professionals averaging 7.25 years of experience from Goldman Sachs, JPMorgan, McKinsey, Boston Consulting Group, and other prestigious firms.
  • Mercor pays these domain experts competitively, with rates averaging $81 per hour and reaching over $200 per hour for senior experts—equivalent to roughly $400,000 annually.

How current AI models performed: OpenAI’s latest models show dramatic improvement but still fall short of human-level performance on complex knowledge work.

  • GPT-4o, released in May 2024, scored 35.9% on the benchmark.
  • GPT-5, released just over a year later, achieved 64.2%—the highest score recorded.
  • However, GPT-5 only achieved perfect scores on two out of 200 tasks, both involving “basic reasoning, simple calculations, and a lot of basic information searching.”
  • Work that doesn’t hit 100% accuracy “might be effectively useless,” according to the paper authors.

The big picture: This benchmark reflects the evolution of AI testing from abstract puzzles to real-world professional tasks, mirroring how AI capabilities have advanced.

  • Earlier AI benchmarks relied on crowdworker services paying a few dollars per hour, while current tests require highly skilled professionals earning hundreds of dollars hourly.
  • “AI got its Ph.D.,” says Brendan Foody, Mercor’s 22-year-old CEO. “Now it’s starting to enter the job market.”
  • The shift parallels AI’s progression in other fields—games like Go were conquered by 2016, software engineering benchmarks emerged in 2023, and now white-collar professional work is being systematically tested.

Current limitations: APEX acknowledges several constraints that prevent it from fully replicating human professional work.

  • The benchmark focuses on “well scoped deliverables” rather than open-ended tasks that might have multiple correct solutions.
  • AI outputs are entirely text-based, not testing models’ ability to use computers as human workers do.
  • Task descriptions require lengthy, detailed prompts that “would be more tedious than just doing it yourself,” according to finance task creator Matt Seck.

Why this matters: The benchmark arrives as AI models increasingly compete with human professionals across knowledge-intensive industries.

  • A separate OpenAI benchmark published Thursday showed expert human evaluators preferred AI work to human work 47.6% of the time across 220 tasks.
  • OpenAI’s models more than doubled their “win rate” against humans between June 2024 and September 2025.
  • The development suggests AI is transitioning from academic curiosity to practical workforce competition, with potential implications for employment in high-skilled professions.

What they’re saying: Industry experts emphasize the significance of measuring AI’s economic utility rather than abstract capabilities.

  • “Getting 100% would mean that you’d basically have an analyst or an associate in a box that you could go and send tasks to,” explains Osvald Nitski, one of the paper’s authors.
  • “It’s hard to imagine a better hourly job from a pay perspective,” says Matt Seck, a former Bank of America investment banking analyst now contracted by Mercor.
AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants

Recent News

SAG-AFTRA president Sean Astin targets AI actress in upcoming talent agent talks

SAG-AFTRA will confront major talent agencies about AI representation in upcoming negotiations.

AI investments drive 67% of US growth despite being just 6% of GDP

The Magnificent Seven now control 36 percent of the S&P 500's market cap.