AI scores 64% on $500K knowledge work benchmark, implicating law, medicine and more

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Mercor, an AI data company, has released the AI Productivity Index (APEX), a comprehensive benchmark that tests whether AI models can perform high-value knowledge work across law, medicine, finance, and management consulting. The benchmark represents a paradigm shift from abstract AI testing to directly measuring models’ ability to complete economically valuable tasks that professionals typically handle.

What you should know: APEX consists of 200 carefully designed tasks created by experienced professionals from top-tier firms, with input from former McKinsey executives, Harvard Business School leadership, and Harvard Law professors.

Tasks include diagnosing patients based on multimedia evidence, providing legal advice on estate planning, and conducting financial valuations of healthcare technology companies.
The benchmark was developed at a cost of over $500,000, contracting white-collar professionals averaging 7.25 years of experience from Goldman Sachs, JPMorgan, McKinsey, Boston Consulting Group, and other prestigious firms.
Mercor pays these domain experts competitively, with rates averaging $81 per hour and reaching over $200 per hour for senior experts—equivalent to roughly $400,000 annually.

How current AI models performed: OpenAI’s latest models show dramatic improvement but still fall short of human-level performance on complex knowledge work.

GPT-4o, released in May 2024, scored 35.9% on the benchmark.
GPT-5, released just over a year later, achieved 64.2%—the highest score recorded.
However, GPT-5 only achieved perfect scores on two out of 200 tasks, both involving “basic reasoning, simple calculations, and a lot of basic information searching.”
Work that doesn’t hit 100% accuracy “might be effectively useless,” according to the paper authors.

The big picture: This benchmark reflects the evolution of AI testing from abstract puzzles to real-world professional tasks, mirroring how AI capabilities have advanced.

Earlier AI benchmarks relied on crowdworker services paying a few dollars per hour, while current tests require highly skilled professionals earning hundreds of dollars hourly.
“AI got its Ph.D.,” says Brendan Foody, Mercor’s 22-year-old CEO. “Now it’s starting to enter the job market.”
The shift parallels AI’s progression in other fields—games like Go were conquered by 2016, software engineering benchmarks emerged in 2023, and now white-collar professional work is being systematically tested.

Current limitations: APEX acknowledges several constraints that prevent it from fully replicating human professional work.

The benchmark focuses on “well scoped deliverables” rather than open-ended tasks that might have multiple correct solutions.
AI outputs are entirely text-based, not testing models’ ability to use computers as human workers do.
Task descriptions require lengthy, detailed prompts that “would be more tedious than just doing it yourself,” according to finance task creator Matt Seck.

Why this matters: The benchmark arrives as AI models increasingly compete with human professionals across knowledge-intensive industries.

A separate OpenAI benchmark published Thursday showed expert human evaluators preferred AI work to human work 47.6% of the time across 220 tasks.
OpenAI’s models more than doubled their “win rate” against humans between June 2024 and September 2025.
The development suggests AI is transitioning from academic curiosity to practical workforce competition, with potential implications for employment in high-skilled professions.

What they’re saying: Industry experts emphasize the significance of measuring AI’s economic utility rather than abstract capabilities.

“Getting 100% would mean that you’d basically have an analyst or an associate in a box that you could go and send tasks to,” explains Osvald Nitski, one of the paper’s authors.
“It’s hard to imagine a better hourly job from a pay perspective,” says Matt Seck, a former Bank of America investment banking analyst now contracted by Mercor.

AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants

TIME

Menu

AI scores 64% on $500K knowledge work benchmark, implicating law, medicine and more

Recent News

SITE BEING UPDATED. PLEASE STAY TUNED.

Adnoc partners with US robotics startup to deploy AI across oil operations

6 places where Google’s Gemini AI should be but isn’t

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

AI scores 64% on $500K knowledge work benchmark, implicating law, medicine and more

Recent News

SITE BEING UPDATED. PLEASE STAY TUNED.

Adnoc partners with US robotics startup to deploy AI across oil operations

6 places where Google’s Gemini AI should be but isn’t

Join the revolution

CO/AI

Resources

Join the revolution