News

A new study from resarchers of Amazon, Stanford, MIT, and others reveals major flaws in AI agent benchmarks, finding they can misestimate performance by up to 100%.