Key facts
- Crosby, a legal tech startup, has launched the Redline Bench to evaluate AI model performance on legal tasks.
- The benchmark specifically assesses AI's ability in contract review, a critical function for law firms.
- The tool uses a rubric created by experienced lawyers to judge AI-generated contract edits.
- Initial results placed ChatGPT 5.5 at the top with a 50.5% score, followed by Gemini 3.5 Flash and Claude Opus 4.8.
- Crosby intends to release the Redline Bench publicly and publish comparative reports on major AI models.
The legal industry is grappling with how to assess the quality of work produced by artificial intelligence tools, a challenge distinct from coding where performance is more easily quantifiable. Crosby, a startup that combines legal services with technology development, has launched the Redline Bench to address this gap.
The Redline Bench is designed to measure how well AI models perform real-world legal tasks, beginning with contract review. According to Crosby founder Ryan Daniels, defining 'good' legal work is more ambiguous than defining 'good' code, which can either run or break. Legal contract edits, for instance, can be defensible in multiple ways, leading to differing opinions among lawyers.
To create the benchmark, Crosby's team of engineers and lawyers simulated software deals. Senior lawyers marked crucial contract changes at each negotiation stage, which were then translated into weighted criteria. When testing AI models, Crosby provides them with contracts and asks them to make edits. A panel of three judges then compares these AI-generated redlines against the lawyer-built rubric, voting on whether each edit meets the prioritized criteria. The final score reflects how often models made edits that lawyers deemed important.
Crosby plans to make the Redline Bench publicly available for any AI lab to use and will regularly release reports comparing major models. In the initial release, ChatGPT 5.5 scored 50.5%, indicating its edits aligned with half of the prioritized lawyer edits. Gemini 3.5 Flash followed with 45.1%, and Claude Opus 4.8 scored 44.4%. Crosby also tested Anthropic's Fable 5, which scored 47.3% before being temporarily withdrawn from the market.
While other companies like Harvey, Anthropic, and OpenAI also develop benchmarks, Daniels suggests that internal benchmarks may be less trustworthy as labs can tune their systems to perform well on their own tests. The development of reliable AI benchmarks is crucial, as billions in investment are tied to the promise of AI reducing legal costs. Lawyers' adoption of these tools hinges on trust, which Crosby aims to foster with its transparent evaluation system.