Crosby launches benchmark to assess AI legal task performance

Created at 17 Jun · 9:41 AM1 source↑ Market-relevant

IN SHORT

Legal tech startup Crosby has released the Redline Bench, a new tool designed to evaluate how well artificial intelligence models perform real-world legal tasks, starting with contract review. The benchmark aims to provide a standardized measure for AI performance in the legal sector.

Key Numbers

50.5%ChatGPT 5.5 score on Redline Bench

45.1%Gemini 3.5 Flash score on Redline Bench

44.4%Claude Opus 4.8 score on Redline Bench

47.3%Anthropic's Fable 5 score on Redline Bench

Who's Involved

Crosby

Legal tech startup that developed the Redline Bench

Ryan Daniels

Founder of Crosby and former in-house lawyer

John Sarihan

Co-founder of Crosby

Sharan Ramjee

Engineer at Crosby, formerly at Stripe

Ross Weiser

Lawyer at Crosby, formerly at Sullivan & Cromwell

Micro1

Company partnered with Crosby to recruit expert lawyers

ChatGPT 5.5

AI model tested by Redline Bench

Gemini 3.5 Flash

AI model tested by Redline Bench

Key facts

Crosby, a legal tech startup, has launched the Redline Bench to evaluate AI model performance on legal tasks.

The benchmark specifically assesses AI's ability in contract review, a critical function for law firms.

The tool uses a rubric created by experienced lawyers to judge AI-generated contract edits.

Initial results placed ChatGPT 5.5 at the top with a 50.5% score, followed by Gemini 3.5 Flash and Claude Opus 4.8.

Crosby intends to release the Redline Bench publicly and publish comparative reports on major AI models.

The legal industry is grappling with how to assess the quality of work produced by artificial intelligence tools, a challenge distinct from coding where performance is more easily quantifiable. Crosby, a startup that combines legal services with technology development, has launched the Redline Bench to address this gap.

The Redline Bench is designed to measure how well AI models perform real-world legal tasks, beginning with contract review. According to Crosby founder Ryan Daniels, defining 'good' legal work is more ambiguous than defining 'good' code, which can either run or break. Legal contract edits, for instance, can be defensible in multiple ways, leading to differing opinions among lawyers.

To create the benchmark, Crosby's team of engineers and lawyers simulated software deals. Senior lawyers marked crucial contract changes at each negotiation stage, which were then translated into weighted criteria. When testing AI models, Crosby provides them with contracts and asks them to make edits. A panel of three judges then compares these AI-generated redlines against the lawyer-built rubric, voting on whether each edit meets the prioritized criteria. The final score reflects how often models made edits that lawyers deemed important.

Crosby plans to make the Redline Bench publicly available for any AI lab to use and will regularly release reports comparing major models. In the initial release, ChatGPT 5.5 scored 50.5%, indicating its edits aligned with half of the prioritized lawyer edits. Gemini 3.5 Flash followed with 45.1%, and Claude Opus 4.8 scored 44.4%. Crosby also tested Anthropic's Fable 5, which scored 47.3% before being temporarily withdrawn from the market.

While other companies like Harvey, Anthropic, and OpenAI also develop benchmarks, Daniels suggests that internal benchmarks may be less trustworthy as labs can tune their systems to perform well on their own tests. The development of reliable AI benchmarks is crucial, as billions in investment are tied to the promise of AI reducing legal costs. Lawyers' adoption of these tools hinges on trust, which Crosby aims to foster with its transparent evaluation system.

Frequently asked questions

The Redline Bench is a tool developed by Crosby to measure how well artificial intelligence models perform real-world legal tasks, starting with contract review.

Unlike coding, where 'good' or 'bad' is often binary, legal work like contract editing can be defensible in multiple ways, making objective evaluation challenging.

Senior lawyers create a rubric of important contract changes. AI models edit contracts, and judges compare these edits against the rubric, voting pass or fail on criteria.

In the first release, ChatGPT 5.5 scored highest at 50.5%, followed by Gemini 3.5 Flash at 45.1% and Claude Opus 4.8 at 44.4%.

Crosby launches benchmark to assess AI legal task performance

Key Numbers

Who's Involved

Crosby launches benchmark to assess AI legal task performance

Key Numbers

Who's Involved

↳ Why This Matters

Key facts

Frequently asked questions

What Happens Next

Get the newsletter.

How It Developed

Sources

Related Stories

Crosby launches benchmark to assess AI legal task performance

PiQ Daily

Key Numbers

Who's Involved

Crosby launches benchmark to assess AI legal task performance

PiQ Daily

Key Numbers

Who's Involved

↳ Why This Matters

Key facts

Frequently asked questions

+ What is the Redline Bench?

+ Why is it difficult to measure AI performance in law?

+ How does the Redline Bench work?

+ Which AI models performed best in the initial test?

What Happens Next

Get the newsletter.

How It Developed

Sources

Related Stories