Web3 AI benchmark finds no top models ready for high-stakes tasks

Created at 31 May · 22:03 UTC1 source↑ Market-relevant

IN SHORT

A new benchmark test of 31 AI models, including GPT-5, Claude, and Gemini, found that no system is currently prepared for the most critical tasks in Web3 AI. The study, which involved 3,543 expert questions, was presented at KDD 2026.

Key Numbers

31top AI models tested

3,543expert questions

Who's Involved

DMind AI

Quantified the gap in Web3 AI safety

KDD 2026

Conference where findings were made official

GPT-5, Claude, Gemini

Models included in the benchmark tests

Key facts

A peer-reviewed benchmark tested 31 top AI models, including GPT-5, Claude, and Gemini.
The tests covered 3,543 expert questions related to Web3 AI.
No AI system tested was deemed ready for the highest-stakes tasks in Web3 AI.
The findings were presented at KDD 2026.

The development of artificial intelligence in specialized fields like medicine (MedQA) and finance (FinBen) has progressed significantly. However, the application of AI in the Web3 space, which often involves high-stakes financial and security-related tasks, has lagged due to safety and reliability concerns. This benchmark aims to quantify the current capabilities and limitations of leading AI models in this critical domain.

↳ Why This Matters

FREQUENTLY ASKED

It is the first peer-reviewed benchmark designed to test the readiness of AI models for high-stakes tasks in the Web3 domain, evaluating 31 top models across 3,543 expert questions.

The benchmark included 31 top models, such as GPT-5, Claude, and Gemini.

The study concluded that no AI system tested is currently ready for the most critical tasks within Web3 AI.

The results of the benchmark were presented at KDD 2026.

Key facts

A peer-reviewed benchmark tested 31 top AI models, including GPT-5, Claude, and Gemini.

The tests covered 3,543 expert questions related to Web3 AI.

No AI system tested was deemed ready for the highest-stakes tasks in Web3 AI.

The findings were presented at KDD 2026.

Web3 AI benchmark finds no top models ready for high-stakes tasks

PiQ Daily

Key facts

Web3 AI benchmark finds no top models ready for high-stakes tasks

PiQ Daily

Key facts

Get the newsletter.