Key facts
- A peer-reviewed benchmark tested 31 top AI models, including GPT-5, Claude, and Gemini.
- The tests covered 3,543 expert questions related to Web3 AI.
- No AI system tested was deemed ready for the highest-stakes tasks in Web3 AI.
- The findings were presented at KDD 2026.
The development of artificial intelligence in specialized fields like medicine (MedQA) and finance (FinBen) has progressed significantly. However, the application of AI in the Web3 space, which often involves high-stakes financial and security-related tasks, has lagged due to safety and reliability concerns. This benchmark aims to quantify the current capabilities and limitations of leading AI models in this critical domain.