Key facts
- DeepReinforce released Ornith-1.0, a family of open-source LLMs for AI coding agents.
- Models range from 9 billion to 397 billion parameters and are MIT licensed.
- The 397B model achieved 82.4 on SWE-bench Verified, outperforming Claude Opus 4.7.
- The 9B model scored 69.4 on SWE-bench Verified, outperforming Gemma 4-31B.
- Ornith-1.0 is optimized for agentic coding tasks and may underperform on general conversations.
DeepReinforce has released Ornith-1.0, a new family of open-source large language models specifically engineered for AI coding agents. Available in four sizes, ranging from 9 billion to a flagship 397 billion parameters, these models are designed to operate within real terminal and repository environments, performing tasks autonomously without constant human guidance.
The Ornith models are built with an 'agentic' approach, meaning they are trained to take actions and complete multi-step coding tasks, such as fixing bugs and refining code, by developing their own strategies. This contrasts with traditional conversational AI models. DeepReinforce emphasizes that Ornith-1.0 is not intended for general-purpose AI conversations or tasks like document summarization, as its performance may be suboptimal outside of developer pipelines.
Performance metrics highlight Ornith-1.0's capabilities. The 397 billion parameter model achieved a score of 82.4 on the SWE-bench Verified benchmark, surpassing notable models like Claude Opus 4.7 (80.8) and DeepSeek-V4-Pro (80.6). On the Terminal Bench 2.1, the 397B model scored 77.5, compared to Claude Opus 4.7's 70.3. Even the smaller 9 billion parameter model demonstrated strong performance, scoring 69.4 on SWE-bench Verified, which is competitive with larger models like Qwen 3.5-35B and significantly higher than Google's Gemma 4-31B (52.0).
DeepReinforce has implemented defenses against reward hacking, a potential issue with self-improving models. These include immutable environments, deterministic monitors, and a frozen judge model to ensure the AI's actions are genuine and not exploiting the training process. While Ornith-1.0 shows impressive results on coding-specific benchmarks, the company notes that Anthropic's latest flagship, Claude Opus 4.8, scores higher, and the primary competitive advantage lies within the open-source category for comparable parameter counts on agentic coding tasks.
