AI entrepreneurs are increasingly turning their attention to 'world models,' a concept aimed at equipping artificial intelligence with a deeper understanding of the physical world, moving beyond the capabilities of current large language models (LLMs).
Computer scientist Louis Castricato, after years of LLM research, founded Overworld to develop AI that can navigate and comprehend physical environments, seeing LLM research as having reached a plateau. This shift is driven by the belief that true AI intelligence requires more than processing text; it needs to grasp spatial and temporal dynamics, physics, and object interactions.
Prominent figures in AI, including Fei-Fei Li and Yann LeCun, are championing the development of world models. Li describes them as learning the statistical structure of space and time, enabling AI to understand phenomena like light, perspective, and physics. LeCun views them as tools for AI agents to predict the outcomes of their actions.
While LLMs excel at tasks like generating text, code, and images, proponents of world models argue they are insufficient for tasks requiring physical interaction, such as manipulating objects. Martin Hebert, dean of computer science at Carnegie Mellon University, emphasizes that understanding geometry, movement, and physical contact is far more complex than predicting the next word.
These world models are seen as a crucial step towards 'physical AI' and embodied robotics, allowing AI to function more like a human brain with general models for balance and adaptation. Applications extend beyond robotics to areas like video game development, with Overworld creating interactive virtual worlds, and weather prediction, as pursued by Causal Labs.
Venture capitalists, such as Steve Jang of Kindred Ventures, are actively investing in this emerging field, supporting companies like Overworld, Causal Labs, and specialized chip developers like Extropic. Fei-Fei Li is working to clarify the concept by developing a taxonomy of world models, distinguishing between different approaches like 'renderers' that prioritize visual fidelity.