
Fei-Fei Li: World Models and the Multiverse
What if the next leap in artificial intelligence isn’t about better language—but better understanding of space?In this episode, a16z General Partner Erik Torenberg moderates a conversation with Fei-Fei Li, cofounder and CEO of World Labs, and a16z General Partner Martin Casado, an early investor in the company. Together, they dive into the concept of world models—AI systems that can understand and reason about the 3D, physical world, not just generate text.Often called the “godmother of AI,” Fei...
Table of Contents
🌌 The Vision: Spatial Intelligence as AI's Missing Piece
The conversation opens with a bold declaration about the next frontier of artificial intelligence. While the tech world has been captivated by large language models and their text-based capabilities, there's a fundamental dimension missing from our AI systems: spatial intelligence. This isn't just about processing 3D coordinates or geometric data—it's about understanding the rich, complex, physical world that surrounds us.
Spatial intelligence represents our ability to comprehend and navigate three-dimensional space, both in the physical world and in our mind's eye. It's what allows us to visualize how objects relate to each other, how spaces connect, and how we might move through and manipulate our environment. For AI systems, developing this capability could unlock entirely new possibilities.
"Space the 3D space, the space out there, the space in your mind's eye—the spatial intelligence is a part of a critical part of intelligence. Suddenly we can actually create infinite universes some are for robots some are for creativity some are for socialization some are for travel some are for storytelling it suddenly will enable us to live in the multiverse." - Fei-Fei Li
The implications stretch far beyond technical advancement. With spatial intelligence, AI systems could generate infinite virtual universes tailored for different purposes—environments designed specifically for robotic training, creative exploration, social interaction, virtual travel experiences, or immersive storytelling. This technology could fundamentally change how we interact with digital spaces and potentially enable us to truly live in a multiverse of experiences.
👑 The Godmother of AI: Fei-Fei Li's Revolutionary Contributions
Fei-Fei Li's impact on artificial intelligence extends far beyond any single breakthrough—she fundamentally transformed how the field approaches one of its most critical components: data. While many researchers focused on refining neural network architectures and algorithms, Li recognized that the real challenge lay in something more fundamental.
Her background spans both industry leadership and academic excellence. She has served on Twitter's board of directors, held executive positions at Google, and now leads World Labs as founder and CEO. But her most transformative contribution came through her recognition that artificial intelligence's progress was fundamentally limited not by computational power or algorithmic sophistication, but by the quality and scale of training data.
"Fei-Fei really singularly brought in data to the equation which now we're recognizing is actually probably the bigger problem, the more interesting one, and so she truly is the godmother of AI as everybody calls her." - Martin Casado
This insight proved prescient. Today's most successful AI systems—from large language models to computer vision applications—depend critically on massive, high-quality datasets. Li's work in creating ImageNet, a comprehensive visual database that enabled the deep learning revolution in computer vision, exemplifies this data-first approach. Her recognition that data quality and quantity could be more important than algorithmic novelty has shaped how the entire field approaches AI development.
The title "godmother of AI" reflects not just her technical contributions, but her role in nurturing an entire generation of AI researchers and setting foundational principles that continue to guide the field's development.
🦄 Finding the Unicorn Investor: More Than Money
Building a deep tech company requires more than capital—it demands partners who can navigate both the technical complexities and business challenges of bringing revolutionary technology to market. For Fei-Fei Li, this meant finding what she calls her "unicorn investor"—someone who could serve not just as a funding source, but as an intellectual partner throughout the journey.
Li's relationship with Martin Casado spans over a decade, beginning when she joined Stanford as a young assistant professor in 2009 while Casado was completing his PhD. This long-standing relationship provided a foundation of mutual respect and understanding that proved crucial when Li began formulating the ideas that would become World Labs.
The criteria for her ideal investor went far beyond financial capability. She needed someone with deep technical expertise as a computer scientist and AI researcher, combined with practical experience in product development, market strategy, and go-to-market execution. Most importantly, she sought someone who could engage as an intellectual peer in exploring uncharted technical territory.
"I was also particularly looking for an intellectual partner because what we are doing at World Labs is very deep tech. We are trying to do something no one else has done. We know with a lot of conviction it will change the world literally, but I need someone who is a computer scientist who is a student of AI who understands product market, go to market, consumers, and just can be on the phone or in person with me every moment of the day as an intellectual partner." - Fei-Fei Li
This partnership model reflects the unique challenges of deep tech entrepreneurship, where the path from research breakthrough to market application requires navigating both technical unknowns and business complexities. The need for constant intellectual collaboration becomes even more critical when attempting to solve problems that no one has solved before.
💡 The "Aha" Moment: Recognizing What's Missing
Sometimes the most profound insights emerge from recognizing what everyone else is overlooking. The genesis of World Labs came through one of those crystallizing moments that can only happen when two minds with complementary perspectives align on a fundamental truth about the future of technology.
The setting was one of Mark's elegant dinners, filled with AI researchers and entrepreneurs, all buzzing with excitement about the latest breakthroughs in large language models. The conversation naturally centered on the impressive capabilities these systems were demonstrating with text and language processing. But both Li and Casado had independently begun to sense that something crucial was missing from this narrative.
Casado's background in image-focused investing had led him to question whether language-based AI represented the complete picture. Meanwhile, Li had been developing deeper intuitions about what AI systems would need to truly navigate and understand the world. The moment of connection came when Li leaned over during the dinner discussion.
"You know what we're missing?" And I said "What are we missing?" She said "We're missing a world model." And I'm like "Yes."" - Martin Casado recounting the conversation
This exchange revealed that both had arrived at similar conclusions through different paths. Li had spent months, perhaps years, developing a comprehensive vision of what AI needed to progress beyond language. Casado had developed a high-level intuition that language models, impressive as they were, couldn't represent the end of the AI story.
The concept of a "world model"—an AI system that truly understands the three-dimensional structure, physics, and relationships of the physical world—became the unifying framework that brought their perspectives together. This wasn't just about adding spatial capabilities to existing AI; it was about building a fundamentally different kind of intelligence.
🧪 The Litmus Test: Finding Someone Who Actually Gets It
One of the greatest challenges in pioneering technology is distinguishing between genuine understanding and polite acknowledgment. When you're working on concepts that don't yet exist, most conversations involve people nodding along without truly grasping the implications or technical depth of what you're proposing.
Li experienced this frustration repeatedly as she discussed her world model concept with various technologists, investors, and potential partners. The pattern was consistent: people would nod when she mentioned "world models," but she could sense the politeness behind their responses. They weren't connecting with the deeper technical vision or understanding why this represented such a fundamental shift in AI development.
This led to a crucial test. Li invited Casado to Stanford for coffee with a specific agenda—she wanted to hear him define "world model" in his own terms. This wasn't just about confirming interest; it was about validating whether he genuinely understood the technical depth and implications of what they were discussing.
"Can you define your world model to me? I really wanted to hear if Martin actually meant it and the way he defined it about an AI model that truly understand the 3D structure shape and the compositionality of the world was exactly what I was talking about." - Fei-Fei Li
The test was successful. Casado's definition aligned precisely with Li's vision: an AI model capable of truly understanding three-dimensional structure, spatial relationships, and the compositional nature of physical reality. This wasn't surface-level agreement about a buzzword—it was deep technical alignment on a complex, multifaceted challenge.
"Wow he's the only person so far I've talked to who actually meant it. It's not just nodding." - Fei-Fei Li
This moment of recognition became the foundation for their partnership, confirming that they shared not just enthusiasm but genuine technical understanding of the problem they wanted to solve.
🔮 The Surprises That Shaped AI's Journey
Looking back across a decade of extraordinary progress in artificial intelligence, even the field's pioneers find themselves surprised by how events unfolded. The path from academic research to transformative technology rarely follows predictable trajectories, and the recent AI revolution has been no exception.
For Li, one of the most surprising developments has been the sheer effectiveness of data-driven approaches. Despite being instrumental in bringing data-centric thinking to AI research, she continues to be amazed by how far these methods have progressed and the sophisticated behaviors that emerge from them.
The irony isn't lost on her—as the person who emphasized the importance of large-scale datasets in AI development, she remains emotionally surprised by just how powerful data-hungry models have become. The emergence of genuine reasoning capabilities, creative problem-solving, and sophisticated linguistic understanding from statistical learning approaches continues to feel remarkable, even to someone who helped enable these breakthroughs.
"It's ironic to say because as Martin said I was the person who brought data into the AI world, but I still continue to be so surprised emotionally that the data hungry models, the data-driven AI can come this far and genuinely have incredible emergent behaviors of thinking machine." - Fei-Fei Li
This ongoing sense of surprise speaks to the unpredictable nature of scientific progress. Even experts who contribute foundational insights can be amazed by how those insights ultimately manifest in real-world applications. The emergent behaviors of modern AI systems—their ability to engage in complex reasoning, generate creative content, and demonstrate what appears to be understanding—continue to surprise even those who helped create the conditions for these capabilities to emerge.
🧭 Following the North Star: Problem-Driven Innovation
True innovation doesn't begin with business plans or market analysis—it starts with identifying fundamental problems that demand solutions. Li's approach to research and entrepreneurship exemplifies this principle, driven not by commercial opportunities but by deep intellectual conviction about what needs to be solved.
Her decision to start World Labs wasn't motivated by the desire to build another foundation model company or compete in the crowded LLM space. Instead, it emerged from years of contemplating a specific, profound limitation in current AI systems: their inability to truly understand and reason about the three-dimensional physical world.
"My intellectual journey is not about company or papers, is about finding the north star problem. It's not like I woke up and say I have to do a company. I woke up every day, day after day for the past few years thinking that there is so much more than language." - Fei-Fei Li
Language, while incredibly powerful for encoding thoughts and information, represents only a fraction of how intelligent beings understand and interact with reality. It's fundamentally limited as a representation of the rich, complex, three-dimensional world that all living creatures inhabit and navigate.
The physical world presents challenges that language cannot adequately capture. Spatial relationships, physical interactions, visual understanding, and embodied reasoning require different forms of intelligence than those developed through text-based training. Animals and humans have evolved sophisticated perceptual and spatial reasoning capabilities over millions of years—capabilities that current AI systems largely lack.
"Language is a lossy way to capture the world... the entire physical perceptual visual world is there, and animals' entire evolutionary history is built upon so much perceptual and eventually embodied intelligence." - Fei-Fei Li
This recognition of language's limitations, combined with understanding that biological intelligence evolved primarily through interaction with the physical world, forms the core motivation behind World Labs' mission to develop spatial intelligence in AI systems.
🏗️ Building Civilization: The Power of Physical Intelligence
Human achievement extends far beyond communication and abstract reasoning—our greatest accomplishments have come through our ability to manipulate, construct, and reshape the physical world around us. This fundamental aspect of intelligence represents a crucial gap in current AI systems that focus primarily on language and text-based reasoning.
Throughout evolutionary history, intelligence developed primarily through interaction with the physical environment. Animals navigate complex three-dimensional spaces, manipulate objects, build structures, and adapt their physical surroundings to meet their needs. This embodied intelligence forms the foundation for higher-order cognitive capabilities.
Human civilization itself represents the ultimate expression of this physical intelligence. We don't just survive in the world—we actively transform it. Cities, infrastructure, technology, art, and architecture all represent humanity's capacity to envision physical possibilities and bring them into reality through construction and manipulation of materials and spaces.
"Humans not only survive, live, work, but we build civilization upon constructing the world and changing the world." - Fei-Fei Li
This insight reveals a critical limitation in current AI development. While language models can discuss construction techniques, architectural principles, or engineering concepts, they cannot actually understand the spatial relationships, physical properties, and three-dimensional reasoning required to design and build structures in the real world.
The transition from academic research to industrial application represents Li's recognition that solving these challenges requires more than theoretical exploration. The development of spatial intelligence in AI systems demands the concentrated effort, computational resources, and engineering focus that only industry-scale initiatives can provide.
"The time has come that concentrated industry grade effort, focused effort in terms of compute, data, talent is really the answer to bringing this to life." - Fei-Fei Li
💎 Key Insights
- Spatial intelligence represents a critical missing component in current AI systems, potentially enabling the creation of infinite virtual universes for different applications
- Data quality and quantity often matter more than algorithmic sophistication in AI development—a principle that shaped the entire field
- Deep tech entrepreneurship requires intellectual partners who can navigate both technical complexity and business challenges
- The most profound innovations often come from recognizing what everyone else is overlooking, rather than following current trends
- Language is fundamentally limited as a representation of the three-dimensional physical world that intelligent beings must navigate
- Human civilization is built upon our ability to construct and modify the physical world, not just communicate about it
- Transitioning breakthrough research into real-world applications requires industry-scale resources and focused effort
📚 References
People:
- Nick Matune - Martin Casado's PhD advisor and mutual connection between Casado and Li
- Mark - Host of the dinner where Li and Casado had their "world model" breakthrough conversation
Concepts:
- ImageNet - Li's visual database that enabled the deep learning revolution in computer vision
- Large Language Models (LLMs) - Current AI systems focused on text-based reasoning and generation
- World Models - AI systems that understand three-dimensional structure, physics, and spatial relationships
- Spatial Intelligence - The ability to understand and reason about three-dimensional physical space
- Embodied Intelligence - Intelligence that develops through physical interaction with the environment
Companies:
- World Labs - Li's company focused on developing spatial intelligence in AI systems
- Twitter - Where Li served on the board of directors
- Google - Where Li held executive positions
- Stanford University - Where Li joined as assistant professor in 2009
🔍 The Blindfold Test: Why Language Fails in Physical Space
Understanding the fundamental limitations of language becomes crystal clear through a simple thought experiment that highlights the vast difference between linguistic description and spatial perception. This exercise reveals why current AI systems, despite their impressive language capabilities, struggle with real-world navigation and manipulation tasks.
Imagine being placed in a room while blindfolded, then trying to complete a task based solely on verbal descriptions. Someone might tell you there's a cup ten feet in front of you, with various objects positioned to your left and right. The inadequacy of this approach becomes immediately apparent—language simply cannot convey the precise spatial relationships, distances, orientations, and physical properties needed to navigate and interact with the environment effectively.
"If I put you in a room and I blindfolded you and I just described the room and then I asked you to do a task, the chances of you being able to do it are very little. I'm like 'Oh 10 foot in front of you is like a cup like you know like on the left is like this' it's just it's this very inaccurate way to convey reality because reality is so complex and it's so exact." - Martin Casado
The contrast becomes stark when the blindfold is removed. Visual perception allows the brain to instantly reconstruct the three-dimensional structure of the space, understanding precise spatial relationships, object properties, and potential interactions. This enables immediate navigation, manipulation, and task completion that would be nearly impossible through language alone.
This fundamental limitation explains why language-based AI systems, regardless of their sophistication, cannot fully address problems requiring spatial reasoning, physical manipulation, or embodied intelligence. The complexity and precision of physical reality demands direct spatial understanding rather than linguistic approximation.
🚗 The Unexpected Path: Why Language Conquered First
The sequence of AI breakthroughs has unfolded in a surprising order that reveals important insights about the relative difficulty of different types of intelligence. For decades, the AI community expected spatial reasoning and robotics to achieve major breakthroughs before language processing, but reality followed the opposite trajectory.
The autonomous vehicle industry exemplifies this challenge. Despite massive investment—approximately $100 billion—and decades of effort since Sebastian Thrun's DARPA Grand Challenge victory in 2006, autonomous driving remains a partially solved problem. This represents just a two-dimensional navigation challenge, yet it has proven extraordinarily difficult to achieve reliably across diverse real-world conditions.
Meanwhile, large language models emerged seemingly from nowhere and achieved remarkable success almost immediately. These systems became economically viable quickly, solving complex language problems that had stumped researchers for decades. This unexpected development forced a reconsideration of which aspects of intelligence are actually most challenging to replicate artificially.
"It's that language went first because we've worked so hard on robotics right... this is like a 2D problem and so that was the path we were going on... and then out of nowhere comes these LLMs and they solve all of these language problems like basically immediately." - Martin Casado
The explanation lies in evolutionary biology. The parts of the brain responsible for language processing are relatively recent developments, making humans somewhat inefficient at language tasks. Computers can therefore match or exceed human language capabilities more easily. In contrast, spatial navigation and reasoning capabilities have been refined over hundreds of millions of years of evolution, making them far more sophisticated and difficult to replicate.
This realization suggests that while language AI achieved rapid success, spatial intelligence represents a much deeper and more fundamental challenge that will require entirely different approaches to solve.
🧠 Unrolling Evolution: The Ancient Roots of Spatial Intelligence
The development of artificial intelligence is following a fascinating pattern that mirrors the reverse of biological evolution. Understanding this progression provides crucial insights into why different types of intelligence present varying levels of difficulty for AI systems to master.
Language capabilities, while impressive in humans, represent a relatively recent evolutionary development. The neural structures supporting complex language processing evolved much more recently than the fundamental spatial reasoning systems that govern navigation, object manipulation, and environmental understanding. This evolutionary timeline explains why computers can achieve remarkable language performance relatively quickly.
Spatial intelligence, by contrast, has been refined through hundreds of millions of years of evolutionary pressure. From the earliest organisms navigating three-dimensional environments to complex animals building structures and manipulating objects, spatial reasoning has undergone extensive optimization. This represents what biologists call "trial by heartbreak"—countless generations of organisms whose survival depended on accurate spatial understanding.
"We're actually pretty inefficient at language... but the part of the brain that actually does the navigation, the spatial has been around... 500 million years... we're unrolling evolution right like so the language part is actually very very important for high level concepts and like the laptop class type work which is what it's impacting right now but when it comes to space... you have to solve this problem." - Martin Casado
Current AI applications reflect this evolutionary hierarchy. Language models excel at knowledge work, analysis, and communication—tasks that align with humanity's relatively recent linguistic capabilities. However, any application requiring physical construction, manipulation, or navigation encounters the much deeper challenge of spatial intelligence.
The success of generative AI in language domains provides both inspiration and methodology for tackling spatial intelligence. The breakthrough techniques that enabled large language models offer potential pathways for developing the spatial reasoning capabilities that have proven so elusive in robotics and embodied AI applications.
👁️ Vision First: A Different Journey to Spatial Intelligence
While the broader AI community has been surprised by the sequence of breakthroughs, some researchers have maintained consistent focus on visual and spatial intelligence throughout their careers. This perspective provides unique insights into why spatial intelligence represents such a fundamental component of intelligence.
For researchers deeply embedded in computer vision, the importance of spatial reasoning has never been in question. Years of working with visual data, three-dimensional reconstruction, and image understanding have reinforced the central role that spatial intelligence plays in genuine understanding of the world.
The success of language models, rather than diminishing the importance of spatial intelligence, actually validates the potential for foundational model approaches across different domains. The breakthrough techniques that enabled ChatGPT and other language models demonstrate that similar architectural and training innovations could unlock spatial intelligence capabilities.
"My journey is very different because I've always been in vision right so I feel like I didn't need LLM to convince me LWM is important. I do want to say we're not here bashing language, I'm just so excited in fact seeing ChatGPT and LLM and these foundation models having such breakthrough success inspires us to realize the moment is closer for world models." - Fei-Fei Li
This perspective emphasizes that the development of spatial intelligence isn't meant to compete with or replace language models, but to address the vast range of intelligent behaviors that extend beyond linguistic communication. Spatial intelligence enables capabilities that language alone cannot provide, from basic navigation to complex physical manipulation and construction.
The success of language foundation models creates a template and provides motivation for developing similar foundational capabilities in spatial domains. The techniques, architectures, and training methodologies that proved successful for language processing offer potential pathways for achieving breakthrough results in spatial intelligence.
🧬 The DNA Discovery: When Language Isn't Enough
Some of humanity's greatest scientific breakthroughs required spatial reasoning that transcends the capabilities of language alone. The discovery of DNA's double helix structure provides a perfect example of how three-dimensional thinking enables insights that purely linguistic analysis could never achieve.
When Watson and Crick unraveled the structure of DNA, they weren't working primarily with textual descriptions or mathematical equations. Instead, they engaged in complex three-dimensional reasoning, visualizing how molecular components could fit together in space, understanding the geometric constraints of chemical bonds, and recognizing the elegant helical pattern that enables DNA's replication mechanism.
This discovery required spatial intelligence that operated beyond the reach of language. While scientists could describe DNA's components and properties linguistically, understanding how these elements assembled into a functional, self-replicating structure demanded direct spatial reasoning about three-dimensional relationships.
"Space, the 3D space, the space out there, the space in your mind's eye, the spatial intelligence that enable people to do so many things that's beyond language is a part of a critical part of intelligence. It goes from ancient animals all the way to humanity's most innovative findings such as the structure of DNA, right that double helix in 3D space. There's no way you can use language alone to reason that out." - Fei-Fei Li
The DNA example illustrates a broader principle about the limitations of language-based reasoning. While language excels at communicating established knowledge, describing relationships, and conveying abstract concepts, it falls short when dealing with novel spatial configurations, geometric relationships, and three-dimensional problem-solving.
This limitation has profound implications for AI development. Systems that rely solely on language processing, regardless of their sophistication, will be fundamentally constrained in their ability to make discoveries or solve problems that require spatial reasoning. Achieving human-level intelligence in scientific discovery, engineering, and innovation will require AI systems capable of spatial thinking.
⚽ The Buckminsterfullerene: Beauty in Molecular Architecture
Scientific discovery often reveals the profound beauty and elegance of spatial structures in nature. The buckminsterfullerene molecule, commonly known as the "Bucky Ball," represents another compelling example of how three-dimensional understanding leads to breakthrough insights that language alone could never provide.
This carbon molecule structure demonstrates the sophisticated geometric principles that govern molecular architecture. The Bucky Ball's unique spherical arrangement of carbon atoms creates a stable, beautiful structure that exhibits remarkable properties. Understanding this molecule requires spatial reasoning about how sixty carbon atoms can arrange themselves in a perfectly symmetrical pattern that maximizes stability while creating a hollow sphere.
The discovery and understanding of buckminsterfullerene involved researchers visualizing complex three-dimensional relationships, understanding geometric constraints, and recognizing patterns in spatial arrangements. This type of molecular architecture thinking operates entirely in the spatial domain, requiring intelligence that can manipulate and reason about three-dimensional structures.
"Another one of my favorite scientific example is Bucky Ball... carbon molecule structure that is so beautifully constructed... that kind of example shows how incredibly profound space and 3D world is." - Fei-Fei Li
The elegance of the Bucky Ball structure illustrates why spatial intelligence represents such a fundamental aspect of understanding reality. Nature operates according to spatial principles, creating structures and systems that can only be fully comprehended through three-dimensional reasoning.
For AI systems to truly understand and interact with the physical world—whether in scientific discovery, engineering design, or basic navigation—they must develop the spatial reasoning capabilities that enable this type of three-dimensional thinking. Language can describe these structures, but only spatial intelligence can truly understand and manipulate them.
💎 Key Insights
- Language is fundamentally inadequate for conveying precise spatial relationships and enabling real-world task completion
- AI development has followed the reverse order of evolution—mastering recent language capabilities before ancient spatial intelligence
- Autonomous vehicles demonstrate how even 2D navigation problems remain challenging despite massive investment
- Language models succeeded quickly because human language processing is evolutionarily recent and relatively inefficient
- Spatial intelligence has been refined over 500 million years of evolution, making it far more sophisticated than language processing
- Scientific breakthroughs like DNA structure discovery require spatial reasoning that transcends language capabilities
- The success of language foundation models provides both inspiration and methodology for developing spatial intelligence
- Current AI applications are limited to "laptop class" knowledge work that doesn't require physical spatial reasoning
📚 References
People:
- Sebastian Thrun - Winner of the DARPA Grand Challenge in 2006, early autonomous vehicle pioneer
Scientific Examples:
- DNA Double Helix Structure - Watson and Crick's discovery requiring three-dimensional spatial reasoning
- Buckminsterfullerene (Bucky Ball) - Carbon molecule structure demonstrating elegant molecular architecture
Technologies:
- DARPA Grand Challenge - 2006 competition that marked early progress in autonomous vehicles
- ChatGPT - Example of successful language model that inspired confidence in foundation model approaches
- Autonomous Vehicles (AV) - Industry that invested ~$100 billion over decades with limited success
Concepts:
- Spatial Intelligence - Ancient evolutionary capability for three-dimensional reasoning and navigation
- Language Processing - Relatively recent evolutionary development that computers can master more easily
- Foundation Models - Architectural approach that achieved breakthrough success in language domains
- World Models - AI systems that understand three-dimensional structure and spatial relationships
🎨 The Visual Nature of Creativity
Creativity across industries is fundamentally rooted in visual and spatial thinking, making it a prime domain for spatial intelligence applications. From design studios to movie sets, from architectural firms to industrial manufacturing, creative work requires sophisticated understanding of three-dimensional relationships, visual aesthetics, and spatial composition.
The creative industries represent far more than entertainment—they encompass productivity tools, machinery design, industrial applications, and countless other domains where visual thinking drives innovation. Designers working on everything from consumer products to complex industrial systems rely on spatial reasoning to envision, iterate, and refine their creations.
Current creative workflows often involve translating spatial ideas through limited two-dimensional interfaces or cumbersome three-dimensional modeling tools. Spatial intelligence could revolutionize these processes by enabling direct manipulation of three-dimensional concepts, allowing creators to work more intuitively with spatial relationships and visual compositions.
"Creativity is very visual... we have creators from design to movie to architecture to industry design. Creativity is not just only for entertainment, it could be for productivity, for machinery, for many things. That alone is a highly visual perceptual spatial area or areas of work." - Fei-Fei Li
The implications extend beyond individual creative projects to entire industries built around visual and spatial problem-solving. Architecture, film production, product design, and industrial engineering all depend on spatial intelligence that current AI systems cannot adequately support.
By developing AI systems that truly understand three-dimensional space, visual relationships, and spatial composition, we could unlock new levels of creative capability and productivity across these visually-driven industries.
🤖 Embodied Machines: Beyond Humanoids and Cars
Robotics represents a vast spectrum of embodied machines that extends far beyond the humanoid robots and autonomous vehicles that dominate popular imagination. The field encompasses countless applications where machines must understand and navigate three-dimensional space, often in collaboration with humans.
Every embodied machine, regardless of its form factor or application, faces the fundamental challenge of spatial reasoning. Whether it's a manufacturing robot assembling components, a drone navigating complex environments, or a service robot operating in human spaces, these systems must develop sophisticated understanding of their three-dimensional environment.
The collaborative aspect adds another layer of complexity. Many robotic applications require machines to work alongside humans, understanding not just the physical space but also human intentions, movements, and spatial behaviors. This collaborative spatial intelligence represents a particularly challenging and important frontier.
"Robotics to me is any embodied machines. It's not just humanoids or cars, there's so much in between, but all of them have to somehow figured out the 3D space it lives in, have to be trained to understand the 3D space and have to do things sometimes even collaboratively with humans and that needs spatial intelligence." - Fei-Fei Li
The breadth of this challenge explains why robotics has remained difficult despite decades of research and investment. Each type of embodied machine operates in different spatial contexts, with different constraints, capabilities, and objectives. However, they all share the need for fundamental spatial intelligence.
Solving spatial intelligence could unlock progress across this entire spectrum of embodied machines, enabling more capable robots in manufacturing, healthcare, service industries, exploration, and countless other applications where machines need to understand and navigate the physical world.
🌍 Breaking Free from Single Reality: The Multiverse Vision
Throughout human history, our species has been constrained to experience life within a single three-dimensional reality—the physical Earth. While a few astronauts have ventured to the Moon, the vast majority of humanity has lived and worked within the bounds of our planet's physical space. This represents a fundamental limitation that spatial intelligence technology could transform.
The development of sophisticated spatial AI opens the possibility of creating infinite virtual universes, each designed for specific purposes and experiences. These wouldn't be simple video game environments or basic virtual reality spaces, but rich, complex three-dimensional worlds that could serve diverse human needs and applications.
Different virtual universes could be optimized for different purposes: some designed as training environments for robots, others as creative spaces for artists and designers, social environments for human interaction, travel experiences that transport people to impossible places, or storytelling worlds that immerse audiences in narrative experiences.
"For the entirety of human civilization we all collectively as people lived in one 3D world and that is the physical earth 3D world... but that's what makes the digital virtual world incredible with this technology... we can actually create infinite universes... some are for robots some are for creativity some are for socialization some are for travel some are for storytelling... it suddenly will enable us to live in the multiverse." - Fei-Fei Li
This vision represents more than technological advancement—it suggests a fundamental expansion of human experience. Instead of being limited to the physical constraints of Earth, people could inhabit multiple virtual worlds, each offering unique possibilities for work, creativity, learning, and social interaction.
The concept of living in a multiverse through spatial intelligence technology could redefine how humans experience space, interact with environments, and explore possibilities that physical reality cannot provide.
🔄 From 2D Views to Complete 3D Understanding
The practical capabilities of spatial intelligence become concrete when examining how these systems could transform a simple two-dimensional image into a complete three-dimensional understanding. This represents a fundamental leap from current computer vision capabilities that primarily analyze flat images to systems that truly comprehend spatial relationships.
Starting with just a single 2D photograph or view, spatial intelligence systems could reconstruct the complete three-dimensional scene, including areas not visible in the original image. This means understanding what lies behind objects, how spaces connect, and what the full spatial structure looks like from any perspective.
The reconstructed 3D representation becomes a manipulable digital environment where users can move objects, measure distances, stack items, and perform any spatial operation that would be possible in physical space. This creates a bridge between 2D visual input and full 3D spatial reasoning.
"With these models you can take a view of the world like a 2D view of the world... and then you could actually create a 3D full representation including what you're not seeing like the back of the table for example... you can manipulate it you can move it you can measure it you can stack it so anything that you would do a space you could do." - Martin Casado
The generative aspect extends this capability even further. Beyond reconstructing what exists, these systems could generate completely new spatial elements, creating 360-degree environments from limited input and filling in spatial details that were never captured in the original data.
This capability has immediate applications across multiple industries: architecture and design, where professionals could quickly prototype and iterate on spatial concepts; video games, where developers could rapidly create rich 3D environments; robotics, where machines could better understand and navigate their surroundings; and countless other domains requiring spatial reasoning.
🌳 The Wisdom of a Six-Year-Old: Why Trees Don't Have Eyes
Sometimes profound insights about intelligence and perception come from the simplest observations. A conversation with a six-year-old about why trees don't have eyes reveals fundamental principles about the relationship between movement, perception, and spatial intelligence that have shaped the evolution of life on Earth.
Trees don't need eyes because they don't move. This simple observation illuminates a crucial principle: perception and spatial intelligence evolved as responses to the need for movement and interaction with the environment. Stationary organisms can survive and thrive without sophisticated sensory systems, but any creature that moves must develop the ability to perceive and navigate three-dimensional space.
This principle explains why spatial intelligence represents such a fundamental aspect of animal life. The entire evolutionary history of mobile creatures has been shaped by the need to navigate, hunt, escape, build, and interact within three-dimensional environments. Every aspect of animal cognition, from basic navigation to complex problem-solving, builds upon this foundation of spatial reasoning.
"I had a conversation with my six-year-old years ago about why trees don't have eyes... trees don't move they don't need eyes so the fact that the entire basis of animal life is moving and doing things and interacting gives life to perception and spatial intelligence." - Fei-Fei Li
The implications for artificial intelligence are profound. If we want to create AI systems that can truly interact with and understand the world, they must develop the spatial intelligence that evolution has refined over hundreds of millions of years. This isn't just about adding another capability to AI systems—it's about developing one of the most fundamental aspects of intelligence itself.
Spatial intelligence will likely transform work and life as dramatically as language models have transformed knowledge work, but across the vast domain of physical interaction and spatial reasoning that governs so much of human activity.
💎 Key Insights
- Creativity across industries is fundamentally visual and spatial, from entertainment to industrial design and productivity tools
- Robotics encompasses all embodied machines, not just humanoids and cars, all requiring sophisticated spatial understanding
- Spatial intelligence could enable creation of infinite virtual universes designed for specific purposes like robot training, creativity, and social interaction
- For the first time in human history, we could transcend living in a single 3D reality and inhabit multiple virtual worlds
- Current systems can transform 2D images into complete manipulable 3D representations including unseen areas
- The generative capabilities extend beyond reconstruction to creating entirely new spatial environments
- Trees don't need eyes because they don't move—spatial intelligence evolved as a response to movement and interaction
- Spatial intelligence represents a horizontal technology platform that could transform work across multiple industries
- Like LLMs, spatial intelligence applications span from practical tools to creative expression and self-actualization
📚 References
Applications:
- Design Industries - Visual creative work spanning entertainment to industrial applications
- Movie Production - Visual storytelling requiring sophisticated spatial understanding
- Architecture - Building design requiring three-dimensional spatial reasoning
- Industrial Design - Product and machinery design with spatial components
- Video Games - Interactive environments requiring 3D spatial generation
- Robotics Training - Specialized virtual environments for machine learning
Concepts:
- Embodied Machines - Any physical system that must navigate and understand 3D space
- Multiverse Living - Ability to inhabit multiple virtual worlds designed for different purposes
- 2D to 3D Reconstruction - Technology that creates complete spatial understanding from flat images
- Generative 3D - Systems that can create new spatial content beyond what was originally captured
- Horizontal Technology Platform - Foundational capability that enables applications across multiple industries
- 360-Degree Environments - Complete spatial representations viewable from any perspective
Examples:
- Trees Without Eyes - Six-year-old's observation about the relationship between movement and perception
- The Moon - Limited example of humans experiencing alternate 3D environments
- Table Reconstruction - Example of inferring unseen spatial elements (back of table) from partial views
🎯 The Fundamental 3D Problem: Why 2D Isn't Enough
The physical world operates according to three-dimensional principles that cannot be adequately represented through two-dimensional abstractions. While humans can mentally reconstruct 3D understanding from 2D inputs like videos or photographs, machines lack this intuitive spatial reasoning capability, creating a fundamental limitation for AI systems operating in physical environments.
Physics, interaction, navigation, and composition all occur in three-dimensional space. When a robot needs to navigate behind objects, measure distances, or manipulate items in the physical world, it requires explicit three-dimensional information that 2D representations simply cannot provide. The Z-axis—representing depth and distance—becomes crucial for any spatial task.
For human observers, 2D video works because our brains automatically reconstruct the missing spatial dimension based on evolutionary programming and learned spatial understanding. We can watch a flat screen and intuitively understand the three-dimensional relationships, distances, and spatial arrangements being depicted.
"Physics happens in 3D and interaction happens in 3D, navigating behind the back of the table needs to happen in 3D, composing the world whether physically digitally needs to happen in 3D... fundamentally the problem is a 3D problem." - Martin Casado
However, when machines attempt to perform spatial tasks using only 2D information, they lack the essential depth information needed for navigation, manipulation, and interaction. A robot trying to grab an object or measure distances cannot succeed without understanding the complete three-dimensional spatial relationships.
"If you need a robot that has the output of the model, if that's 2D and then you ask the robot to do distance or to grab something, that information is missing... you've got the XYZ plane, the Z plane just isn't there at all." - Martin Casado
This limitation explains why 2D computer vision, despite its impressive advances, cannot adequately support embodied AI applications that must interact with the physical world.
👁️ The Vision Scientist's Experiment: Life Without Stereo Vision
Sometimes the most profound insights about human perception come from experiencing its absence. Five years ago, a cornea injury temporarily robbed Fei-Fei Li of her stereo vision, creating an unexpected natural experiment that illuminated the critical importance of three-dimensional visual understanding.
Living with monocular vision for several months provided Li with firsthand experience of the challenges that machines face when trying to navigate the world without true depth perception. Despite a lifetime of spatial learning and her expertise as a vision scientist, the loss of stereo vision created immediate and frightening limitations in her daily life.
The most striking impact came when attempting to drive. Even though Li retained perfect knowledge of her car's dimensions, understood the size of neighboring parked cars, and knew her neighborhood roads intimately, the lack of depth perception made driving treacherous. Simple tasks like judging the distance between her car and parked vehicles became nearly impossible.
"I was frightened to drive... first of all I couldn't get on highway that speed... but I was just driving in my own neighborhood and I realized I don't have a good distance measure between my car and the parked car on a local small road even though I have perfect understanding of how big is my car, almost how big is the neighbors' parked cars, I know the roads for years and years." - Fei-Fei Li
The experience forced her to drive at extremely slow speeds—around 10 miles per hour—to avoid scratching vehicles, despite having complete conceptual knowledge of all spatial relationships involved. This demonstrated how depth perception provides information that cannot be substituted by other forms of understanding.
"I had to be so slow like almost 10 miles an hour so that I don't scratch the cars and that was exactly why we needed a stereo vision." - Fei-Fei Li
This personal experience reinforced the technical understanding that machines operating without true 3D spatial intelligence face similar fundamental limitations, regardless of their other capabilities or programmed knowledge.
🔬 Building on Giants: The Research Foundation
The development of spatial intelligence at World Labs builds upon significant breakthroughs that have emerged from academic research over recent years. While spatial AI represents a newer area compared to language models, important foundational work has been developing in computer vision and 3D reconstruction technologies.
One of the most significant advances came through Neural Radiance Fields (NeRF), a revolutionary approach to 3D reconstruction using deep learning. This breakthrough, developed by World Labs co-founder Ben Mildenhall and his colleagues at Berkeley, transformed how researchers approach the challenge of creating three-dimensional representations from two-dimensional inputs.
NeRF technology enabled unprecedented quality in 3D scene reconstruction, allowing systems to generate photorealistic views of three-dimensional spaces from limited input data. This work, which gained significant attention about four years ago, provided crucial foundational techniques for spatial intelligence development.
Another important advancement came through Gaussian Splatting representation, pioneered in part by World Labs co-founder Christoph Lassner. This approach offers efficient methods for representing complex three-dimensional volumetric data, providing the computational foundations needed for real-time spatial reasoning and manipulation.
"One important revolution that has happened in 3D computer vision was neural radiance field or NeRF and that was done by our co-founder Ben Mildenhall and his colleagues at Berkeley... that was a way to do 3D reconstruction using deep learning that was really taking the world by storm about four years ago." - Fei-Fei Li
The team also includes Justin Johnson, Li's former student and another World Labs co-founder, who contributed foundational work in image generation and style transfer during the pre-transformer era when GANs (Generative Adversarial Networks) were the primary approach for visual content generation.
These individual breakthroughs in academia and industry created the technical building blocks that make comprehensive spatial intelligence systems possible, but no single organization had focused on integrating them into a unified approach to world modeling.
🎯 All-In on the North Star: Concentrating World-Class Talent
Solving spatial intelligence requires more than incremental research progress—it demands concentrated effort from world-class experts across multiple disciplines working toward a unified goal. World Labs represents a deliberate assembly of top talent from computer vision, AI, graphics, and optimization, all focused on the singular challenge of creating spatial intelligence.
The complexity of spatial intelligence spans multiple technical domains that must work together seamlessly. Success requires deep expertise in AI and machine learning for developing the core reasoning capabilities, computer graphics for representing and manipulating 3D information efficiently, optimization techniques for handling the computational complexity, and data science for training these systems effectively.
This interdisciplinary approach reflects the fundamental challenge of spatial intelligence—it cannot be solved by expertise in any single domain. The breakthrough requires integration across fields that have traditionally operated somewhat independently, bringing together the best minds from each area to work on a unified technical vision.
"At World Labs we just have the conviction that we're gonna be all in on this one singular big north star problem, concentrating on the world's smartest people in computer vision, in diffusion models, in graphics, computer graphics, in optimization, in AI, all of in data, all of them coming into this one team and try to make this work and to productize this." - Fei-Fei Li
The productization aspect adds another layer of complexity beyond research. Creating spatial intelligence systems that work reliably in real-world applications requires not just technical breakthroughs but also engineering excellence, scalable infrastructure, and robust implementation that can handle the demands of practical deployment.
Martin Casado emphasizes that solving spatial intelligence requires this special combination of expertise: deep AI knowledge for data and model architecture, combined with graphics expertise for representing complex 3D information efficiently in computer memory and on screens. This unique combination of skills makes assembling the right team particularly challenging and important.
💎 Key Insights
- Physics, interaction, and navigation all fundamentally occur in 3D space, making 2D representations inadequate for spatial tasks
- Humans can reconstruct 3D understanding from 2D inputs, but machines need explicit three-dimensional information
- Losing stereo vision even temporarily reveals how crucial depth perception is for basic spatial tasks like driving
- Years of spatial knowledge and expertise cannot compensate for missing depth information in real-world navigation
- Spatial intelligence research builds on significant breakthroughs like NeRF and Gaussian Splatting from academic research
- Solving spatial intelligence requires expertise across multiple domains: AI, computer graphics, optimization, and data science
- World Labs represents an "all-in" approach, concentrating world-class talent from different fields on a single north star problem
- Productizing spatial intelligence requires both research breakthroughs and robust engineering for real-world deployment
- The complexity of representing 3D information efficiently in computer memory and on screens requires specialized graphics expertise
📚 References
People:
- Ben Mildenhall - World Labs co-founder who developed Neural Radiance Fields (NeRF) at Berkeley
- Christoph Lassner - World Labs co-founder whose work contributed to Gaussian Splatting representation
- Justin Johnson - Former student of Fei-Fei Li and World Labs co-founder, pioneered early image generation and style transfer
Technologies:
- Neural Radiance Fields (NeRF) - Revolutionary 3D reconstruction approach using deep learning that gained prominence ~4 years ago
- Gaussian Splatting - Efficient method for representing complex three-dimensional volumetric data
- GANs (Generative Adversarial Networks) - Pre-transformer approach for image generation
- Style Transfer - Early technique for transforming visual content, foundational to current generative approaches
Technical Domains:
- Computer Vision - Field focused on enabling machines to understand visual information
- Diffusion Models - Current state-of-the-art approach for generative AI systems
- Computer Graphics - Discipline focused on representing and rendering 3D information efficiently
- Optimization - Mathematical techniques for solving complex computational problems
- XYZ Coordinate System - Three-dimensional spatial representation with Z-axis representing depth
Medical/Personal:
- Stereo Vision - Depth perception capability enabled by using both eyes together
- Cornea Injury - Medical condition that temporarily affected Fei-Fei Li's depth perception
- Monocular Vision - Seeing with only one eye, eliminating natural depth perception