
François Chollet: The ARC Prize & How We Get to AGI
François Chollet on June 16, 2025 at AI Startup School in San Francisco.François Chollet is a leading voice in AI. He's the creator of the Keras library, author of Deep Learning with Python, and the founder of the ARC Prize, a global competition aimed at measuring true general intelligence.He's spent years thinking deeply about what intelligence actually is—and why scaling up today’s AI models isn’t enough to reach it.In this talk, he walks through the limits of pretraining and memorized...
Table of Contents
📉 Why Has AI Progress Been So Predictable for Decades?
The Fundamental Driver Behind AI's Exponential Growth
The Most Important Chart in Technology:
- Exponential Decline: Compute costs have fallen by two orders of magnitude every decade since 1940
- Consistent Pattern: This trend shows no signs of stopping anytime soon
- AI Breakthrough Catalyst: In the 2010s, abundant GPU compute + large datasets finally made deep learning work
The 2010s Deep Learning Revolution:
- Computer Vision: Previously intractable problems suddenly became solvable
- Natural Language Processing: Major breakthroughs across language understanding
- Self-Supervised Learning: Text modeling began working at scale
- Scaling Laws: Predictable improvements with larger models and more data



🚀 What Made Everyone Believe Scaling Was Everything?
The Seductive Promise of LLM Scaling Laws
The Scaling Obsession Era:
- Predictable Results: Same architecture + same training process = consistent improvements
- Benchmark Dominance: Scaling up crushed almost all AI benchmarks
- Linear Relationship: Model size and training data correlated directly with performance
The Emergent Intelligence Hypothesis:
- Popular Belief: General intelligence would spontaneously emerge from bigger models
- More Data = More Intelligence: The field became obsessed with this simple formula
- Universal Solution: Many believed more scale was all needed to solve everything
The Critical Flaw:
Confusion About Benchmark Meaning: The AI community misunderstood what these benchmark results actually represented



🧠 What's the Real Difference Between Skills and Intelligence?
Why Memorized Performance Isn't True Intelligence
The Fundamental Distinction:
- Memorized Skills: Static, task-specific abilities that can be recalled
- Fluid Intelligence: The ability to understand something completely new on the fly
- Critical Gap: There's a massive difference between these two capabilities
The ARC Benchmark Revolution (2019):
- Purpose: Designed to highlight the difference between memorization and genuine reasoning
- Focus: Not about regurgitating memorized skills, but making sense of novel problems
- Human Performance: Any person in the room would score well above 95%
The Scaling Reality Check:
50,000x Scale-Up Results:
- 2019 Baseline: 0% accuracy on ARC benchmark
- GPT-4 Era: Only reached roughly 10% accuracy
- Conclusion: Massive scaling didn't translate to fluid intelligence



🔄 What Changed Everything in 2024?
The Paradigm Shift from Pre-training to Test-Time Adaptation
The Revolutionary Pivot:
- New Pattern Emergence: AI research community shifted to test-time adaptation
- Dynamic State Changes: Models that could modify their own state during inference
- Adaptive Learning: Moving beyond querying pre-loaded knowledge
Test-Time Adaptation Breakthrough:
- Real-Time Learning: Ability to learn and adapt during inference time
- ARC Progress: Suddenly seeing significant progress on the benchmark
- Fluid Intelligence Signs: AI showing genuine signs of adaptive reasoning
The OpenAI o3 Milestone:
December 2024 Achievement:
- Human-Level Performance: First time achieving human-level results on ARC
- Fine-Tuned Approach: Specifically optimized for the benchmark
- Paradigm Confirmation: Validated the test-time adaptation approach



🎯 How Do Models Actually Adapt in Real-Time?
The Technical Reality of Test-Time Adaptation
Core Adaptation Mechanisms:
- Dynamic Behavior Modification: Models change their processing based on specific inference data
- Self-Reprogramming: Attempting to reprogram themselves for each task
- Universal Adoption: Every successful ARC approach now uses these techniques
Key Adaptation Techniques:
- Test-Time Training: Continued learning during inference
- Program Synthesis: Generating new code/logic for specific problems
- Chain of Thought Synthesis: Building reasoning paths dynamically
- Behavioral Plasticity: Modifying response patterns based on context
The Current State (2025):
Complete Paradigm Shift:
- Era Transition: Moved fully from pre-training scaling to test-time adaptation
- Performance Requirements: No competitive ARC performance without adaptation
- New Standard: Adaptation techniques now essential for fluid intelligence



🤔 What Are the Three Critical Questions About AGI?
The Framework for Understanding Our Current AI Moment
The Essential Questions:
- Historical Analysis: Why didn't pre-training scaling get us to AGI?
- Current Assessment: Does test-time adaptation actually get us to AGI this time?
- Future Roadmap: What comes next beyond test-time adaptation?
The Dogma Shift Context:
- Two Years Ago: Pre-training scaling was standard belief across the field
- Universal Acceptance: "Everybody was saying this" - it was the dominant paradigm
- Today's Reality: "Almost no one believes this anymore" - complete reversal
The Fundamental Question:
What Is Intelligence?: Before answering the three questions, we need to understand what we're actually trying to build
The Stakes:
- AGI Claims: Some people believe AGI is already here
- Industry Impact: Understanding these questions shapes the future of AI development
- Scientific Clarity: Getting clear definitions drives better research directions



💡 What Are the Two Competing Definitions of Intelligence?
The Fundamental Philosophical Divide in AI
The Minsky Style View:
- Task-Focused Definition: AI is about making machines capable of performing human tasks
- Corporate Alignment: Echoes mainstream corporate AGI definitions
- Quantitative Threshold: Often quoted as performing 80% of economically valuable tasks
The MacCarthy Style View:
- Novelty-Focused Definition: AI is about getting machines to handle unprepared problems
- Adaptation Emphasis: Focuses on dealing with completely new situations
- Process Over Product: Intelligence as capability, not just performance
Chollet's Intelligence Framework:
Process vs. Output Distinction:
- Intelligence: The process itself - the ability to generate solutions
- Skill: The output of that process - specific capabilities
- Critical Error: Confusing skills with intelligence itself
The Road Network Analogy:
- Road Network: Connects predefined points A to B (skills)
- Road Building Company: Can connect new A's and B's as needs evolve (intelligence)
- Key Insight: Intelligence is about building new roads, not just using existing ones



🔬 How Do We Formally Define Intelligence?
The Mathematical Framework for Understanding Intelligence
The Formal Definition:
Intelligence = Conversion Ratio
- Input: Information you have (past experience + developer-imparted priors)
- Output: Operational area over potential future situations
- Key Factors: High novelty and uncertainty in future situations
The Efficiency Metric:
- Operationalization: How efficiently you convert knowledge into capabilities
- Novel Situations: Focus on previously unseen scenarios
- Uncertainty Handling: Ability to function without complete information
The Category Error Problem:
Crystallized vs. Fluid Intelligence:
- Crystallized Behavior: Pre-programmed skills and responses
- Fluid Intelligence: Real-time problem-solving and adaptation
- Common Mistake: Attributing intelligence to crystallized programs
The Process vs. Product Distinction:
- The Process: The mechanism that creates solutions
- The Product: The specific solutions created
- Fatal Confusion: Mistaking the road for the road-building process



💎 Key Insights
Essential Insights:
- Compute Cost Decline: The exponential decrease in compute costs (2 orders of magnitude per decade since 1940) has been the primary driver of AI progress, not algorithmic breakthroughs alone
- Scaling Paradigm Failure: Despite 50,000x scale-up from 2019 to GPT-4 era, ARC benchmark performance only improved from 0% to 10%, proving that scaling alone doesn't create fluid intelligence
- 2024 Paradigm Shift: The AI field completely pivoted from pre-training scaling to test-time adaptation, with every successful ARC approach now using dynamic adaptation techniques
Actionable Insights:
- Focus on Adaptation: When evaluating AI systems, look for test-time adaptation capabilities rather than just benchmark performance on memorized tasks
- Redefine Intelligence Metrics: Distinguish between crystallized skills (road networks) and fluid intelligence (road-building companies) when assessing AI progress
- Embrace Novelty Testing: Use benchmarks like ARC that test reasoning on completely new problems rather than pattern matching on familiar data
📚 References
People Mentioned:
- François Chollet - AI researcher, creator of Keras, founder of ARC Prize, discussing fundamental questions about intelligence and AGI
- Jared - Referenced speaker who discussed scaling laws in a previous presentation
Companies & Products:
- ARC Prize - Artificial intelligence benchmark, Co-founded by François Chollet
- NDEA - A new intelligence science lab Co-founded by François Chollet
- OpenAI - Released the o3 model that achieved human-level performance on ARC benchmark in December 2024
- Keras - Deep learning library created by François Chollet
- GPT-4 - OpenAI's language model used as example of scaled pre-training approach
Technologies & Tools:
- GPU-based Compute - Hardware that enabled the deep learning revolution in the 2010s
- ARC Benchmark - Abstraction Reasoning Corpus, designed to test fluid intelligence rather than memorized skills
- Test-Time Training - Technique allowing models to continue learning during inference
- Program Synthesis - Method for generating new code/logic for specific problems
- Chain of Thought Synthesis - Approach for building reasoning paths dynamically
Concepts & Frameworks:
- Scaling Laws - Mathematical relationships between model size, data, and performance
- Test-Time Adaptation - Paradigm where models modify their behavior dynamically during inference
- Fluid vs. Crystallized Intelligence - Distinction between adaptive reasoning and memorized skills
- Self-Supervised Learning - Training approach that became dominant in the 2010s
- Abstraction Reasoning Corpus (ARC) - Benchmark specifically designed to measure genuine fluid intelligence
📊 Why Are Human Exams Terrible for Measuring AI Intelligence?
The Fundamental Flaw in Current AI Benchmarking
The Exam Problem:
- Wrong Design Purpose: Human exams were designed to measure task-specific skills, not intelligence
- Flawed Assumptions: Built on assumptions that make sense for humans but not machines
- Memorization Loophole: Most exams assume you haven't memorized all questions and answers beforehand
Intelligence as Efficiency:
- Core Definition: Intelligence is an efficiency ratio - how well you operationalize past information to deal with the future
- Benchmark Limitation: Exam-like benchmarks can't tell us how close we are to AGI
- Measurement Problem: They measure crystallized knowledge, not fluid reasoning
The AGI Distance Problem:
Why Current Metrics Fail:
- Static vs. Dynamic: Exams test static recall rather than dynamic adaptation
- Known vs. Novel: Focus on familiar patterns rather than unprecedented challenges
- Skill vs. Intelligence: Confuse demonstrated competence with reasoning capability



🎯 What Are the Three Key Concepts for Measuring True Intelligence?
The Framework for Defining and Measuring Real AI Intelligence
1. Static Skills vs. Fluid Intelligence:
The Spectrum of Capability:
- Static Programs: Collection of pre-built solutions for known problems
- Fluid Synthesis: Ability to create brand new programs for unseen challenges
- Not Binary: Exists on a spectrum between these two extremes
2. Operational Area for Skills:
Scope of Application:
- Narrow Scope: Only skilled in situations very close to training examples
- Broad Scope: Skilled across wide range of scenarios within domain
- Transfer Example: Learning to drive in San Jose, then successfully driving in Sacramento
The Driving Analogy:
- Local Competence: Can only drive in specific geofenced area
- General Competence: Can drive in any city after learning in one location
- Intelligence Indicator: Broader operational area suggests higher intelligence
3. Information Efficiency:
Learning Resource Requirements:
- Data Needs: How much information required to acquire a skill
- Practice Requirements: Amount of training needed for competence
- Efficiency = Intelligence: Higher information efficiency indicates higher intelligence



⚖️ Why Does Our Definition of Intelligence Shape Everything We Build?
The Measurement-Building Feedback Loop
The Engineering Principle:
- Core Rule: "We can only build what we measure"
- Definition Impact: How we define intelligence reflects our understanding of cognition
- Scope Determination: Definitions determine what questions we ask and answers we get
The Feedback Signal Problem:
- Goal Direction: Measurements drive us toward specific objectives
- Blind Spots: What we don't measure gets ignored in development
- Understanding Reflection: Our metrics reveal our grasp of the problem
The Shortcut Rule Phenomenon:
Universal Engineering Pattern:
- Single Metric Focus: Optimizing for one measure of success
- Unintended Consequences: Success comes at expense of unmeasured factors
- Target vs. Point: Hit the target but miss the actual point
Classic Examples:
- Kaggle Competitions: Winners often create solutions too complex for real-world use
- Netflix Prize: Winning system was extremely accurate but never deployed in production



♟️ What Did AI Chess Teach Us About Missing the Point?
The Chess Paradox: Achieving Goals While Learning Nothing
The Chess AI Journey:
- Original Intent: 1970s AI community wanted to understand human intelligence through chess
- Success Achievement: Deep Blue beat world champion Kasparov decades later
- Learning Outcome: "We had really learned nothing about intelligence"
The Pattern Recognition:
- Goal Achievement: Successfully created superhuman chess-playing AI
- Knowledge Gap: Process taught nothing about general intelligence
- Fundamental Mismatch: Task-specific optimization vs. intelligence understanding
The Broader Implication:
Decades of Misdirection:
- Task-Specific Focus: AI has chased individual skills because that was our intelligence definition
- Automation Result: This approach only leads to automation systems
- Current Reality: "Exactly the kind of system that we have today"
What We Actually Want:
- Beyond Automation: Not just automating known tasks
- Autonomous Invention: AI capable of tackling humanity's most difficult challenges
- Scientific Acceleration: Systems that can accelerate scientific progress



🚀 What's the Difference Between Automation and Invention?
Two Paths to AGI with Radically Different Outcomes
Path 1: Automation-Focused AGI
Task-Specific Intelligence Definition:
- Primary Benefit: Increases economic productivity significantly
- Obvious Value: Extremely valuable for known task completion
- Potential Downside: May increase unemployment
- Limitation: Only handles predefined problems
Path 2: Invention-Focused AGI
Fluid Intelligence Definition:
- Core Capability: Unlocks autonomous invention
- Scientific Impact: Accelerates the timeline of scientific discovery
- Innovation Potential: Tackles humanity's most difficult challenges
- Adaptive Nature: Can face unprecedented problems
The Target Problem:
Need for New Direction:
- Current Focus: Decades of chasing task-specific skills
- Required Shift: Target fluid intelligence itself
- Key Abilities: Adaptation and invention capabilities
The Measurement Imperative:
Progress Through Better Metrics:
- Better Target: Focus on what we actually care about
- Better Feedback: Signals that drive toward true intelligence
- Progress Mechanism: "It's by measuring what you really care about that we'll be able to make progress"



🧩 How Does ARC Actually Test Intelligence Instead of Memory?
The Revolutionary Approach to AI Intelligence Measurement
ARC1 Design Principles:
- IQ Test for Machines: Released in 2019 as intelligence benchmark for both AI and humans
- 1,000 Unique Tasks: Each task is completely unique - no pattern repetition
- No Cramming Possible: Must figure out each task on the fly using general intelligence
The Anti-Memorization Design:
- Unique Problems: Cannot memorize patterns because each task is novel
- On-the-Fly Reasoning: Must use fluid intelligence rather than recalled knowledge
- General Intelligence Required: Success depends on reasoning, not memory
Explicit Knowledge Priors:
Core Knowledge Foundation:
- Objectness: Understanding of discrete objects and their properties
- Elementary Physics: Basic cause-and-effect relationships
- Basic Geometry: Spatial relationships and transformations
- Topology: Understanding of connectivity and boundaries
- Counting: Numerical concepts and quantity
The Four-Year-Old Standard:
- Accessibility: Concepts any four-year-old child has mastered
- Non-Specialized: Very little specialized knowledge required
- No Preparation: Don't need to study or prepare for ARC
- Universal Foundation: Built on truly general cognitive building blocks



🔍 Why Do Humans Excel at ARC While AI Struggles?
The Intelligence Gap That Reveals Missing AI Capabilities
The Performance Paradox:
- Human Performance: Children can perform really well on ARC tasks
- AI Performance: Most sophisticated AI models struggle significantly
- Red Flag Signal: This gap indicates we're missing fundamental capabilities
What Makes ARC Unique:
- Pure Reasoning: Cannot be solved by memorizing patterns
- Fluid Intelligence Required: Must demonstrate genuine reasoning
- Contrast with Other Benchmarks: Most benchmarks target fixed, known tasks that can be "hacked" via memorization
The Diagnostic Value:
ARC as Research Tool:
- Not AGI Test: Won't tell you if a system is already AGI
- Bottleneck Identifier: Directs attention to most important unsolved problems
- Research Direction: Acts as arrow pointing toward critical missing pieces
The Navigation Metaphor:
- Not the Destination: Solving ARC isn't the ultimate goal
- Directional Tool: "Really just an arrow pointing in the right direction"
- Progress Indicator: Shows when we're making real advances in fluid intelligence
Historical Resistance:
50,000x Scale-Up Results:
- Performance Stagnation: ARC performance stayed near zero despite massive scaling
- Decisive Conclusion: Fluid intelligence does not emerge from pre-training scaling alone
- Test Adaptation Necessity: "You absolutely need test adaptation to demonstrate genuine fluid intelligence"



📈 What Made ARC the Only Benchmark to Detect the 2024 Paradigm Shift?
Why ARC Uniquely Signaled the Test-Time Adaptation Revolution
The Benchmark Landscape Problem:
- Saturated Benchmarks: Other benchmarks couldn't distinguish between true intelligence gains and brute force scaling
- Clear Signal Provider: ARC was the only benchmark providing clear signal about the profound shift
- IQ vs. Scaling: Could differentiate between genuine intelligence increase and computational brute force
The Timing Advantage:
- Test Adaptation Arrival: When test-time adaptation emerged in 2024
- Unique Detection: ARC alone could measure the qualitative difference
- Research Validation: Confirmed that new approaches were fundamentally different
The Current Saturation Question:
ARC1 Performance Plateau:
- Visible Saturation: Graph shows ARC1 is now saturating as well
- Critical Question: "Does that mean we have human level AI now?"
- Next Phase Implications: Need to understand what saturation actually means
The Evolution Challenge:
- Benchmark Evolution: As AI capabilities advance, benchmarks need updating
- Measurement Adaptation: Tools must evolve to continue providing meaningful signals
- Progress Tracking: Need to maintain ability to distinguish real from apparent progress



💎 Key Insights
Essential Insights:
- Benchmark Design Flaw: Human exams are fundamentally unsuitable for measuring AI intelligence because they assume you haven't memorized all questions and answers - the exact opposite of how AI systems work
- The Shortcut Rule: Engineering teams inevitably optimize for single metrics at the expense of unmeasured factors, leading to solutions that "hit the target but miss the point" (like Netflix Prize winners being too complex for production)
- Two AGI Paths: There are two fundamentally different definitions of AGI - one focused on automation (economic productivity) and one focused on invention (scientific acceleration) - and the path we choose determines the kind of AI we build
Actionable Insights:
- Measure Fluid Intelligence: Focus on benchmarks that test reasoning on novel problems rather than pattern matching on familiar data
- Avoid Single-Metric Optimization: When building AI systems, explicitly measure and optimize for multiple dimensions of intelligence to avoid the shortcut rule
- Target Information Efficiency: Evaluate AI systems based on how much data they need to acquire new skills, not just their final performance levels
📚 References
People Mentioned:
- Garry Kasparov - World chess champion who was defeated by Deep Blue, illustrating how task-specific AI success doesn't advance general intelligence understanding
Companies & Products:
- Deep Blue - IBM's chess-playing computer that beat Kasparov but taught nothing about intelligence
- Kaggle - Platform referenced as example of optimization leading to impractical solutions
- Netflix Prize - Competition where winning system was too complex for production use, exemplifying the shortcut rule
Technologies & Tools:
- ARC Benchmark - Abstraction Reasoning Corpus containing 1,000 unique tasks designed to measure fluid intelligence
- Test-Time Adaptation - Paradigm shift technique that ARC uniquely detected in 2024
Concepts & Frameworks:
- Static Skills vs. Fluid Intelligence - Fundamental distinction between memorized capabilities and adaptive reasoning
- Operational Area - Concept measuring the breadth of situations where a skill applies effectively
- Information Efficiency - Metric of how much data is needed to acquire a skill, with higher efficiency indicating higher intelligence
- The Shortcut Rule - Engineering phenomenon where optimizing for single metrics leads to missing the broader point
- Core Knowledge Priors - Basic concepts like objectness, physics, geometry, topology, and counting that four-year-olds master
🎯 Why Is ARC1 Suddenly Not Enough to Measure Intelligence?
The Binary Test Problem and the Need for More Granular Measurement
The ARC1 Limitation:
- Binary Nature: Only provides two possible modes of performance
- Minimal Intelligence Test: Was a minimal reproduction of fluid intelligence
- Sharp Performance Cliff: Either near-zero (like baseline models) or very high (like O3)
The Saturation Problem:
- Human Performance: Everyone in the room would score within noise distance of 100%
- Saturation Point: ARC1 saturates way below human-level fluid intelligence
- Limited Bandwidth: Can't distinguish between different levels of capability above threshold
The Need for Evolution:
Better Tool Requirements:
- More Sensitivity: Need tool that provides more useful bandwidth
- Better Comparison: Enable meaningful comparison with human intelligence levels
- Granular Evaluation: Distinguish between different AI system capabilities
The Intelligence Spectrum:
- Beyond Binary: Intelligence exists on a spectrum, not just on/off
- Measurement Gap: Current tools can't capture the full range of capabilities
- Progress Tracking: Need to measure incremental improvements in reasoning



🆕 How Does ARC2 Challenge Current Test-Time Reasoning Systems?
The Evolution from Pattern Matching to Compositional Reasoning
ARC2 Design Philosophy:
- 2019 vs. 2024 Focus: ARC1 challenged deep learning patterns; ARC2 challenges reasoning systems
- Test-Time Adaptation Target: Specifically designed to test current paradigm approaches
- Same Format, Higher Sophistication: Maintains familiar structure but requires deeper thinking
Compositional Reasoning Focus:
- Greater Complexity: Much more sophisticated tasks than ARC1
- Compositional Generalization: Probes ability to combine concepts in new ways
- Anti-Brute Force: Cannot be easily solved through computational brute force
The Deliberate Thinking Requirement:
Cognitive Load Comparison:
- ARC1: Many tasks could be solved instantly without much thinking
- ARC2: All tasks require some level of deliberate, conscious reasoning
- Human Feasibility: Tasks remain very doable for humans despite increased complexity
The Brute Force Resistance:
- Pattern Recognition Failure: Cannot be solved through memorization alone
- Reasoning Necessity: Requires genuine understanding and problem-solving
- Test-Time Adaptation Requirement: Only systems using TTA score meaningfully above zero



👥 What Did Testing 400 Real People Reveal About ARC2?
The San Diego Human Intelligence Baseline Study
The Diverse Testing Pool:
- Random Recruitment: Not physics PhDs or specialists - just regular people
- Broad Demographics: Uber drivers, UCSD students, unemployed individuals
- Motivation: People looking to make money on the side, no special training
The Comprehensive Results:
- Universal Solvability: All tasks were solved by at least two people who saw them
- Statistical Robustness: Each task seen by average of seven people
- Crowd Intelligence: Group of 10 random people with majority voting would score 100%
The Key Finding:
Complete Human Feasibility:
- No Prior Training: Tasks doable by regular folks without preparation
- Universal Accessibility: Confirms tasks are within normal human cognitive range
- Validation Success: Proves ARC2 targets human-level reasoning, not specialized expertise
The Testing Methodology:
- In-Person Validation: Tested firsthand over several days in San Diego
- Real-World Sample: Truly representative of general population
- Rigorous Standards: Multiple validators per task ensure reliability



🤖 How Badly Do Current AI Models Fail at ARC2?
The Stark Performance Gap Between Humans and AI
Baseline Model Performance:
- Complete Failure: GPT-4, Claude, Llama 4 get 0% on ARC2
- Memorization Impossibility: Simply no way to solve tasks via memorization alone
- Pattern Recognition Breakdown: Traditional approaches completely ineffective
Static Reasoning Systems:
- Single Chain of Thought: Systems using one reasoning chain per task
- Minimal Improvement: Score only 1-2%, within noise distance of zero
- Static Limitation: Fixed reasoning approaches prove insufficient
Test-Time Adaptation Requirements:
The Performance Hierarchy:
- Baseline Models: 0% (complete failure)
- Static Reasoning: 1-2% (essentially zero)
- Test-Time Adaptation: Only approaches scoring meaningfully above zero
- Still Sub-Human: Even TTA systems far below human performance
The O3 Reality Check:
- Best Current Performance: O3 and similar systems still not quite human-level
- Granular Evaluation: ARC2 enables precise measurement of advanced systems
- Gap Visibility: Makes clear how far even the best AI is from human reasoning
The AGI Distance Metric:
Chollet's AGI Test:
- Easy Human Tasks: As long as we can create tasks any human can do
- AI Failure: But AI cannot figure out regardless of compute
- No AGI Yet: We don't have AGI until this becomes difficult



🎮 What Revolutionary Approach Will ARC3 Take to Test Intelligence?
From Input-Output Pairs to Interactive Agency Assessment
The Paradigm Shift:
- Format Departure: Significant departure from input-output pair format of ARC1 and ARC2
- Agency Assessment: Testing the ability to explore, learn interactively, and set goals
- Autonomous Goal Achievement: AI must figure out objectives and methods independently
The Interactive Challenge:
- Unknown Environment: AI dropped into brand new environment
- No Instructions: Doesn't know what controls do or what the goal is
- Discovery Required: Must figure out gameplay mechanics from scratch
- Starting Question: "What is it even supposed to do in the game?"
The Design Principles:
Core Knowledge Foundation:
- Unique Games: Every single game is entirely unique
- Familiar Building Blocks: Built on core knowledge priors like ARC1 and ARC2
- Hundreds of Tasks: Will feature hundreds of interactive reasoning scenarios
Efficiency as Central Metric:
- Beyond Success: Models graded not just on whether they solve tasks
- How Efficiently: Focus on how efficiently they solve problems
- Action Limits: Strict limits on number of actions models can take
- Human Baseline: Targeting same level of action efficiency as humans
The Timeline:
Development and Release Schedule:
- Launch: Early 2026 for full release
- Developer Preview: July 2024 (next month) for early access
- Continuous Evolution: Not stopping at ARC3 - development continuing beyond



🔬 What's the Kaleidoscope Hypothesis About Universal Patterns?
Why Nothing Is Ever Truly Novel and Intelligence Is Pattern Mining
The Novelty Paradox:
- Apparent Novelty: Future seems completely different from past experience
- Common Ground Necessity: If truly nothing in common, couldn't make sense regardless of intelligence
- Universal Similarity: Everything in universe shares fundamental similarities
The Universal Isomorphisms:
- Tree Similarities: One tree similar to another tree, also similar to neurons
- Force Analogies: Electromagnetism similar to hydrodynamics, also similar to gravity
- Surrounded by Patterns: We live in a world of recurring structural relationships
The Kaleidoscope Metaphor:
Endless Recombination:
- Apparent Complexity: Experience seems to feature never-ending novelty and complexity
- Limited Atoms: Number of unique "atoms of meaning" needed to describe everything is actually very small
- Recombination Principle: Everything around us is recombination of these fundamental atoms
Intelligence as Pattern Mining:
- Experience Mining: Intelligence is ability to mine experience for reusable patterns
- Atom Identification: Identifying atoms of meaning that work across different situations
- Cross-Task Transfer: Finding principles that apply to many different contexts
The Abstraction Process:
Building Blocks of Understanding:
- Invariance Detection: Identifying structure and principles that repeat
- Abstract Building Blocks: These reusable atoms are called abstractions
- On-the-Fly Recombination: Making sense of new situations by combining existing abstractions



💎 Key Insights
Essential Insights:
- Binary Limitation: ARC1 was a binary test that could only distinguish between "no intelligence" and "some intelligence," saturating far below human-level capability and requiring more granular measurement tools
- Human Universality: Testing 400 random people (Uber drivers, students, unemployed individuals) in San Diego proved that ARC2 tasks are solvable by any regular person, with 10 random people reaching 100% accuracy through majority voting
- The Kaleidoscope Hypothesis: Nothing is truly novel - the universe consists of recurring patterns and isomorphisms, with intelligence being the ability to mine experience for reusable "atoms of meaning" that can be recombined across different situations
Actionable Insights:
- Test Agency, Not Just Reasoning: Future AI evaluation should focus on interactive agency (exploration, goal-setting, autonomous achievement) rather than just input-output pattern matching
- Use Human Efficiency Baselines: When measuring AI progress, compare not just accuracy but efficiency - how many actions needed to solve problems compared to human performance
- Look for Abstraction Transfer: Evaluate AI systems based on their ability to identify and reuse patterns across different domains, not just performance on isolated tasks
📚 References
People Mentioned:
- Uber Drivers - Part of diverse testing pool for ARC2 human validation study in San Diego
- UCSD Students - University of California San Diego students who participated in ARC2 testing
- Random Folks - Unemployed individuals and people looking to make money on the side who validated ARC2 accessibility
Companies & Products:
- OpenAI - Company behind the O3 model that achieved high performance on ARC1 but still struggles with ARC2
- GPT-4 - Baseline model that scores 0% on ARC2 tasks
- Claude - AI model that fails completely on ARC2 (0% performance)
- Llama 4 - Meta's language model that also scores 0% on ARC2
Technologies & Tools:
- ARC1 (Abstraction Reasoning Corpus) - Original 2019 benchmark that became a binary test for fluid intelligence
- ARC2 - March 2024 release focusing on compositional reasoning and test-time adaptation challenges
- ARC3 - Upcoming 2026 interactive benchmark testing agency and autonomous goal achievement
- Test-Time Adaptation (TTA) - Required approach for any meaningful performance above zero on ARC2
Concepts & Frameworks:
- Binary Test Problem - Limitation where benchmarks only distinguish between "no intelligence" and "some intelligence"
- Compositional Generalization - Ability to combine concepts in new ways, central focus of ARC2
- Interactive Agency - Capability to explore, learn interactively, and set goals autonomously (ARC3 focus)
- The Kaleidoscope Hypothesis - Theory that apparent novelty comes from recombination of limited "atoms of meaning"
- Abstractions - Reusable building blocks of understanding that can transfer across different situations
- Action Efficiency - Metric comparing how many actions AI takes versus humans to solve the same problem
🔧 What Are the Two Key Components of Intelligence Implementation?
The Fundamental Architecture for Building Intelligent Systems
The Two-Part Intelligence Framework:
- Abstraction Acquisition: Efficiently extract reusable abstractions from past experience and data feeds
- On-the-Fly Recombination: Efficiently select and recombine building blocks into models fit for current situation
The Critical Efficiency Factor:
- Not Just Capability: Intelligence isn't determined by whether you can do something
- Efficiency Focus: How efficiently you acquire abstractions and recombine them for novelty
- Data Efficiency: How much experience needed to acquire simple skills
- Compute Efficiency: How much processing required to deploy skills
Intelligence as Efficiency Metrics:
Examples of Inefficiency:
- Skill Acquisition: Needing hundreds of thousands of hours to acquire simple skill = low intelligence
- Chess Example: Enumerating every single move to find best move = low intelligence
- Real Intelligence: High skill demonstration through efficient acquisition and deployment
The Dual Efficiency Requirements:
- Data Efficiency: Learning from minimal examples
- Compute Efficiency: Solving problems with reasonable computational resources
- Both Required: True intelligence needs efficiency in both dimensions



🎯 Why Didn't Bigger Models and More Data Lead to AGI?
The Two Critical Missing Pieces in Current AI Systems
Missing Component #1: On-the-Fly Recombination
- Training vs. Testing Mismatch: Models learned abstractions during training but were static at test time
- Template Fetching: Could only retrieve and apply pre-recorded templates
- No Dynamic Adaptation: Lacked ability to recombine knowledge for new situations
Test-Time Adaptation as Solution:
- Recombination Capabilities: TTA adds the missing on-the-fly recombination abilities
- Huge Step Forward: Gets us much closer to AGI by enabling dynamic adaptation
- Critical Problem Addressed: Solves the static inference limitation
Missing Component #2: Incredible Inefficiency
Gradient Descent Limitations:
- Vast Data Requirements: Needs massive amounts of data to distill simple abstractions
- Order of Magnitude Gap: 3-4 orders of magnitude more data than humans need
- Simple Abstractions: Even basic concepts require enormous training datasets
Recombination Inefficiency:
- Expensive Computation: Latest TTA techniques need thousands of dollars of compute
- ARC1 Performance: Just to solve ARC1 at human level requires massive resources
- Scaling Failure: Doesn't even scale to ARC2 problems
The Fundamental Issue:
Missing Compositional Generalization:
- Deep Learning Gap: Models lack ability to compositionally combine learned elements
- ARC2 Target: What the benchmark specifically tries to measure
- Core Problem: Can't efficiently create new combinations from existing knowledge



🧠 What Are the Two Fundamental Types of Abstraction?
The Dual Nature of How Intelligence Processes Information
The Universal Abstraction Process:
- Compare Instances: Look at different examples of things
- Merge into Templates: Find common patterns across instances
- Eliminate Details: Drop specific details that don't matter for the pattern
- Create Abstraction: Left with reusable template that captures essence
The Key Distinction:
Domain Differences:
- Type 1: Operates over continuous domain (values, measurements, gradients)
- Type 2: Operates over discrete domain (programs, graphs, structures)
- Mirror Processes: Both follow same fundamental comparison and merging approach
Type 1: Value-Centric Abstraction
Continuous Distance Functions:
- Comparison Method: Things compared via continuous distance function
- Applications: Perception, pattern recognition, intuition
- Modern ML: What current machine learning systems excel at
- Transformer Strength: What makes transformers a major AI breakthrough
Type 1 Capabilities:
- Perception: Visual and sensory processing
- Intuition: Gut feelings and rapid pattern recognition
- Pattern Cognition: Recognizing similar structures across examples
Type 2: Program-Centric Abstraction
Discrete Program Comparison:
- Comparison Method: Comparing discrete programs (graphs)
- Structure Matching: Looking for exact isomorphisms and subgraph isomorphisms
- Human Reasoning: Underlying much of logical thought processes
- Software Engineering: What programmers do when refactoring code
The Programming Analogy:
- Software Engineer Abstraction: When engineers talk about abstraction, they mean Type 2
- Code Refactoring: Finding common patterns in discrete program structures
- Exact Matching: Unlike continuous distances, requires precise structural alignment



🔄 How Do These Two Types of Abstraction Create All Cognition?
The Left Brain-Right Brain Integration Model
The Cognitive Integration:
- All Cognition: Arises from combination of Type 1 and Type 2 abstraction
- Complementary Processes: Both driven by analogy-making but in different domains
- Value vs. Program Analogy: Different approaches to finding similarity and patterns
The Brain Hemisphere Metaphor:
- Left Brain: Type 2 - reasoning, planning, rigor, logical structure
- Right Brain: Type 1 - perception, intuition, pattern recognition
- Integration Required: Full intelligence needs both working together
Transformer Capabilities and Limitations:
Type 1 Excellence:
- Natural Fit: Transformers excel at value-centric abstraction
- Strong Performance: Perception, intuition, pattern cognition all work well
- Major Breakthrough: Represents significant advance in Type 1 capabilities
Type 2 Struggles:
Simple Task Failures:
- Sorting Lists: Struggle with basic sorting when provided as token sequences
- Adding Digits: Difficulty with arithmetic on digit sequences
- Sequential Logic: Problems with discrete logical operations
The Type 2 Gap:
What's Missing:
- Discrete Program Search: Need different approach than continuous optimization
- Structural Reasoning: Can't handle exact structure matching requirements
- Compositional Logic: Missing ability to combine discrete elements systematically



🔍 Why Is Discrete Program Search the Key to Invention?
Moving Beyond Automation to True Creative Capability
Search vs. Gradient Descent:
- Invention Requirement: Discrete program search unlocks invention beyond automation
- All Creative AI: Known AI systems capable of invention rely on discrete search
- Historical Evidence: Even 1990s systems used search for antenna design creativity
The Creative AI Examples:
- 1990s Antenna Design: Gigantic search spaces for novel antenna configurations
- AlphaGo Move 37: Famous creative move came from discrete search process
- Alpha Evo System: DeepMind's recent creative system also uses discrete search
The Fundamental Principle:
Deep Learning vs. Search:
- Deep Learning: Doesn't invent, only interpolates within learned patterns
- Search: Enables genuine invention and creative leaps
- Invention Mechanism: Search can discover truly novel combinations
Search as Creative Engine:
- Novel Discovery: Can find solutions not present in training data
- Combinatorial Exploration: Explores space of possible program combinations
- Creative Leaps: Enables moves beyond interpolation of existing patterns
Discrete Program Search Definition:
Technical Framework:
- Combinatorial Search: Search over graphs of operators
- Language-Based: Operators taken from some Domain Specific Language (DSL)
- Graph Structures: Working with discrete symbolic graphs rather than continuous functions



⚖️ How Does Program Synthesis Compare to Machine Learning?
The Fundamental Differences in Model Creation and Learning
Model Representation:
- Machine Learning: Model is a differentiable parametric function (a curve)
- Program Synthesis: Model is a discrete graph of symbolic operators from a language
- Fundamental Difference: Continuous vs. discrete representation of knowledge
Learning Engine Comparison:
- ML Learning Engine: Gradient descent - very computationally efficient
- Program Synthesis Learning: Search algorithms - extremely computationally inefficient
- Efficiency Trade-off: Fast learning vs. slow but more powerful discovery
Gradient Descent Advantages:
Computational Efficiency:
- Fast Model Finding: Can find models that fit data very quickly
- Efficient Process: Computationally efficient optimization process
- Rapid Convergence: Quick convergence to solutions within continuous space
Search Algorithm Challenges:
- Computational Cost: Extremely compute inefficient compared to gradient descent
- Exhaustive Exploration: Must explore combinatorial spaces of possible programs
- Scaling Issues: Computational requirements grow rapidly with problem complexity
The Key Obstacles:
Machine Learning Challenge:
- Data Hunger: Primary obstacle is massive data requirements
- Sample Efficiency: Needs many examples to learn patterns
- Generalization: Struggle to generalize beyond training distribution
Program Synthesis Challenge:
- Compute Expense: Extremely high computational requirements
- Search Space: Vast combinatorial spaces to explore
- Efficiency Gap: Orders of magnitude more expensive than gradient descent



💎 Key Insights
Essential Insights:
- Intelligence Has Two Components: Real intelligence requires both abstraction acquisition (learning reusable patterns) and on-the-fly recombination (adapting those patterns to new situations) - current AI systems excel at the first but lack the second
- Efficiency Defines Intelligence: Intelligence isn't about whether you can do something, but how efficiently you can do it - needing hundreds of thousands of hours for simple skills or thousands of dollars of compute for human-level performance indicates low intelligence
- Two Types of Abstraction: All cognition comes from combining Type 1 (continuous/value-centric for perception and intuition) and Type 2 (discrete/program-centric for reasoning and planning) - transformers excel at Type 1 but struggle with simple Type 2 tasks like sorting lists
Actionable Insights:
- Focus on Program Search: To achieve true invention and creativity, AI systems need discrete program search capabilities rather than just continuous optimization through gradient descent
- Measure Efficiency, Not Just Accuracy: When evaluating AI progress, prioritize data efficiency and compute efficiency rather than raw performance on benchmarks
- Combine Both Abstraction Types: Build AI systems that integrate both continuous pattern recognition and discrete structural reasoning rather than focusing solely on transformer-style approaches
📚 References
Companies & Products:
- DeepMind - Referenced for Alpha Evo system that uses discrete search for creative problem-solving
- AlphaGo - DeepMind's Go-playing system that used discrete search for creative moves like the famous Move 37
Technologies & Tools:
- Transformers - Neural network architecture that excels at Type 1 (value-centric) abstraction but struggles with Type 2 (program-centric) tasks
- Gradient Descent - Machine learning optimization technique that is computationally efficient but requires vast amounts of data
- Test-Time Adaptation (TTA) - Approach that adds on-the-fly recombination capabilities to AI systems
- ARC1 - Benchmark that requires thousands of dollars of compute for human-level performance with current TTA techniques
- ARC2 - More advanced benchmark that current systems cannot scale to solve efficiently
Concepts & Frameworks:
- Abstraction Acquisition - Process of efficiently extracting reusable patterns from past experience and data
- On-the-Fly Recombination - Ability to efficiently select and combine building blocks for current situations
- Type 1 Abstraction - Value-centric abstraction operating over continuous domains (perception, intuition, pattern cognition)
- Type 2 Abstraction - Program-centric abstraction operating over discrete domains (reasoning, planning, rigor)
- Compositional Generalization - Missing capability in current deep learning models that ARC2 attempts to measure
- Discrete Program Search - Combinatorial search over graphs of operators from domain-specific languages
- Domain Specific Language (DSL) - Specialized programming language providing operators for program synthesis
- Left Brain vs. Right Brain Metaphor - Conceptual framework distinguishing between reasoning/planning and perception/intuition
⚖️ What's the Fundamental Trade-off Between ML and Program Synthesis?
The Data vs. Compute Efficiency Paradox
Machine Learning Characteristics:
- Data Density Requirement: Need dense sampling of the data manifold to fit models
- High Data Needs: Requires massive amounts of training data
- Compute Efficient: Gradient descent is very computationally efficient for learning
Program Synthesis Characteristics:
- Extreme Data Efficiency: Can fit a program using only 2-3 examples
- Vast Search Space: Must sift through enormous space of potential programs
- Combinatorial Explosion: Search space grows combinatorially with problem complexity
The Inverse Relationship:
Opposite Strengths and Weaknesses:
- ML: Data hungry but compute efficient
- Program Synthesis: Data efficient but compute expensive
- Fundamental Trade-off: Can't have both efficiency types simultaneously with current approaches
The Scaling Wall:
- Combinatorial Explosion: Program synthesis hits computational limits quickly
- Search Space Growth: Exponential growth in complexity makes search intractable
- Practical Limitation: Prevents scaling to complex real-world problems



🧠 Why Must We Combine Both Types of Abstraction for True Intelligence?
The Human Intelligence Integration Model
The All-In Problem:
- Type 1 Only: Going all-in on continuous abstraction won't unlock full potential
- Type 2 Only: Focusing solely on discrete abstraction also limits capabilities
- Combination Necessity: Must combine both types to achieve real intelligence
Human Intelligence Excellence:
- What Makes Us Special: We combine perception/intuition with explicit step-by-step reasoning
- Universal Integration: Use both forms of abstraction in all thoughts and actions
- Natural Fusion: Seamlessly blend continuous and discrete processing
The Chess Example:
Type 2 Calculation:
- Step-by-Step Analysis: Calculate potential moves sequentially in mind
- Limited Scope: Can't analyze every possible move (too many options)
- Selective Analysis: Only examine a few promising options (knight, queen, etc.)
Type 1 Guidance:
- Intuitive Filtering: Use pattern recognition to narrow down options
- Board Pattern Recognition: Unconscious pattern matching from experience
- Experience Mining: Extract patterns from past games automatically
The Tractability Solution:
Making Type 2 Feasible:
- Intuition Guides Logic: Type 1 intuition makes Type 2 calculation tractable
- Pattern-Guided Search: Use continuous patterns to focus discrete search
- Efficiency Through Integration: Combination enables what neither can do alone



🗺️ How Can We Use "Map Drawing" to Solve the Combinatorial Explosion?
The Revolutionary Approach to Making Program Search Tractable
The System Integration Strategy:
- Type 2 Technique: Discrete search over program space (hits combinatorial explosion)
- Type 1 Technique: Curve fitting and interpolation on continuous manifolds
- Integration Solution: Use fast approximate judgments to fight combinatorial explosion
The Continuous Embedding Approach:
- Fast Approximation: Take lots of data and embed on interpolating manifold
- Approximate Judgments: Enable fast but approximate decisions about target space
- Explosion Control: Use these judgments to make program search tractable
The Map Drawing Analogy:
From Discrete to Continuous:
- Discrete Objects: Start with space of discrete objects with discrete relationships
- Normally Requires Search: Would typically need combinatorial search (like subway pathfinding)
- Embedding Strategy: Embed objects into latent space with continuous distance functions
The Pathfinding Example:
- Subway System: Discrete stations with discrete connections
- Search Problem: Finding paths requires exploring connection combinations
- Continuous Approximation: Map to continuous space where distance approximates relationships
The Technical Implementation:
Hybrid Architecture:
- Latent Space Embedding: Transform discrete program space into continuous representations
- Distance Functions: Use continuous metrics to approximate discrete relationships
- Guided Search: Fast approximations guide expensive discrete search
- Explosion Prevention: Keep combinatorial explosion in check during search



👨💻 What Will the Next Generation of AI Look Like?
The Programmer-Like Meta-Learner Vision
The Fundamental Shift:
- From Static Models: Move away from fixed, pre-trained systems
- To Dynamic Programmers: AI systems that write software for each new task
- On-the-Fly Synthesis: Generate custom programs adapted to specific situations
The Meta-Learner Architecture:
- Task-Specific Programs: Synthesize programs tailored for each new challenge
- Hybrid Modules: Blend deep learning and algorithmic components
- Adaptive Assembly: Dynamically combine different types of processing
The Module Integration:
Deep Learning Submodules:
- Type 1 Problems: Handle perception and pattern recognition tasks
- Continuous Processing: Leverage transformer-style capabilities for intuitive tasks
Algorithmic Modules:
- Type 2 Problems: Handle logical reasoning and discrete processing
- Structured Computation: Perform step-by-step logical operations
- Symbolic Manipulation: Work with discrete symbolic representations
The Assembly System:
Guided Program Search:
- Search System: Discrete program search assembles the overall system
- Deep Learning Guidance: DL-based intuition guides search through program space
- Structure Understanding: Intuitive knowledge about what program structures work
The Intelligence Integration:
- Best of Both Worlds: Combines continuous intuition with discrete reasoning
- Dynamic Architecture: Each task gets custom-built solution
- Efficient Search: Intuition makes combinatorial search feasible



📚 How Will the Global Abstraction Library Work?
The Evolving Knowledge Repository for AI Systems
The Library Concept:
- Global Repository: Shared library of reusable building blocks and abstractions
- Constantly Evolving: Library grows and improves as it learns from incoming tasks
- Not From Scratch: Search process leverages existing knowledge rather than starting over
The Learning Cycle:
- New Problem Appears: System searches library for relevant building blocks
- Synthesis Process: While solving problems, creates new building blocks
- Upload Back: New abstractions get added to the global library
- Collective Growth: Library becomes richer with each solved problem
The Software Engineering Analogy:
GitHub-Like Sharing:
- Individual Development: Software engineer develops useful library for their work
- Community Sharing: Upload to GitHub for others to reuse
- Collective Benefit: Everyone benefits from shared abstractions
The Reusability Principle:
- Abstraction Reuse: Previously solved patterns help with new problems
- Knowledge Transfer: Solutions from one domain apply to another
- Cumulative Intelligence: System gets smarter by building on past work
The Ultimate Goal:
Human-Like Problem Solving:
- New Situation Response: AI faces completely new challenges
- Rich Library Access: Leverages extensive abstraction repository
- Quick Assembly: Rapidly creates working models from existing components
- Software Engineer Parallel: Similar to how humans use existing tools and libraries
Continuous Improvement:
- Library Expansion: Constantly growing collection of abstractions
- Intuition Refinement: Improving understanding of program space structure
- Self-Improvement: System becomes more capable over time



🏢 What Is Ndea and Why Was It Created?
The New Research Lab Building the Future of AI
The Mission:
- Scientific Progress Acceleration: Dramatically accelerate scientific progress through AI
- Independent Invention: Need AI capable of independent invention and discovery
- Knowledge Frontier Expansion: AI that expands frontiers of knowledge, not just operates within them
The Vision Gap:
- Current Limitation: Existing AI operates within known boundaries
- Required Capability: Need systems that push beyond current knowledge limits
- Discovery Focus: Emphasis on genuine discovery rather than just automation
The Technical Approach:
Deep Learning Guided Program Search:
- Hybrid Method: Combines deep learning guidance with program search
- Programmer-Like Meta-Learner: Building the system described in previous cards
- Scientific Focus: Specifically designed for scientific discovery applications
Beyond Automation:
- Deep Learning Strength: Great at automation tasks
- Scientific Requirement: Discovery requires something more than automation
- New Form Needed: Belief that new AI form is key to acceleration
The First Milestone:
ARC Benchmark Challenge:
- Starting Point: System begins knowing nothing about ARC
- Complete Learning: Must learn to solve ARC from scratch
- Progress Validation: Use ARC performance to test system capabilities
The Ultimate Application:
- Science Empowerment: Leverage system to empower human researchers
- Timeline Acceleration: Help accelerate the timeline of scientific discovery
- Human Partnership: AI-human collaboration for scientific breakthroughs
The Founding Motivation:
Why Start Ndea:
- Belief in Necessity: Conviction that dramatic acceleration requires new AI form
- Independent Discovery: Focus on AI that can make genuine discoveries
- Scientific Impact: Goal to transform how science progresses



💎 Key Insights
Essential Insights:
- The Efficiency Paradox: Machine learning is data-hungry but compute-efficient, while program synthesis is data-efficient (2-3 examples) but compute-expensive due to combinatorial explosion - the key breakthrough is combining both approaches
- Human Intelligence Integration: Humans excel because we seamlessly combine Type 1 (intuitive pattern recognition) with Type 2 (step-by-step reasoning) - like using chess intuition to focus logical calculation on promising moves only
- The Programmer AI Vision: Next-generation AI will work like programmers, writing custom software for each task by combining deep learning modules (for perception) with algorithmic modules (for reasoning), guided by deep learning intuition about program space
Actionable Insights:
- Build Hybrid Systems: Create AI architectures that combine continuous optimization with discrete program search, using the strengths of each to compensate for the other's weaknesses
- Develop Global Abstraction Libraries: Build systems that accumulate and share reusable building blocks across tasks, enabling knowledge transfer and cumulative learning like software engineers sharing code on GitHub
- Focus on Scientific Discovery: Target AI development toward expanding knowledge frontiers rather than just automating known tasks, as this requires genuine invention capabilities beyond current deep learning
📚 References
People Mentioned:
- Software Engineers - Used as analogy for how the global abstraction library will work, with AI systems sharing building blocks like developers share code on GitHub
Companies & Products:
- GitHub - Referenced as model for how AI systems will share reusable abstractions and building blocks in a global library
- Ndea - François Chollet's new AI research lab focused on building programmer-like meta-learners for scientific discovery
Technologies & Tools:
- ARC Benchmark - First milestone for Ndea's system, which must learn to solve ARC starting from knowing nothing about it
- Gradient Descent - Machine learning technique that is computationally efficient but requires dense data sampling
- Program Synthesis - Approach that is extremely data-efficient (2-3 examples) but computationally expensive due to combinatorial search
Concepts & Frameworks:
- Combinatorial Explosion - The exponential growth in search space complexity that makes program synthesis computationally intractable
- Data Manifold - The mathematical space that machine learning models need to densely sample to fit data effectively
- Programmer-Like Meta-Learner - Vision for next-generation AI that writes custom software for each task, combining deep learning and algorithmic modules
- Global Abstraction Library - Evolving repository of reusable building blocks that AI systems can leverage and contribute to
- Deep Learning Guided Program Search - Hybrid approach using continuous intuition to make discrete program search tractable
- Latent Space Embedding - Technique for representing discrete objects in continuous space to enable fast approximate judgments
- Scientific Discovery vs. Automation - Distinction between AI that operates within knowledge boundaries versus AI that expands them