undefined - Prompt Engineering Advice From Top AI Startups

Prompt Engineering Advice From Top AI Startups

At first, prompting seemed to be a temporary workaround for getting the most out of large language models. But over time, it's become critical to the way we interact with AI.On the Lightcone, Garry, Harj, Diana, and Jared break down what they've learned from working with hundreds of founders building with LLMs: why prompting still matters, where it breaks down, and how teams are making it more reliable in production.They share real examples of prompts that failed, how companies are testi...

May 30, 202531:26

Table of Contents

0:00-12:05
12:12-17:23
17:29-23:12
23:19-27:23
27:29-31:04

🚀 Introduction: The New Frontier of Prompt Engineering

Welcome to the Lightcone Podcast, where YC Partners Garry, Harj, Diana, and Jared dive into what's actually happening inside the best AI startups when it comes to prompt engineering. They've surveyed more than a dozen companies to pull back the curtain on practical techniques from the frontier of building AI products.

The conversation sets up the reality that while prompting may have seemed like a temporary workaround initially, it has become critical to how we interact with AI systems effectively.

Timestamp: [0:00-0:58]Youtube Icon

🎯 Real-World Example: Parahelp's Production Prompt

Jared shares an exclusive look at a production prompt from Parahelp, an AI customer support company that powers support for major AI companies like Perplexity, Replit, and Bolt. This represents a rare opportunity to see the "crown jewels" of a vertical AI agent company's intellectual property.

The Parahelp team graciously agreed to open-source their actual prompt that powers their AI agent, providing unprecedented insight into how professional-grade AI customer support actually works behind the scenes. When you email a customer support ticket to Perplexity, what's responding is actually Parahelp's AI agent using this sophisticated prompt structure.

This example demonstrates the level of sophistication required for AI agents operating in production environments where reliability and consistency are paramount.

Timestamp: [0:58-1:44]Youtube Icon

📋 Anatomy of a Professional Prompt: Six Pages of Precision

Diana walks through the detailed structure of Parahelp's production prompt, revealing it to be six pages long with very specific architectural decisions. The prompt demonstrates several key principles that separate professional-grade prompts from amateur attempts.

The prompt begins by establishing the LLM's role as "a manager of a customer service agent" and breaks down responsibilities into clear bullet points. It then defines the specific task of approving or rejecting tool calls, since the system orchestrates agent calls from multiple other agents.

The structure follows a step-by-step approach with numbered steps (one through five) and includes important constraints about what kinds of tools it should not call. Output formatting is meticulously specified because agents need to integrate with other agents, requiring precise API-like interactions.

The prompt uses markdown-style formatting with clear headings and sub-bullet sections, making it easier for LLMs to parse and follow. It includes three major sections covering planning methodology, step creation processes, and high-level planning examples.

Timestamp: [1:44-4:15]Youtube Icon

🛠️ The Programming-Like Nature of Modern Prompts

The conversation reveals how sophisticated prompts have evolved to look more like programming than natural English writing. The Parahelp prompt uses XML tag formatting to specify plans and structure, which has proven more effective than traditional prose approaches.

This technical approach stems from understanding how LLMs were trained - many were post-trained with XML-type input during their RLHF (Reinforcement Learning from Human Feedback) process, making them naturally better at parsing structured, tag-based instructions.

The hosts note that what they're seeing is just the general system prompt, with customer-specific examples and workflows handled in subsequent stages of the pipeline. This separation allows for scalability while maintaining customization.

Timestamp: [3:34-4:44]Youtube Icon

🏗️ Prompt Architecture: System, Developer, and User Layers

The hosts break down the emerging architecture of professional prompt systems into three distinct layers, each serving different purposes in the AI application stack.

System Prompt: Defines the high-level API of how the company operates. The Parahelp example represents a pure system prompt with nothing customer-specific - it establishes the fundamental operating principles and capabilities of the AI agent.

Developer Prompt: Contains all the customer-specific context and workflows. For Parahelp, this layer would include specific instructions for handling Perplexity's FAQ questions differently from Bolt's technical support needs. This is where customization happens without rebuilding the entire system.

User Prompt: Contains the end-user input. For products like Replit or Cursor, this would be where users type requests like "generate me a site that has these buttons." Parahelp doesn't have a user prompt layer since their product isn't consumed directly by end users.

This architectural approach addresses a critical challenge for vertical AI agent companies: how to build flexible, general-purpose products without becoming consulting companies that build custom prompts for every customer. The layered approach allows for systematic scaling while maintaining necessary customization.

Timestamp: [5:02-6:04]Youtube Icon

🔧 The Automation Opportunity in Prompt Engineering

The conversation identifies significant startup opportunities in building tooling around prompt engineering, particularly for automatically generating and optimizing worked examples - a critical component for improving AI output quality.

The hosts envision an ideal scenario where an agent automatically extracts the best examples from customer datasets and seamlessly integrates them into the appropriate pipeline layer without manual intervention. Currently, companies like Parahelp need high-quality worked examples specific to each customer, but this process requires significant manual effort.

This automation challenge represents a natural segue into meta-prompting, where AI systems help improve their own prompting strategies. The need for better tooling around example selection, prompt optimization, and pipeline integration suggests a rich ecosystem of potential solutions for teams building AI products at scale.

Timestamp: [6:11-6:55]Youtube Icon

🔄 Meta-Prompting: AI Helping AI Get Better

Garry introduces meta-prompting through the example of Tropir, a YC startup that helps companies debug and understand multi-stage AI workflows. They've developed "prompt folding" - a technique where one prompt dynamically generates better versions of itself.

The concept works by taking an existing prompt that may have failed or underperformed, feeding it to an LLM along with examples of where it went wrong, and asking the AI to improve the prompt rather than manually rewriting it. This approach leverages the fact that LLMs understand themselves surprisingly well.

A practical example involves classifier prompts that generate specialized prompts based on previous queries, creating a dynamic optimization loop where the AI system continuously improves its own performance based on real-world usage patterns.

Timestamp: [6:55-8:02]Youtube Icon

📚 Complex Tasks: Learning from Expert Examples

For particularly complex tasks, Diana discuss how companies like Jasberry use sophisticated example-based training. Jasberry builds automatic bug-finding tools for code, which requires the AI to identify subtle issues that even expert programmers find challenging.

Their approach involves feeding the AI numerous examples of complex bugs that only expert programmers could typically identify. For instance, detecting N+1 query problems requires understanding both database optimization and code structure patterns that are difficult to describe in prose alone.

This pattern of using examples instead of trying to write detailed prose instructions works particularly well because it helps LLMs reason around complicated tasks and provides concrete steering mechanisms. The approach resembles unit testing in programming - it's like test-driven development for LLM behavior.

When tasks are too complex to parameterize exactly, showing the AI what good and bad outputs look like becomes more effective than trying to describe the nuances in natural language.

Timestamp: [8:02-9:00]Youtube Icon

⚠️ The Hallucination Trap: When AI Tries Too Hard to Help

Tropir discovered a critical insight about LLM behavior: models are so eager to help that they'll fabricate responses rather than admit uncertainty. When asked for output in a specific format, LLMs will often generate plausible-looking responses even when they lack sufficient information.

The solution involves providing LLMs with explicit "escape hatches" - clear instructions to stop and ask for clarification rather than making up answers. This requires telling the AI that if it doesn't have enough information to make a determination, it should pause and request additional context rather than generating a potentially incorrect response.

This insight challenges the common approach of being overly prescriptive about output formats without considering the model's tendency to comply even when inappropriate. Building in explicit uncertainty handling becomes crucial for reliable AI systems in production environments.

Timestamp: [9:06-9:41]Youtube Icon

🔧 YC's Debug Info Innovation: AI That Reports Its Own Problems

Harj describes an inventive approach developed at YC for giving LLMs an escape hatch through structured debugging information. Instead of just asking the AI to stop when confused, they built a systematic way for the AI to report issues back to developers.

Their response format includes a dedicated "debug info" parameter where the LLM can essentially file complaints about confusing or underspecified information it receives. This creates a feedback loop where the AI actively helps developers identify problems with their prompts and workflows.

The system runs in production with real user data, allowing developers to review outputs and extract actionable feedback. The debug info parameter becomes a to-do list for agent developers, with the AI itself identifying specific areas that need improvement.

This approach transforms the AI from a passive tool into an active participant in improving the development process, creating a collaborative dynamic between human developers and AI systems.

Timestamp: [9:41-10:47]Youtube Icon

🎓 Getting Started: Simple Meta-Prompting for Everyone

Harj provides practical advice for hobbyists and developers interested in experimenting with meta-prompting techniques. The approach is surprisingly accessible and follows the same structural principles used by professional teams.

The simple method involves giving the AI a role as an expert prompt engineer who provides detailed critiques and improvement advice. You then feed it your existing prompt and ask for feedback and enhancement suggestions. This creates an iterative improvement loop that often yields significantly better results.

The process works surprisingly well and can be repeated multiple times, with each iteration potentially improving the prompt further. This democratizes access to sophisticated prompt optimization techniques that were previously available only to teams with extensive AI expertise.

Timestamp: [10:47-11:13]Youtube Icon

⚡ Production Optimization: Big Models Train Small Models

Companies frequently use meta-prompting with large, powerful models to create optimized prompts that can then run efficiently on smaller, faster models. This approach balances quality with performance requirements, particularly important for applications requiring low latency.

The typical workflow involves using models with hundreds of billions of parameters (like Claude 3.5 or GPT-4) to perform meta-prompting and generate highly refined prompts. These optimized prompts are then deployed on smaller, faster models that can respond quickly enough for real-time applications.

This pattern is especially common among voice AI agent companies, where response latency is critical for maintaining the illusion of natural conversation. If there's too much pause before the agent responds, humans can detect that something is artificial, breaking the conversational flow.

The result is a two-stage optimization process: use powerful models for prompt development, then deploy optimized prompts on fast models for production use. This allows companies to achieve both high quality and low latency in their AI applications.

Timestamp: [11:13-12:05]Youtube Icon

💎 Key Insights

  • Modern prompt engineering resembles programming more than natural language writing, with XML tags and structured formatting proving more effective than prose
  • Professional prompts follow a three-layer architecture: system prompts (general operations), developer prompts (customer-specific context), and user prompts (end-user input)
  • Meta-prompting allows AI systems to improve their own prompts through iterative feedback loops, with LLMs surprisingly effective at self-optimization
  • Complex tasks benefit more from expert examples than detailed written instructions, similar to test-driven development approaches
  • LLMs need explicit "escape hatches" to avoid hallucinating responses when they lack sufficient information
  • Production systems increasingly use large models to optimize prompts that then run on smaller, faster models for latency-sensitive applications
  • The biggest opportunity lies in automating the extraction and integration of worked examples from customer datasets

Timestamp: [0:00-12:05]Youtube Icon

📚 References

Companies:

  • Parahelp - AI customer support company powering Perplexity, Replit, and Bolt
  • Tropir - YC startup helping companies debug multi-stage AI workflows
  • Jasberry - Company building automatic bug-finding tools for code
  • Perplexity - AI search company using Parahelp for customer support
  • Replit - Online coding platform using Parahelp for customer support
  • Bolt - Development platform using Parahelp for customer support
  • YC (Y Combinator) - Startup accelerator mentioned as context for examples

Technical Concepts:

  • Meta-prompting - Using AI to improve its own prompts through iterative feedback
  • Prompt folding - Technique where prompts dynamically generate better versions of themselves
  • RLHF (Reinforcement Learning from Human Feedback) - Training process that makes LLMs better at parsing XML-structured input
  • N+1 query problems - Database optimization issues that expert programmers must identify

AI Models:

  • Claude 3.5 - Large language model used for meta-prompting optimization
  • GPT-4 - Large language model used for meta-prompting optimization

Timestamp: [0:00-12:05]Youtube Icon

📝 Managing Long Prompts: Documentation and Iteration Strategies

As prompts grow into large working documents spanning multiple pages, managing their evolution becomes critical. The panel discusses practical strategies for tracking improvements and managing complex prompt development cycles.

One effective approach involves maintaining a Google Doc to note down specific issues with outputs or areas for improvement. Rather than trying to fix everything immediately, developers can collect observations about where the AI isn't performing as expected and batch these notes for systematic improvement.

This documentation-driven approach allows teams to systematically collect feedback and then leverage AI tools to suggest specific improvements rather than making ad-hoc changes that might introduce new problems.

Timestamp: [12:12-12:42]Youtube Icon

🔍 Thinking Traces: The Hidden Debug Information

Gemini 2.5 Pro's thinking traces provide unprecedented insight into how AI models process prompts and make decisions. These traces reveal the internal reasoning process, showing exactly where prompts succeed or fail in guiding the model's behavior.

The thinking traces function as critical debug information that was previously unavailable through API access. This capability has recently been added to the API, allowing developers to integrate this debugging information directly into their development tools and workflows.

The long context windows in Gemini Pro enable a particularly effective debugging approach where developers can process prompts example by example, watching the reasoning trace in real time to understand how to better steer the model toward desired outcomes.

Timestamp: [12:42-13:38]Youtube Icon

🛠️ REPL-Style Debugging with Large Context Windows

Gemini Pro's extensive context window enables a REPL (Read-Eval-Print Loop) style debugging approach where developers can interactively test and refine prompts. This method involves putting a prompt alongside one example and literally watching the reasoning trace in real time.

YC's software team has built specialized workbenches for debugging AI workflows, but the panel notes that sometimes direct interaction through gemini.google.com proves more effective. The platform allows developers to drag and drop JSON files directly into the interface without requiring special containers or complex setup.

This approach democratizes access to sophisticated debugging capabilities, making advanced prompt optimization techniques available even to teams without custom tooling infrastructure.

Timestamp: [13:27-14:02]Youtube Icon

🏆 Evals: The True Crown Jewels of AI Companies

While prompts often get attention as the core intellectual property of AI companies, the panel reveals that evaluations (evals) represent the true crown jewels. Parahelp's willingness to open-source their prompt stems from their belief that the real value lies in their evaluation systems.

Evals provide the critical context for understanding why specific prompt decisions were made and enable systematic improvement over time. Without comprehensive evaluation systems, even the most sophisticated prompts become difficult to maintain, debug, or enhance as requirements evolve.

This insight reframes the competitive landscape for AI companies, suggesting that sustainable advantages come from evaluation capabilities rather than prompt engineering alone.

Timestamp: [14:24-14:53]Youtube Icon

🚜 The Nebraska Principle: Deep Domain Knowledge as Competitive Moat

YC funds numerous vertical AI and SaaS companies, but the panel emphasizes that true competitive advantage comes from intimate understanding of specific industries and workflows. This requires founders to literally sit side-by-side with domain experts to understand their real-world needs.

The process involves taking in-person interactions from places like Nebraska and codifying that knowledge into very specific evaluations. For example, understanding how a particular user wants outcomes handled when an invoice comes in and they need to decide whether to honor a tractor warranty.

This deep domain expertise addresses concerns about AI companies being mere "wrappers" by establishing defensible competitive advantages through superior understanding of specific user needs and workflows.

Timestamp: [14:59-16:10]Youtube Icon

🎯 The Founder Profile: Technical Excellence Meets Domain Obsession

The panel describes the core competency required of successful AI startup founders today: maniacal obsession with the details of specific user workflows combined with technical sophistication. This represents a unique intersection of skills that creates significant barriers to entry.

The challenge lies in finding founders who are simultaneously great engineers and technologists while also understanding parts of the world that very few people understand. This creates a narrow but valuable opportunity space for those who can bridge both domains.

The example of Ryan Peterson from Flexport illustrates this principle perfectly - someone who understands software development but also became the third biggest importer of medical hot tubs for an entire year, giving him unique insights into logistics and import/export workflows.

Timestamp: [16:10-17:23]Youtube Icon

💎 Key Insights

  • Long prompts require systematic documentation and iteration strategies, with Google Docs serving as effective tracking mechanisms for improvement opportunities
  • Thinking traces in Gemini 2.5 Pro provide critical debug information that was previously unavailable, now accessible through API integration
  • Large context windows enable REPL-style debugging where developers can watch reasoning traces in real time for immediate feedback
  • Evaluations, not prompts, represent the true intellectual property and competitive advantage for AI companies
  • Sustainable competitive moats require deep domain expertise gained through direct interaction with end users in their actual work environments
  • Successful AI startup founders need the rare combination of technical excellence and obsessive understanding of specific industry workflows
  • The "weirder" the domain knowledge a technical founder possesses, the greater the startup opportunity potential

Timestamp: [12:12-17:23]Youtube Icon

📚 References

People:

  • Eric Bacon - YC's head of data who has helped with meta-prompting and using Gemini Pro 2.5 as a REPL
  • Ryan Peterson - Founder of Flexport, example of technical founder with deep domain expertise (was third biggest importer of medical hot tubs)

Companies:

  • Flexport - Logistics company founded by Ryan Peterson, example of technical founder with unique domain knowledge
  • Parahelp - AI customer support company mentioned for their approach to prompts vs. evals as intellectual property

Technologies:

  • Gemini Pro 2.5 - Google's large language model with long context windows and thinking traces
  • Thinking traces - Debug information showing AI model reasoning process, recently added to Gemini API

Concepts:

  • REPL (Read-Eval-Print Loop) - Interactive debugging approach enabled by large context windows
  • Evals - Evaluation systems that represent the true crown jewels of AI companies
  • Nebraska Principle - The idea that competitive advantage comes from deep understanding of specific geographic/industry domains

Timestamp: [12:12-17:23]Youtube Icon

🚀 The Forward Deployed Engineer: Palantir's Revolutionary Approach

Garry explains how the concept of Forward Deployed Engineers (FDE) originated at Palantir and why it's become the essential model for AI startup founders today. The term traces back to Palantir's core recognition that Fortune 500 companies and government agencies lacked technologists who truly understood computer science at the highest level.

Palantir's founders - Peter Thiel, Alex Karp, Stefan Cohen, Joe Lonsdale, and Nathan Gettings - identified that these organizations faced multi-billion and sometimes trillion-dollar problems but had no one in the room who could apply cutting-edge technology to solve them.

Before AI became mainstream, these organizations were drowning in data - giant databases of people, things, and transactions - with no idea how to extract value. Palantir's insight was to deploy the world's best technologists directly into these environments to build software that could make sense of petabytes of data and find needles in haystacks.

Timestamp: [17:29-19:26]Youtube Icon

👮 Inside the FBI: Turning File Cabinets into Software

The Forward Deployed Engineer role involved literally sitting next to FBI agents investigating domestic terrorism, observing their actual workflows in their offices. This immersive approach revealed the stark reality of how critical work was being done with primitive tools.

Forward deployed engineers would observe these file cabinet and fax machine workflows, then convert them into clean, powerful software. The goal was ambitious: make investigation work at three-letter agencies as easy as posting a photo to Instagram.

This direct observation and rapid iteration approach has proven so effective that many former Palantir Forward Deployed Engineers have become some of the most successful founders in YC's current portfolio.

Timestamp: [19:33-20:36]Youtube Icon

🔨 Engineers vs. Salespeople: A Fundamental Difference in Approach

Palantir's breakthrough came from sending engineers instead of traditional salespeople to engage with clients. While other companies deployed relationship-focused sales teams with lengthy sales cycles, Palantir revolutionized the process by putting technical builders directly in front of decision-makers.

Traditional enterprise sales involved charismatic salespeople with "hair and teeth" taking clients to steakhouses, building relationships over months or years, trying to secure seven-figure contracts through personality and promises. The timeline could stretch from 6 weeks to 5 years, and often the software would never actually work as promised.

Palantir's approach was radically different: put an engineer in the room with Palantir Foundry (their core data visualization and mining suite), and instead of the next meeting being about reviewing contracts or specifications, it would be about demonstrating working software.

Timestamp: [20:47-22:03]Youtube Icon

🥊 David vs. Goliath: How Engineers Beat Enterprise Giants

The Forward Deployed Engineer model provides a blueprint for how small startups can compete against enterprise giants like Salesforce, Oracle, and Booz Allen. The key isn't trying to out-sales the big companies with their fancy offices and strong handshakes - it's showing something revolutionary.

Success requires engineers who can combine technical excellence with deep empathy and design thinking. The goal is to create software so powerful that when clients see something that makes them feel truly understood, they want to buy it immediately.

The strategy works because it cuts through traditional enterprise sales friction with demonstrable value. Instead of lengthy relationship-building cycles, engineers can create immediate "wow" moments that translate directly into business outcomes.

This approach represents the biggest opportunity for startup founders today - the ability to compete not on sales process but on actual problem-solving capability delivered through superior technology.

Timestamp: [22:09-22:47]Youtube Icon

👥 Founders as Forward Deployed Engineers: The Non-Delegatable Advantage

The panel emphasizes that founders must personally embody the Forward Deployed Engineer role - this critical function cannot be outsourced or delegated. Technical founders need to become the ethnographer, designer, and product person all in one.

The goal is to create such a compelling demonstration in the second meeting that prospects have never seen anything like it. This requires founders to personally observe user workflows, understand pain points, and rapidly build solutions that address real needs.

This hands-on approach ensures that founders maintain direct connection to user needs and can iterate quickly based on real-world feedback. It's the difference between building what you think users want versus building what they actually need based on firsthand observation.

The Forward Deployed Engineer mindset transforms founders from distant product managers into embedded problem-solvers who understand their users' worlds better than anyone else.

Timestamp: [22:47-23:12]Youtube Icon

💎 Key Insights

  • Forward Deployed Engineers originated at Palantir to bridge the gap between world-class technologists and organizations with trillion-dollar problems
  • The most critical work in major institutions often relies on primitive tools like Word documents and Excel spreadsheets, creating massive optimization opportunities
  • Sending engineers instead of traditional salespeople fundamentally changes the sales process from relationship-building to value demonstration
  • Small startups can beat enterprise giants by showing revolutionary capabilities rather than competing on traditional sales processes
  • Founders must personally serve as their company's Forward Deployed Engineers - this role cannot be delegated or outsourced
  • Success requires combining technical excellence with empathy, design thinking, and ethnographic observation skills
  • The goal is creating immediate "wow" moments where prospects say "take my money" after seeing demonstrations built from direct user observation

Timestamp: [17:29-23:12]Youtube Icon

📚 References

People:

  • Peter Thiel - Co-founder of Palantir
  • Alex Karp - Co-founder of Palantir
  • Stefan Cohen - Co-founder of Palantir
  • Joe Lonsdale - Co-founder of Palantir
  • Nathan Gettings - Co-founder of Palantir

Companies:

  • Palantir - Data analytics company that pioneered the Forward Deployed Engineer model
  • Salesforce - Enterprise software company mentioned as competition
  • Oracle - Enterprise database company mentioned as competition
  • Booz Allen - Consulting firm mentioned as competition
  • Meta (formerly Facebook) - Social media company referenced for comparison
  • Google - Technology company referenced for comparison

Technologies/Products:

  • Palantir Foundry - Palantir's core data visualization and data mining suite

Concepts:

  • Forward Deployed Engineer (FDE) - Palantir's model of embedding engineers directly with clients
  • Data mining - What machine learning was called before AI became mainstream

Timestamp: [17:29-23:12]Youtube Icon

🤖 Vertical AI Agents: The FDE Model Accelerated

Vertical AI agents are successfully leveraging the Forward Deployed Engineer model to close unprecedented deals with large enterprises. The combination of FDE methodology with AI capabilities creates a powerful acceleration effect that's transforming enterprise sales cycles.

These companies can meet with end buyers and champions at big enterprises, capture that context, and immediately integrate it into their prompts. What previously required teams of engineers and longer development cycles can now be accomplished by just two founders working rapidly.

The speed advantage is remarkable - while Palantir might have taken longer with a full engineering team, AI-enabled startups can iterate overnight and return with working demonstrations. This has enabled them to close six and seven-figure deals with large enterprises, something that was previously impossible for small teams.

This model represents a fundamental shift in how enterprise software can be built and sold, with AI serving as a force multiplier for the Forward Deployed Engineer approach.

Timestamp: [23:19-24:03]Youtube Icon

🎤 Giga ML: Engineering Excellence Meets Forward Deployment

Giga ML exemplifies how talented software engineers can succeed by forcing themselves into the Forward Deployed Engineer role, even when they're not natural salespeople. The company specializes in customer support, particularly voice support, and has closed significant deals through technical demonstration rather than traditional sales approaches.

The founders physically go on-site following the Palantir model - after closing deals, they sit with customer support teams to continuously tune and optimize their LLM performance. However, their real innovation comes in the demo phase, where they win deals through superior technical capabilities.

Their competitive advantage lies in RAG pipeline innovations that enable voice responses to be both accurate and extremely low latency - a technically challenging combination that creates impressive demonstrations. This technical differentiation allows them to win against incumbents in ways that weren't possible before LLMs.

Timestamp: [24:03-25:04]Youtube Icon

⚡ The Demo Differentiation Advantage

The current LLM landscape has created unprecedented opportunities for technical differentiation in the demo phase of enterprise sales. Previously, it was nearly impossible to beat established players like Salesforce with incremental improvements to CRM interfaces or user experience.

Now, because AI technology evolves rapidly and achieving the last 5-10% of performance is extremely difficult, Forward Deployed Engineers can create dramatic competitive advantages. The process involves meeting with prospects, rapidly tweaking systems for their specific needs, and returning with demonstrations that create "wow" moments.

This represents a fundamental shift where technical excellence in AI implementation can directly translate to sales success, bypassing traditional enterprise sales processes through superior product demonstration.

Timestamp: [25:04-25:35]Youtube Icon

📞 Happy Robot: Seven-Figure Success in Logistics

Happy Robot demonstrates the scalability of the Forward Deployed Engineer model with AI voice agents in the logistics industry. They've achieved remarkable success by selling seven-figure contracts to the top three largest logistics brokers in the world, showcasing how quickly this approach can scale to major enterprise deals.

The company builds AI voice agents specifically for logistics brokers and follows the FDE model by engaging directly with CIOs and decision-makers. Their success stems from rapid product iteration and extremely quick turnaround times based on direct customer feedback.

Their trajectory demonstrates the acceleration possible with this model - starting with six-figure deals and progressing to seven-figure contracts within just a couple of months. This rapid scaling illustrates how sophisticated prompt engineering combined with the FDE approach can compress traditional enterprise sales timelines dramatically.

Timestamp: [25:35-26:11]Youtube Icon

🎭 The Personalities of Different LLMs

Each large language model exhibits distinct personalities and characteristics that make them suitable for different types of tasks and interactions. Understanding these personalities has become crucial for founders who need to select the right model for specific use cases.

Claude is recognized as the more approachable and human-steerable model, making it easier to work with for applications requiring nuanced human-like interactions. Its personality lends itself well to tasks where empathy and natural communication are important.

In contrast, Llama models require significantly more steering and feel more like interacting with a developer. This could be an artifact of having less Reinforcement Learning from Human Feedback (RLHF) training, making them more challenging to work with but potentially more powerful for users skilled in advanced prompting techniques.

This personality understanding helps founders choose the right model for their specific applications and user interactions, similar to selecting different team members for different types of projects.

Timestamp: [26:11-27:06]Youtube Icon

📊 LLMs for Investment Decision Scoring

YC has begun using LLMs internally to help founders evaluate potential investors using structured scoring rubrics. This represents a practical application of AI for high-stakes decision-making where clear, quantifiable guidance is essential.

The system uses a straightforward 0-100 scoring rubric where 0 represents "never ever take their money" and 100 means "take their money right away - they help you so much that you'd be crazy not to take their money." This binary-to-graduated scale helps founders make critical funding decisions with more objective analysis.

The approach demonstrates how LLMs can be applied to complex business decisions that traditionally relied on intuition and experience. By systematizing investor evaluation, YC can help founders make more informed decisions about whose money to accept - a choice that can significantly impact startup trajectories.

This application showcases the versatility of LLMs beyond customer-facing applications, extending into internal business operations and strategic decision-making processes.

Timestamp: [27:06-27:23]Youtube Icon

💎 Key Insights

  • Vertical AI agents can compress enterprise sales cycles from months to days by integrating customer context directly into prompts
  • The combination of Forward Deployed Engineer methodology with AI creates unprecedented acceleration in enterprise deal-making
  • Technical differentiation in AI demos can now overcome traditional enterprise sales advantages that incumbents previously held
  • Engineers without natural sales skills can succeed by forcing themselves into Forward Deployed Engineer roles and leading with technical excellence
  • Different LLMs have distinct personalities requiring different interaction approaches - Claude is more human-steerable while Llama needs more technical steering
  • Startups are achieving six to seven-figure enterprise deals within months using FDE+AI methodology
  • LLMs can be effectively applied to internal business decisions like investor evaluation using structured scoring rubrics

Timestamp: [23:19-27:23]Youtube Icon

📚 References

Companies:

  • Giga ML - Customer support company specializing in voice support that closed deals with Zepto using FDE model
  • Zepto - Company that became a customer of Giga ML
  • Happy Robot - AI voice agent company that sold seven-figure contracts to top logistics brokers
  • Salesforce - Enterprise CRM company mentioned as incumbent competition

AI Models:

  • Claude - LLM described as more happy and human-steerable
  • Llama - LLM that requires more steering and feels like talking to a developer

Technical Concepts:

  • RAG pipeline - Retrieval Augmented Generation pipeline that Giga ML innovated for voice responses
  • RLHF (Reinforcement Learning from Human Feedback) - Training process that affects LLM personality and steerability

Industries:

  • Logistics brokers - Industry where Happy Robot achieved seven-figure contracts with top three largest companies
  • Customer support - Industry focus for Giga ML's voice support solutions

Timestamp: [23:19-27:23]Youtube Icon

📊 Rubrics: The Foundation for Numerical Scoring

Best practices for LLM prompting include providing detailed rubrics, especially when seeking numerical outputs. Rubrics help models understand how to evaluate and differentiate between score levels, such as distinguishing between an 80 versus a 90 rating.

However, rubrics are inherently imperfect tools with inevitable exceptions and edge cases. The key insight is that different models handle these limitations in dramatically different ways, revealing distinct approaches to following guidelines versus exercising judgment.

This fundamental tension between structured guidance and flexible judgment becomes particularly important when evaluating complex, nuanced scenarios where strict adherence to rules might miss important contextual factors.

The effectiveness of rubric-based prompting depends heavily on understanding how different models interpret and apply scoring frameworks.

Timestamp: [27:29-27:53]Youtube Icon

⚖️ Model Personalities: Soldier vs. High-Agency Employee

Comparing O3 and Gemini 2.5 Pro reveals fascinating differences in how models approach rubric interpretation. These differences reflect distinct "personalities" that affect their suitability for different types of evaluation tasks.

O3 demonstrates rigid adherence to provided rubrics, functioning like a disciplined soldier who follows instructions precisely. It heavily penalizes anything that doesn't fit the established criteria, prioritizing consistency and strict rule-following over contextual interpretation.

In contrast, Gemini 2.5 Pro exhibits flexibility and judgment, behaving more like a high-agency employee. While it applies the rubric as guidance, it can reason through exceptions and adjust scores based on contextual factors that the rubric might not fully capture.

Timestamp: [27:53-28:57]Youtube Icon

💰 Real-World Investor Evaluation Examples

The investor scoring system reveals how LLM personality differences play out in practical applications. Some investors clearly merit immediate acceptance based on their exceptional processes and track records.

Top-tier firms like Benchmark and Thrive represent the high end of the scale - investors whose processes are so polished that founders should "take their money right away." These investors never ghost potential portfolio companies, respond to emails faster than most founders, and maintain consistently impressive operational standards.

However, many situations involve nuanced judgment calls where exceptional investors might have operational weaknesses. Some investors have outstanding track records and genuinely care about their portfolio companies but struggle with time management, leading to slow responses and accidental ghosting despite good intentions.

These scenarios demonstrate exactly why LLMs are valuable - they can process complex, contradictory signals and provide nuanced scores (like 91 instead of 89) based on comprehensive evaluation rather than simple rule application.

Timestamp: [29:03-29:49]Youtube Icon

🧠 The Art of Communication: Managing AI Like People

The hosts reflect on prompt engineering as fundamentally similar to managing and communicating with people. The process requires clearly conveying information needed for good decision-making and establishing transparent evaluation criteria.

This management analogy extends to ensuring AI systems understand how they'll be evaluated and scored, similar to setting clear expectations for human employees. The communication challenge involves translating complex requirements into clear, actionable guidance that enables consistent performance.

The parallel between AI prompting and people management suggests that many traditional management and communication skills translate directly to working with AI systems, making this a more intuitive process than purely technical approaches might suggest.

Timestamp: [30:15-30:37]Youtube Icon

🏭 Kaizen: The Meta-Prompting Manufacturing Philosophy

The podcast concludes with a powerful analogy connecting meta-prompting to Kaizen, the Japanese manufacturing philosophy that revolutionized car production in the 1990s. This principle holds that the people actually doing the work are best positioned to improve the process.

Just as Japanese automakers achieved superior quality by empowering front-line workers to continuously refine manufacturing processes, meta-prompting enables AI systems to improve their own performance through iterative refinement. The practitioners using the prompts daily are positioned to identify improvements and optimize performance.

This philosophical framework positions meta-prompting not as a technical hack but as a systematic approach to continuous improvement that mirrors proven manufacturing excellence methodologies. It suggests that the future of AI development lies in creating systems that can self-improve through structured feedback loops.

Timestamp: [30:37-30:58]Youtube Icon

🎬 Conclusion: A Brave New World

The hosts wrap up by acknowledging they're operating in uncharted territory - a "brave new world" where the tools and best practices are still emerging. Despite being early in this frontier, the principles and techniques discussed provide a foundation for the rapidly evolving field of prompt engineering.

The episode concludes with encouragement for listeners to experiment with these concepts and develop their own prompting innovations, recognizing that the community is collectively building the knowledge base for this new discipline.

Timestamp: [30:58-31:04]Youtube Icon

💎 Key Insights

  • Rubrics are essential for numerical scoring but must account for exceptions and edge cases that strict rule-following might miss
  • Different AI models exhibit distinct personalities: O3 acts like a rigid soldier while Gemini 2.5 Pro behaves like a high-agency employee with judgment
  • Real-world evaluation scenarios often require nuanced scoring that balances multiple contradictory factors
  • Prompt engineering resembles people management more than pure programming, requiring clear communication and expectation-setting
  • Meta-prompting follows Kaizen principles where practitioners are best positioned to improve the processes they use daily
  • The field is still in its early stages, comparable to coding in 1995, with tools and best practices rapidly evolving
  • Success requires experimentation and community knowledge-building as the discipline continues to mature

Timestamp: [27:29-31:04]Youtube Icon

📚 References

AI Models:

  • O3 - OpenAI model that demonstrates rigid adherence to rubrics, described as soldier-like
  • Gemini 2.5 Pro - Google model that shows flexibility and judgment in applying rubrics

Investment Firms:

  • Benchmark - Venture capital firm cited as example of top-tier investor with impeccable process
  • Thrive - Investment firm mentioned alongside Benchmark as having exceptional operational standards

Business Concepts:

  • Kaizen - Japanese manufacturing philosophy emphasizing continuous improvement by front-line workers
  • Rubrics - Scoring frameworks used to guide LLM evaluation and decision-making

Historical Context:

  • Japanese car manufacturing in the 1990s - Example of how Kaizen principles led to superior automotive quality

Timestamp: [27:29-31:04]Youtube Icon