We analyzed 47 Y Combinator videos featuring Andrej Karpathy, Sam Altman, Dario Amodei, and other AI leaders to extract practical guidance on which models to use and when.
The AI model landscape has exploded. In Y Combinator interviews, founders and investors consistently emphasize one thing: the model you choose matters less than how you use it. Sam Altman put it directly: "The models are converging in capability. What matters is your application layer."
That said, meaningful differences still exist. Here's the current state:
| Model | Best For | Weakness | Context |
|---|---|---|---|
| Claude 3.5 Sonnet | Coding, extended reasoning, writing | Tool use ecosystem | 200K |
| GPT-4o | Multimodal, real-time, tool use | Reasoning depth | 128K |
| Gemini 1.5 Pro | Long context, video, Google ecosystem | Consistency | 1M+ |
| o1 | Complex reasoning, math, science | Speed, cost | 128K |
| DeepSeek R1 | Reasoning at lower cost | Ecosystem, support | 64K |
| Llama 3.1 405B | Self-hosting, privacy, customization | Infrastructure needs | 128K |
YC Partner Insight
From YC's AI talks: "Model capability is table stakes now. The winners will be those who understand their users deeply and build the right application layer on top."
In his YC talk, Andrej Karpathy laid out a framework for understanding LLMs that every founder should internalize.
Traditional programming. You write explicit rules. Deterministic, predictable, but limited to problems you can specify completely.
Machine learning. You provide data and architecture. The model learns the program. Great for pattern recognition but narrow.
Natural language programming. You describe what you want in English. The model "knows" from training on human knowledge. General purpose but probabilistic.
Karpathy's key insight: LLMs are not deterministic computers. They're "vibes-based" systems. You need to treat them like you would a new hire - give them examples, iterate on instructions, and verify their work.
The Psychology of LLMs
Karpathy emphasizes that LLMs have "psychology" - they respond to social pressure in prompts, they try to please, and they sometimes hallucinate when uncertain. Understanding this is key to using them effectively.
Based on YC interviews with founders and AI leaders, here's where each model excels in practice.
Dario and Amanda Amodei's interviews on YC reveal Claude's design philosophy: safety through understanding, not restrictions. The model is trained to be genuinely helpful while avoiding harm.
Sam Altman's YC interviews emphasize OpenAI's focus on developer experience and ecosystem. GPT-4's strength is the breadth of integrations and tooling around it.
Less frequently discussed in YC talks, but the 1M+ token context window makes Gemini uniquely powerful for specific use cases.
Cursor CEO's take
In his YC interview, Cursor's CEO explains their model switching: "We use Claude for heavy lifting - the actual code generation. But the model choice matters less than the context you give it. Most of the intelligence is in how you construct the prompt."
2024-2025 saw the rise of "reasoning models" - LLMs that explicitly think through problems step-by-step before answering.
From YC's analysis of DeepSeek: "The engineering innovations are real." DeepSeek achieved similar results to frontier models with significantly less compute, using techniques like:
8-bit floating point instead of 16-bit. 2x memory efficiency, enabling larger batches and faster training.
Only 37B parameters active per inference despite 671B total. Dramatically reduces inference cost.
Compresses KV cache for faster inference without quality loss.
YC's dedicated scaling laws episode breaks down why AI capabilities keep improving predictably.
Loss = A × (Compute)^(-0.05) × (Data)^(-0.05) × (Parameters)^(-0.076)
In plain English: Model performance improves predictably as you increase compute, data, or parameters. The relationship is logarithmic - you need 10x more resources for each incremental improvement.
No ceiling in sight. GPT-5, Claude 4, etc. will be meaningfully more capable than current models. Build for this - don't over-engineer around current limitations.
Every 18 months, the same capability gets 10x cheaper. What costs $1 today will cost $0.10 in 18 months. Price accordingly.
If your product is just "GPT-4 + a wrapper," you have no moat. The defensibility comes from data, distribution, and user workflows - not model access.
The GPT Wrapper Myth
YC's analysis shows that "GPT wrapper" companies CAN build real businesses. The key is building something that gets better with use - whether through proprietary data, user feedback loops, or workflow integration that creates switching costs.
From YC's prompting masterclass, here are the techniques that actually move the needle.
Don't say "return JSON." Say "Return a JSON object with keys: name (string), score (integer 0-100), reasoning (string, 2-3 sentences)." The more specific, the more reliable.
Show 2-3 examples of the exact input/output format you want. This works better than any amount of explanation for most tasks.
For complex tasks, explicitly ask the model to "think step by step" or "explain your reasoning before giving the final answer." This dramatically improves accuracy on multi-step problems.
Tell the model what to do when it's uncertain: "If you're not sure, respond with 'UNSURE: ' followed by your best guess and why you're uncertain."
The Temperature Setting
For deterministic tasks (extraction, classification), use temperature=0. For creative tasks (writing, brainstorming), use 0.7-1.0. Most startups should default to temperature=0 and only increase when they want variety.
YC interviews increasingly discuss when to use open source models like Llama, Mistral, or DeepSeek.
The practical advice from YC founders: Start with APIs (Claude or GPT), validate your product, then consider open source for specific high-volume use cases. Don't prematurely optimize for cost.
Based on patterns across 47 YC videos, here's a practical decision tree.
Use Claude via Cursor or Claude Code. Multiple YC founders cite Claude as their primary coding assistant.
Fallback: GPT-4 if you need specific integrations or tool use.
Use o1 or DeepSeek R1. The extended thinking time is worth it for problems that require multi-step reasoning.
Cost tip: Start with DeepSeek R1 for testing, use o1 for production if quality matters.
Use GPT-4. The function calling and tool use ecosystem is most mature. Reliability matters more than marginal capability differences.
Consider: Claude for heavy lifting, GPT-4 for orchestration.
Use Gemini 1.5 Pro. The 1M+ token context window is unmatched for stuffing entire codebases into context.
Alternative: Claude 200K is often sufficient and more consistent.
Use GPT-4o. Native voice mode is still ahead. Combine with LiveKit for production voice apps.
Cost tip: Use speech-to-text pipeline instead of real-time mode for significant savings.
Use Claude Haiku or GPT-3.5. Don't overpay for capabilities you don't need. These are 10-50x cheaper than frontier models.
Rule: If Haiku works 95% of the time, use Haiku and handle edge cases separately.
Taffy lets you analyze transcripts and comments from any YouTube channel. Find out what AI tools founders are discussing, what problems they're solving, and what's working.
Get Started FreeFree daily channel insights. No credit card required.
It depends on your use case. Claude excels at extended thinking, coding, and complex reasoning. GPT-4 has stronger tool use and real-time capabilities. For most startups, the recommendation is: use Claude for heavy cognitive tasks, GPT-4 for production apps with many integrations.
Scaling laws predict model performance based on compute, data, and parameters. They matter because they show capabilities improve predictably with scale. This helps you plan which features to build now vs. wait for future models to enable.
Start with APIs (Claude, GPT) for speed. Consider open source when you have: privacy requirements, high volume making APIs expensive, or need for fine-tuning. Don't prematurely optimize - validate your product first.
Reasoning models like o1 and DeepSeek R1 use chain-of-thought to solve complex problems. Use them for math, coding challenges, and multi-step reasoning. They're slower and more expensive, so reserve them for tasks where accuracy matters more than speed.
Use model cascading (start with cheap models, escalate if needed), clean your prompts to reduce tokens, use smaller models for simple tasks (Haiku, GPT-3.5), and implement caching for repeated queries. YC founders report 78%+ cost reductions with these techniques.
We publish deep-dive research guides weekly. Be the first to know when new analysis drops.
No spam. Unsubscribe anytime.