AI/ML Guide

AI Model Selection for Startups: Claude vs GPT vs Gemini

We analyzed 47 Y Combinator videos featuring Andrej Karpathy, Sam Altman, Dario Amodei, and other AI leaders to extract practical guidance on which models to use and when.

20 min read Updated January 2025 47 YC videos analyzed
AI Model Selection Guide - Visualization of Claude, GPT, and Gemini comparison
47
YC videos analyzed
78%
Cost reduction possible
6
Frontier model providers
10x
Context window growth in 2 years

The 2025 Model Landscape

The AI model landscape has exploded. In Y Combinator interviews, founders and investors consistently emphasize one thing: the model you choose matters less than how you use it. Sam Altman put it directly: "The models are converging in capability. What matters is your application layer."

That said, meaningful differences still exist. Here's the current state:

Model Best For Weakness Context
Claude 3.5 Sonnet Coding, extended reasoning, writing Tool use ecosystem 200K
GPT-4o Multimodal, real-time, tool use Reasoning depth 128K
Gemini 1.5 Pro Long context, video, Google ecosystem Consistency 1M+
o1 Complex reasoning, math, science Speed, cost 128K
DeepSeek R1 Reasoning at lower cost Ecosystem, support 64K
Llama 3.1 405B Self-hosting, privacy, customization Infrastructure needs 128K

YC Partner Insight

From YC's AI talks: "Model capability is table stakes now. The winners will be those who understand their users deeply and build the right application layer on top."

Karpathy's Software 3.0 Framework

In his YC talk, Andrej Karpathy laid out a framework for understanding LLMs that every founder should internalize.

1.0

Software 1.0: Explicit Code

Traditional programming. You write explicit rules. Deterministic, predictable, but limited to problems you can specify completely.

2.0

Software 2.0: Neural Networks

Machine learning. You provide data and architecture. The model learns the program. Great for pattern recognition but narrow.

3.0

Software 3.0: LLMs

Natural language programming. You describe what you want in English. The model "knows" from training on human knowledge. General purpose but probabilistic.

Karpathy's key insight: LLMs are not deterministic computers. They're "vibes-based" systems. You need to treat them like you would a new hire - give them examples, iterate on instructions, and verify their work.

The Psychology of LLMs

Karpathy emphasizes that LLMs have "psychology" - they respond to social pressure in prompts, they try to please, and they sometimes hallucinate when uncertain. Understanding this is key to using them effectively.

Claude vs GPT vs Gemini: Practical Differences

Based on YC interviews with founders and AI leaders, here's where each model excels in practice.

Claude (Anthropic)

Dario and Amanda Amodei's interviews on YC reveal Claude's design philosophy: safety through understanding, not restrictions. The model is trained to be genuinely helpful while avoiding harm.

Strengths

  • Extended thinking for complex reasoning
  • Coding (particularly with Claude Code)
  • Long, nuanced writing
  • Following complex instructions
  • 200K context with strong recall

Best Use Cases

  • AI-assisted coding (Cursor, Claude Code)
  • Document analysis
  • Technical writing
  • Research synthesis

GPT-4/4o (OpenAI)

Sam Altman's YC interviews emphasize OpenAI's focus on developer experience and ecosystem. GPT-4's strength is the breadth of integrations and tooling around it.

Strengths

  • Best-in-class tool use and function calling
  • Real-time voice (GPT-4o)
  • Vision capabilities
  • Massive ecosystem
  • Consistent API reliability

Best Use Cases

  • Production apps with complex workflows
  • Voice assistants
  • Multi-modal applications
  • Agent systems with many tools

Gemini (Google)

Less frequently discussed in YC talks, but the 1M+ token context window makes Gemini uniquely powerful for specific use cases.

Strengths

  • Massive context window (1M+ tokens)
  • Native video understanding
  • Google ecosystem integration
  • Competitive pricing

Best Use Cases

  • Entire codebases in context
  • Long video analysis
  • Google Workspace integration
  • Long document processing

Cursor CEO's take

In his YC interview, Cursor's CEO explains their model switching: "We use Claude for heavy lifting - the actual code generation. But the model choice matters less than the context you give it. Most of the intelligence is in how you construct the prompt."

Reasoning Models: o1 vs DeepSeek R1

2024-2025 saw the rise of "reasoning models" - LLMs that explicitly think through problems step-by-step before answering.

OpenAI o1

  • Proprietary chain-of-thought reasoning
  • Excels at math, coding, science
  • Slower but more accurate for hard problems
  • Premium pricing ($15/1M input, $60/1M output)

DeepSeek R1

  • Open weights (you can run it locally)
  • Comparable reasoning to o1
  • Much lower cost
  • Chinese company - consider data residency

From YC's analysis of DeepSeek: "The engineering innovations are real." DeepSeek achieved similar results to frontier models with significantly less compute, using techniques like:

FP8 Training

8-bit floating point instead of 16-bit. 2x memory efficiency, enabling larger batches and faster training.

Mixture of Experts (MoE)

Only 37B parameters active per inference despite 671B total. Dramatically reduces inference cost.

Multi-head Latent Attention

Compresses KV cache for faster inference without quality loss.

Scaling Laws: What Every Founder Should Know

YC's dedicated scaling laws episode breaks down why AI capabilities keep improving predictably.

The Core Equation

Loss = A × (Compute)^(-0.05) × (Data)^(-0.05) × (Parameters)^(-0.076)

In plain English: Model performance improves predictably as you increase compute, data, or parameters. The relationship is logarithmic - you need 10x more resources for each incremental improvement.

1

Models Will Keep Getting Better

No ceiling in sight. GPT-5, Claude 4, etc. will be meaningfully more capable than current models. Build for this - don't over-engineer around current limitations.

2

Inference Costs Will Drop

Every 18 months, the same capability gets 10x cheaper. What costs $1 today will cost $0.10 in 18 months. Price accordingly.

3

The Moat Is Not the Model

If your product is just "GPT-4 + a wrapper," you have no moat. The defensibility comes from data, distribution, and user workflows - not model access.

The GPT Wrapper Myth

YC's analysis shows that "GPT wrapper" companies CAN build real businesses. The key is building something that gets better with use - whether through proprietary data, user feedback loops, or workflow integration that creates switching costs.

State-of-the-Art Prompting for AI Agents

From YC's prompting masterclass, here are the techniques that actually move the needle.

1

Be Specific About Format

Don't say "return JSON." Say "Return a JSON object with keys: name (string), score (integer 0-100), reasoning (string, 2-3 sentences)." The more specific, the more reliable.

2

Give Examples (Few-Shot)

Show 2-3 examples of the exact input/output format you want. This works better than any amount of explanation for most tasks.

3

Use Chain-of-Thought

For complex tasks, explicitly ask the model to "think step by step" or "explain your reasoning before giving the final answer." This dramatically improves accuracy on multi-step problems.

4

Define Escape Hatches

Tell the model what to do when it's uncertain: "If you're not sure, respond with 'UNSURE: ' followed by your best guess and why you're uncertain."

The Temperature Setting

For deterministic tasks (extraction, classification), use temperature=0. For creative tasks (writing, brainstorming), use 0.7-1.0. Most startups should default to temperature=0 and only increase when they want variety.

Open Source vs Closed Models

YC interviews increasingly discuss when to use open source models like Llama, Mistral, or DeepSeek.

Use Open Source When

  • Privacy/data residency is critical
  • You need to fine-tune for your domain
  • High volume makes API costs prohibitive
  • You need full control over the model
  • Latency requirements are extreme

Use Closed APIs When

  • Speed to market matters most
  • You're still iterating on product-market fit
  • Volume is low to moderate
  • You need the latest capabilities
  • You don't want to manage infrastructure

The practical advice from YC founders: Start with APIs (Claude or GPT), validate your product, then consider open source for specific high-volume use cases. Don't prematurely optimize for cost.

Decision Framework: Which Model to Use

Based on patterns across 47 YC videos, here's a practical decision tree.

For Coding/Development

Use Claude via Cursor or Claude Code. Multiple YC founders cite Claude as their primary coding assistant.

Fallback: GPT-4 if you need specific integrations or tool use.

For Complex Reasoning/Math

Use o1 or DeepSeek R1. The extended thinking time is worth it for problems that require multi-step reasoning.

Cost tip: Start with DeepSeek R1 for testing, use o1 for production if quality matters.

For Production Apps with Tools

Use GPT-4. The function calling and tool use ecosystem is most mature. Reliability matters more than marginal capability differences.

Consider: Claude for heavy lifting, GPT-4 for orchestration.

For Long Context (Entire Codebases)

Use Gemini 1.5 Pro. The 1M+ token context window is unmatched for stuffing entire codebases into context.

Alternative: Claude 200K is often sufficient and more consistent.

For Voice/Real-Time

Use GPT-4o. Native voice mode is still ahead. Combine with LiveKit for production voice apps.

Cost tip: Use speech-to-text pipeline instead of real-time mode for significant savings.

For Simple Tasks (Classification, Extraction)

Use Claude Haiku or GPT-3.5. Don't overpay for capabilities you don't need. These are 10-50x cheaper than frontier models.

Rule: If Haiku works 95% of the time, use Haiku and handle edge cases separately.

Want to research AI and startup channels?

Taffy lets you analyze transcripts and comments from any YouTube channel. Find out what AI tools founders are discussing, what problems they're solving, and what's working.

Get Started Free

Free daily channel insights. No credit card required.

Frequently Asked Questions

Should I use Claude or GPT for my startup?

It depends on your use case. Claude excels at extended thinking, coding, and complex reasoning. GPT-4 has stronger tool use and real-time capabilities. For most startups, the recommendation is: use Claude for heavy cognitive tasks, GPT-4 for production apps with many integrations.

What are scaling laws and why do they matter?

Scaling laws predict model performance based on compute, data, and parameters. They matter because they show capabilities improve predictably with scale. This helps you plan which features to build now vs. wait for future models to enable.

Should startups use open source LLMs like Llama or DeepSeek?

Start with APIs (Claude, GPT) for speed. Consider open source when you have: privacy requirements, high volume making APIs expensive, or need for fine-tuning. Don't prematurely optimize - validate your product first.

What are reasoning models and when should I use them?

Reasoning models like o1 and DeepSeek R1 use chain-of-thought to solve complex problems. Use them for math, coding challenges, and multi-step reasoning. They're slower and more expensive, so reserve them for tasks where accuracy matters more than speed.

How do I reduce LLM costs without sacrificing quality?

Use model cascading (start with cheap models, escalate if needed), clean your prompts to reduce tokens, use smaller models for simple tasks (Haiku, GPT-3.5), and implement caching for repeated queries. YC founders report 78%+ cost reductions with these techniques.

Related Guides