AI Model Selection for Startups: Claude vs GPT vs Gemini
Andrej Karpathy uses Claude for writing and GPT-4 for code. Sam Altman says most startups pick the wrong model for their use case. Across 47 Y Combinator talks, a clear framework emerged for matching models to tasks -- and most founders are leaving performance (and money) on the table.
In this guide:
What Does the 2025 AI Model Landscape Look Like?
The 2025 AI model landscape features six frontier providers with converging capabilities, where the model you choose matters less than how you use it. Sam Altman put it directly: "The models are converging in capability. What matters is your application layer." That said, meaningful differences still exist.
Here is the current state:
| Model | Best For | Weakness | Context |
|---|---|---|---|
| Claude 3.5 Sonnet | Coding, extended reasoning, writing | Tool use ecosystem | 200K |
| GPT-4o | Multimodal, real-time, tool use | Reasoning depth | 128K |
| Gemini 1.5 Pro | Long context, video, Google ecosystem | Consistency | 1M+ |
| o1 | Complex reasoning, math, science | Speed, cost | 128K |
| DeepSeek R1 | Reasoning at lower cost | Ecosystem, support | 64K |
| Llama 3.1 405B | Self-hosting, privacy, customization | Infrastructure needs | 128K |
YC Partner Insight
From YC's AI talks: "Model capability is table stakes now. The winners will be those who understand their users deeply and build the right application layer on top."
What Is Karpathy's Software 3.0 Framework?
Karpathy's Software 3.0 framework describes LLMs as the third era of programming: Software 1.0 is explicit code, 2.0 is neural networks trained on data, and 3.0 is natural language programming where you describe what you want and the model executes. Every founder should internalize this framework.
Software 1.0: Explicit Code
Traditional programming. You write explicit rules. Deterministic, predictable, but limited to problems you can specify completely.
Software 2.0: Neural Networks
Machine learning. You provide data and architecture. The model learns the program. Great for pattern recognition but narrow.
Software 3.0: LLMs
Natural language programming. You describe what you want in English. The model "knows" from training on human knowledge. General purpose but probabilistic.
Karpathy's key insight: LLMs are not deterministic computers. They're "vibes-based" systems. You need to treat them like you would a new hire - give them examples, iterate on instructions, and verify their work.
The Psychology of LLMs
Karpathy emphasizes that LLMs have "psychology" - they respond to social pressure in prompts, they try to please, and they sometimes hallucinate when uncertain. Understanding this is key to using them effectively.
Key Takeaway
The most expensive model is rarely the right choice. YC founders running production AI products consistently use tiered model architectures -- cheap models for classification and routing, mid-tier models for most generation tasks, and frontier models only for complex reasoning. One founder cut their AI costs 80% by switching from GPT-4 to Haiku for simple tasks.
What Are the Practical Differences Between Claude, GPT, and Gemini?
Claude excels at coding and extended reasoning, GPT-4 leads in tool use and ecosystem maturity, and Gemini dominates long-context tasks with its 1M+ token window. Based on YC interviews with founders and AI leaders, here is where each model excels in practice.
Claude (Anthropic)
Dario and Amanda Amodei's interviews on YC reveal Claude's design philosophy: safety through understanding, not restrictions. The model is trained to be genuinely helpful while avoiding harm.
Strengths
- Extended thinking for complex reasoning
- Coding (particularly with Claude Code)
- Long, nuanced writing
- Following complex instructions
- 200K context with strong recall
Best Use Cases
- AI-assisted coding (Cursor, Claude Code)
- Document analysis
- Technical writing
- Research synthesis
GPT-4/4o (OpenAI)
Sam Altman's YC interviews emphasize OpenAI's focus on developer experience and ecosystem. GPT-4's strength is the breadth of integrations and tooling around it.
Strengths
- Best-in-class tool use and function calling
- Real-time voice (GPT-4o)
- Vision capabilities
- Massive ecosystem
- Consistent API reliability
Best Use Cases
- Production apps with complex workflows
- Voice assistants
- Multi-modal applications
- Agent systems with many tools
Gemini (Google)
Less frequently discussed in YC talks, but the 1M+ token context window makes Gemini uniquely powerful for specific use cases.
Strengths
- Massive context window (1M+ tokens)
- Native video understanding
- Google ecosystem integration
- Competitive pricing
Best Use Cases
- Entire codebases in context
- Long video analysis
- Google Workspace integration
- Long document processing
Cursor CEO's take
In his YC interview, Cursor's CEO explains their model switching: "We use Claude for heavy lifting - the actual code generation. But the model choice matters less than the context you give it. Most of the intelligence is in how you construct the prompt."
When Should You Use Reasoning Models Like o1 or DeepSeek R1?
You should use reasoning models for complex math, coding challenges, and multi-step reasoning tasks where accuracy matters more than speed. These models explicitly think through problems step-by-step before answering, with DeepSeek R1 offering comparable reasoning at lower cost. If you are building AI agents, reasoning models work best for the planning and evaluation layers.
OpenAI o1
- Proprietary chain-of-thought reasoning
- Excels at math, coding, science
- Slower but more accurate for hard problems
- Premium pricing ($15/1M input, $60/1M output)
DeepSeek R1
- Open weights (you can run it locally)
- Comparable reasoning to o1
- Much lower cost
- Chinese company - consider data residency
From YC's analysis of DeepSeek: "The engineering innovations are real." DeepSeek achieved similar results to frontier models with significantly less compute, using techniques like:
FP8 Training
8-bit floating point instead of 16-bit. 2x memory efficiency, enabling larger batches and faster training.
Mixture of Experts (MoE)
Only 37B parameters active per inference despite 671B total. Dramatically reduces inference cost.
Multi-head Latent Attention
Compresses KV cache for faster inference without quality loss.
What Should Founders Know About Scaling Laws?
Founders should know three things about scaling laws: models will keep getting better with no ceiling in sight, inference costs drop 10x every 18 months, and the moat is never the model itself but your data, distribution, and user workflows. YC's dedicated scaling laws episode breaks down why AI capabilities keep improving predictably.
The Core Equation
Loss = A × (Compute)^(-0.05) × (Data)^(-0.05) × (Parameters)^(-0.076)
In plain English: Model performance improves predictably as you increase compute, data, or parameters. The relationship is logarithmic - you need 10x more resources for each incremental improvement.
Models Will Keep Getting Better
No ceiling in sight. GPT-5, Claude 4, etc. will be meaningfully more capable than current models. Build for this - don't over-engineer around current limitations.
Inference Costs Will Drop
Every 18 months, the same capability gets 10x cheaper. What costs $1 today will cost $0.10 in 18 months. Price accordingly.
The Moat Is Not the Model
If your product is just "GPT-4 + a wrapper," you have no moat. The defensibility comes from data, distribution, and user workflows - not model access.
The GPT Wrapper Myth
YC's analysis shows that "GPT wrapper" companies CAN build real businesses. The key is building something that gets better with use - whether through proprietary data, user feedback loops, or workflow integration that creates switching costs.
Build AI Products Without Deep Technical Knowledge
Our AI Product Playbook covers finding opportunities, building moats, and pricing strategies for AI-native products.
How Do You Write State-of-the-Art Prompts for AI Agents?
You write state-of-the-art prompts by being specific about output format, providing 2-3 examples (few-shot), using chain-of-thought for complex tasks, and defining escape hatches for uncertainty. From YC's prompting masterclass, these are the techniques that actually move the needle.
Be Specific About Format
Don't say "return JSON." Say "Return a JSON object with keys: name (string), score (integer 0-100), reasoning (string, 2-3 sentences)." The more specific, the more reliable.
Give Examples (Few-Shot)
Show 2-3 examples of the exact input/output format you want. This works better than any amount of explanation for most tasks.
Use Chain-of-Thought
For complex tasks, explicitly ask the model to "think step by step" or "explain your reasoning before giving the final answer." This dramatically improves accuracy on multi-step problems.
Define Escape Hatches
Tell the model what to do when it's uncertain: "If you're not sure, respond with 'UNSURE: ' followed by your best guess and why you're uncertain."
The Temperature Setting
For deterministic tasks (extraction, classification), use temperature=0. For creative tasks (writing, brainstorming), use 0.7-1.0. Most startups should default to temperature=0 and only increase when they want variety.
Should You Use Open Source or Closed AI Models?
Start with closed APIs (Claude or GPT) to validate your product, then consider open source models like Llama, Mistral, or DeepSeek for specific high-volume use cases where privacy, cost, or fine-tuning requirements justify the infrastructure investment.
Use Open Source When
- Privacy/data residency is critical
- You need to fine-tune for your domain
- High volume makes API costs prohibitive
- You need full control over the model
- Latency requirements are extreme
Use Closed APIs When
- Speed to market matters most
- You're still iterating on product-market fit
- Volume is low to moderate
- You need the latest capabilities
- You don't want to manage infrastructure
The practical advice from YC founders: Start with APIs (Claude or GPT), validate your product, then consider open source for specific high-volume use cases. Don't prematurely optimize for cost.
Which AI Model Should You Use for Your Startup?
Use Claude for coding and heavy cognitive tasks, GPT-4 for production apps with complex tool use, Gemini for long-context processing, reasoning models for math and science, and Haiku/GPT-3.5 for simple classification tasks. Based on patterns across 47 YC videos, here is the practical decision tree. For non-technical founders, vibe coding tools abstract most of these decisions away.
For Coding/Development
Use Claude via Cursor or Claude Code. Multiple YC founders cite Claude as their primary coding assistant.
Fallback: GPT-4 if you need specific integrations or tool use.
For Complex Reasoning/Math
Use o1 or DeepSeek R1. The extended thinking time is worth it for problems that require multi-step reasoning.
Cost tip: Start with DeepSeek R1 for testing, use o1 for production if quality matters.
For Production Apps with Tools
Use GPT-4. The function calling and tool use ecosystem is most mature. Reliability matters more than marginal capability differences.
Consider: Claude for heavy lifting, GPT-4 for orchestration.
For Long Context (Entire Codebases)
Use Gemini 1.5 Pro. The 1M+ token context window is unmatched for stuffing entire codebases into context.
Alternative: Claude 200K is often sufficient and more consistent.
For Voice/Real-Time
Use GPT-4o. Native voice mode is still ahead. Combine with LiveKit for production voice apps.
Cost tip: Use speech-to-text pipeline instead of real-time mode for significant savings.
For Simple Tasks (Classification, Extraction)
Use Claude Haiku or GPT-3.5. Don't overpay for capabilities you don't need. These are 10-50x cheaper than frontier models.
Rule: If Haiku works 95% of the time, use Haiku and handle edge cases separately.
Our take
The biggest mistake we see founders make is treating model selection as a permanent decision. The landscape changes every 3-4 months. The smart play is building a model-agnostic abstraction layer from day one so you can swap providers without rewriting your application. Cursor does this internally -- they switch between Claude and GPT depending on the task. Your architecture should make model switching a config change, not a rewrite.
Want to research AI and startup channels?
Taffy lets you analyze transcripts and comments from any YouTube channel. Find out what AI tools founders are discussing, what problems they're solving, and what's working.
Get Started FreeFree daily channel insights. No credit card required.
Frequently Asked Questions
Should I use Claude or GPT for my startup?
It depends on your use case. Claude excels at extended thinking, coding, and complex reasoning. GPT-4 has stronger tool use and real-time capabilities. For most startups, the recommendation is: use Claude for heavy cognitive tasks, GPT-4 for production apps with many integrations.
What are scaling laws and why do they matter?
Scaling laws predict model performance based on compute, data, and parameters. They matter because they show capabilities improve predictably with scale. This helps you plan which features to build now vs. wait for future models to enable.
Should startups use open source LLMs like Llama or DeepSeek?
Start with APIs (Claude, GPT) for speed. Consider open source when you have: privacy requirements, high volume making APIs expensive, or need for fine-tuning. Don't prematurely optimize - validate your product first.
What are reasoning models and when should I use them?
Reasoning models like o1 and DeepSeek R1 use chain-of-thought to solve complex problems. Use them for math, coding challenges, and multi-step reasoning. They're slower and more expensive, so reserve them for tasks where accuracy matters more than speed.
How do I reduce LLM costs without sacrificing quality?
Use model cascading (start with cheap models, escalate if needed), clean your prompts to reduce tokens, use smaller models for simple tasks (Haiku, GPT-3.5), and implement caching for repeated queries. YC founders report 78%+ cost reductions with these techniques.
Written by
Arun Agrahri
Builder of Taffy. I spend most of my time analyzing YouTube channels to find patterns others miss. These guides are the result of processing thousands of videos and comments through our data pipeline.
Get the next guide first
We publish deep-dive research guides weekly. Be the first to know when new analysis drops.
No spam. Unsubscribe anytime.