
Kundan
|Mar 29, 2025
|6 min read
Which of the popular LLMs have the best developer experience?
I finally have some free time to share what's been bothering me lately: the developer experience of working with various LLM APIs. This isn't another post about benchmarks or technical capabilities—it's about the real-world frustrations and occasional wins when building actual products with these tools.
The Context: Resume Parsing with LLMs
For context, I've been developing TrackJobs.co, specifically focusing on the resume builder component. My challenge has been:
- Taking users' PDF resumes
- Extracting the raw text content (using libraries like PDFJS)
- Sending this content to various LLMs
- Getting structured data back according to a specific schema
- Using that data in my application
Sounds simple enough on paper. In reality? It's been a nightmare of inconsistent APIs, limited documentation, and hair-pulling debugging sessions. Let me break down my experience with each major provider.
OpenAI: The Structured Output Struggle
The OpenAI API was my starting point, and while powerful, it has some significant shortcomings:
Schema Support Issues
OpenAI introduced Zod support in their SDK, but the implementation feels half-baked(it's still in beta):
- No optional fields: You can't use Zod's optional fields with ChatGPT. Instead, you're forced to use nullables.
- The undefined vs. null problem: This creates a fundamental mismatch between how your application handles optional data and how the API returns it.
- No validators: You can't use things like
z.string().email()
orz.string().url()
. Even if the schema validation passes, you'll get invalid emails or URLs. - Undocumented limitations: There's no clear documentation on what Zod features are supported. It's all trial and error.
The real pain? Since, I want a single Zod schema to have form validation and to get LLM's response in the same schema.
Incomplete Parsing
When using GPT-4o-Mini
for resume parsing (which should be straightforward), I keep hitting roadblocks:
- Even with the max completion tokens set to 16K, the model often parses only two sections (like education and work experience) and ignores everything else.
- This happens even when I've verified the complete data is being passed correctly.
Poor Error Handling
The error handling is bad:
- When a request fails due to incomplete responses or length limits, tokens are still consumed from your account.
- You can't see how many tokens were used on failed requests, making debugging nearly impossible.
Slow Responses
Compared to competitors, OpenAI's API is painfully slow:
- While Gemini takes about 30 seconds for a resume-parsing request, OpenAI often takes a full minute.
- For users with the attention span of a goldfish (most of us these days), this is unacceptable.
Gemini: Faster But Still Frustrating
Google's Gemini API improves on some fronts but introduces its own headaches:
The 8K Token Ceiling
The most significant limitation is the output token limit:
- Even their best models (2.0 Pro, 2.0 Flash) are capped at 8K tokens for output.
- While 8K tokens seems plenty, it becomes problematic with complex documents.
- Despite using only about 30% of the token limit (around 2,500 tokens out of 8,192), I regularly hit "length limit reached" errors with no clear solution.
I've confirmed this isn't a rate limit issue. For the 2.0 Flash model:
- The limit is 15 requests per minute
- 1 million tokens per minute
- I'm making only 1-2 requests per minute, well under these thresholds
Schema Support Limitations
While Gemini's Zod support is better than OpenAI's, it still has significant gaps:
- No support for record types
- No support for tuples
- No first-party schema conversion tools (like Zod to Gemini or Pydantic to Gemini)
- Their native schema system is severely limited with poor enforcement capabilities
The Classic Google Documentation Problem
Like many Google products, Gemini suffers from fragmented, contradictory documentation:
- Different pages showing different information for the same topics
- Contradictory pricing information across Google domains
- Confusion between Gemini and Vertex AI that leaves you wondering what product you're actually using
Unclear Error Messages
The error messages are often useless:
- If you pass an invalid model name, you get a "400 no body" error rather than a clear "invalid model" message.
- JSON formatting issues where the model starts returning JSON but doesn't complete it properly (incomplete response/missing ending } or "").
- Like OpenAI, no token usage information on failed requests
Anthropic (Claude): Premium Price, Premium Pain
While I haven't extensively used Claude, my findings are:
The Apple Approach
- Claude positions itself as the premium option in the LLM market
- Even their mini models (Claude 3.0 Haiku) are significantly more expensive than GPT-4o-Mini or Gemini 2.0 Flash
- The pricing makes it hard to justify for simpler tasks like resume parsing
Technical Limitations
- No support for the OpenAI SDK, requiring separate integration work
- No proper schema enforcement mechanisms
- JSON support exists but lacks standardized schema definition capabilities
DeepSeek: The Almost Alternative
DeepSeek offers OpenAI SDK compatibility but has limitations:
Cost Structure Problems
- DeepSeek is affordable compared to GPT-4o, but not compared to smaller models
- There are no smaller, cost-effective models available
- The per-million output token pricing is too expensive for simple tasks like resume parsing
The Deeper Problem: Unpredictability
The biggest issue across all these platforms is the sheer unpredictability:
- Even with schema enforcement, there's no guarantee data will be parsed correctly every time
- Sometimes it works flawlessly, other times it fails mysteriously
- There seem to be limits to schema depth and complexity that aren't documented anywhere
Companies are investing heavily in model capabilities but neglecting developer experience. They want the spotlight on their fancy new models but aren't focusing on making them practical to implement, especially for solo developers or small teams.
What Developers Actually Need
Based on my experience, here's what would dramatically improve the LLM developer experience:
- Complete schema support: First-class support for standard validation libraries like Zod and Pydantic
- Better documentation: Clear documentation on limitations and supported features
- Improved error handling: Token usage information for failed requests and clear error messages
- Predictable parsing: If a schema is enforced, the data should reliably match it
- Better structured output support: More work on making the beta features production-ready
Bottom Line: Which One Sucks Least?
If you're choosing an LLM API today, here's my assessment:
- OpenAI: Powerful but frustrating. Use if you need the larger token windows.
- Gemini: Faster but limited. Best if you're working with smaller documents.
- Claude: Premium quality at premium prices. Potentially better for enterprise use.
- DeepSeek: Good alternative to GPT-4o but still expensive for simple tasks as there are no mid-size or smaller models.
My verdict would be to go with Gemini as it has decent knowledge-cutoff and cheap enough for simpler tasks. But the constant patching won't go away anytime soon.
The current landscape feels like companies are rushing to release new models rather than fixing fundamental developer experience issues. Structured outputs have been in beta for 7-8 months with little improvement, while new models keep coming out every few weeks.
I'm still using a mix of these APIs depending on the specific task, but none provides a truly satisfying developer experience. OpenAI has the advantage of the larger token window, while Gemini wins on speed, but all of them make me want to tear my hair out on a regular basis.
Until these issues are addressed, we developers will continue patching together solutions and fighting with inconsistent APIs. If you've faced similar challenges or found workarounds, I'd love to hear about them.
For now, I'm back to debugging why my perfectly valid Zod schema is causing yet another cryptic error. The joys of working on the bleeding edge, I suppose.