Cover for Which of the popular LLMs have the best developer experience

Kundan

Mar 29, 2025

6 min read

Which of the popular LLMs have the best developer experience?

I finally have some free time to share what's been bothering me lately: the developer experience of working with various LLM APIs. This isn't another post about benchmarks or technical capabilities—it's about the real-world frustrations and occasional wins when building actual products with these tools.

The Context: Resume Parsing with LLMs

For context, I've been developing TrackJobs.co, specifically focusing on the resume builder component. My challenge has been:

Taking users' PDF resumes
Extracting the raw text content (using libraries like PDFJS)
Sending this content to various LLMs
Getting structured data back according to a specific schema
Using that data in my application

Sounds simple enough on paper. In reality? It's been a nightmare of inconsistent APIs, limited documentation, and hair-pulling debugging sessions. Let me break down my experience with each major provider.

OpenAI: The Structured Output Struggle

The OpenAI API was my starting point, and while powerful, it has some significant shortcomings:

Schema Support Issues

OpenAI introduced Zod support in their SDK, but the implementation feels half-baked(it's still in beta):

No optional fields: You can't use Zod's optional fields with ChatGPT. Instead, you're forced to use nullables.
The undefined vs. null problem: This creates a fundamental mismatch between how your application handles optional data and how the API returns it.
No validators: You can't use things like z.string().email() or z.string().url(). Even if the schema validation passes, you'll get invalid emails or URLs.
Undocumented limitations: There's no clear documentation on what Zod features are supported. It's all trial and error.

The real pain? Since, I want a single Zod schema to have form validation and to get LLM's response in the same schema.

Incomplete Parsing

When using GPT-4o-Mini for resume parsing (which should be straightforward), I keep hitting roadblocks:

Even with the max completion tokens set to 16K, the model often parses only two sections (like education and work experience) and ignores everything else.
This happens even when I've verified the complete data is being passed correctly.

Poor Error Handling

The error handling is bad:

When a request fails due to incomplete responses or length limits, tokens are still consumed from your account.
You can't see how many tokens were used on failed requests, making debugging nearly impossible.

Slow Responses

Compared to competitors, OpenAI's API is painfully slow:

While Gemini takes about 30 seconds for a resume-parsing request, OpenAI often takes a full minute.
For users with the attention span of a goldfish (most of us these days), this is unacceptable.

Gemini: Faster But Still Frustrating

Google's Gemini API improves on some fronts but introduces its own headaches:

The 8K Token Ceiling

The most significant limitation is the output token limit:

Even their best models (2.0 Pro, 2.0 Flash) are capped at 8K tokens for output.
While 8K tokens seems plenty, it becomes problematic with complex documents.
Despite using only about 30% of the token limit (around 2,500 tokens out of 8,192), I regularly hit "length limit reached" errors with no clear solution.

I've confirmed this isn't a rate limit issue. For the 2.0 Flash model:

The limit is 15 requests per minute
1 million tokens per minute
I'm making only 1-2 requests per minute, well under these thresholds

Schema Support Limitations

While Gemini's Zod support is better than OpenAI's, it still has significant gaps:

No support for record types
No support for tuples
No first-party schema conversion tools (like Zod to Gemini or Pydantic to Gemini)
Their native schema system is severely limited with poor enforcement capabilities

The Classic Google Documentation Problem

Like many Google products, Gemini suffers from fragmented, contradictory documentation:

Different pages showing different information for the same topics
Contradictory pricing information across Google domains
Confusion between Gemini and Vertex AI that leaves you wondering what product you're actually using

Unclear Error Messages

The error messages are often useless:

If you pass an invalid model name, you get a "400 no body" error rather than a clear "invalid model" message.
JSON formatting issues where the model starts returning JSON but doesn't complete it properly (incomplete response/missing ending } or "").
Like OpenAI, no token usage information on failed requests

Anthropic (Claude): Premium Price, Premium Pain

While I haven't extensively used Claude, my findings are:

The Apple Approach

Claude positions itself as the premium option in the LLM market
Even their mini models (Claude 3.0 Haiku) are significantly more expensive than GPT-4o-Mini or Gemini 2.0 Flash
The pricing makes it hard to justify for simpler tasks like resume parsing

Technical Limitations

No support for the OpenAI SDK, requiring separate integration work
No proper schema enforcement mechanisms
JSON support exists but lacks standardized schema definition capabilities

DeepSeek: The Almost Alternative

DeepSeek offers OpenAI SDK compatibility but has limitations:

Cost Structure Problems

DeepSeek is affordable compared to GPT-4o, but not compared to smaller models
There are no smaller, cost-effective models available
The per-million output token pricing is too expensive for simple tasks like resume parsing

The Deeper Problem: Unpredictability

The biggest issue across all these platforms is the sheer unpredictability:

Even with schema enforcement, there's no guarantee data will be parsed correctly every time
Sometimes it works flawlessly, other times it fails mysteriously
There seem to be limits to schema depth and complexity that aren't documented anywhere

Companies are investing heavily in model capabilities but neglecting developer experience. They want the spotlight on their fancy new models but aren't focusing on making them practical to implement, especially for solo developers or small teams.

What Developers Actually Need

Based on my experience, here's what would dramatically improve the LLM developer experience:

Complete schema support: First-class support for standard validation libraries like Zod and Pydantic
Better documentation: Clear documentation on limitations and supported features
Improved error handling: Token usage information for failed requests and clear error messages
Predictable parsing: If a schema is enforced, the data should reliably match it
Better structured output support: More work on making the beta features production-ready

Bottom Line: Which One Sucks Least?

If you're choosing an LLM API today, here's my assessment:

OpenAI: Powerful but frustrating. Use if you need the larger token windows.
Gemini: Faster but limited. Best if you're working with smaller documents.
Claude: Premium quality at premium prices. Potentially better for enterprise use.
DeepSeek: Good alternative to GPT-4o but still expensive for simple tasks as there are no mid-size or smaller models.

My verdict would be to go with Gemini as it has decent knowledge-cutoff and cheap enough for simpler tasks. But the constant patching won't go away anytime soon.

The current landscape feels like companies are rushing to release new models rather than fixing fundamental developer experience issues. Structured outputs have been in beta for 7-8 months with little improvement, while new models keep coming out every few weeks.

I'm still using a mix of these APIs depending on the specific task, but none provides a truly satisfying developer experience. OpenAI has the advantage of the larger token window, while Gemini wins on speed, but all of them make me want to tear my hair out on a regular basis.

Until these issues are addressed, we developers will continue patching together solutions and fighting with inconsistent APIs. If you've faced similar challenges or found workarounds, I'd love to hear about them.

For now, I'm back to debugging why my perfectly valid Zod schema is causing yet another cryptic error. The joys of working on the bleeding edge, I suppose.

Table of Contents