AI API Token Costs Explained: How GPT, Claude & Gemini Pricing Actually Works
A practical breakdown of how LLM API pricing works, what actually drives your bill up, and how to estimate and cut costs across GPT, Claude, Gemini, and DeepSeek.
If you have ever opened your LLM provider's billing dashboard and felt a small jolt of confusion — "wait, why did this cost more than that?" — you are not alone. Token-based pricing looks simple from a distance: you pay per token, tokens map roughly to words, multiply and you're done. In practice, the bill is shaped by a handful of mechanics that are easy to miss until they have already cost you money.
This guide walks through what a token actually is, how providers structure their pricing, the specific habits that quietly inflate API bills, and the levers you actually have to pull to bring costs down — without guessing.
What a "token" actually is
A token is not a word. It is a chunk of text — sometimes a whole word, sometimes a fragment, punctuation mark, or even a single character — produced by the model's tokenizer. As a rough rule of thumb for English text, one token is about four characters, or roughly three-quarters of a word. So 1,000 tokens is somewhere around 750 words, or about a page and a half of single-spaced text.
That ratio shifts significantly by content type and language:
- Code and structured data (JSON, XML, markdown tables) tokenize less efficiently than prose — symbols, indentation, and repeated keys all consume tokens.
- Non-English languages, especially those that don't use whitespace to separate words (Japanese, Chinese, Thai), often need more tokens per character than English.
- Numbers and unusual identifiers (UUIDs, hashes, SKUs) frequently get split into several tokens each.
The practical takeaway: don't estimate token counts by counting words. Run representative samples of your actual prompts — including real user input, not placeholder text — through a tokenizer or a calculator that mirrors the target model's tokenization.
How LLM API pricing is actually structured
Nearly every major provider — OpenAI, Anthropic, Google, and DeepSeek included — prices its models on a per-million-token basis, split into separate input and output rates. Two details matter more than the headline number:
1. Output tokens cost more than input tokens
Generating a response requires the model to run a forward pass for every single token it produces, one at a time. Reading a prompt, by contrast, can be processed largely in parallel. That asymmetry is why output (completion) tokens are typically priced at three to five times the rate of input (prompt) tokens across most providers. A request with a short prompt and a long, detailed answer can end up costing more than one with a long prompt and a terse reply.
2. Model tiers carry very different price-to-capability ratios
Every major lab now ships a "frontier" flagship model alongside smaller, faster, cheaper variants (mini, flash, haiku-class models). The cheaper tier can be five to twenty times less expensive per token — but it is not always the cheaper choice per task. If a smaller model needs two retries and a longer back-and-forth to get to a usable answer, the "savings" evaporate, and you've also burned extra latency and context budget.
The formula behind every API bill
Cost = (input tokens ÷ 1,000,000 × input rate) + (output tokens ÷ 1,000,000 × output rate)
What actually drives your bill up (it's rarely the obvious thing)
Most unexpectedly large bills don't come from one big request — they come from a small inefficiency that repeats thousands of times. The usual suspects, roughly in order of impact:
Resending the full conversation history on every turn
Chat completion APIs are stateless: the model has no memory between requests. To maintain a conversation, your application must resend the entire message history — system prompt, every prior user message, and every prior assistant reply — on each new turn. In a 20-message conversation, the first exchange gets reprocessed (and rebilled) roughly twenty times. This compounding effect is the single largest hidden cost in most chat applications.
Bloated system prompts
A detailed system prompt with examples, formatting rules, and tool definitions might run 1,500–3,000 tokens. If that prompt is sent with every single request — and it has to be — it becomes a fixed tax on every interaction, whether or not it materially changes the response for that particular query.
Over-stuffed RAG context
Retrieval-augmented generation pipelines often retrieve more chunks "just in case" than the query actually needs. Each unnecessary chunk is pure input-token cost with no benefit — and in some cases it actively dilutes the signal the model needs to answer well.
Unbounded output length
Without an explicit output cap, models will sometimes produce longer responses than the use case requires — extra caveats, repeated summaries, verbose formatting. Since output tokens are the most expensive line item, this is often the fastest place to trim cost without trimming quality.
Comparing the major providers at a glance
Exact rates change frequently as providers ship new model generations, so rather than quoting figures that will be stale within months, here's how the landscape is structured — the part that tends to stay stable:
| Provider | Typical structure | Where to watch costs |
|---|---|---|
| OpenAI (GPT family) | Flagship + mini/nano tiers, separate input/output/cached-input rates | Long system prompts, function/tool call schemas resent each turn |
| Anthropic (Claude family) | Opus/Sonnet/Haiku tiers, prompt caching with steep discounts on cache hits | Whether your app actually structures prompts to take advantage of caching |
| Google (Gemini family) | Pro/Flash tiers, very large context windows, context-length-based pricing on some models | Requests that creep past pricing-tier thresholds on long-context models |
| DeepSeek | Aggressively priced relative to frontier labs, off-peak discount windows on some plans | Whether the lower per-token price actually nets out once retries are factored in |
Because rates and tiers shift often, the most reliable approach is to plug your actual prompt/response patterns into an up-to-date calculator rather than relying on a number you saw in an article six months ago — including this one.
Try it yourself
AI Token Calculator
Estimate token usage and compare API costs across GPT, Claude, Gemini, and DeepSeek side-by-side — paste a real prompt and get an instant breakdown.
Five ways to actually cut your API costs
- Cache or summarize conversation history. Instead of resending the full transcript every turn, summarize older turns into a compact running summary, or use a provider's prompt-caching feature so repeated prefixes (system prompt, tool definitions, long context) are billed at a fraction of the standard input rate on subsequent calls.
- Right-size your system prompt. Audit it for redundant instructions, unused examples, and formatting rules the model already follows by default. Every token you remove is removed from every request, forever — it compounds the opposite way that bloat does.
- Retrieve fewer, better chunks in RAG pipelines. Tune your retriever for precision over recall, and re-rank before stuffing context. Fewer, more relevant chunks usually improve answer quality while cutting input-token cost.
- Set explicit output limits. Use `max_tokens` (or the equivalent parameter) and prompt instructions that specify the expected response format and length. "Answer in two sentences" is a cost-control technique, not just a style preference.
- Match the model tier to the task, not the other way around. Route simple, high-volume tasks (classification, extraction, short rewrites) to smaller/faster models, and reserve frontier models for tasks that genuinely need their reasoning depth. Measure cost per successfully completed task, not cost per token, when comparing tiers.
Frequently asked questions
Why is my output more expensive than my input?
Most providers price output (completion) tokens at 3-5x the rate of input (prompt) tokens, because generating text requires a full forward pass per token while reading a prompt can be processed in parallel. A short prompt that produces a long answer can cost more than a long prompt that produces a one-word answer.
Does the conversation history really get billed every time?
Yes. Chat-based APIs are stateless — each request must include the full message history for the model to have any memory of the conversation. In a 20-turn conversation, the first message is sent, billed, and reprocessed roughly 20 times unless the provider offers prompt caching.
Is a 'cheaper' model always the better choice for cost control?
Not necessarily. A smaller model that needs three retries to produce a usable answer can cost more in aggregate than a stronger model that succeeds on the first try, and it also burns more of your context budget on failed attempts. Cost per task matters more than cost per token.
How do I estimate my costs before launching a feature?
Estimate average input and output token counts per request (system prompt + history + user message, and expected response length), multiply by your expected daily request volume, and apply the per-million-token rates for your chosen model. Our AI Token Calculator does this conversion for you across GPT, Claude, Gemini, and DeepSeek.
Token pricing rewards precision: the closer your estimate is to your real-world prompt and response patterns, the fewer surprises you'll see on the invoice. Run your actual system prompt, a representative chat history, and your expected output length through the AI Token Calculator above before you ship — it takes less time than reading this sentence twice.