Short answer: Claude token cost is not just the message you type. In the raw Claude API, you pay for input tokens, output tokens, tool definitions, tool results, cache writes, cache reads, and thinking tokens when thinking is enabled. In Claude Code, there is extra project context on top: conversation history, CLAUDE.md, skills, tools, MCP context, file reads, and command output.
This guide separates Claude API, Claude Code, and claude.ai so you can estimate cost correctly instead of blaming one mysterious "hidden prompt" for everything.
Quick Takeaways
- API calls are stateless: your application sends the prompt, tools, and conversation history that you want Claude to see.
- Claude Code is heavier than a simple API call: it carries coding context, project instructions, tool output, and session history.
- Prompt caching saves money, not context: cached tokens still occupy the context window.
- Thinking tokens are billed as output: Opus and Sonnet reasoning can be useful, but it is not free.
- MCP behavior has changed: Claude Code now defers MCP tool definitions by default, so unused servers are still worth disabling, but the old "all schemas always load" claim is too broad.
What Is a Token?
A token is the unit of text a model processes. It is usually a word fragment, not a full word. Anthropic's pricing docs use the rough rule that 1 token is about 4 characters or 0.75 English words, but exact counts vary by model and content.
# Approximate token counts
Hello ~1 token
Hello, world! ~4 tokens
authentication ~3 tokens
1 paragraph ~50-100 tokens
1 page of prose ~400-500 tokens
1,000 lines of code ~3,000-5,000 tokens
Opus 4.7 note: Anthropic says Claude Opus 4.7 uses a new tokenizer that may use up to about 35% more tokens for the same fixed text compared with previous models. If you move a workflow from Opus 4.6 to Opus 4.7, re-run token counts before assuming the old budget still holds.
API vs Claude Code vs claude.ai
The biggest source of confusion is treating every Claude product as if it bills and behaves the same way. They do not.
claude.ai is different again: Free, Pro, Max, Team, and Enterprise plans are usage-plan products. They may expose usage limits, but the everyday consumer plan experience is not the same as directly paying the API invoice per million tokens. Keep API pricing examples separate from subscription-plan expectations.
The Claude Token Cost Formula
For API workloads, the simplest mental model is:
total_cost =
uncached_input_tokens * input_price
+ cache_write_tokens * cache_write_price
+ cache_read_tokens * cache_read_price
+ output_tokens * output_price
+ server_tool_charges * tool_price
For Claude Code, add another layer: session context. File reads, command output, prior turns, subagent summaries, and project instructions can stay in the conversation and increase the size of later turns.
Three beginner-friendly cost examples
# 1) Typical Sonnet API call
Input: 12,000 tokens * $3 / 1,000,000 = $0.036
Output: 800 tokens * $15 / 1,000,000 = $0.012
Total: roughly $0.048
# 2) Cheap Haiku extraction job
Input: 100,000 tokens * $1 / 1,000,000 = $0.10
Output: 2,000 tokens * $5 / 1,000,000 = $0.01
Total: roughly $0.11
# 3) Opus reasoning request
Input: 80,000 tokens * $5 / 1,000,000 = $0.40
Output: 8,000 tokens * $25 / 1,000,000 = $0.20
Total: roughly $0.60
These examples are intentionally small and round. The point is to learn the shape: large input makes retrieval and coding sessions expensive, while long answers and thinking make output cost jump quickly.
Current Claude API Pricing
Last reviewed: May 8, 2026. The official Claude API pricing page lists these headline rates. Always re-check the source before publishing a pricing-sensitive article because model pricing changes quickly.
| Model | Input / MTok | Output / MTok | Context notes |
|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $5.00 | Fast, cost-efficient workloads and smaller automations |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Strong default; includes 1M context at standard pricing |
| Claude Opus 4.7 | $5.00 | $25.00 | Hard reasoning; includes 1M context at standard pricing |
# Prompt caching multipliers
5-minute cache write = 1.25x base input price
1-hour cache write = 2.00x base input price
Cache read = 0.10x base input price
# Batch API
50% discount on input and output tokens for asynchronous bulk work
Pricing modifiers beginners miss
- 1M context: Claude Opus 4.7, Opus 4.6, and Sonnet 4.6 include the full 1M-token context window at standard pricing. A 900k-token request still costs far more than a 9k-token request because it contains more tokens, but it does not trigger a separate long-context premium for those models.
- Data residency: on the first-party Claude API, US-only inference through
inference_geoadds a 1.1x multiplier for Claude Opus 4.7, Opus 4.6, and newer models. - Third-party platforms: AWS Bedrock and Google Vertex AI regional or multi-region endpoint choices can add a 10% premium for newer Claude models. Do not copy first-party API pricing directly into a Bedrock or Vertex AI budget.
- Fast mode: Opus 4.6 fast mode is a beta feature with 6x standard token pricing and is not available with the Batch API.
Sources: Claude API pricing, context windows, extended thinking, prompt caching, Claude Code costs, and Usage and Cost API.
Hidden Cost 1: System and App Instructions
Every application has instructions that shape model behavior. In a raw API call, this is the system prompt and any messages you send. In Claude Code, the product also manages a coding-oriented environment and session state around your work.
A precise number like 14,328 tokens can be useful as a snapshot from one Claude Code session, but it should not be presented as a universal cost for every Claude API request. The right recommendation is: measure your actual session with /context or inspect your API usage fields.
Hidden Cost 2: CLAUDE.md, Skills, and Project Memory
Claude Code can load project instructions such as CLAUDE.md. That is helpful, but a long file becomes a repeated cost. Anthropic's current guidance is to keep the base file focused and move specialized workflow instructions into skills so they load on demand.
# Better CLAUDE.md pattern
- Keep repo-wide rules only
- Link to docs instead of pasting long docs
- Move PR review, deployment, or database workflows into skills
- Keep examples short and delete stale notes
# Check impact
/context
/usage
Hidden Cost 3: MCP and Tool Context
Old advice often says every MCP server injects every full schema into every request. That is no longer accurate enough. Current Claude Code docs say MCP tool definitions are deferred by default, meaning only tool names enter context until Claude uses a specific tool.
The practical advice is still similar: disable unused MCP servers, prefer direct CLI tools such as gh, aws, or gcloud when they are more context-efficient, and run /context to see what is actually consuming space.
Hidden Cost 4: Conversation History
LLM APIs are stateless at the request boundary. Your app or client sends the relevant conversation history again so Claude can continue. That means long sessions get more expensive because later turns include more accumulated context.
- Turn 1: system instructions + user message + first answer.
- Turn 15: all useful prior context plus the new message.
- Long coding session: file reads, test output, error logs, and tool results can remain in the context until compacted or cleared.
Use /compact after a meaningful milestone and /clear when switching to an unrelated task. Do not carry yesterday's debugging context into today's documentation task.
Hidden Cost 5: Thinking Tokens
With extended or adaptive thinking, Claude can spend tokens reasoning before producing the visible answer. Anthropic documents that those thinking tokens are billed as output tokens. On Opus 4.7, adaptive thinking is the supported thinking mode; manual fixed thinking budgets are no longer accepted for Opus 4.7.
If reasoning is not needed, lower the effort level or disable thinking for simple transformations, formatting tasks, or small edits. If reasoning quality matters, keep it on and budget for it explicitly.
Hidden Cost 6: Tool Results
Tool calls are not just "actions." They create text that may enter the conversation. A 400-line stack trace, a full file read, or a large web fetch can add thousands of input tokens to later turns.
- Read exact file ranges instead of whole files.
- Filter logs before returning them to Claude.
- Use grep-like commands to locate relevant lines before reading full context.
- Summarize long command output before continuing a session.
Hidden Cost 7: API Tool Use and Server Tools
In the Claude API, tools are part of the request. Tool names, descriptions, JSON schemas, tool_use blocks, and tool_result blocks can all add input tokens. Anthropic also includes a small tool-use system prompt when tools are enabled. For Claude 4.x models, the pricing docs list a few hundred extra tool-use system prompt tokens depending on tool choice, before counting your own tool schemas and results.
- Client-side tools: usually cost normal input and output tokens, including schemas and tool results.
- Server-side tools: can add usage-based charges on top of tokens, such as web search requests or code execution time.
- Bash and editor tools: can add fixed input-token overhead plus command output, errors, and file contents.
- Beginner rule: do not send a giant universal tool list to every request. Send the smallest useful tool set for that workflow.
Hidden Cost 8: Claude Code Background Work and Subagents
Claude Code is not just a single chat request. It can use background tokens for features such as summaries and model-assisted session management. If you use subagents or agent teams, each agent may have its own context window. That can be excellent for parallel work, but it also means token usage scales with the number of active agents and the amount of context each one loads.
- Subagents are useful: delegate verbose searches, test runs, and log analysis so only a summary comes back to your main context.
- Subagents are not free: they still spend tokens in their own context window.
- Agent teams need budget discipline: keep team size small, keep prompts focused, and stop agents when the work is done.
- Auto-compaction helps: it can summarize long conversations when context gets large, but you should still use
/compactand/clearintentionally.
Prompt Caching Break-Even
Prompt caching is usually the highest-leverage API optimization when your prompt has a repeated prefix: tool definitions, system instructions, large documents, or a stable conversation prefix.
# Example: 100,000-token reusable prefix on Opus 4.7
Normal input cost: 100,000 * $5 / 1,000,000 = $0.50
5-minute cache write: 100,000 * $6.25 / 1,000,000 = $0.625
Cache read after write: 100,000 * $0.50 / 1,000,000 = $0.05
# Break-even
One cache hit after a 5-minute write is usually enough to beat paying full price twice.
Important: cached tokens still count toward the context window. Caching reduces price and latency; it does not make the prompt smaller.
How to Reduce Claude Token Spend
1. Use the right model
- Haiku: extraction, classification, routing, simple rewriting, structured transformations.
- Sonnet: most coding, analysis, planning, documentation, and agentic workflows.
- Opus: high-stakes reasoning, hard architecture decisions, difficult debugging, long-horizon agent work.
2. Keep Claude Code context clean
- Use
/usageto see session token and cost estimates. - Use
/contextto inspect what is consuming the context window. - Use
/compactafter finishing a subtask. - Use
/clearbefore switching to unrelated work.
3. Reduce MCP and tool overhead
- Disable MCP servers you are not actively using.
- Prefer CLI tools when they return a smaller answer than an MCP integration.
- Keep tool output focused: file ranges, filtered logs, and summarized results.
4. Cache repeated API prefixes
response = client.messages.create(
model="claude-opus-4-7",
system=[{
"type": "text",
"text": stable_system_prompt,
"cache_control": {"type": "ephemeral"}
}],
messages=messages
)
5. Use batch processing for non-urgent bulk work
If the workload can wait, such as summarizing thousands of tickets, generating descriptions, or processing a large dataset, the Batch API can reduce both input and output token cost by 50%.
Real Cost Example
Here is a simplified 30-turn Claude Code-style session using Opus 4.7. Exact numbers vary, but the shape is what matters.
# Unoptimized 30-turn coding session
Project/context input: 400,000 tokens
Conversation history: 1,300,000 tokens
Tool results: 250,000 tokens
Fresh user messages: 30,000 tokens
Thinking tokens: 60,000 output tokens
Visible answer output: 30,000 output tokens
Input cost: 1,980,000 * $5 / 1,000,000 = $9.90
Output cost: 90,000 * $25 / 1,000,000 = $2.25
Total: roughly $12.15
# Optimized shape
- repeated context mostly cache reads
- shorter CLAUDE.md
- fewer broad file reads
- disabled unused MCP servers
- compacted after milestones
Typical result: 40-70% lower spend for the same work.
Monitoring Token Usage
Cost control only works when you measure at the right level. For one API request, use the response usage object. Before sending a large request, use the Token Counting API. For teams, use the Admin Usage and Cost APIs so you can group spend by model, workspace, API key, service tier, data residency, fast mode, and server-side tool usage.
Inside Claude Code, /usage is useful for the current session, but the dollar figure is an estimate computed from token counts. For authoritative API billing, use the Claude Console. For subscription users, plan usage bars are more relevant than raw API invoice math.
# Claude Code
/usage # current session usage and estimated cost
/context # context window breakdown
/compact # summarize and shrink session history
/clear # start fresh for unrelated work
# Count before sending a big API request
client.messages.count_tokens(
model="claude-opus-4-7",
messages=[{"role": "user", "content": long_prompt}]
)
# API response usage object
{
"usage": {
"input_tokens": 15234,
"output_tokens": 892,
"cache_creation_input_tokens": 14328,
"cache_read_input_tokens": 42000
}
}
What to monitor on a real product
- Cost per feature: chat, summarization, code generation, support bot, document analysis.
- Cost per customer or workspace: find out who drives spend before adding global limits.
- Cache hit ratio: high repeated prefixes should produce cache reads, not full input charges every time.
- Model mix: track when Opus is used, and confirm it is reserved for work that needs it.
- Tool and server-tool usage: web search, code execution, large file reads, and big tool results can hide inside aggregate spend.
FAQ
Are Claude thinking tokens billed?
Yes. Anthropic documents that thinking tokens are billed as output tokens. If thinking is summarized or omitted from the visible response, the full thinking process can still be billed.
Does prompt caching reduce context size?
No. Prompt caching reduces price and latency for repeated prompt prefixes. Cached tokens still count toward the context window.
Why does Claude Code use more tokens than a simple API call?
Claude Code is an agentic coding environment. It may include project instructions, conversation history, tool results, file reads, command output, skills, and MCP/tool context. A tiny user message can ride on top of a much larger coding session context.
Do MCP servers always load every full schema?
Not in current Claude Code guidance. MCP tool definitions are deferred by default, but tool names, selected tools, tool calls, and tool results can still affect context. Disable unused servers and measure with /context.
Does 1M context always mean premium long-context pricing?
No. Current Claude API pricing says Claude Opus 4.7, Opus 4.6, and Sonnet 4.6 include the full 1M-token context window at standard pricing. You still pay for every token, so a huge request is expensive, but these models do not add a separate long-context premium for using the 1M window.
Is the Claude Code /usage dollar amount my final bill?
Not necessarily. Claude Code reports useful session token and estimated cost information, but the dollar figure is computed locally and may differ from your actual bill. For API billing, use the Claude Console. For Pro or Max subscription users, included plan usage is not the same as a per-token API invoice.
Can data residency, fast mode, or server tools change the price?
Yes. US-only inference can add a 1.1x token multiplier for supported models, Opus 4.6 fast mode is priced at 6x standard token rates, and server-side tools such as web search or code execution can add usage-based charges beyond normal token cost.
What is the fastest way to cut Claude token cost?
Use Sonnet for default coding work, cache repeated API prefixes, compact long Claude Code sessions, trim CLAUDE.md, disable unused MCP servers, monitor tool usage, and avoid dumping full files or full logs into context.
Final Checklist
- Separate API pricing from Claude Code and claude.ai subscription behavior.
- Use current model pricing and add a "last reviewed" note near pricing tables.
- Measure with
/usage,/context, and API usage fields. - Use the Token Counting API before sending large prompts or documents.
- Use the Admin Usage and Cost APIs when you need team-level cost attribution.
- Use prompt caching for stable prefixes, but remember it does not reduce context size.
- Account for data residency, regional endpoints, fast mode, Batch API, and server-side tool charges.
- Move specialized CLAUDE.md instructions into skills.
- Prefer focused tool output over broad file reads and unfiltered logs.
- Use adaptive thinking only when the task needs deeper reasoning.
- Use Batch API for non-urgent bulk processing.