Claude Code Cost Optimization: Practical Techniques to Cut Token Spend Without Losing Productivity

Claude Code is genuinely useful. It is also genuinely expensive if you are not paying attention. A few days of heavy usage on a Max plan can feel manageable, but the same workflow multiplied across a team — or left running in CI — can produce a bill that surprises everyone.

The good news is that most Claude Code cost problems are solvable at the workflow level, not the budget level. The token burn typically comes from a handful of well-understood patterns: bloated CLAUDE.md files, poor file selection, redundant context, and choosing the wrong model for the task. Fix those, and costs drop significantly without any loss in output quality.

This guide covers the practical techniques, in order of impact.

Understanding Where Tokens Actually Go

Before optimizing, it helps to know what you are actually paying for.

Claude Code bills on a per-token basis: input tokens (what you send, including conversation history, files read, tool outputs) and output tokens (what Claude writes back). Input tokens are generally cheaper than output tokens, but in real Claude Code sessions, input almost always dominates.

Here is how a typical session breaks down:

CLAUDE.md loads on every compaction. If your CLAUDE.md is 3,000 tokens, that’s 3,000 tokens consumed at session start — and again after each /compact. In a long session with three compactions, that’s 12,000 tokens before a single line of code is written.

Tool outputs accumulate fast. Every bash execution, file read, and glob search returns output that goes into context. A grep -r across a large codebase can return thousands of lines. A failing test suite dumps its full output. These numbers add up.

File reads are full copies. When Claude Code reads a file, the entire file goes into the context window. If you have Claude read five large files at the start of a session to “understand the codebase,” you might have consumed 20,000–50,000 input tokens before doing anything.

Conversation history compounds. Your messages, Claude’s responses, the back-and-forth — this grows linearly throughout the session. A verbose 60-message session can consume tens of thousands of tokens in history alone.

Understanding this breakdown tells you where to intervene. The highest-leverage changes are: shrinking what loads automatically, being surgical about what gets read, and choosing the right model for each type of task.

Model Selection: The Single Biggest Cost Lever

Not all Claude Code tasks require Opus. Using the wrong model for the task is the single biggest driver of unnecessary cost.

Opus 4.6: Use for complex reasoning, architectural decisions, debugging hard multi-file problems, code review requiring contextual judgment. This is the expensive model. Use it deliberately.

Sonnet 4.5/4.6: The right choice for most everyday coding tasks — implementing a specified feature, refactoring code to a defined standard, writing tests for existing code. Quality is excellent, cost is substantially lower.

Haiku (if available in your plan): Appropriate for highly repetitive tasks, summarization, simple edits where you have already specified the change precisely.

The practical problem is that Claude Code defaults to the same model for everything. The fix is to be explicit in your requests and, where your team uses Claude Code via the API, to configure model selection by task type.

In CLAUDE.md, you can instruct Claude to use a lighter model for specific categories:

## Model Guidance

For the following tasks, you may use a more efficient model if available:
- Generating repetitive boilerplate (e.g., adding the same interface to 20 files)
- Summarizing file contents for context
- Simple variable renames or formatting fixes

Use full reasoning for:
- Any architectural change
- Bug root-cause analysis
- Security-sensitive changes
- Database schema or API contract changes

This does not directly control which model runs (that depends on your client configuration), but it signals intent and helps when Claude Code is making autonomous decisions in multi-agent setups.

Prompt Caching: Free Optimization You Might Not Be Using

Anthropic supports prompt caching, and Claude Code uses it automatically for certain content. But understanding how it works lets you design your workflows to maximize cache hits.

The way prompt caching works: if the beginning of your prompt context matches a previous request exactly, Anthropic can serve the cached portion at a reduced price. As of early 2026, cache reads cost roughly 10% of the standard input token price.

For Claude Code, this means your CLAUDE.md content — which loads at the start of every session and after every compaction — is a strong caching candidate if it remains stable. Every character change to CLAUDE.md invalidates the cache for that portion.

Practical implications:

Keep the stable portions of CLAUDE.md at the top. Information that never changes (tech stack, core conventions, project structure) should come before information that changes frequently (ongoing task notes, recent decisions). Cache invalidation propagates forward: a change to line 10 invalidates caching for line 10 onwards.

Separate volatile from stable content. Consider using CLAUDE.md for permanent context and a separate CURRENT_CONTEXT.md (or similar) for session-specific notes that you load explicitly when needed, rather than on every session.

Avoid storing timestamps or auto-generated content in CLAUDE.md. A timestamp at the top of the file invalidates the entire cache on every run.

In team setups where multiple engineers use the same CLAUDE.md, stable content that is shared produces significant cache savings. A 2,000-token CLAUDE.md accessed by 10 engineers 50 times per day is 1 million tokens per day. Cache hits at 10% cost means a 10x reduction on that component alone.

Designing a Lean CLAUDE.md

The most consistent recommendation from developers who have audited their Claude Code costs is: your CLAUDE.md is probably bigger than it needs to be.

Common bloat patterns:

Explaining what Claude already knows. Descriptions of how React works, how TypeScript interfaces work, what REST means — these are unnecessary. Claude Code knows these things. Your CLAUDE.md should contain information Claude cannot infer from the codebase: your specific conventions, decisions, and constraints.

Duplicating information that is in the code. If your package.json shows you use pnpm, you do not need to state “we use pnpm” in CLAUDE.md. If your directory structure is self-evident, do not describe it in prose.

Stale content. Task lists, notes from three months ago, references to libraries you removed. These add tokens every session and provide negative value (they may actively mislead).

Verbose examples where a short rule would suffice. Instead of a 200-token example showing the correct way to write a function, write: “Functions should return typed objects, never any. See src/utils/user.ts as the canonical example.” This points Claude to the real example without copying it into CLAUDE.md.

A lean CLAUDE.md for a typical project should be 500–1,500 tokens. If yours is over 3,000 tokens, audit it. For a detailed approach to this, see our guide on CLAUDE.md token budget optimization.

The @import directive is also worth knowing: instead of putting all your rules inline, you can split CLAUDE.md into sections and import them conditionally. This avoids loading backend rules during frontend sessions, for example. See our token optimization deep dive for the specifics.

.claudeignore: Stop Paying to Read Files You Do Not Need

.claudeignore is Claude Code’s equivalent of .gitignore. Files and directories listed here are excluded from Claude Code’s view of your project.

By default, Claude Code can access and read any file in your project. In a large monorepo, that includes everything from generated files to test fixtures to dependency directories that somehow got committed. When Claude runs a glob or search across the project, it traverses all of it.

A practical .claudeignore for most projects:

# Dependencies
node_modules/
vendor/
.venv/
__pycache__/

# Build outputs
dist/
build/
out/
.next/
.nuxt/

# Generated files
*.generated.ts
*.generated.py
coverage/
.nyc_output/

# Large binary or data files
*.csv
*.json.gz
*.parquet
*.db
*.sqlite

# Lock files (Claude does not need to read these)
package-lock.json
yarn.lock
pnpm-lock.yaml
poetry.lock

# Internal tooling logs
logs/
*.log

The cost impact is subtle but real: when Claude Code uses search tools to navigate your codebase, it only searches through what .claudeignore permits. Excluding node_modules alone can reduce the scope of file system operations dramatically.

For even more precision, add project-specific directories. If you have a data/fixtures/ directory with 500 test JSON files that Claude never needs to read, exclude it. If you have auto-generated documentation that Claude Code should never modify, exclude that too.

Surgical File Reading: Only Read What You Need

One of the highest-cost patterns in casual Claude Code usage is asking for codebase understanding before giving Claude a specific task:

“Read the whole src/api/ directory and understand the structure before we start.”

This seems reasonable. It is expensive. Reading an api/ directory with 40 files at an average of 200 lines each means 8,000 lines of code entering the context window before anything happens.

The better approach: give Claude the task and let it pull what it needs. Claude Code has file reading tools precisely for this. When you describe a bug or a feature, Claude will read the relevant files. It does not need a pre-loaded full-codebase tour.

If orientation is genuinely necessary, point Claude to the most information-dense source:

Before starting: read ARCHITECTURE.md and the README. 
Do not read individual source files unless you need them for the specific task.

A two-file orientation (ARCHITECTURE.md + README) might be 1,000 tokens. A “read the whole api/ directory” is 50,000+ tokens. The task outcome is often identical.

For the same reason, prefer asking Claude to make specific targeted changes over asking it to “refactor X.” A targeted change requires reading fewer files. “Refactor X” invites reading everything X touches, and everything that touches X.

Git Worktree Patterns for Long-Running Work

For complex features that span many sessions, git worktrees offer a cost management advantage that is not immediately obvious.

Without worktrees: each new session starts from scratch. Claude needs to rebuild context about what was done in previous sessions. You either re-read a lot of files or write extensive session notes.

With worktrees: you can maintain a dedicated worktree per feature, with a feature-specific CLAUDE.md override that contains the relevant context for that feature only. Claude’s context in that worktree is focused. You avoid loading codebase-wide context when working on an isolated feature.

This is covered in detail in our git worktree guide, but the cost-relevant pattern is:

# Create a feature worktree
git worktree add ../feature-payment-refactor -b payment-refactor

# Add a worktree-local CLAUDE.md
cat > ../feature-payment-refactor/CLAUDE.md << 'EOF'
# Payment Refactor Context

## Goal
Refactor payment processing from Stripe v2 to Stripe v3 SDK.

## Relevant files
- src/payments/ (primary scope)
- src/api/checkout.ts
- tests/payments/

## Do not touch
- src/auth/ (separate scope)
- src/admin/ (separate scope)

## Progress
- [x] Updated StripeClient initialization
- [x] Migrated webhook handling
- [ ] Migrate checkout session creation
- [ ] Update error handling to v3 format
EOF

This CLAUDE.md is 200 tokens and entirely focused on the task. The main project CLAUDE.md does not load here, because this directory has its own. Claude Code does not need to know about the auth system, the CRM integration, or anything else in the main CLAUDE.md.

The principle: scope the context to the task. A feature-scoped context is cheaper and often more effective than a whole-project context.

Sub-Agent Patterns for Parallel Work

If you are using sub-agents (via the --agent flag or multi-agent workflows), the context architecture matters for cost.

The expensive pattern: one main agent that accumulates all context, spawns sub-agents that return verbose summaries, which are then fed back into the main context.

The efficient pattern: sub-agents that do bounded, well-defined tasks and return minimal structured output.

Verbose sub-agent output (expensive):

Main agent: "Review each file in src/api/ and report what you find"
Sub-agent: returns 2,000-token detailed analysis of every file
→ Main agent context grows by 2,000 tokens

Focused sub-agent output (efficient):

Main agent: "Check if any file in src/api/ imports from deprecated-utils.ts. Return only the list of files."
Sub-agent: returns ["src/api/user.ts", "src/api/checkout.ts"]
→ Main agent context grows by ~20 tokens

The task specification determines how much the sub-agent reads, processes, and returns. Precise task definitions are cheaper than exploratory ones.

For a detailed treatment of sub-agent architecture, see our subagents best practices guide.

Session Length and Compaction Strategy

Longer sessions accumulate more context and cost more. This is obvious. Less obvious is that compaction is not free.

When Claude Code runs /compact, it generates a summary of the current session and starts fresh with that summary as the new context. The summary generation itself costs tokens. And if your CLAUDE.md is large, it reloads at full price after compaction.

Strategies for managing session cost:

Work in shorter, focused sessions. A 30-minute session on a specific task is cheaper than a 2-hour exploratory session. Not always practical, but worth defaulting to when possible.

Compact before sessions naturally drift. Rather than letting the context fill up and forcing a compaction mid-task, manually compact at natural breakpoints (end of a feature, end of a test pass). Compacting at a clean breakpoint produces a better summary, which means the next session starts with higher-quality context.

Use CURRENT_TASK.md instead of repeating context in messages. If you need to re-orient Claude at the start of a session, put the context in a file and reference it once:

Read CURRENT_TASK.md before starting.

This is cheaper than writing out the context in a long message, and it is reusable across sessions without re-typing.

Team and CI Configuration

Individual developer habits matter less than team defaults when Claude Code is used at scale.

In CI/CD pipelines: audit what Claude Code is actually reading. A CI job that reads the entire repo before running a focused task is burning money unnecessarily. Use .claudeignore to limit scope, and write CI-specific CLAUDE.md overrides for the tasks CI handles.

In shared codebases: centralize CLAUDE.md. Multiple engineers with divergent CLAUDE.md files means no cache benefits. A shared, stable CLAUDE.md that all team members use produces significant cache savings.

Usage visibility: if your team uses Claude Code via the API (rather than individual subscriptions), instrument your usage. Track tokens per session, tokens per developer, tokens per task type. You cannot optimize what you cannot see. The Anthropic API returns token usage in every response — log it.

Rate limiting: for autonomous agents running in CI or background jobs, set explicit rate limits or session token budgets. An agent that loops unexpectedly can consume large amounts of tokens before anyone notices.

Practical Audit: Finding Your Biggest Cost Driver

If you want to identify your specific cost problems quickly, follow this process:

Add logging to a session. Run Claude Code with CLAUDE_DEBUG=1 (or equivalent) for a few real sessions. Review the log to see which operations produced the largest tool outputs.
Measure your CLAUDE.md size. wc -w CLAUDE.md gives you word count. Multiply by ~1.3 for a rough token estimate. If it is over 2,000 tokens, audit it line by line.
Check your .claudeignore coverage. Run find . -not -path './.git/*' | wc -l in your project root. If it returns more than a few thousand, and you do not have a .claudeignore, you probably have significant unnecessary file system traversal.
Review your most frequent prompts. The prompts you send dozens of times per day are worth optimizing. A 100-token reduction in a prompt you send 50 times per day saves 5,000 tokens per day — per developer.
Look at sub-agent return sizes. If you use sub-agents, check what they are actually returning. Verbose returns that get injected into main context are a common cost leak.

The Trade-Off: Cost vs. Thoroughness

There is a real tension here that is worth naming. Some of the cost optimizations above — reading fewer files, using shorter CLAUDE.md, working in shorter sessions — can reduce how well Claude Code understands your codebase. That reduced understanding can lead to worse output: more mistakes, more iterations, more back-and-forth.

The right optimization target is cost per working outcome, not cost per session. If a slightly longer CLAUDE.md produces first-pass code that does not need three rounds of correction, the higher token count might be cheaper overall.

The practical test: optimize one thing at a time and observe whether your output quality changes. If it does not, the optimization was safe. If it does, decide whether the quality loss is acceptable or not.

In practice, most teams find that the first 20–30% of potential cost reductions come with no quality loss at all. The bloat in CLAUDE.md, the unnecessary file reads, the wrong model for routine tasks — these are pure inefficiency. Addressing them is straightforward.

The harder optimizations — reducing context for complex tasks, shortening sessions on work that benefits from continuity — require judgment calls. Make them with data, not assumptions.

For more on managing context specifically, see the Claude Code context management guide. For the CLAUDE.md side of this, the token budget optimization guide goes deeper into prompt structure. For teams running parallel agents, the subagents best practices guide covers the cost architecture for multi-agent setups.

Claude Code Cost Optimization: Practical Techniques to Cut Token Spend Without Losing Productivity

Understanding Where Tokens Actually Go

Model Selection: The Single Biggest Cost Lever

Prompt Caching: Free Optimization You Might Not Be Using

Designing a Lean CLAUDE.md

.claudeignore: Stop Paying to Read Files You Do Not Need

Surgical File Reading: Only Read What You Need

Git Worktree Patterns for Long-Running Work

Sub-Agent Patterns for Parallel Work

Session Length and Compaction Strategy

Team and CI Configuration

Practical Audit: Finding Your Biggest Cost Driver

The Trade-Off: Cost vs. Thoroughness

Related Articles

AGENTS.md, CLAUDE.md, and .cursorrules Templates by Use Case (2026)

CLAUDE.md vs CONVENTIONS.md vs AGENTS.md: The Definitive 2026 Comparison

CLAUDE.md Best Practices 2026: 12 Patterns from 100+ Real Repos

Using Claude Code Plan Mode to Design Better CLAUDE.md Files

Explore the collection