Free AI Research with Ollama + Claude Code: Local LLMs for Developers
Claude Code transformed how I write software. The agentic workflow, file editing capabilities, and multi-step reasoning made it indispensable. But after a month of heavy usage, I noticed something uncomfortable: my API bills were growing faster than my codebase.
The thing is, not every task needs the full power of Claude Opus or Sonnet. When I'm exploring a new library, prototyping an idea, or just experimenting with different approaches, I don't need the most capable model - I need a model that can help me think through problems without watching my token counter tick upward.
That's where Ollama changed my workflow completely.
The Hidden Cost of AI-Assisted Development
Let me be direct about the economics. Claude Code charges per token. When you're in deep exploration mode - asking follow-up questions, regenerating code, having Claude read through documentation - tokens add up quickly. I tracked my usage for two weeks:
- Exploration/research sessions: 60% of my Claude Code usage
- Production code writing: 40% of my usage
- Exploration cost efficiency: Often low (many iterations, dead ends, experiments)
- Production cost efficiency: High (targeted changes, fewer iterations)
The insight was clear: I was paying premium prices for exploratory work that didn't require premium models.
Ollama's Anthropic-Compatible API
Here's what makes this integration possible: starting with Ollama v0.14.0, the team added an Anthropic Messages API compatibility layer. Claude Code doesn't know the difference between talking to Anthropic's servers and talking to your local Ollama instance.
No MCP server setup. No proxy configuration. No custom plugins. Just environment variables.
The architecture works like this:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Claude Code │────▶│ Ollama Server │────▶│ Local Model │
│ (CLI Agent) │ │ (localhost) │ │ (qwen3-coder) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
└───────────────────────┘
Anthropic Messages API
(same protocol)Claude Code sends requests using Anthropic's API format. Ollama receives them, translates them for the local model, and returns responses in the expected format. From Claude Code's perspective, it's just talking to another Anthropic-compatible endpoint.
Setting Up the Integration
Prerequisites
You'll need three things installed:
Install Ollama:
# macOS (Homebrew)
brew install ollama
# macOS/Linux (direct install)
curl -fsSL https://ollama.com/install.sh | sh
# Windows (winget)
winget install Ollama.Ollama
# Windows (alternative) - Download installer from https://ollama.com/downloadInstall Claude Code:
# All platforms
npm install -g @anthropic-ai/claude-codePulling a Code-Optimized Model
Not all models work equally well for coding tasks. I've tested several, and these perform best with Claude Code's agentic workflows:
# Best overall for code tasks (20B parameters, good balance)
ollama pull gpt-oss:20b
# Excellent for code generation and analysis
ollama pull qwen3-coder
# Fast and capable, good for quick iterations
ollama pull glm-4.7-flash:latest★ Insight ─────────────────────────────────────
- Context length matters: Claude Code works best with models supporting 64k+ tokens. Shorter context windows cause the agent to lose track of file contents during multi-step operations.
- Tool calling support: For full agentic features (file editing, bash commands), you need Ollama 0.14.0+ which includes tool/function calling in the Anthropic compatibility layer.
─────────────────────────────────────────────────
Configuring Environment Variables
macOS/Linux - Add to your shell configuration (~/.zshrc or ~/.bashrc):
# ~/.zshrc or ~/.bashrc
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY="" # Required but ignored
export ANTHROPIC_BASE_URL=http://localhost:11434Then reload your shell:
source ~/.zshrcWindows (PowerShell) - Add to your PowerShell profile or run in terminal:
# Temporary (current session only)
$env:ANTHROPIC_AUTH_TOKEN = "ollama"
$env:ANTHROPIC_API_KEY = ""
$env:ANTHROPIC_BASE_URL = "http://localhost:11434"
# Permanent (add to $PROFILE or set via System Properties)
[Environment]::SetEnvironmentVariable("ANTHROPIC_AUTH_TOKEN", "ollama", "User")
[Environment]::SetEnvironmentVariable("ANTHROPIC_API_KEY", "", "User")
[Environment]::SetEnvironmentVariable("ANTHROPIC_BASE_URL", "http://localhost:11434", "User")Windows (Command Prompt):
set ANTHROPIC_AUTH_TOKEN=ollama
set ANTHROPIC_API_KEY=
set ANTHROPIC_BASE_URL=http://localhost:11434Running Claude Code with Ollama
First, ensure Ollama's server is running:
ollama serveThen launch Claude Code with your chosen model:
claude --model qwen3-coderYou can also run everything inline without modifying your shell config:
ANTHROPIC_AUTH_TOKEN=ollama \
ANTHROPIC_BASE_URL=http://localhost:11434 \
ANTHROPIC_API_KEY="" \
claude --model gpt-oss:20bModel Selection for Different Tasks
After months of experimentation, I've developed preferences for different scenarios:
For Code Exploration and Research
ollama pull qwen3-coder
claude --model qwen3-coderWhy: Qwen3-coder handles code comprehension well and can follow complex file structures. When I'm reading through an unfamiliar codebase or understanding how a library works, this model does the job without burning through my Anthropic credits.
For Quick Prototyping
ollama pull glm-4.7-flash:latest
claude --model glm-4.7-flash:latestWhy: Speed matters when iterating on ideas. GLM-4.7-flash generates responses quickly, which keeps my flow state intact during rapid prototyping sessions.
For Larger Context Requirements
ollama pull gpt-oss:20b
claude --model gpt-oss:20bWhy: The 20B parameter model handles longer context windows better. When I need Claude Code to read multiple files and synthesize information across them, this model maintains coherence.
★ Insight ─────────────────────────────────────
- Hardware affects model choice: A model that runs slowly on your hardware disrupts the conversational flow of Claude Code. Better to use a smaller, faster model than wait 30 seconds between responses.
- Test with realistic tasks: Run your typical workflow with each model before settling on one. Some models excel at explanation but struggle with code generation, or vice versa.
─────────────────────────────────────────────────
The Claude Launcher Alternative
If you switch between local and cloud models frequently, Claude Launcher simplifies the workflow:
npm install -g claude-launcher
# Run with local Ollama model
claude-launcher -l
# Run with cloud Claude (when you need full capabilities)
claude-launcherThe launcher handles environment variable switching automatically, so you don't need to manage multiple shell configurations.
One-line option via Ollama's built-in launcher:
ollama launchThis presents a menu where you can select Claude Code and your preferred model interactively.
When to Use Local vs. Cloud
I've developed a simple decision framework:
Use Local Ollama When:
- Exploring new codebases: Reading through unfamiliar code, understanding architecture
- Learning new frameworks: Following tutorials, experimenting with APIs
- Early prototyping: Rough implementations where you expect many iterations
- Documentation research: Having Claude explain concepts, summarize docs
- Offline work: When internet is unreliable or you're on a plane
- Privacy-sensitive projects: Code that shouldn't leave your machine
Use Cloud Claude When:
- Production code: When quality matters and you need the most capable model
- Complex refactoring: Large-scale changes requiring nuanced understanding
- Code review: Where catching subtle bugs justifies the cost
- Writing tests: Comprehensive test coverage benefits from Claude's reasoning
- Debugging tricky issues: When you've already tried obvious solutions
// Decision helper I keep in my notes
const shouldUseLocalModel = (task: CodingTask): boolean => {
const localIndicators = [
task.isExploratory,
task.expectedIterations > 5,
task.isLearning,
task.privacySensitive,
!task.requiresHighAccuracy
];
return localIndicators.filter(Boolean).length >= 2;
};Performance Considerations
Hardware Requirements
Local LLMs need computational resources. Here's what I've found works:
| Model Size | Minimum RAM | Recommended GPU | Response Time |
|---|---|---|---|
| 7B params | 8GB | None (CPU ok) | 2-5 seconds |
| 13B params | 16GB | 8GB VRAM | 3-8 seconds |
| 20B params | 32GB | 12GB VRAM | 5-15 seconds |
Platform-specific notes:
- macOS (Apple Silicon): M1/M2/M3 chips use unified memory efficiently. 16GB handles most models well.
- Windows/Linux (NVIDIA): Ollama automatically uses CUDA if available. RTX 3080/4080+ recommended for larger models.
- Windows/Linux (AMD): ROCm support available for compatible AMD GPUs.
- CPU-only: Works but expect 3-5x slower response times compared to GPU.
On my M2 MacBook Pro (16GB RAM), qwen3-coder runs comfortably with 3-5 second response times. Larger models work but feel sluggish.
Optimizing Context Length
Ollama models have configurable context windows. For Claude Code workflows, extend it:
# Create a custom model with larger context
ollama create qwen3-coder-64k -f - <<EOF
FROM qwen3-coder
PARAMETER num_ctx 65536
EOF
# Use the extended model
claude --model qwen3-coder-64kHandling Tool Calls
Some Ollama models don't fully support tool/function calling, which Claude Code relies on heavily. If you notice the agent struggling to use tools, ensure you're running Ollama 0.14.0 or later:
# Check version
ollama --version
# Update if needed (pre-release for best tool support)
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3-rc1 shPractical Workflow Integration
Here's how I structure my typical day:
Morning exploration (local):
- Pull latest from repos I'm contributing to
- Use local Ollama to understand changes
- Research new libraries or patterns
- Prototype solutions to problems I identified
Focused development (cloud):
- Switch to Claude Sonnet/Opus for implementation
- Write production code with higher-quality assistance
- Comprehensive code review before PRs
Evening learning (local):
- Work through tutorials and documentation
- Experiment with ideas that might not go anywhere
- No cost anxiety during open-ended exploration
My shell aliases make switching seamless:
macOS/Linux (~/.zshrc or ~/.bashrc):
alias claude-local='ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model qwen3-coder'
alias claude-fast='ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model glm-4.7-flash:latest'
alias claude-cloud='claude' # Uses default Anthropic settingsWindows (PowerShell profile) - Add to $PROFILE:
function claude-local {
$env:ANTHROPIC_AUTH_TOKEN = "ollama"
$env:ANTHROPIC_BASE_URL = "http://localhost:11434"
$env:ANTHROPIC_API_KEY = ""
claude --model qwen3-coder
}
function claude-fast {
$env:ANTHROPIC_AUTH_TOKEN = "ollama"
$env:ANTHROPIC_BASE_URL = "http://localhost:11434"
$env:ANTHROPIC_API_KEY = ""
claude --model glm-4.7-flash:latest
}
function claude-cloud {
Remove-Item Env:ANTHROPIC_AUTH_TOKEN -ErrorAction SilentlyContinue
Remove-Item Env:ANTHROPIC_BASE_URL -ErrorAction SilentlyContinue
claude
}Troubleshooting Common Issues
"Connection refused" errors
Ollama server isn't running:
# macOS/Linux - Start the server
ollama serve
# macOS/Linux - Run in background
ollama serve &
# Windows - Ollama runs as a service automatically after installation
# If needed, start manually from the Ollama app in system traySlow or incomplete responses
Model might be too large for your hardware:
# Try a smaller model
ollama pull codellama:7b
claude --model codellama:7bTool calls not working
Update to latest Ollama:
# Check current version
ollama --version
# macOS (Homebrew)
brew reinstall ollama
# Windows (winget)
winget upgrade Ollama.Ollama
# Linux
curl -fsSL https://ollama.com/install.sh | shContext length exceeded
Create a model variant with larger context:
ollama create mymodel-extended -f - <<EOF
FROM qwen3-coder
PARAMETER num_ctx 32768
EOFThe Economics Revisited
After three months of this hybrid approach:
- API costs: Down 45% compared to cloud-only usage
- Exploration time: Up 30% (no cost anxiety)
- Production code quality: Maintained (still using Claude for critical work)
- Learning velocity: Noticeably improved
The real value isn't just cost savings - it's behavioral change. When there's no cost associated with asking questions, I ask more questions. I explore tangents. I try approaches that might not work. This experimentation compounds into better understanding and, eventually, better code.
For Designers Using Claude Code
If you're a designer working with code, local models are particularly valuable for:
- Understanding component libraries: Ask endless questions about how Tailwind classes work
- CSS experimentation: Generate variations without worrying about token costs
- Design system exploration: Have Claude explain design token architectures
- Prototyping interactions: Iterate on animation code rapidly
The lower stakes of local models make them perfect for learning-oriented work where you might not know the right questions to ask yet.
★ Insight ─────────────────────────────────────
- Design exploration benefits from iteration: Unlike production code where you want it right the first time, design work benefits from trying multiple approaches. Local models enable this experimentation mindset.
- Component understanding: Having Claude explain existing component code helps bridge the designer-developer gap without cost concerns.
─────────────────────────────────────────────────
Free AI assistance exists. It runs on your machine, respects your privacy, and enables the kind of exploratory work that makes AI assistants truly transformative. The setup takes ten minutes. The workflow change takes a day to internalize. The impact on how you learn and experiment with code lasts much longer.
Start with ollama pull qwen3-coder and see where curiosity takes you.