Free AI Research with Ollama + Claude Code: Local LLMs for Developers

Claude Code transformed how I write software. The agentic workflow, file editing capabilities, and multi-step reasoning made it indispensable. But after a month of heavy usage, I noticed something uncomfortable: my API bills were growing faster than my codebase.

The thing is, not every task needs the full power of Claude Opus or Sonnet. When I'm exploring a new library, prototyping an idea, or just experimenting with different approaches, I don't need the most capable model - I need a model that can help me think through problems without watching my token counter tick upward.

That's where Ollama changed my workflow completely.

The Hidden Cost of AI-Assisted Development

Let me be direct about the economics. Claude Code charges per token. When you're in deep exploration mode - asking follow-up questions, regenerating code, having Claude read through documentation - tokens add up quickly. I tracked my usage for two weeks:

Exploration/research sessions: 60% of my Claude Code usage
Production code writing: 40% of my usage
Exploration cost efficiency: Often low (many iterations, dead ends, experiments)
Production cost efficiency: High (targeted changes, fewer iterations)

The insight was clear: I was paying premium prices for exploratory work that didn't require premium models.

Ollama's Anthropic-Compatible API

Here's what makes this integration possible: starting with Ollama v0.14.0, the team added an Anthropic Messages API compatibility layer. Claude Code doesn't know the difference between talking to Anthropic's servers and talking to your local Ollama instance.

No MCP server setup. No proxy configuration. No custom plugins. Just environment variables.

The architecture works like this:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Claude Code   │────▶│  Ollama Server  │────▶│   Local Model   │
│   (CLI Agent)   │     │  (localhost)    │     │  (qwen3-coder)  │
└─────────────────┘     └─────────────────┘     └─────────────────┘
        │                       │
        └───────────────────────┘
          Anthropic Messages API
          (same protocol)

Claude Code sends requests using Anthropic's API format. Ollama receives them, translates them for the local model, and returns responses in the expected format. From Claude Code's perspective, it's just talking to another Anthropic-compatible endpoint.

Setting Up the Integration

Prerequisites

You'll need three things installed:

Install Ollama:

# macOS (Homebrew)
brew install ollama
 
# macOS/Linux (direct install)
curl -fsSL https://ollama.com/install.sh | sh
 
# Windows (winget)
winget install Ollama.Ollama
 
# Windows (alternative) - Download installer from https://ollama.com/download

Install Claude Code:

# All platforms
npm install -g @anthropic-ai/claude-code

Pulling a Code-Optimized Model

Not all models work equally well for coding tasks. I've tested several, and these perform best with Claude Code's agentic workflows:

# Best overall for code tasks (20B parameters, good balance)
ollama pull gpt-oss:20b
 
# Excellent for code generation and analysis
ollama pull qwen3-coder
 
# Fast and capable, good for quick iterations
ollama pull glm-4.7-flash:latest

★ Insight ─────────────────────────────────────

Context length matters: Claude Code works best with models supporting 64k+ tokens. Shorter context windows cause the agent to lose track of file contents during multi-step operations.
Tool calling support: For full agentic features (file editing, bash commands), you need Ollama 0.14.0+ which includes tool/function calling in the Anthropic compatibility layer. ─────────────────────────────────────────────────

Configuring Environment Variables

macOS/Linux - Add to your shell configuration (~/.zshrc or ~/.bashrc):

# ~/.zshrc or ~/.bashrc
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""  # Required but ignored
export ANTHROPIC_BASE_URL=http://localhost:11434

Then reload your shell:

source ~/.zshrc

Windows (PowerShell) - Add to your PowerShell profile or run in terminal:

# Temporary (current session only)
$env:ANTHROPIC_AUTH_TOKEN = "ollama"
$env:ANTHROPIC_API_KEY = ""
$env:ANTHROPIC_BASE_URL = "http://localhost:11434"
 
# Permanent (add to $PROFILE or set via System Properties)
[Environment]::SetEnvironmentVariable("ANTHROPIC_AUTH_TOKEN", "ollama", "User")
[Environment]::SetEnvironmentVariable("ANTHROPIC_API_KEY", "", "User")
[Environment]::SetEnvironmentVariable("ANTHROPIC_BASE_URL", "http://localhost:11434", "User")

Windows (Command Prompt):

set ANTHROPIC_AUTH_TOKEN=ollama
set ANTHROPIC_API_KEY=
set ANTHROPIC_BASE_URL=http://localhost:11434

Running Claude Code with Ollama

First, ensure Ollama's server is running:

ollama serve

Then launch Claude Code with your chosen model:

claude --model qwen3-coder

You can also run everything inline without modifying your shell config:

ANTHROPIC_AUTH_TOKEN=ollama \
ANTHROPIC_BASE_URL=http://localhost:11434 \
ANTHROPIC_API_KEY="" \
claude --model gpt-oss:20b

Model Selection for Different Tasks

After months of experimentation, I've developed preferences for different scenarios:

For Code Exploration and Research

ollama pull qwen3-coder
claude --model qwen3-coder

Why: Qwen3-coder handles code comprehension well and can follow complex file structures. When I'm reading through an unfamiliar codebase or understanding how a library works, this model does the job without burning through my Anthropic credits.

For Quick Prototyping

ollama pull glm-4.7-flash:latest
claude --model glm-4.7-flash:latest

Why: Speed matters when iterating on ideas. GLM-4.7-flash generates responses quickly, which keeps my flow state intact during rapid prototyping sessions.

For Larger Context Requirements

ollama pull gpt-oss:20b
claude --model gpt-oss:20b

Why: The 20B parameter model handles longer context windows better. When I need Claude Code to read multiple files and synthesize information across them, this model maintains coherence.

★ Insight ─────────────────────────────────────

Hardware affects model choice: A model that runs slowly on your hardware disrupts the conversational flow of Claude Code. Better to use a smaller, faster model than wait 30 seconds between responses.
Test with realistic tasks: Run your typical workflow with each model before settling on one. Some models excel at explanation but struggle with code generation, or vice versa. ─────────────────────────────────────────────────

The Claude Launcher Alternative

If you switch between local and cloud models frequently, Claude Launcher simplifies the workflow:

npm install -g claude-launcher
 
# Run with local Ollama model
claude-launcher -l
 
# Run with cloud Claude (when you need full capabilities)
claude-launcher

The launcher handles environment variable switching automatically, so you don't need to manage multiple shell configurations.

One-line option via Ollama's built-in launcher:

ollama launch

This presents a menu where you can select Claude Code and your preferred model interactively.

When to Use Local vs. Cloud

I've developed a simple decision framework:

Use Local Ollama When:

Exploring new codebases: Reading through unfamiliar code, understanding architecture
Learning new frameworks: Following tutorials, experimenting with APIs
Early prototyping: Rough implementations where you expect many iterations
Documentation research: Having Claude explain concepts, summarize docs
Offline work: When internet is unreliable or you're on a plane
Privacy-sensitive projects: Code that shouldn't leave your machine

Use Cloud Claude When:

Production code: When quality matters and you need the most capable model
Complex refactoring: Large-scale changes requiring nuanced understanding
Code review: Where catching subtle bugs justifies the cost
Writing tests: Comprehensive test coverage benefits from Claude's reasoning
Debugging tricky issues: When you've already tried obvious solutions

// Decision helper I keep in my notes
const shouldUseLocalModel = (task: CodingTask): boolean => {
  const localIndicators = [
    task.isExploratory,
    task.expectedIterations > 5,
    task.isLearning,
    task.privacySensitive,
    !task.requiresHighAccuracy
  ];
 
  return localIndicators.filter(Boolean).length >= 2;
};

Performance Considerations

Hardware Requirements

Local LLMs need computational resources. Here's what I've found works:

Model Size	Minimum RAM	Recommended GPU	Response Time
7B params	8GB	None (CPU ok)	2-5 seconds
13B params	16GB	8GB VRAM	3-8 seconds
20B params	32GB	12GB VRAM	5-15 seconds

Platform-specific notes:

macOS (Apple Silicon): M1/M2/M3 chips use unified memory efficiently. 16GB handles most models well.
Windows/Linux (NVIDIA): Ollama automatically uses CUDA if available. RTX 3080/4080+ recommended for larger models.
Windows/Linux (AMD): ROCm support available for compatible AMD GPUs.
CPU-only: Works but expect 3-5x slower response times compared to GPU.

On my M2 MacBook Pro (16GB RAM), qwen3-coder runs comfortably with 3-5 second response times. Larger models work but feel sluggish.

Optimizing Context Length

Ollama models have configurable context windows. For Claude Code workflows, extend it:

# Create a custom model with larger context
ollama create qwen3-coder-64k -f - <<EOF
FROM qwen3-coder
PARAMETER num_ctx 65536
EOF
 
# Use the extended model
claude --model qwen3-coder-64k

Handling Tool Calls

Some Ollama models don't fully support tool/function calling, which Claude Code relies on heavily. If you notice the agent struggling to use tools, ensure you're running Ollama 0.14.0 or later:

# Check version
ollama --version
 
# Update if needed (pre-release for best tool support)
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3-rc1 sh

Practical Workflow Integration

Here's how I structure my typical day:

Morning exploration (local):

Pull latest from repos I'm contributing to
Use local Ollama to understand changes
Research new libraries or patterns
Prototype solutions to problems I identified

Focused development (cloud):

Switch to Claude Sonnet/Opus for implementation
Write production code with higher-quality assistance
Comprehensive code review before PRs

Evening learning (local):

Work through tutorials and documentation
Experiment with ideas that might not go anywhere
No cost anxiety during open-ended exploration

My shell aliases make switching seamless:

macOS/Linux (~/.zshrc or ~/.bashrc):

alias claude-local='ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model qwen3-coder'
alias claude-fast='ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model glm-4.7-flash:latest'
alias claude-cloud='claude'  # Uses default Anthropic settings

Windows (PowerShell profile) - Add to $PROFILE:

function claude-local {
    $env:ANTHROPIC_AUTH_TOKEN = "ollama"
    $env:ANTHROPIC_BASE_URL = "http://localhost:11434"
    $env:ANTHROPIC_API_KEY = ""
    claude --model qwen3-coder
}
 
function claude-fast {
    $env:ANTHROPIC_AUTH_TOKEN = "ollama"
    $env:ANTHROPIC_BASE_URL = "http://localhost:11434"
    $env:ANTHROPIC_API_KEY = ""
    claude --model glm-4.7-flash:latest
}
 
function claude-cloud {
    Remove-Item Env:ANTHROPIC_AUTH_TOKEN -ErrorAction SilentlyContinue
    Remove-Item Env:ANTHROPIC_BASE_URL -ErrorAction SilentlyContinue
    claude
}

Troubleshooting Common Issues

"Connection refused" errors

Ollama server isn't running:

# macOS/Linux - Start the server
ollama serve
 
# macOS/Linux - Run in background
ollama serve &
 
# Windows - Ollama runs as a service automatically after installation
# If needed, start manually from the Ollama app in system tray

Slow or incomplete responses

Model might be too large for your hardware:

# Try a smaller model
ollama pull codellama:7b
claude --model codellama:7b

Tool calls not working

Update to latest Ollama:

# Check current version
ollama --version
 
# macOS (Homebrew)
brew reinstall ollama
 
# Windows (winget)
winget upgrade Ollama.Ollama
 
# Linux
curl -fsSL https://ollama.com/install.sh | sh

Context length exceeded

Create a model variant with larger context:

ollama create mymodel-extended -f - <<EOF
FROM qwen3-coder
PARAMETER num_ctx 32768
EOF

The Economics Revisited

After three months of this hybrid approach:

API costs: Down 45% compared to cloud-only usage
Exploration time: Up 30% (no cost anxiety)
Production code quality: Maintained (still using Claude for critical work)
Learning velocity: Noticeably improved

The real value isn't just cost savings - it's behavioral change. When there's no cost associated with asking questions, I ask more questions. I explore tangents. I try approaches that might not work. This experimentation compounds into better understanding and, eventually, better code.

For Designers Using Claude Code

If you're a designer working with code, local models are particularly valuable for:

Understanding component libraries: Ask endless questions about how Tailwind classes work
CSS experimentation: Generate variations without worrying about token costs
Design system exploration: Have Claude explain design token architectures
Prototyping interactions: Iterate on animation code rapidly

The lower stakes of local models make them perfect for learning-oriented work where you might not know the right questions to ask yet.

★ Insight ─────────────────────────────────────

Design exploration benefits from iteration: Unlike production code where you want it right the first time, design work benefits from trying multiple approaches. Local models enable this experimentation mindset.
Component understanding: Having Claude explain existing component code helps bridge the designer-developer gap without cost concerns. ─────────────────────────────────────────────────

Free AI assistance exists. It runs on your machine, respects your privacy, and enables the kind of exploratory work that makes AI assistants truly transformative. The setup takes ten minutes. The workflow change takes a day to internalize. The impact on how you learn and experiment with code lasts much longer.

Start with ollama pull qwen3-coder and see where curiosity takes you.