Use a Local Model as OpenClaw Router to Cut API Costs by 80%
Most OpenClaw users run every LLM call through a single cloud model like Claude or GPT-4o. That works, but it is expensive. Around 80% of those calls are routing decisions that a tiny local model can handle for free. This guide shows you how to set up a dual-model architecture where a small local model handles planning and routing, while your expensive API model only fires for the work that actually needs it. The result: the same output quality at a fraction of the cost.
What is a Router Model and Why Does it Matter?
In a multi-agent OpenClaw setup, not every LLM call does the same kind of work. There are two fundamentally different types of calls happening behind the scenes.
The first type is routing. When a message comes in, the system needs to figure out which agent should handle it, what tools to invoke, and whether the task needs to be broken into subtasks. This is the planner or router layer. It reads the incoming message, looks at the available agents and their SOUL.md definitions, and makes a decision. The output of a routing call is short: an agent name, a tool name, or a task breakdown. It does not generate paragraphs of text or complex analysis.
The second type is execution. Once the router decides where a task goes, the assigned agent actually does the work. A writer agent drafts a blog post. An SEO agent analyzes keyword data. A DevOps agent generates a deployment script. These calls need a powerful model because the output quality directly impacts the final result.
Here is the key insight: routing calls do not need a powerful model. A 3-billion parameter model running locally on your machine can classify tasks and pick the right agent just as accurately as GPT-4o for the vast majority of requests. But GPT-4o costs $2.50 per million input tokens. A local model costs nothing.
In a typical multi-agent pipeline, routing calls account for roughly 80% of all LLM invocations. The agent selection call, the tool planning call, the subtask decomposition call, the validation call. These all happen before the actual execution. By moving these to a free local model, you eliminate 80% of your API spend without touching output quality.
Why a Cheap Local Model Works for Routing
Routing is a classification problem, not a generation problem. The router looks at an incoming message and answers a simple question: which agent handles this? That is fundamentally different from writing a 2,000-word blog post or analyzing a complex dataset.
Consider what the router actually needs to do. It receives a message like "write a blog post about Docker security best practices." It looks at the available agents: a writer agent, an SEO agent, a DevOps agent, and a PM agent. It needs to decide that the writer agent should handle this, possibly with input from the SEO agent for keywords. That decision does not require GPT-4o-level reasoning. A small model that understands the SOUL.md descriptions can make this call correctly.
Local models also have a massive advantage in latency for routing. A 3B model running on your machine responds in 50-200 milliseconds. A cloud API call takes 500-2000 milliseconds when you factor in network round trips, queue times, and rate limiting. Your routing decisions happen 5-10x faster with a local model, which means your agents start working sooner.
The combination of zero cost and lower latency makes local routing a clear win for any multi-agent setup that processes more than a handful of tasks per day.
Best Small Models for OpenClaw Routing
Not all small models are equal for routing tasks. You want a model that is fast, accurate at classification, good at following structured output formats, and light enough to run alongside your other services. Here are the three best options as of March 2026.
Qwen 2.5 3B - Best Overall Router
Qwen 2.5 3B from Alibaba is the sweet spot for most OpenClaw routing setups. It runs on 8GB RAM, handles structured JSON output reliably, and has strong instruction-following capabilities. It understands agent role descriptions well and makes accurate delegation decisions. Response time is typically 80-150ms on modern hardware. Install it with: ollama pull qwen2.5:3b
Gemma 2 2B - Lightest Option
Google's Gemma 2 2B is the best choice if you are running on constrained hardware like a Raspberry Pi 5 or a VPS with 4GB RAM. At 2 billion parameters, it is small enough to leave plenty of room for your agents and other services. Routing accuracy is slightly lower than Qwen 2.5 for complex multi-step planning, but for straightforward agent selection it performs well. Install it with: ollama pull gemma2:2b
Phi-3 Mini 3.8B - Best Reasoning
Microsoft's Phi-3 Mini at 3.8 billion parameters offers the best reasoning capability among small models. If your routing involves complex task decomposition where a vague request needs to be broken into 4-5 ordered subtasks assigned to different agents, Phi-3 handles that better than the smaller alternatives. It needs 16GB RAM to run comfortably alongside other services. Install it with: ollama pull phi3:mini
| Model | Parameters | RAM Needed | Response Time | Best For |
|---|---|---|---|---|
| Qwen 2.5 3B | 3B | 8GB | 80-150ms | General routing, most setups |
| Gemma 2 2B | 2B | 4GB | 50-100ms | Low-resource hardware, Raspberry Pi |
| Phi-3 Mini | 3.8B | 16GB | 100-200ms | Complex task decomposition |
How to Configure Dual-Model Routing in OpenClaw
Setting up a dual-model architecture in OpenClaw requires two things: a local Ollama instance running your router model, and a configuration change that tells OpenClaw to use different models for different purposes.
Step 1: Install Ollama and Pull Your Router Model
If you do not have Ollama installed yet, grab it from ollama.com. Then pull your chosen router model. For this guide, we will use Qwen 2.5 3B.
# Install Ollama (macOS/Linux) curl -fsSL https://ollama.com/install.sh | sh # Pull the router model ollama pull qwen2.5:3b # Verify it is running ollama list
Step 2: Configure OpenClaw for Dual-Model Setup
OpenClaw lets you set different models for the router (planner) and the execution layer. Use the config set command to point routing to your local Ollama instance while keeping your API model for execution.
# Set the router model to local Ollama openclaw config set router.provider ollama openclaw config set router.model qwen2.5:3b openclaw config set router.endpoint http://localhost:11434 # Set the execution model to your API provider openclaw config set execution.provider anthropic openclaw config set execution.model claude-sonnet-4-20250514 # Verify your configuration openclaw config list
Step 3: Define Clear Agent Boundaries in SOUL.md
The router model works best when each agent has a clearly defined role in its SOUL.md file. Ambiguous roles lead to routing mistakes regardless of model size. Make sure every agent has a distinct purpose, explicit boundaries, and a clear list of what it does and does not handle.
# Example: agents/writer/SOUL.md # Role: Content Writer # Handles: Blog posts, articles, social media copy, email drafts # Does NOT handle: SEO analysis, keyword research, deployment, monitoring # Example: agents/seo/SOUL.md # Role: SEO Analyst # Handles: Keyword research, SERP analysis, meta tag optimization, content audits # Does NOT handle: Writing content, deploying code, monitoring infrastructure
When agent roles are clearly separated, even a 2B model can route correctly because the decision becomes straightforward pattern matching rather than nuanced reasoning.
Step 4: Test the Setup
Run a few test messages through your agents and check that routing decisions are correct. The gateway logs will show which model handled each call.
# Send a test message openclaw agent --agent orion --message "Write a blog post about Docker security" # Check the gateway logs to verify routing # You should see: router=qwen2.5:3b โ agent=writer โ model=claude-sonnet-4-20250514 openclaw gateway logs --tail 20
Cost Savings: Real Numbers
Let us calculate the actual savings with concrete numbers. We will use a realistic multi-agent setup: 5 agents (PM, Writer, SEO, DevOps, Support) processing 100 tasks per day.
Single Model Setup (Before)
With a single cloud model handling everything, every routing decision and every execution call goes through the API.
| Call Type | Calls/Day | Avg Tokens | Monthly Cost |
|---|---|---|---|
| Routing/Planning | 400 | ~500 tokens | $45/month |
| Execution | 100 | ~2,000 tokens | $15/month |
| Total | 500 | $60/month |
Dual Model Setup (After)
With a local router model, the 400 daily routing calls cost nothing. Only the 100 execution calls hit the API.
| Call Type | Calls/Day | Model | Monthly Cost |
|---|---|---|---|
| Routing/Planning | 400 | Qwen 2.5 3B (local) | $0/month |
| Execution | 100 | Claude Sonnet (API) | $12/month |
| Total | 500 | $12/month |
That is $60/month down to $12/month. An 80% reduction. Over a year, you save $576. The only cost is the electricity to run the local model, which is negligible on modern hardware.
If you scale up to 500 tasks per day, the savings become even more dramatic. A single-model setup at that volume costs roughly $300/month. The dual-model approach brings it down to $60/month. That is $2,880 saved per year.
Architecture Diagram: How the Dual-Model Flow Works
Here is what happens when a message enters your OpenClaw system with dual-model routing configured:
Message arrives
User sends a task through Telegram, Slack, or the CLI.
Router model classifies
Qwen 2.5 3B (local, free) reads the message and available agent SOUL.md files. Decides which agent handles it. Takes 80-150ms.
Router plans subtasks
If the task is complex, the router breaks it into ordered subtasks and assigns each to the appropriate agent. Still local, still free.
Execution model generates
The assigned agent uses Claude or GPT (API, paid) to produce the actual output: the blog post, the analysis, the deployment script.
Result returned
The output goes back to the user through the same channel. The user sees the same quality output. The bill is 80% lower.
Limitations and When to Skip Routing
Dual-model routing is not always the right choice. There are specific scenarios where you should stick with a single powerful model for everything.
Single-agent setups
If you only run one agent, there is nothing to route. Every message goes to the same agent. Adding a router model adds latency without saving anything. Dual-model routing only makes sense with 2 or more agents.
Ambiguous agent boundaries
If your agents have overlapping responsibilities and the correct routing depends on subtle context, a 3B model will make more mistakes than a large model. Fix this by tightening your SOUL.md definitions rather than upgrading the router model. Clear boundaries solve 90% of routing errors.
Very low task volume
If you process fewer than 10 tasks per day, the cost savings are minimal. At 10 tasks/day with a cloud model, you might spend $3-5/month on API calls. Setting up Ollama and maintaining a local model is not worth the effort to save $2-4/month. The break-even point is roughly 30 tasks per day.
Complex multi-step reasoning chains
If your routing requires the planner to reason through a 6-step chain where each step depends on the previous one, small models can lose track. For these workflows, use a mid-tier API model like Claude Haiku or GPT-4o-mini as the router instead of a local model. You still save money compared to using your top-tier model, just not as much.
Hardware constraints
Running a local model requires available RAM and CPU. If your server is already maxed out running your agents, databases, and other services, adding an Ollama instance might cause performance issues. A 3B model needs at least 4GB of free RAM. Check your resources before adding a local router.
Advanced: Cascading Model Strategy
Once you have dual-model routing working, you can take it further with a three-tier cascade. This approach uses different models at each level of complexity.
Tier 1 is the local router (Qwen 2.5 3B) handling all planning and delegation. Tier 2 is a mid-range API model like Claude Haiku or GPT-4o-mini for routine execution tasks like summarization, data extraction, and simple responses. Tier 3 is your premium model like Claude Sonnet or GPT-4o, reserved only for tasks that need the highest quality output: long-form writing, complex analysis, and code generation.
# Three-tier cascade configuration openclaw config set router.provider ollama openclaw config set router.model qwen2.5:3b # Default execution model (mid-tier, cheaper) openclaw config set execution.provider anthropic openclaw config set execution.model claude-haiku # Premium model for specific agents openclaw config set agents.writer.model claude-sonnet-4-20250514 openclaw config set agents.analyst.model claude-sonnet-4-20250514
With a three-tier cascade, you can push savings even further. Routine tasks that make up 60-70% of execution calls go to the cheaper mid-tier model, and only the 30-40% of tasks that truly need premium quality use the expensive model. Combined with free local routing, total API costs can drop by 90% compared to running everything through a single premium model.
Monitoring Your Dual-Model Setup
After switching to dual-model routing, monitor two things: routing accuracy and cost reduction.
For routing accuracy, check the gateway logs for misrouted tasks. A misrouted task is one where the router sends a message to the wrong agent, and the agent either fails or produces an off-topic response. If your misroute rate is above 5%, your SOUL.md definitions probably have overlapping responsibilities. Tighten the boundaries.
# Check routing decisions in logs openclaw gateway logs --filter router # Monitor API costs (only execution calls should show up) openclaw gateway logs --filter provider=anthropic --tail 50
For cost tracking, compare your API provider dashboard before and after the switch. You should see an immediate drop in total API calls. The calls that remain should be fewer but with higher average token counts, because only execution calls (which generate longer outputs) are hitting the API.
The Bottom Line
Running a single expensive model for every LLM call in your multi-agent setup is like hiring a senior engineer to sort your mail. It works, but it is a waste of talent and money.
A local router model handles the planning and delegation work for free, while your premium API model focuses on what it does best: generating high-quality output. The setup takes 15 minutes, saves 80% on API costs, and often improves response times because local routing is faster than cloud API calls.
Start with Qwen 2.5 3B as your router, keep Claude or GPT for execution, and tighten your SOUL.md files so the router has clear boundaries to work with. Your agents will produce the same quality output. Your API bill will not.
Frequently Asked Questions
What is a router model in OpenClaw?
A router model (also called a planner model) is a small, fast LLM that decides which agent should handle a task, what tools to call, and how to break down complex requests into steps. It does not generate the final output. It only makes routing decisions. This means it does not need to be a powerful model. A 3B parameter local model running on Ollama can handle routing just as well as GPT-4o for most workloads.
Can I use a local router model with a cloud execution model?
Yes, this is exactly the dual-model setup described in this guide. You run a small model like Phi-3 Mini or Qwen 2.5 3B locally through Ollama as your router, and use Claude or GPT-4o through their APIs as your execution model. The router handles task planning and delegation at zero cost, while the expensive API model only runs when it needs to generate high-quality output.
Which local model is best for routing in OpenClaw?
For most setups, Qwen 2.5 3B offers the best balance of speed and accuracy for routing decisions. It runs comfortably on 8GB RAM and handles multi-agent delegation well. If you have limited hardware (4GB RAM), Gemma 2 2B is lighter but still capable. If you have 16GB or more, Phi-3 Mini 3.8B gives slightly better reasoning for complex routing scenarios. All three work well as OpenClaw routers.
How much can I actually save with a local router model?
In a typical multi-agent setup processing 100 tasks per day, roughly 80% of LLM calls are routing decisions (task classification, agent selection, tool planning) and only 20% are execution calls that generate the final output. By moving routing to a free local model, you eliminate 80% of your API costs. For a setup that costs $60/month with a single cloud model, the dual-model approach brings that down to approximately $12/month.
Will routing quality drop if I use a small local model?
For straightforward multi-agent setups with clear agent roles, no. Small models handle if-then routing decisions very well. Where quality can drop is in ambiguous scenarios where the task could reasonably go to multiple agents, or in complex multi-step planning where the router needs to decompose a vague request into an ordered sequence of subtasks. If your agents have clearly defined SOUL.md boundaries, a 3B model routes accurately over 95% of the time.
See what AI agents can do for your site
Free scan. Enter your URL, get an SEO analysis and a custom AI team recommendation in 30 seconds.
Deploy a Ready-Made AI Agent
Skip the setup. Pick a template and deploy in 60 seconds.