Use a Local Model as OpenClaw Router to Cut API Costs by 80%

What is a Router Model and Why Does it Matter?

In a multi-agent OpenClaw setup, not every LLM call does the same kind of work. There are two fundamentally different types of calls happening behind the scenes.

The first type is routing. When a message comes in, the system needs to figure out which agent should handle it, what tools to invoke, and whether the task needs to be broken into subtasks. This is the planner or router layer. It reads the incoming message, looks at the available agents and their SOUL.md definitions, and makes a decision. The output of a routing call is short: an agent name, a tool name, or a task breakdown. It does not generate paragraphs of text or complex analysis.

The second type is execution. Once the router decides where a task goes, the assigned agent actually does the work. A writer agent drafts a blog post. An SEO agent analyzes keyword data. A DevOps agent generates a deployment script. These calls need a powerful model because the output quality directly impacts the final result.

Here is the key insight: routing calls do not need a powerful model. A 3-billion parameter model running locally on your machine can classify tasks and pick the right agent just as accurately as GPT-4o for the vast majority of requests. But GPT-4o costs $2.50 per million input tokens. A local model costs nothing.

In a typical multi-agent pipeline, routing calls account for roughly 80% of all LLM invocations. The agent selection call, the tool planning call, the subtask decomposition call, the validation call. These all happen before the actual execution. By moving these to a free local model, you eliminate 80% of your API spend without touching output quality.

Why a Cheap Local Model Works for Routing

Routing is a classification problem, not a generation problem. The router looks at an incoming message and answers a simple question: which agent handles this? That is fundamentally different from writing a 2,000-word blog post or analyzing a complex dataset.

Consider what the router actually needs to do. It receives a message like "write a blog post about Docker security best practices." It looks at the available agents: a writer agent, an SEO agent, a DevOps agent, and a PM agent. It needs to decide that the writer agent should handle this, possibly with input from the SEO agent for keywords. That decision does not require GPT-4o-level reasoning. A small model that understands the SOUL.md descriptions can make this call correctly.

Local models also have a massive advantage in latency for routing. A 3B model running on your machine responds in 50-200 milliseconds. A cloud API call takes 500-2000 milliseconds when you factor in network round trips, queue times, and rate limiting. Your routing decisions happen 5-10x faster with a local model, which means your agents start working sooner.

The combination of zero cost and lower latency makes local routing a clear win for any multi-agent setup that processes more than a handful of tasks per day.

Best Small Models for OpenClaw Routing

Not all small models are equal for routing tasks. You want a model that is fast, accurate at classification, good at following structured output formats, and light enough to run alongside your other services. Here are the three best options as of March 2026.

Qwen 2.5 3B - Best Overall Router

Qwen 2.5 3B from Alibaba is the sweet spot for most OpenClaw routing setups. It runs on 8GB RAM, handles structured JSON output reliably, and has strong instruction-following capabilities. It understands agent role descriptions well and makes accurate delegation decisions. Response time is typically 80-150ms on modern hardware. Install it with: ollama pull qwen2.5:3b

Gemma 2 2B - Lightest Option

Google's Gemma 2 2B is the best choice if you are running on constrained hardware like a Raspberry Pi 5 or a VPS with 4GB RAM. At 2 billion parameters, it is small enough to leave plenty of room for your agents and other services. Routing accuracy is slightly lower than Qwen 2.5 for complex multi-step planning, but for straightforward agent selection it performs well. Install it with: ollama pull gemma2:2b

Phi-3 Mini 3.8B - Best Reasoning

Microsoft's Phi-3 Mini at 3.8 billion parameters offers the best reasoning capability among small models. If your routing involves complex task decomposition where a vague request needs to be broken into 4-5 ordered subtasks assigned to different agents, Phi-3 handles that better than the smaller alternatives. It needs 16GB RAM to run comfortably alongside other services. Install it with: ollama pull phi3:mini

Model	Parameters	RAM Needed	Response Time	Best For
Qwen 2.5 3B	3B	8GB	80-150ms	General routing, most setups
Gemma 2 2B	2B	4GB	50-100ms	Low-resource hardware, Raspberry Pi
Phi-3 Mini	3.8B	16GB	100-200ms	Complex task decomposition

How to Configure Dual-Model Routing in OpenClaw

Setting up a dual-model architecture in OpenClaw requires two things: a local Ollama instance running your router model, and a configuration change that tells OpenClaw to use different models for different purposes.

Step 1: Install Ollama and Pull Your Router Model

If you do not have Ollama installed yet, grab it from ollama.com. Then pull your chosen router model. For this guide, we will use Qwen 2.5 3B.

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the router model
ollama pull qwen2.5:3b

# Verify it is running
ollama list

Step 2: Configure OpenClaw for Dual-Model Setup

OpenClaw lets you set different models for the router (planner) and the execution layer. Use the config set command to point routing to your local Ollama instance while keeping your API model for execution.

# Set the router model to local Ollama
openclaw config set router.provider ollama
openclaw config set router.model qwen2.5:3b
openclaw config set router.endpoint http://localhost:11434

# Set the execution model to your API provider
openclaw config set execution.provider anthropic
openclaw config set execution.model claude-sonnet-4-20250514

# Verify your configuration
openclaw config list

Step 3: Define Clear Agent Boundaries in SOUL.md

The router model works best when each agent has a clearly defined role in its SOUL.md file. Ambiguous roles lead to routing mistakes regardless of model size. Make sure every agent has a distinct purpose, explicit boundaries, and a clear list of what it does and does not handle.

# Example: agents/writer/SOUL.md
# Role: Content Writer
# Handles: Blog posts, articles, social media copy, email drafts
# Does NOT handle: SEO analysis, keyword research, deployment, monitoring

# Example: agents/seo/SOUL.md
# Role: SEO Analyst
# Handles: Keyword research, SERP analysis, meta tag optimization, content audits
# Does NOT handle: Writing content, deploying code, monitoring infrastructure

When agent roles are clearly separated, even a 2B model can route correctly because the decision becomes straightforward pattern matching rather than nuanced reasoning.

Step 4: Test the Setup

Run a few test messages through your agents and check that routing decisions are correct. The gateway logs will show which model handled each call.

# Send a test message
openclaw agent --agent orion --message "Write a blog post about Docker security"

# Check the gateway logs to verify routing
# You should see: router=qwen2.5:3b → agent=writer → model=claude-sonnet-4-20250514
openclaw gateway logs --tail 20

Cost Savings: Real Numbers

Let us calculate the actual savings with concrete numbers. We will use a realistic multi-agent setup: 5 agents (PM, Writer, SEO, DevOps, Support) processing 100 tasks per day.

Single Model Setup (Before)

With a single cloud model handling everything, every routing decision and every execution call goes through the API.

Call Type	Calls/Day	Avg Tokens	Monthly Cost
Routing/Planning	400	~500 tokens	$45/month
Execution	100	~2,000 tokens	$15/month
Total	500		$60/month

Dual Model Setup (After)

With a local router model, the 400 daily routing calls cost nothing. Only the 100 execution calls hit the API.

Call Type	Calls/Day	Model	Monthly Cost
Routing/Planning	400	Qwen 2.5 3B (local)	$0/month
Execution	100	Claude Sonnet (API)	$12/month
Total	500		$12/month

That is $60/month down to $12/month. An 80% reduction. Over a year, you save $576. The only cost is the electricity to run the local model, which is negligible on modern hardware.

If you scale up to 500 tasks per day, the savings become even more dramatic. A single-model setup at that volume costs roughly $300/month. The dual-model approach brings it down to $60/month. That is $2,880 saved per year.

Architecture Diagram: How the Dual-Model Flow Works

Here is what happens when a message enters your OpenClaw system with dual-model routing configured:

Message arrives

User sends a task through Telegram, Slack, or the CLI.

Router model classifies

Qwen 2.5 3B (local, free) reads the message and available agent SOUL.md files. Decides which agent handles it. Takes 80-150ms.

Router plans subtasks

If the task is complex, the router breaks it into ordered subtasks and assigns each to the appropriate agent. Still local, still free.

Execution model generates

The assigned agent uses Claude or GPT (API, paid) to produce the actual output: the blog post, the analysis, the deployment script.

Result returned

The output goes back to the user through the same channel. The user sees the same quality output. The bill is 80% lower.

Limitations and When to Skip Routing

Dual-model routing is not always the right choice. There are specific scenarios where you should stick with a single powerful model for everything.

Single-agent setups

If you only run one agent, there is nothing to route. Every message goes to the same agent. Adding a router model adds latency without saving anything. Dual-model routing only makes sense with 2 or more agents.

Ambiguous agent boundaries

If your agents have overlapping responsibilities and the correct routing depends on subtle context, a 3B model will make more mistakes than a large model. Fix this by tightening your SOUL.md definitions rather than upgrading the router model. Clear boundaries solve 90% of routing errors.

Very low task volume

If you process fewer than 10 tasks per day, the cost savings are minimal. At 10 tasks/day with a cloud model, you might spend $3-5/month on API calls. Setting up Ollama and maintaining a local model is not worth the effort to save $2-4/month. The break-even point is roughly 30 tasks per day.

Complex multi-step reasoning chains

If your routing requires the planner to reason through a 6-step chain where each step depends on the previous one, small models can lose track. For these workflows, use a mid-tier API model like Claude Haiku or GPT-4o-mini as the router instead of a local model. You still save money compared to using your top-tier model, just not as much.

Hardware constraints

Running a local model requires available RAM and CPU. If your server is already maxed out running your agents, databases, and other services, adding an Ollama instance might cause performance issues. A 3B model needs at least 4GB of free RAM. Check your resources before adding a local router.

Advanced: Cascading Model Strategy

Once you have dual-model routing working, you can take it further with a three-tier cascade. This approach uses different models at each level of complexity.

Tier 1 is the local router (Qwen 2.5 3B) handling all planning and delegation. Tier 2 is a mid-range API model like Claude Haiku or GPT-4o-mini for routine execution tasks like summarization, data extraction, and simple responses. Tier 3 is your premium model like Claude Sonnet or GPT-4o, reserved only for tasks that need the highest quality output: long-form writing, complex analysis, and code generation.

# Three-tier cascade configuration
openclaw config set router.provider ollama
openclaw config set router.model qwen2.5:3b

# Default execution model (mid-tier, cheaper)
openclaw config set execution.provider anthropic
openclaw config set execution.model claude-haiku

# Premium model for specific agents
openclaw config set agents.writer.model claude-sonnet-4-20250514
openclaw config set agents.analyst.model claude-sonnet-4-20250514

With a three-tier cascade, you can push savings even further. Routine tasks that make up 60-70% of execution calls go to the cheaper mid-tier model, and only the 30-40% of tasks that truly need premium quality use the expensive model. Combined with free local routing, total API costs can drop by 90% compared to running everything through a single premium model.

Monitoring Your Dual-Model Setup

After switching to dual-model routing, monitor two things: routing accuracy and cost reduction.

For routing accuracy, check the gateway logs for misrouted tasks. A misrouted task is one where the router sends a message to the wrong agent, and the agent either fails or produces an off-topic response. If your misroute rate is above 5%, your SOUL.md definitions probably have overlapping responsibilities. Tighten the boundaries.

# Check routing decisions in logs
openclaw gateway logs --filter router

# Monitor API costs (only execution calls should show up)
openclaw gateway logs --filter provider=anthropic --tail 50

For cost tracking, compare your API provider dashboard before and after the switch. You should see an immediate drop in total API calls. The calls that remain should be fewer but with higher average token counts, because only execution calls (which generate longer outputs) are hitting the API.

The Bottom Line

Running a single expensive model for every LLM call in your multi-agent setup is like hiring a senior engineer to sort your mail. It works, but it is a waste of talent and money.

A local router model handles the planning and delegation work for free, while your premium API model focuses on what it does best: generating high-quality output. The setup takes 15 minutes, saves 80% on API costs, and often improves response times because local routing is faster than cloud API calls.

Start with Qwen 2.5 3B as your router, keep Claude or GPT for execution, and tighten your SOUL.md files so the router has clear boundaries to work with. Your agents will produce the same quality output. Your API bill will not.

Frequently Asked Questions

What is a router model in OpenClaw?

A router model (also called a planner model) is a small, fast LLM that decides which agent should handle a task, what tools to call, and how to break down complex requests into steps. It does not generate the final output. It only makes routing decisions. This means it does not need to be a powerful model. A 3B parameter local model running on Ollama can handle routing just as well as GPT-4o for most workloads.

Can I use a local router model with a cloud execution model?

Yes, this is exactly the dual-model setup described in this guide. You run a small model like Phi-3 Mini or Qwen 2.5 3B locally through Ollama as your router, and use Claude or GPT-4o through their APIs as your execution model. The router handles task planning and delegation at zero cost, while the expensive API model only runs when it needs to generate high-quality output.

Which local model is best for routing in OpenClaw?

For most setups, Qwen 2.5 3B offers the best balance of speed and accuracy for routing decisions. It runs comfortably on 8GB RAM and handles multi-agent delegation well. If you have limited hardware (4GB RAM), Gemma 2 2B is lighter but still capable. If you have 16GB or more, Phi-3 Mini 3.8B gives slightly better reasoning for complex routing scenarios. All three work well as OpenClaw routers.

How much can I actually save with a local router model?

In a typical multi-agent setup processing 100 tasks per day, roughly 80% of LLM calls are routing decisions (task classification, agent selection, tool planning) and only 20% are execution calls that generate the final output. By moving routing to a free local model, you eliminate 80% of your API costs. For a setup that costs $60/month with a single cloud model, the dual-model approach brings that down to approximately $12/month.

Will routing quality drop if I use a small local model?

For straightforward multi-agent setups with clear agent roles, no. Small models handle if-then routing decisions very well. Where quality can drop is in ambiguous scenarios where the task could reasonably go to multiple agents, or in complex multi-step planning where the router needs to decompose a vague request into an ordered sequence of subtasks. If your agents have clearly defined SOUL.md boundaries, a 3B model routes accurately over 95% of the time.

See what AI agents can do for your site

Free scan. Enter your URL, get an SEO analysis and a custom AI team recommendation in 30 seconds.

Scan Your Site Free Browse Agent Templates