Local Models (Ollama, vLLM, LocalAI)
Run docker-agent with locally hosted models for privacy, offline use, or cost savings.
Overview
docker-agent can connect to any OpenAI-compatible local model server. This guide covers the most popular options:
- Ollama — Easy-to-use local model runner
- vLLM — High-performance inference server
- LocalAI — OpenAI-compatible API for various backends
TipDocker Model Runner
For the easiest local model experience, consider Docker Model Runner which is built into Docker Desktop and requires no additional setup.
Ollama
Ollama is a popular tool for running LLMs locally. docker-agent includes a built-in ollama alias for easy configuration.
Setup
Install Ollama from ollama.ai
Pull a model:
ollama pull llama3.2 ollama pull qwen2.5-coderStart the Ollama server (usually runs automatically):
ollama serve
Configuration
Use the built-in ollama alias:
agents:
root:
model: ollama/llama3.2
description: Local assistant
instruction: You are a helpful assistant.The ollama alias automatically uses:
- Base URL:
http://localhost:11434/v1 - API Type: OpenAI-compatible
- No API key required
Custom Port or Host
If Ollama runs on a different host or port:
models:
my_ollama:
provider: ollama
model: llama3.2
base_url: http://192.168.1.100:11434/v1
agents:
root:
model: my_ollama
description: Remote Ollama assistant
instruction: You are a helpful assistant.Popular Ollama Models
| Model | Size | Best For |
|---|---|---|
llama3.2 | 3B | General purpose, fast |
llama3.1 | 8B | Better reasoning |
qwen2.5-coder | 7B | Code generation |
mistral | 7B | General purpose |
codellama | 7B | Code tasks |
deepseek-coder | 6.7B | Code generation |
vLLM
vLLM is a high-performance inference server optimized for throughput.
Setup
# Install vLLM
pip install vllm
# Start the server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--port 8000Configuration
providers:
vllm:
api_type: openai_chatcompletions
base_url: http://localhost:8000/v1
agents:
root:
model: vllm/meta-llama/Llama-3.2-3B-Instruct
description: vLLM-powered assistant
instruction: You are a helpful assistant.LocalAI
LocalAI provides an OpenAI-compatible API that works with various backends.
Setup
# Run with Docker
docker run -p 8080:8080 --name local-ai \
-v ./models:/models \
localai/localai:latest-cpuConfiguration
providers:
localai:
api_type: openai_chatcompletions
base_url: http://localhost:8080/v1
agents:
root:
model: localai/gpt4all-j
description: LocalAI assistant
instruction: You are a helpful assistant.Generic Custom Provider
For any OpenAI-compatible server:
providers:
my_server:
api_type: openai_chatcompletions
base_url: http://localhost:8000/v1
# token_key: MY_API_KEY # if auth required
agents:
root:
model: my_server/model-name
description: Custom server assistant
instruction: You are a helpful assistant.Performance Tips
NoteLocal Model Considerations
- Memory: Larger models need more RAM/VRAM. A 7B model typically needs 8-16GB RAM.
- GPU: GPU acceleration dramatically improves speed. Check your server's GPU support.
- Context length: Local models often have smaller context windows than cloud models.
- Tool calling: Not all local models support function/tool calling. Test your model's capabilities.
Example: Offline Development Agent
agents:
developer:
model: ollama/qwen2.5-coder
description: Offline code assistant
instruction: |
You are a software developer working offline.
Focus on code quality and clear explanations.
max_iterations: 20
toolsets:
- type: filesystem
- type: shell
- type: think
- type: todoTroubleshooting
Connection Refused
Ensure your model server is running and accessible:
curl http://localhost:11434/v1/models # Ollama
curl http://localhost:8000/v1/models # vLLMModel Not Found
Verify the model is downloaded/available:
ollama list # List available Ollama modelsSlow Responses
- Check if GPU acceleration is enabled
- Try a smaller model
- Reduce
max_tokensin your config