Local Models (Ollama, vLLM, LocalAI)

Table of contents

Run docker-agent with locally hosted models for privacy, offline use, or cost savings.

Overview

docker-agent can connect to any OpenAI-compatible local model server. This guide covers the most popular options:

Ollama — Easy-to-use local model runner
vLLM — High-performance inference server
LocalAI — OpenAI-compatible API for various backends

Tip
Docker Model Runner
For the easiest local model experience, consider Docker Model Runner which is built into Docker Desktop and requires no additional setup.

Ollama

Ollama is a popular tool for running LLMs locally. docker-agent includes a built-in ollama alias for easy configuration.

Setup

Install Ollama from ollama.ai

Pull a model:

ollama pull llama3.2
ollama pull qwen2.5-coder

Start the Ollama server (usually runs automatically):
ollama serve

Configuration

Use the built-in ollama alias:

agents:
  root:
    model: ollama/llama3.2
    description: Local assistant
    instruction: You are a helpful assistant.

The ollama alias automatically uses:

Base URL: http://localhost:11434/v1
API Type: OpenAI-compatible
No API key required

Custom Port or Host

If Ollama runs on a different host or port:

models:
  my_ollama:
    provider: ollama
    model: llama3.2
    base_url: http://192.168.1.100:11434/v1

agents:
  root:
    model: my_ollama
    description: Remote Ollama assistant
    instruction: You are a helpful assistant.

Popular Ollama Models

Model	Size	Best For
`llama3.2`	3B	General purpose, fast
`llama3.1`	8B	Better reasoning
`qwen2.5-coder`	7B	Code generation
`mistral`	7B	General purpose
`codellama`	7B	Code tasks
`deepseek-coder`	6.7B	Code generation

vLLM

vLLM is a high-performance inference server optimized for throughput.

Setup

# Install vLLM
pip install vllm

# Start the server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --port 8000

Configuration

providers:
  vllm:
    api_type: openai_chatcompletions
    base_url: http://localhost:8000/v1

agents:
  root:
    model: vllm/meta-llama/Llama-3.2-3B-Instruct
    description: vLLM-powered assistant
    instruction: You are a helpful assistant.

LocalAI

LocalAI provides an OpenAI-compatible API that works with various backends.

Setup

# Run with Docker
docker run -p 8080:8080 --name local-ai \
  -v ./models:/models \
  localai/localai:latest-cpu

Configuration

providers:
  localai:
    api_type: openai_chatcompletions
    base_url: http://localhost:8080/v1

agents:
  root:
    model: localai/gpt4all-j
    description: LocalAI assistant
    instruction: You are a helpful assistant.

Generic Custom Provider

For any OpenAI-compatible server:

providers:
  my_server:
    api_type: openai_chatcompletions
    base_url: http://localhost:8000/v1
    # token_key: MY_API_KEY  # if auth required

agents:
  root:
    model: my_server/model-name
    description: Custom server assistant
    instruction: You are a helpful assistant.

Performance Tips

Note
Local Model Considerations
Memory: Larger models need more RAM/VRAM. A 7B model typically needs 8-16GB RAM.
GPU: GPU acceleration dramatically improves speed. Check your server's GPU support.
Context length: Local models often have smaller context windows than cloud models.
Tool calling: Not all local models support function/tool calling. Test your model's capabilities.

Example: Offline Development Agent

agents:
  developer:
    model: ollama/qwen2.5-coder
    description: Offline code assistant
    instruction: |
      You are a software developer working offline.
      Focus on code quality and clear explanations.
    max_iterations: 20
    toolsets:
      - type: filesystem
      - type: shell
      - type: think
      - type: todo

Troubleshooting

Connection Refused

Ensure your model server is running and accessible:

curl http://localhost:11434/v1/models  # Ollama
curl http://localhost:8000/v1/models   # vLLM

Model Not Found

Verify the model is downloaded/available:

ollama list  # List available Ollama models

Slow Responses

Check if GPU acceleration is enabled
Try a smaller model
Reduce max_tokens in your config

What can I help you with?

Local Models (Ollama, vLLM, LocalAI)