Share feedback
Answers are generated based on the documentation.

Inference engines

Docker Model Runner supports three inference engines: llama.cpp, vLLM, and Diffusers. Each engine has different strengths, supported platforms, and model format requirements. This guide helps you choose the right engine and configure it for your use case.

Engine comparison

Featurellama.cppvLLMDiffusers
Model formatsGGUFSafetensors, HuggingFaceDDUF
PlatformsAll (macOS, Windows, Linux)Linux x86_64 onlyLinux (x86_64, ARM64)
GPU supportNVIDIA, AMD, Apple Silicon, VulkanNVIDIA CUDA onlyNVIDIA CUDA only
CPU inferenceYesNoNo
QuantizationBuilt-in (Q4, Q5, Q8, etc.)LimitedLimited
Memory efficiencyHigh (with quantization)ModerateModerate
ThroughputGoodHigh (with batching)Good
Best forLocal development, resource-constrained environmentsProduction, high throughputImage generation
Use caseText generation (LLMs)Text generation (LLMs)Image generation (Stable Diffusion)

llama.cpp

llama.cpp is the default inference engine in Docker Model Runner. It's designed for efficient local inference and supports a wide range of hardware configurations.

Platform support

PlatformGPU supportNotes
macOS (Apple Silicon)MetalAutomatic GPU acceleration
Windows (x64)NVIDIA CUDARequires NVIDIA drivers 576.57+
Windows (ARM64)Adreno OpenCLQualcomm 6xx series and later
Linux (x64)NVIDIA, AMD, VulkanMultiple backend options
LinuxCPU onlyWorks on any x64/ARM64 system

Model format: GGUF

llama.cpp uses the GGUF format, which supports efficient quantization for reduced memory usage without significant quality loss.

Quantization levels

QuantizationBits per weightMemory usageQuality
Q2_K~2.5LowestReduced
Q3_K_M~3.5MinimalAcceptable
Q4_K_M~4.5LowGood
Q5_K_M~5.5ModerateExcellent
Q6_K~6.5HigherExcellent
Q8_08HighNear-original
F1616HighestOriginal

Recommended: Q4_K_M offers the best balance of quality and memory usage for most use cases.

Pulling quantized models

Models on Docker Hub often include quantization in the tag:

$ docker model pull ai/llama3.2:3B-Q4_K_M

Using llama.cpp

llama.cpp is the default engine. No special configuration is required:

$ docker model run ai/smollm2

To explicitly specify llama.cpp when running models:

$ docker model run ai/smollm2 --backend llama.cpp

llama.cpp API endpoints

When using llama.cpp, API calls use the llama.cpp engine path:

POST /engines/llama.cpp/v1/chat/completions

Or without the engine prefix:

POST /engines/v1/chat/completions

vLLM

vLLM is a high-performance inference engine optimized for production workloads with high throughput requirements.

Platform support

PlatformGPUSupport status
Linux x86_64NVIDIA CUDASupported
Windows with WSL2NVIDIA CUDASupported (Docker Desktop 4.54+)
macOS-Not supported
Linux ARM64-Not supported
AMD GPUs-Not supported
Important

vLLM requires an NVIDIA GPU with CUDA support. It does not support CPU-only inference.

Model format: Safetensors

vLLM works with models in Safetensors format, which is the standard format for HuggingFace models. These models typically use more memory than quantized GGUF models but may offer better quality and faster inference on powerful hardware.

Setting up vLLM

Docker Engine (Linux)

Install the Model Runner with vLLM backend:

$ docker model install-runner --backend vllm --gpu cuda

Verify the installation:

$ docker model status
Docker Model Runner is running

Status:
llama.cpp: running llama.cpp version: c22473b
vllm: running vllm version: 0.11.0

Docker Desktop (Windows with WSL2)

  1. Ensure you have:

    • Docker Desktop 4.54 or later
    • NVIDIA GPU with updated drivers
    • WSL2 enabled
  2. Install vLLM backend:

    $ docker model install-runner --backend vllm --gpu cuda
    

Running models with vLLM

vLLM models are typically tagged with -vllm suffix:

$ docker model run ai/smollm2-vllm

To specify the vLLM backend explicitly:

$ docker model run ai/model --backend vllm

vLLM API endpoints

When using vLLM, specify the engine in the API path:

POST /engines/vllm/v1/chat/completions

vLLM configuration

HuggingFace overrides

Use --hf_overrides to pass model configuration overrides:

$ docker model configure --hf_overrides '{"max_model_len": 8192}' ai/model-vllm

Common vLLM settings

SettingDescriptionExample
max_model_lenMaximum context length8192
gpu_memory_utilizationFraction of GPU memory to use0.9
tensor_parallel_sizeGPUs for tensor parallelism2

vLLM and llama.cpp performance comparison

ScenarioRecommended engine
Single user, local developmentllama.cpp
Multiple concurrent requestsvLLM
Limited GPU memoryllama.cpp (with quantization)
Maximum throughputvLLM
CPU-only systemllama.cpp
Apple Silicon Macllama.cpp
Production deploymentvLLM (if hardware supports it)

Diffusers

Diffusers is an inference engine for image generation models, including Stable Diffusion. Unlike llama.cpp and vLLM which focus on text generation with LLMs, Diffusers enables you to generate images from text prompts.

Platform support

PlatformGPUSupport status
Linux x86_64NVIDIA CUDASupported
Linux ARM64NVIDIA CUDASupported
Windows-Not supported
macOS-Not supported
Important

Diffusers requires an NVIDIA GPU with CUDA support. It does not support CPU-only inference.

Setting up Diffusers

Install the Model Runner with Diffusers backend:

$ docker model reinstall-runner --backend diffusers --gpu cuda

Verify the installation:

$ docker model status
Docker Model Runner is running

Status:
llama.cpp: running llama.cpp version: 34ce48d
mlx: not installed
sglang: sglang package not installed
vllm: vLLM binary not found
diffusers: running diffusers version: 0.36.0

Pulling Diffusers models

Pull a Stable Diffusion model:

$ docker model pull stable-diffusion:Q4

Generating images with Diffusers

Diffusers uses an image generation API endpoint. To generate an image:

$ curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stable-diffusion:Q4",
    "prompt": "A picture of a nice cat",
    "size": "512x512"
  }' | jq -r '.data[0].b64_json' | base64 -d > image.png

This command:

  1. Sends a POST request to the Diffusers image generation endpoint
  2. Specifies the model, prompt, and output image size
  3. Extracts the base64-encoded image from the response
  4. Decodes it and saves it as image.png

Diffusers API endpoint

When using Diffusers, specify the engine in the API path:

POST /engines/diffusers/v1/images/generations

Supported parameters

ParameterTypeDescription
modelstringRequired. The model identifier (e.g., stable-diffusion:Q4).
promptstringRequired. The text description of the image to generate.
sizestringImage dimensions in WIDTHxHEIGHT format (e.g., 512x512).

Running multiple engines

You can run llama.cpp, vLLM, and Diffusers simultaneously. Docker Model Runner routes requests to the appropriate engine based on the model or explicit engine selection.

Check which engines are running:

$ docker model status
Docker Model Runner is running

Status:
llama.cpp: running llama.cpp version: 34ce48d
mlx: not installed
sglang: sglang package not installed
vllm: running vllm version: 0.11.0
diffusers: running diffusers version: 0.36.0

Engine-specific API paths

EngineAPI pathUse case
llama.cpp/engines/llama.cpp/v1/chat/completionsText generation
vLLM/engines/vllm/v1/chat/completionsText generation
Diffusers/engines/diffusers/v1/images/generationsImage generation
Auto-select/engines/v1/chat/completionsText generation (auto-selects engine)

Managing inference engines

Install an engine

$ docker model install-runner --backend <engine> [--gpu <type>]

Options:

  • --backend: llama.cpp, vllm, or diffusers
  • --gpu: cuda, rocm, vulkan, or metal (depends on platform)

Reinstall an engine

$ docker model reinstall-runner --backend <engine>

Check engine status

$ docker model status

View engine logs

$ docker model logs

Packaging models for each engine

Package a GGUF model (llama.cpp)

$ docker model package --gguf ./model.gguf --push myorg/mymodel:Q4_K_M

Package a Safetensors model (vLLM)

$ docker model package --safetensors ./model/ --push myorg/mymodel-vllm

Troubleshooting

vLLM won't start

  1. Verify NVIDIA GPU is available:

    $ nvidia-smi
    
  2. Check Docker has GPU access:

    $ docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
    
  3. Verify you're on a supported platform (Linux x86_64 or Windows WSL2).

llama.cpp is slow

  1. Ensure GPU acceleration is working (check logs for Metal/CUDA messages).

  2. Try a more aggressive quantization:

    $ docker model pull ai/model:Q4_K_M
    
  3. Reduce context size:

    $ docker model configure --context-size 2048 ai/model
    

Out of memory errors

  1. Use a smaller quantization (Q4 instead of Q8).
  2. Reduce context size.
  3. For vLLM, adjust gpu_memory_utilization:
    $ docker model configure --hf_overrides '{"gpu_memory_utilization": 0.8}' ai/model
    

What's next