Share feedback
This is a custom LLM for answering questions about Docker. Answers are based on the documentation.

Configuration options

Docker Model Runner provides several configuration options to tune model behavior, memory usage, and inference performance. This guide covers the key settings and how to apply them.

Context size (context length)

The context size determines the maximum number of tokens a model can process in a single request, including both the input prompt and generated output. This is one of the most important settings affecting memory usage and model capabilities.

Default context size

By default, Docker Model Runner uses a context size that balances capability with resource efficiency:

EngineDefault behavior
llama.cpp4096 tokens
vLLMUses the model's maximum trained context size
Note

The actual default varies by model. Most models support between 2,048 and 8,192 tokens by default. Some newer models support 32K, 128K, or even larger contexts.

Configure context size

You can adjust context size per model using the docker model configure command:

$ docker model configure --context-size 8192 ai/qwen2.5-coder

Or in a Compose file:

models:
  llm:
    model: ai/qwen2.5-coder
    context_size: 8192

Context size guidelines

Context sizeTypical use caseMemory impact
2,048Simple queries, short code snippetsLow
4,096Standard conversations, medium code filesModerate
8,192Long conversations, larger code filesHigher
16,384+Extended documents, multi-file contextHigh
Important

Larger context sizes require more memory (RAM/VRAM). If you experience out-of-memory errors, reduce the context size. As a rough guide, each additional 1,000 tokens requires approximately 100-500 MB of additional memory, depending on the model size.

Check a model's maximum context

To see a model's configuration including context size:

$ docker model inspect ai/qwen2.5-coder
Note

The docker model inspect command shows the model's maximum supported context length (e.g., gemma3.context_length), not the configured context size. The configured context size is what you set with docker model configure --context-size and represents the actual limit used during inference, which should be less than or equal to the model's maximum supported context length.

Runtime flags

Runtime flags let you pass parameters directly to the underlying inference engine. This provides fine-grained control over model behavior.

Using runtime flags

Runtime flags can be provided through multiple mechanisms:

Using Docker Compose

In a Compose file:

models:
  llm:
    model: ai/qwen2.5-coder
    context_size: 4096
    runtime_flags:
      - "--temp"
      - "0.7"
      - "--top-p"
      - "0.9"

Using Command Line

With the docker model configure command:

$ docker model configure --runtime-flag "--temp" --runtime-flag "0.7" --runtime-flag "--top-p" --runtime-flag "0.9" ai/qwen2.5-coder

Common llama.cpp parameters

These are the most commonly used llama.cpp parameters. You don't need to look up the llama.cpp documentation for typical use cases.

Sampling parameters

FlagDescriptionDefaultRange
--tempTemperature for sampling. Lower = more deterministic, higher = more creative0.80.0-2.0
--top-kLimit sampling to top K tokens. Lower = more focused401-100
--top-pNucleus sampling threshold. Lower = more focused0.90.0-1.0
--min-pMinimum probability threshold0.050.0-1.0
--repeat-penaltyPenalty for repeating tokens1.11.0-2.0

Example: Deterministic output (for code generation)

runtime_flags:
  - "--temp"
  - "0"
  - "--top-k"
  - "1"

Example: Creative output (for storytelling)

runtime_flags:
  - "--temp"
  - "1.2"
  - "--top-p"
  - "0.95"

Performance parameters

FlagDescriptionDefaultNotes
--threadsCPU threads for generationAutoSet to number of performance cores
--threads-batchCPU threads for batch processingAutoUsually same as --threads
--batch-sizeBatch size for prompt processing512Higher = faster prompt processing
--mlockLock model in memoryOffPrevents swapping, requires sufficient RAM
--no-mmapDisable memory mappingOffMay improve performance on some systems

Example: Optimized for multi-core CPU

runtime_flags:
  - "--threads"
  - "8"
  - "--batch-size"
  - "1024"

GPU parameters

FlagDescriptionDefaultNotes
--n-gpu-layersLayers to offload to GPUAll (if GPU available)Reduce if running out of VRAM
--main-gpuGPU to use for computation0For multi-GPU systems
--split-modeHow to split across GPUslayerOptions: none, layer, row

Example: Partial GPU offload (limited VRAM)

runtime_flags:
  - "--n-gpu-layers"
  - "20"

Advanced parameters

FlagDescriptionDefault
--rope-scalingRoPE scaling methodAuto
--rope-freq-baseRoPE base frequencyModel default
--rope-freq-scaleRoPE frequency scaleModel default
--no-prefill-assistantDisable assistant pre-fillOff
--reasoning-budgetToken budget for reasoning models0 (disabled)

vLLM parameters

When using the vLLM backend, different parameters are available.

Use --hf_overrides to pass HuggingFace model config overrides as JSON:

$ docker model configure --hf_overrides '{"rope_scaling": {"type": "dynamic", "factor": 2.0}}' ai/model-vllm

Configuration presets

Here are complete configuration examples for common use cases.

Code completion (fast, deterministic)

models:
  coder:
    model: ai/qwen2.5-coder
    context_size: 4096
    runtime_flags:
      - "--temp"
      - "0.1"
      - "--top-k"
      - "1"
      - "--batch-size"
      - "1024"

Chat assistant (balanced)

models:
  assistant:
    model: ai/llama3.2
    context_size: 8192
    runtime_flags:
      - "--temp"
      - "0.7"
      - "--top-p"
      - "0.9"
      - "--repeat-penalty"
      - "1.1"

Creative writing (high temperature)

models:
  writer:
    model: ai/llama3.2
    context_size: 8192
    runtime_flags:
      - "--temp"
      - "1.2"
      - "--top-p"
      - "0.95"
      - "--repeat-penalty"
      - "1.0"

Long document analysis (large context)

models:
  analyzer:
    model: ai/qwen2.5-coder:14B
    context_size: 32768
    runtime_flags:
      - "--mlock"
      - "--batch-size"
      - "2048"

Low memory system

models:
  efficient:
    model: ai/smollm2:360M-Q4_K_M
    context_size: 2048
    runtime_flags:
      - "--threads"
      - "4"

Environment-based configuration

You can also configure models via environment variables in containers:

VariableDescription
LLM_URLAuto-injected URL of the model endpoint
LLM_MODELAuto-injected model identifier

See Models and Compose for details on how these are populated.

Reset configuration

Configuration set via docker model configure persists until the model is removed. To reset configuration:

$ docker model configure --context-size -1 ai/qwen2.5-coder

Using -1 resets to the default value.

What's next