Configuration options

Table of contents

Docker Model Runner provides several configuration options to tune model behavior, memory usage, and inference performance. This guide covers the key settings and how to apply them.

Context size (context length)

The context size determines the maximum number of tokens a model can process in a single request, including both the input prompt and generated output. This is one of the most important settings affecting memory usage and model capabilities.

Default context size

By default, Docker Model Runner uses a context size that balances capability with resource efficiency:

Engine	Default behavior
llama.cpp	4096 tokens
vLLM	Uses the model's maximum trained context size

Note
The actual default varies by model. Most models support between 2,048 and 8,192 tokens by default. Some newer models support 32K, 128K, or even larger contexts.

Configure context size

You can adjust context size per model using the docker model configure command:

$ docker model configure --context-size 8192 ai/qwen2.5-coder

Or in a Compose file:

models:
  llm:
    model: ai/qwen2.5-coder
    context_size: 8192

Context size guidelines

Context size	Typical use case	Memory impact
2,048	Simple queries, short code snippets	Low
4,096	Standard conversations, medium code files	Moderate
8,192	Long conversations, larger code files	Higher
16,384+	Extended documents, multi-file context	High

Important
Larger context sizes require more memory (RAM/VRAM). If you experience out-of-memory errors, reduce the context size. As a rough guide, each additional 1,000 tokens requires approximately 100-500 MB of additional memory, depending on the model size.

Check a model's maximum context

To see a model's configuration including context size:

$ docker model inspect ai/qwen2.5-coder

Note
The docker model inspect command shows the model's maximum supported context length (e.g., gemma3.context_length), not the configured context size. The configured context size is what you set with docker model configure --context-size and represents the actual limit used during inference, which should be less than or equal to the model's maximum supported context length.

Runtime flags

Runtime flags let you pass parameters directly to the underlying inference engine. This provides fine-grained control over model behavior.

Using runtime flags

Runtime flags can be provided through multiple mechanisms:

Using Docker Compose

In a Compose file:

models:
  llm:
    model: ai/qwen2.5-coder
    context_size: 4096
    runtime_flags:
      - "--temp"
      - "0.7"
      - "--top-p"
      - "0.9"

Using Command Line

With the docker model configure command:

$ docker model configure --runtime-flag "--temp" --runtime-flag "0.7" --runtime-flag "--top-p" --runtime-flag "0.9" ai/qwen2.5-coder

Common llama.cpp parameters

These are the most commonly used llama.cpp parameters. You don't need to look up the llama.cpp documentation for typical use cases.

Sampling parameters

Flag	Description	Default	Range
`--temp`	Temperature for sampling. Lower = more deterministic, higher = more creative	0.8	0.0-2.0
`--top-k`	Limit sampling to top K tokens. Lower = more focused	40	1-100
`--top-p`	Nucleus sampling threshold. Lower = more focused	0.9	0.0-1.0
`--min-p`	Minimum probability threshold	0.05	0.0-1.0
`--repeat-penalty`	Penalty for repeating tokens	1.1	1.0-2.0

Example: Deterministic output (for code generation)

runtime_flags:
  - "--temp"
  - "0"
  - "--top-k"
  - "1"

Example: Creative output (for storytelling)

runtime_flags:
  - "--temp"
  - "1.2"
  - "--top-p"
  - "0.95"

Performance parameters

Flag	Description	Default	Notes
`--threads`	CPU threads for generation	Auto	Set to number of performance cores
`--threads-batch`	CPU threads for batch processing	Auto	Usually same as `--threads`
`--batch-size`	Batch size for prompt processing	512	Higher = faster prompt processing
`--mlock`	Lock model in memory	Off	Prevents swapping, requires sufficient RAM
`--no-mmap`	Disable memory mapping	Off	May improve performance on some systems

Example: Optimized for multi-core CPU

runtime_flags:
  - "--threads"
  - "8"
  - "--batch-size"
  - "1024"

GPU parameters

Flag	Description	Default	Notes
`--n-gpu-layers`	Layers to offload to GPU	All (if GPU available)	Reduce if running out of VRAM
`--main-gpu`	GPU to use for computation	0	For multi-GPU systems
`--split-mode`	How to split across GPUs	layer	Options: `none`, `layer`, `row`

Example: Partial GPU offload (limited VRAM)

runtime_flags:
  - "--n-gpu-layers"
  - "20"

Advanced parameters

Flag	Description	Default
`--rope-scaling`	RoPE scaling method	Auto
`--rope-freq-base`	RoPE base frequency	Model default
`--rope-freq-scale`	RoPE frequency scale	Model default
`--no-prefill-assistant`	Disable assistant pre-fill	Off
`--reasoning-budget`	Token budget for reasoning models	0 (disabled)

vLLM parameters

When using the vLLM backend, different parameters are available.

Use --hf_overrides to pass HuggingFace model config overrides as JSON:

$ docker model configure --hf_overrides '{"rope_scaling": {"type": "dynamic", "factor": 2.0}}' ai/model-vllm

Configuration presets

Here are complete configuration examples for common use cases.

Code completion (fast, deterministic)

models:
  coder:
    model: ai/qwen2.5-coder
    context_size: 4096
    runtime_flags:
      - "--temp"
      - "0.1"
      - "--top-k"
      - "1"
      - "--batch-size"
      - "1024"

Chat assistant (balanced)

models:
  assistant:
    model: ai/llama3.2
    context_size: 8192
    runtime_flags:
      - "--temp"
      - "0.7"
      - "--top-p"
      - "0.9"
      - "--repeat-penalty"
      - "1.1"

Creative writing (high temperature)

models:
  writer:
    model: ai/llama3.2
    context_size: 8192
    runtime_flags:
      - "--temp"
      - "1.2"
      - "--top-p"
      - "0.95"
      - "--repeat-penalty"
      - "1.0"

Long document analysis (large context)

models:
  analyzer:
    model: ai/qwen2.5-coder:14B
    context_size: 32768
    runtime_flags:
      - "--mlock"
      - "--batch-size"
      - "2048"

Low memory system

models:
  efficient:
    model: ai/smollm2:360M-Q4_K_M
    context_size: 2048
    runtime_flags:
      - "--threads"
      - "4"

Environment-based configuration

You can also configure models via environment variables in containers:

Variable	Description
`LLM_URL`	Auto-injected URL of the model endpoint
`LLM_MODEL`	Auto-injected model identifier

See Models and Compose for details on how these are populated.

Reset configuration

Configuration set via docker model configure persists until the model is removed. To reset configuration:

$ docker model configure --context-size -1 ai/qwen2.5-coder

Using -1 resets to the default value.

What's next

Inference engines - Learn about llama.cpp and vLLM
API reference - API parameters for per-request configuration
Models and Compose - Configure models in Compose applications

Ask me about Docker

Configuration options