Seccomp security profiles for Docker

Secure computing mode (seccomp) is a Linux kernel feature. You can use it to restrict the actions available within the container. The seccomp() system call operates on the seccomp state of the calling process. You can use this feature to restrict your application's access.

This feature is available only if Docker has been built with seccomp and the kernel is configured with CONFIG_SECCOMP enabled. To check if your kernel supports seccomp:

$ grep CONFIG_SECCOMP= /boot/config-$(uname -r)
CONFIG_SECCOMP=y

Pass a profile for a container

The default seccomp profile provides a sane default for running containers with seccomp and disables around 44 system calls out of 300+. It is moderately protective while providing wide application compatibility. The default Docker profile can be found here.

In effect, the profile is an allowlist that denies access to system calls by default and then allows specific system calls. The profile works by defining a defaultAction of SCMP_ACT_ERRNO and overriding that action only for specific system calls. The effect of SCMP_ACT_ERRNO is to cause a Permission Denied error. Next, the profile defines a specific list of system calls which are fully allowed, because their action is overridden to be SCMP_ACT_ALLOW. Finally, some specific rules are for individual system calls such as personality, and others, to allow variants of those system calls with specific arguments.

seccomp is instrumental for running Docker containers with least privilege. It is not recommended to change the default seccomp profile.

When you run a container, it uses the default profile unless you override it with the --security-opt option. For example, the following explicitly specifies a policy:

$ docker run --rm \
             -it \
             --security-opt seccomp=/path/to/seccomp/profile.json \
             hello-world

Significant syscalls blocked by the default profile

Docker's default seccomp profile is an allowlist which specifies the calls that are allowed. The table below lists the significant (but not all) syscalls that are effectively blocked because they are not on the allowlist. The table includes the reason each syscall is blocked rather than white-listed.

SyscallDescription
acctAccounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT.
add_keyPrevent containers from using the kernel keyring, which is not namespaced.
bpfDeny loading potentially persistent BPF programs into kernel, already gated by CAP_SYS_ADMIN.
clock_adjtimeTime/date is not namespaced. Also gated by CAP_SYS_TIME.
clock_settimeTime/date is not namespaced. Also gated by CAP_SYS_TIME.
cloneDeny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_NEWUSER.
create_moduleDeny manipulation and functions on kernel modules. Obsolete. Also gated by CAP_SYS_MODULE.
delete_moduleDeny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
finit_moduleDeny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
get_kernel_symsDeny retrieval of exported kernel and module symbols. Obsolete.
get_mempolicySyscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
init_moduleDeny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
iopermPrevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO.
ioplPrevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO.
kcmpRestrict process inspection capabilities, already blocked by dropping CAP_SYS_PTRACE.
kexec_file_loadSister syscall of kexec_load that does the same thing, slightly different arguments. Also gated by CAP_SYS_BOOT.
kexec_loadDeny loading a new kernel for later execution. Also gated by CAP_SYS_BOOT.
keyctlPrevent containers from using the kernel keyring, which is not namespaced.
lookup_dcookieTracing/profiling syscall, which could leak a lot of information on the host. Also gated by CAP_SYS_ADMIN.
mbindSyscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
mountDeny mounting, already gated by CAP_SYS_ADMIN.
move_pagesSyscall that modifies kernel memory and NUMA settings.
nfsservctlDeny interaction with the kernel NFS daemon. Obsolete since Linux 3.1.
open_by_handle_atCause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH.
perf_event_openTracing/profiling syscall, which could leak a lot of information on the host.
personalityPrevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulnerabilities.
pivot_rootDeny pivot_root, should be privileged operation.
process_vm_readvRestrict process inspection capabilities, already blocked by dropping CAP_SYS_PTRACE.
process_vm_writevRestrict process inspection capabilities, already blocked by dropping CAP_SYS_PTRACE.
ptraceTracing/profiling syscall. Blocked in Linux kernel versions before 4.8 to avoid seccomp bypass. Tracing/profiling arbitrary processes is already blocked by dropping CAP_SYS_PTRACE, because it could leak a lot of information on the host.
query_moduleDeny manipulation and functions on kernel modules. Obsolete.
quotactlQuota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN.
rebootDon't let containers reboot the host. Also gated by CAP_SYS_BOOT.
request_keyPrevent containers from using the kernel keyring, which is not namespaced.
set_mempolicySyscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
setnsDeny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN.
settimeofdayTime/date is not namespaced. Also gated by CAP_SYS_TIME.
stimeTime/date is not namespaced. Also gated by CAP_SYS_TIME.
swaponDeny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN.
swapoffDeny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN.
sysfsObsolete syscall.
_sysctlObsolete, replaced by /proc/sys.
umountShould be a privileged operation. Also gated by CAP_SYS_ADMIN.
umount2Should be a privileged operation. Also gated by CAP_SYS_ADMIN.
unshareDeny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN, with the exception of unshare --user.
uselibOlder syscall related to shared libraries, unused for a long time.
userfaultfdUserspace page fault handling, largely needed for process migration.
ustatObsolete syscall.
vm86In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.
vm86oldIn kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.

Run without the default seccomp profile

You can pass unconfined to run a container without the default seccomp profile.

$ docker run --rm -it --security-opt seccomp=unconfined debian:latest \
    unshare --map-root-user --user sh -c whoami