Container Sandboxing

In this lesson we examine how to harden containers by applying sandboxing techniques that reduce the kernel attack surface and limit what containerized processes can do. We contrast container isolation with virtual machines, show practical examples (including PID namespaces), and present common sandboxing controls and advanced alternatives that provide stronger isolation.

Quick refresher: virtual machines vs containers

Virtual machines provide strong isolation because each VM runs a full guest operating system and its own kernel on top of a hypervisor. Containers, by contrast, share the host kernel and isolate processes using kernel mechanisms such as namespaces and cgroups. That architecture difference is critical when evaluating attack surfaces and escape risks.

A diagram titled "Container Sandboxing" comparing virtual machines and containers. It shows layered stacks (application, libs, deps, guest OS) with VMs using a hypervisor and separate guest OSes, while containers share Docker and the host OS over the hardware.

Isolation Aspect	Virtual Machines	Containers
Kernel per workload	Yes — dedicated guest kernel	No — shared host kernel
Isolation mechanism	Hypervisor-based	Namespaces, cgroups, LSMs
Typical use case	Strong multi-tenant isolation	Lightweight microservices, higher density
Escape risk if kernel compromised	Lower (guest kernel separation)	Higher (shared kernel)

PID namespaces: a simple example

Containers map process IDs into a PID namespace. Inside the container, a process can appear as PID 1, while on the host it has a different PID. This demonstrates logical process isolation but also shows that the host can still observe and terminate the underlying host PID. Example (run a BusyBox container that sleeps for 1000 seconds):

# Run a container that sleeps
$ docker run -d --name sleeping-container busybox sleep 1000
e2fd5090c9a51eb7cc91a466cf2e18c5468871f84adbb55c2e6c1cf4ea0028a8

# Inside the container: PID 1 is the sleep process
$ docker exec -ti sleeping-container ps -ef
PID   USER     TIME  COMMAND
1     root     0:00  sleep 1000
11    root     0:00  ps -ef

# On the host you can also see the sleep process with a different PID
$ ps -ef | grep sleep | grep -vi grep
root     7902  7871  0 21:39 ?        00:00:00 sleep 1000

Because the host-level process exists, killing that host PID terminates the container process. This shows that namespaces provide isolation at the user-space level, but the shared kernel remains the ultimate control plane.

Containers share the host kernel. If the kernel has a vulnerability (for example, a local privilege escalation like Dirty COW), a compromised container process can potentially exploit the kernel and affect the host and other containers.

How containerized processes interact with the kernel

Applications (in containers or on bare OS) run in user space and make system calls to access hardware and privileged services. Since containers use the host kernel, restricting system call access and other kernel-visible actions is a key hardening strategy. Two widely used kernel-level sandboxing controls are seccomp and AppArmor (or SELinux).

Seccomp: restricts the set of system calls a process may invoke.
AppArmor: enforces path- and capability-based access controls for files and other resources.

Both tools reduce the risk of a kernel exploit being used from within a container by reducing what code inside the container can ask the kernel to do.

Example sandboxing configurations

Example seccomp profile (whitelist-style — deny by default, allow a small syscall set):

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": [
    "SCMP_ARCH_X86_64",
    "SCMP_ARCH_X86",
    "SCMP_ARCH_X32"
  ],
  "syscalls": [
    {
      "names": [
        "execve",
        "brk",
        "access",
        "capset",
        "clone"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Example AppArmor profile snippet (blacklist-style rule denying writes to /proc):

profile apparmor-deny-write flags=(attach_disconnected) {
    #include <abstractions/base>

    # Allow typical reads and library access (adjust paths for your program)
    /usr/bin/your-binary ixr,
    /lib/** r,
    /usr/lib/** r,
    /etc/** r,

    # Deny all write access to /proc
    deny /proc/** w,
}

Whitelist vs blacklist approaches

Pattern	Description	Strengths	Trade-offs
Whitelist (seccomp)	Default deny; explicitly allow required syscalls	Minimal kernel surface, strong security	Requires profiling/application knowledge
Blacklist (AppArmor snippet)	Default allow; block specific actions or paths	Easier to implement for diverse apps	May miss attack vectors; less strict

When feasible, prefer whitelist-based restrictions (e.g., seccomp profiles) to minimize the kernel functionality exposed to containerized applications. Use blacklists when you need broader compatibility and then complement them with other controls.

Practical guidance for production workloads

Small, homogeneous fleets: Create strict, minimal seccomp and AppArmor/SELinux profiles for each service (for example, many Nginx or MySQL instances). This gives strong protection with manageable maintenance.
Large, heterogeneous fleets: Use a layered approach — namespaces + cgroups + capability drops + seccomp + LSMs (AppArmor/SELinux) — and focus on automation to generate and roll out profiles.
Follow the principle of least privilege: drop Linux capabilities your process does not need and restrict filesystem and network access.
Monitor and iterate: use runtime observability and profiling to generate accurate whitelists and to find false positives/negatives before enforcing strict policies.

Advanced sandboxing / microVM alternatives

If the shared-kernel model is unacceptable for your threat model, consider technologies that provide stronger kernel isolation by running containers inside lightweight VMs or alternative kernels:

Technology	Description	Use case
gVisor	User-space kernel that intercepts syscalls and emulates kernel behavior	Improve isolation without heavy VMs
Kata Containers	Runs container workloads inside lightweight VMs managed by a runtime	Stronger isolation with VM-level boundaries
Firecracker	MicroVMs designed for minimal overhead and fast startup	Serverless and multi-tenant isolation at VM level

These projects trade some density and complexity for stronger separations between workloads and the host kernel.

Introduction

Understanding the Kubernetes Attack Surface

Cluster Setup and Hardening

System Hardening

Minimize Microservice Vulnerabilities

Supply Chain Security

Monitoring, Logging and Runtime Security

Container Sandboxing

Quick refresher: virtual machines vs containers

PID namespaces: a simple example

How containerized processes interact with the kernel

Example sandboxing configurations

Whitelist vs blacklist approaches

Practical guidance for production workloads

Advanced sandboxing / microVM alternatives

Further reading and references

Watch Video

Introduction

Understanding the Kubernetes Attack Surface

Cluster Setup and Hardening

System Hardening

Minimize Microservice Vulnerabilities

Supply Chain Security

Monitoring, Logging and Runtime Security

Documentation Index

​Quick refresher: virtual machines vs containers

​PID namespaces: a simple example

​How containerized processes interact with the kernel

​Example sandboxing configurations

​Whitelist vs blacklist approaches

​Practical guidance for production workloads

​Advanced sandboxing / microVM alternatives

​Further reading and references

Watch Video

Quick refresher: virtual machines vs containers

PID namespaces: a simple example

How containerized processes interact with the kernel

Example sandboxing configurations

Whitelist vs blacklist approaches

Practical guidance for production workloads

Advanced sandboxing / microVM alternatives

Further reading and references