How the Linux kernel copyfail vulnerability impacts kubernetes: What you need to know and what you can do

copy fail in kubernetes: when your pod escapes to the host with four bytes

if you thought containers were a security boundary, cve-2026-31431 ("copy fail") has some unfortunate news for you. discovered by xint, this linux kernel vulnerability lets an unprivileged local user overwrite four controlled bytes in the page cache of any readable file—and yes, that includes binaries inside your containers. worse: because the page cache is a host-wide resource, corruption in one container can silently propagate to another. the result? a fully unprivileged pod can achieve node-level code execution.

this isn't a theoretical "what if." a public 732-byte python proof-of-concept demonstrates container escape on every major kubernetes distribution by exploiting shared image layers between an attacker-controlled pod and a privileged daemonset like kube-proxy. if your cluster runs linux kernels built between 2017 and april 2026, you should probably stop reading and start patching.

the container escape primitive: shared page cache, shared fate

the core vulnerability lives in the kernel's algif_aead subsystem, where improper handling of scatter-gather lists during in-place aead decryption allows a controlled 4-byte write into the page cache. the exploit chain is elegantly brutal:

# simplified exploit flow (full PoC: https://github.com/Percivalll/Copy-Fail-CVE-2026-31431-Kubernetes-PoC)
import os, socket

# 1. open AF_ALG socket to vulnerable crypto template
s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET)
s.bind(("aead", "authencesn(hmac(sha256),cbc(aes))"))
# ... set key, accept request socket ...

# 2. splice target file (e.g., /usr/sbin/ipset) into crypto operation
os.splice(target_fd, pipe_wr, offset=chosen_offset)
os.splice(pipe_rd, alg_fd, length=auth_tag_size)

# 3. trigger decrypt → kernel writes 4 controlled bytes into page cache
req_socket.recv(1)  # hmac fails, but corruption persists

the magic—and the danger—lies in how linux manages file i/o. when a container reads a file from a shared image layer, the kernel serves it from the same physical page cache pages across all containers on that node. this is a performance optimization, not a bug. but when combined with copy fail, it becomes an escape hatch.

why overlay filesystems make this worse

container runtimes like containerd and cri-o use overlayfs to implement copy-on-write semantics. when multiple pods reference the same image layer:

host page cache
├── lowerdir: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/<layer>/usr/sbin/ipset
├── upperdir: (container-specific, empty for read-only files)
└── merged view: served from shared page cache pages

if an unprivileged pod corrupts /usr/sbin/ipset in the page cache, every pod on that node that reads the same file from the same layer sees the corrupted in-memory version—without any cross-container communication, without touching disk, and without triggering traditional file integrity monitors.

the kube-proxy attack vector: a privileged daemonset waiting to happen

the public kubernetes poc targets /usr/sbin/ipset, a binary used by kube-proxy to manage iptables/ipset rules. here's why this is a perfect storm:

characteristic why it matters
kube-proxy runs as a privileged daemonset executes with hostnetwork: true, full capabilities, and root uid
ipset is invoked periodically corrupted binary gets executed automatically, no user interaction needed
image layer is shared across nodes same base image (registry.k8s.io/kube-proxy:v1.35.2) means same page cache mapping
binary is readable by unprivileged users satisfies the "any readable file" prerequisite for copy fail

the attack sequence:

  1. attacker deploys an unprivileged pod with the poc script (no special capabilities required)
  2. poc corrupts the page cache for /usr/sbin/ipset in the shared image layer
  3. kube-proxy on the same node executes the corrupted binary during its next reconciliation loop
  4. attacker-controlled shellcode runs with kube-proxy's privileges: root on the node, access to host namespaces, and full cluster control via the node's service account

this isn't a "maybe." the poc has been tested and confirmed working on ubuntu, amazon linux, rhel, and suse kernels spanning versions 6.12 through 6.18.

GitHub - Percivalll/Copy-Fail-CVE-2026-31431-Kubernetes-PoC
Contribute to Percivalll/Copy-Fail-CVE-2026-31431-Kubernetes-PoC development by creating an account on GitHub.

kubernetes-specific mitigations: patch first, architect second

immediate actions (today)

disable the vulnerable kernel module (temporary)

# node-level mitigation via DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata: { name: disable-algif-aead }
spec:
  template:
    spec:
      hostPID: true
      containers:
      - name: mitigator
        image: alpine:latest
        command: ["/bin/sh", "-c"]
        args:
        - |
          echo "install algif_aead /bin/false" > /host/etc/modprobe.d/disable-algif.conf
          chroot /host rmmod algif_aead 2>/dev/null || true
        volumeMounts:
        - name: host-root
          mountPath: /host
      volumes:
      - name: host-root
        hostPath: { path: /, type: Directory }

block af_alg at the runtime level
use seccomp profiles to prevent af_alg socket creation in untrusted pods:

# pod securityContext with seccomp
securityContext:
  seccompProfile:
    type: Localhost
    localhostProfile: profiles/block-af-alg.json
// profiles/block-af-alg.json
{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [{
    "names": ["socket"],
    "action": "SCMP_ACT_ERRNO",
    "args": [{"index": 0, "value": 38, "op": "SCMP_CMP_EQ"}] // AF_ALG = 38
  }]
}

patch your nodes
apply a kernel containing the upstream fix (commit a664bf3d603d). for managed kubernetes services:

# EKS: trigger node group update
aws eks update-nodegroup-config --cluster-name my-cluster --nodegroup-name my-ng \
  --launch-template version=$NEW_VERSION

# GKE: enable auto-upgrade or manually upgrade nodes
gcloud container clusters upgrade my-cluster --node-pool default-pool \
  --cluster-version=1.35.2-gke.100

architectural hardening (this quarter)

  • isolate image layers for privileged workloads
    use distinct base images for daemonsets like kube-proxy that aren't shared with user workloads. this breaks the page-cache propagation path.
  • adopt pod security admission (psa) or gatekeeper policies
    enforce that pods cannot request hostpath volumes, privileged mode, or af_alg-capable seccomp exemptions.

restrict pod placement with node affinity
prevent untrusted workloads from scheduling on nodes running privileged daemonsets with shared base images:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node-role.kubernetes.io/control-plane
          operator: DoesNotExist
        - key: workload-trust-level
          operator: In
          values: ["untrusted"]

enforce read-only root filesystems
while copy fail bypasses on-disk checks, a read-only rootfs limits post-exploitation persistence options:

securityContext:
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false

detection strategies for kubernetes environments

copy fail is stealthy by design: the corrupted page is never marked dirty, so on-disk checksums remain valid. detection requires behavioral signals:

  1. monitor for anomalous kube-proxy behavior
    corrupted ipset execution may cause:
    • unexpected iptables rule modifications
    • kube-proxy crash loops with unusual stack traces
    • auth.log entries with missing invoking usernames (see original advisory)
  2. watch for poc network artifacts
    non-stealthy attackers may fetch exploit code from https://copy.fail/exp. alert on egress to this domain from cluster pods.

correlate pod scheduling with kernel version
flag any unprivileged pod scheduled on a node running an unpatched kernel:

# quick cluster audit
kubectl get nodes -o json | jq -r '.items[] | 
  select(.status.nodeInfo.kernelVersion | test("6\\.(1[0-7]|[0-9])")) | 
  .metadata.name'

audit af_alg socket creation
use auditd or ebpf-based tracing to alert on unexpected socket(AF_ALG, ...) calls from containerized processes:

# ebpf trace example (bpftrace)
tracepoint:syscalls:sys_enter_socket /args->family == 38/ {
  printf("AF_ALG socket from pid %d (%s)\n", pid, comm);
}

the uncomfortable truth about container "isolation"

copy fail exposes a fundamental tension in container security: performance optimizations (shared page cache, overlayfs) directly conflict with isolation guarantees. the linux kernel was never designed with multi-tenant container workloads as a primary threat model—and it shows.

this isn't a call to abandon containers. it's a reminder that "isolation" is a spectrum, not a binary. defense-in-depth means:

  • assuming local privesc vulnerabilities will exist
  • minimizing the blast radius when they do
  • treating kernel patch latency as a first-order risk metric

because when four bytes can buy you the entire node, your pod security policy just became a suggestion.


references

source: adapted from wiz.io blog post by amitai cohen, merav bar, and shahar dorfman (may 1, 2026) and xint code research (april 29, 2026)

Related content

Nicolás Georger Nicolás Georger View more content by Nicolás Georger Self-taught IT professional driving innovation & social impact with cybernetics, open source (Linux, Kubernetes), AI & ML.