How the Linux kernel copyfail vulnerability impacts kubernetes: What you need to know and what you can do
copy fail in kubernetes: when your pod escapes to the host with four bytes
if you thought containers were a security boundary, cve-2026-31431 ("copy fail") has some unfortunate news for you. discovered by xint, this linux kernel vulnerability lets an unprivileged local user overwrite four controlled bytes in the page cache of any readable file—and yes, that includes binaries inside your containers. worse: because the page cache is a host-wide resource, corruption in one container can silently propagate to another. the result? a fully unprivileged pod can achieve node-level code execution.
this isn't a theoretical "what if." a public 732-byte python proof-of-concept demonstrates container escape on every major kubernetes distribution by exploiting shared image layers between an attacker-controlled pod and a privileged daemonset like kube-proxy. if your cluster runs linux kernels built between 2017 and april 2026, you should probably stop reading and start patching.
the container escape primitive: shared page cache, shared fate
the core vulnerability lives in the kernel's algif_aead subsystem, where improper handling of scatter-gather lists during in-place aead decryption allows a controlled 4-byte write into the page cache. the exploit chain is elegantly brutal:
# simplified exploit flow (full PoC: https://github.com/Percivalll/Copy-Fail-CVE-2026-31431-Kubernetes-PoC)
import os, socket
# 1. open AF_ALG socket to vulnerable crypto template
s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET)
s.bind(("aead", "authencesn(hmac(sha256),cbc(aes))"))
# ... set key, accept request socket ...
# 2. splice target file (e.g., /usr/sbin/ipset) into crypto operation
os.splice(target_fd, pipe_wr, offset=chosen_offset)
os.splice(pipe_rd, alg_fd, length=auth_tag_size)
# 3. trigger decrypt → kernel writes 4 controlled bytes into page cache
req_socket.recv(1) # hmac fails, but corruption persists
the magic—and the danger—lies in how linux manages file i/o. when a container reads a file from a shared image layer, the kernel serves it from the same physical page cache pages across all containers on that node. this is a performance optimization, not a bug. but when combined with copy fail, it becomes an escape hatch.
why overlay filesystems make this worse
container runtimes like containerd and cri-o use overlayfs to implement copy-on-write semantics. when multiple pods reference the same image layer:
host page cache
├── lowerdir: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/<layer>/usr/sbin/ipset
├── upperdir: (container-specific, empty for read-only files)
└── merged view: served from shared page cache pages
if an unprivileged pod corrupts /usr/sbin/ipset in the page cache, every pod on that node that reads the same file from the same layer sees the corrupted in-memory version—without any cross-container communication, without touching disk, and without triggering traditional file integrity monitors.
the kube-proxy attack vector: a privileged daemonset waiting to happen
the public kubernetes poc targets /usr/sbin/ipset, a binary used by kube-proxy to manage iptables/ipset rules. here's why this is a perfect storm:
| characteristic | why it matters |
|---|---|
kube-proxy runs as a privileged daemonset |
executes with hostnetwork: true, full capabilities, and root uid |
ipset is invoked periodically |
corrupted binary gets executed automatically, no user interaction needed |
| image layer is shared across nodes | same base image (registry.k8s.io/kube-proxy:v1.35.2) means same page cache mapping |
| binary is readable by unprivileged users | satisfies the "any readable file" prerequisite for copy fail |
the attack sequence:
- attacker deploys an unprivileged pod with the poc script (no special capabilities required)
- poc corrupts the page cache for
/usr/sbin/ipsetin the shared image layer kube-proxyon the same node executes the corrupted binary during its next reconciliation loop- attacker-controlled shellcode runs with kube-proxy's privileges: root on the node, access to host namespaces, and full cluster control via the node's service account
this isn't a "maybe." the poc has been tested and confirmed working on ubuntu, amazon linux, rhel, and suse kernels spanning versions 6.12 through 6.18.
kubernetes-specific mitigations: patch first, architect second
immediate actions (today)
disable the vulnerable kernel module (temporary)
# node-level mitigation via DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata: { name: disable-algif-aead }
spec:
template:
spec:
hostPID: true
containers:
- name: mitigator
image: alpine:latest
command: ["/bin/sh", "-c"]
args:
- |
echo "install algif_aead /bin/false" > /host/etc/modprobe.d/disable-algif.conf
chroot /host rmmod algif_aead 2>/dev/null || true
volumeMounts:
- name: host-root
mountPath: /host
volumes:
- name: host-root
hostPath: { path: /, type: Directory }
block af_alg at the runtime level
use seccomp profiles to prevent af_alg socket creation in untrusted pods:
# pod securityContext with seccomp
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/block-af-alg.json
// profiles/block-af-alg.json
{
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [{
"names": ["socket"],
"action": "SCMP_ACT_ERRNO",
"args": [{"index": 0, "value": 38, "op": "SCMP_CMP_EQ"}] // AF_ALG = 38
}]
}
patch your nodes
apply a kernel containing the upstream fix (commit a664bf3d603d). for managed kubernetes services:
# EKS: trigger node group update
aws eks update-nodegroup-config --cluster-name my-cluster --nodegroup-name my-ng \
--launch-template version=$NEW_VERSION
# GKE: enable auto-upgrade or manually upgrade nodes
gcloud container clusters upgrade my-cluster --node-pool default-pool \
--cluster-version=1.35.2-gke.100
architectural hardening (this quarter)
- isolate image layers for privileged workloads
use distinct base images for daemonsets likekube-proxythat aren't shared with user workloads. this breaks the page-cache propagation path. - adopt pod security admission (psa) or gatekeeper policies
enforce that pods cannot requesthostpathvolumes,privilegedmode, oraf_alg-capable seccomp exemptions.
restrict pod placement with node affinity
prevent untrusted workloads from scheduling on nodes running privileged daemonsets with shared base images:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
- key: workload-trust-level
operator: In
values: ["untrusted"]
enforce read-only root filesystems
while copy fail bypasses on-disk checks, a read-only rootfs limits post-exploitation persistence options:
securityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
detection strategies for kubernetes environments
copy fail is stealthy by design: the corrupted page is never marked dirty, so on-disk checksums remain valid. detection requires behavioral signals:
- monitor for anomalous
kube-proxybehavior
corruptedipsetexecution may cause:- unexpected iptables rule modifications
kube-proxycrash loops with unusual stack traces- auth.log entries with missing invoking usernames (see original advisory)
- watch for poc network artifacts
non-stealthy attackers may fetch exploit code fromhttps://copy.fail/exp. alert on egress to this domain from cluster pods.
correlate pod scheduling with kernel version
flag any unprivileged pod scheduled on a node running an unpatched kernel:
# quick cluster audit
kubectl get nodes -o json | jq -r '.items[] |
select(.status.nodeInfo.kernelVersion | test("6\\.(1[0-7]|[0-9])")) |
.metadata.name'
audit af_alg socket creation
use auditd or ebpf-based tracing to alert on unexpected socket(AF_ALG, ...) calls from containerized processes:
# ebpf trace example (bpftrace)
tracepoint:syscalls:sys_enter_socket /args->family == 38/ {
printf("AF_ALG socket from pid %d (%s)\n", pid, comm);
}
the uncomfortable truth about container "isolation"
copy fail exposes a fundamental tension in container security: performance optimizations (shared page cache, overlayfs) directly conflict with isolation guarantees. the linux kernel was never designed with multi-tenant container workloads as a primary threat model—and it shows.
this isn't a call to abandon containers. it's a reminder that "isolation" is a spectrum, not a binary. defense-in-depth means:
- assuming local privesc vulnerabilities will exist
- minimizing the blast radius when they do
- treating kernel patch latency as a first-order risk metric
because when four bytes can buy you the entire node, your pod security policy just became a suggestion.
references
- wiz.io: copy fail vulnerability advisory
- xint technical writeup: copy fail
- kubernetes poc: cve-2026-31431 container escape
- upstream kernel fix: commit a664bf3d603d
- kubernetes pod security admission docs
- seccomp profiles for kubernetes
source: adapted from wiz.io blog post by amitai cohen, merav bar, and shahar dorfman (may 1, 2026) and xint code research (april 29, 2026)
Related content
¿Kubernetes es seguro para inteligencia artificial? Si, pero debes saber estos detalles
Kubernetes es un orquestador increíble, pero es fundamentalmente ciego al caos semántico de la IA. Por eso, confiar únicamente en K8s para la seguridad de un LLM es el equivalente digital a instalar una cerradura biométrica de alta tecnología en una puerta de cartón.
Read the full post →
Grave brecha de Trivy en Github Actions amenaza tus secretos, tokens, credenciales e incluso tus artefactos, qué debes hacer y saber
la ironía de la seguridad: las github actions de trivy secuestradas (otra vez) En un giro del destino que haría que cualquier SRE se sirviera un trago fuerte, Trivy —el escáner de vulnerabilidades estándar de la industria mantenido por Aqua Security— ha sido comprometido por segunda vez en un mes.