Notes to self, 2026
2026-05-24 - containerd / kubernetes / open file limits
After upgrading a customer's Kubernetes cluster running on Ubuntu/Jammy, we ran into a snag. The vernemq instances wouldn't start completely, instead they gave us EMFILE errors: Too many open files. This came as a surprise. After all, we hadn't touched any limits.
But, it turned out containerd
v1.8 changed the LimitNOFILE setting from a very
permissive infinity to the systemd default. The result?
Processes inside Kubernetes would get a measly max 1024 file descriptors
by default.
containerd upgrades in Ubuntu
Normally Ubuntu is super conservative when migrating versions: once a stable LTS is released, you need to move heaven and earth to get a patch in. This time however, they bumped the containerd package from 1.7 to 2.2 within Jammy (22.04) and Noble (24.04) — in fact, Jammy has even seen two version bumps, as it started with version 1.5.
For such a big version jump, you might expect there to be impacting changes — Canonical did not, or at least didn't see any impact.
LimitNOFILE and systemd
For systemd started processes, systemd sets the
nofile ulimit to the values set in the service
file for the specific application. For example:
[Service] LimitNOFILE=infinity # value of /proc/sys/fs/nr_open for soft/hard
Or the default:
[Service] LimitNOFILE=1024:524288 # 1024 soft, 524288 hard
The systemd
exec manual recommends leaving the soft limit to 1024 because of
old processes still using select(2).
containerd ulimit history
At containerd they have had a hard time deciding what an appropriate value is:
| Year | Ver | Limit | commit | PR |
|---|---|---|---|---|
| 2017 | v1.0 | LimitNOFILE=1048576 |
b009642 | #1846 |
| 2018 | v1.2 | LimitNOFILE=infinity |
4972e3f | #2601 |
| 2019 | v1.3 | LimitNOFILE=1048576 |
1a1f8f1 | #3202 |
| 2020 | v1.5 | LimitNOFILE=infinity |
c691c36 | #4475 |
| 2023 | v1.8 | # unset |
3ca39ef | #8924 |
The maximum amount of open file descriptors available to a
process has been alternating between 1048576:1048576 and
infinity, which has in fact meant the same (except for
pre-2019 systemd when infinity meant 65536:65536).
Depending on which systemd version you were using, and which
containerd defaults, you got between 65536 and
1048576 as the soft and hard limit.
But now, after 2023, since containerd 1.8 and higher, you get a
default of 1024 and 524288 for soft and hard
limits respectively.
The earlier adjustments had no practical effect, but this last change did.
65536 is in many cases enough for everyone, but
1024 definitely isn't.
Impact on workload
If your containerd instances are starting your Kubernetes containers you may suddenly notice that you're running out of file descriptors. Example:
!!!! !!!! WARNING: ulimit -n is 1024; 65536 is the recommended minimum. !!!! Exec: /vernemq/bin/../erts-11.1.8/bin/erlexec -boot /vernemq/bin/../releases/1.13.0/vernemq ... 11:27:23.787 [error] File operation error: emfile. Target: /vernemq/bin/../lib/mongodb-3.4.4/ebin/vmq_ql_query.beam. Function: get_file. Process: code_server.
For some processes, like this old vernemq here, this is fatal. For other processes, this results in degraded performance when cache files or database tables have to be closed more quickly than strictly necessary.
The systemd default value of 1024 fixes a real problem — but only for very old software. For Kubernetes it creates a problem to which few applications have an answer.
Solutions
The best solution is if every application assesses their file descriptor need beforehand, and raises their soft limit to an appropriate value. However, not all applications actually do this.
Programmatically, one has to call setrlimit(RLIMIT_NOFILE, ...).
Or, you could start your application from a shell and call ulimit first.
For instance, for this old vernemq statefulset
I had to extract the ENTRYPOINT from the image and then
manually set this in the container spec:
command: ["/bin/sh"] args: ["-c", "ulimit -n 131072; exec /usr/sbin/start_vernemq"]
(Setting ulimit -n like that would typically be done from a
Dockerfile entrypoint shell script because
Kubernetes does not provide any means to set it from a spec.)
Alternatively, we could set LimitNOFILE=infinity on the
containerd daemon ourselves via a systemd drop-in, or
at least raise it above 1024.
The question: do we change every application, or do we change the defaults? And to what?
Checking current workload
Before deploying containerd 2.2 everywhere, we'll have to decide what to do. And because we have more than one Kubernetes cluster to examine for this particular issue, we'll use a script to get the details: find_ulimit_nofile.py (view)
The script checks running processes, counts open file descriptors and reports the processes if (a) the soft limit appears unchanged, and (b) the open file descriptor count is getting close to or over 1024.
When run on a few containerd 1.7 nodes, we get output like:
# python3 ./find_ulimit_nofile.py
PID FDs SOFT COMM EXE
867893 16831 1048576 beam.smp /vernemq/erts-11.1.8/bin/beam.smp
1843653 3064 1048576 argocd-applicat /usr/local/bin/argocd
4080120 2920 1048576 java /usr/lib/jvm/java-17-openjdk-17.0.11.0.9-2.el8.x86_64/bin/java
2391 1881 1048576 containerd /usr/bin/containerd
2610690 1157 1048576 mysqld /opt/bitnami/mariadb/sbin/mariadbd
2703298 1078 1048576 redis-server /usr/local/bin/redis-server
1046409 990 1048576 postgres /usr/lib/postgresql/16/bin/postgres
1046484 989 1048576 postgres /usr/lib/postgresql/16/bin/postgres
1771834 972 1048576 dd-ipc-helper /memfd:spawn_worker_trampoline (deleted)
1771390 834 1048576 dd-ipc-helper /memfd:spawn_worker_trampoline (deleted)
1975742 777 1048576 nginx /usr/local/nginx/sbin/nginx
1906067 777 1048576 nginx /usr/local/nginx/sbin/nginx
767917 771 1048576 java /usr/share/elasticsearch/jdk/bin/java
This script output shows the LimitNOFILE=infinity
situation, before moving to the heavily capped
LimitNOFILE=1024:524288 situation. For each of these
processes, we can check whether they are provisioned and capable to raise
their limits.
As one of the few examples, MariaDB actually does set
RLIMIT_NOFILE dynamically at startup, according to the
source. Postgres does not, but does read the value and
dynamically limits how many tables it can keep open.
Containerd sets its own limit (hard = soft), but does
not propagate it to its children.
Those two nginx ingress controllers? Yes, they set it, provided
worker_rlimit_nofile is configured, which it normally is.
Elasticsearch? I doubt it sets anything from Java,
but maybe an entrypoint.sh script does. That other
Java process? Kafka this time. No idea.
After investigating a bunch of different processes you can
probably tell my enthusiasm is wearing off.
Conclusions
I think we can draw two conclusions here.
One: trying to figure out whether all workload sets its soft open file limits, is a losing battle. Only if you have a well defined limited scope of applications you run, would it make sense to raise it only for individual applications.
Two: seeing that the maximum amount of open file descriptors doesn't exceed 16k and usually not even 4k, we can instead raise the default limits to something generous but not ridiculous: 65536 or 131072.
The choice is a trade-off:
- We accept that old-style
select(2)-using applications might not work: we can fix them with a custom entrypoint ulimit if needed. - We could set the soft limit to the hard limit and give everyone 512K or 1024K limits. But keeping them slightly more modest gives us earlier detection of file descriptor leaks (in broken applications) and better behaviour if the process tries to close all possible file descriptors. (Examples of that and further reading at rsyslog doing 100% CPU trying to close a billion file descriptors, CPython adding close_range(2) support and a containerd summary of the rationale to remove LimitNOFILE=infinity.)
We'll settle for this:
[Service] LimitNOFILE=131072:524288
2026-05-08 - linux / local root exploit / module vetting
Recently, we were greeted with the Copy Fail Linux kernel vulnerability. Mitigating this was a matter of denylisting a module. But, only eight days later, there was another exploit, also (ab)using AF_ALG and kernel module autoloading. I'm betting this is not the last, now that the kernel is scrutinized using AI models that keep getting more advanced.
Luckily, we had our machine inventory up to date. So when CVE-2026-31431 ("Copy Fail") came along, deploying a mitigation was a matter of:
- Creating
/etc/modprobe.d/cve-2026-31431.confeverywhere, with:install algif_aead /bin/false
- checking our loaded module inventory (the os.kernel
GoCollect collector collects this for us) to see if
af_alg,algif_aeadorauthencesnwas already loaded anywhere; - and lastly, testing that the specific exploit is now mitigated:
$ python -c 'from socket import *;s=socket(AF_ALG,SOCK_SEQPACKET);s.bind(("aead","authencesn(hmac(sha256),cbc(aes))"));print("metsys elbarenluv a si siht ,tihs"[::-1])' Traceback (most recent call last): File "<string>", line 1, in <module> FileNotFoundError: [Errno 2] No such file or directoryAn error is good: it means the exploit won't work.
Locking down autoloading
I remembered we had been discussing locking down the kernels further, and specifically locking down the loading of (normally) unused modules. Because we expect more bugs to be found in other modules, we'd rather stay ahead of the game and reject them beforehand.
Right now, there seem to be two ways to handle auto-loading:
- Disabling all explicit module loading — using the kernel.modules_disabled sysctl;
- Allowing all module loading, including implicit module loading by
unprivileged users — any user calling e.g.
socket(AF_ALG)can get certain modules loaded into the kernel.
Obviously, disabling all unneeded modules or disabling module loading altogether seems like the most secure fix. But, we never got around to the tedious work of figuring out which modules we actually need.
And, locking down module loading once the system is up is nice.
But do you really know when it is fully up? Maybe your
Ceph daemonsets inside your Kubernetes cluster hadn't
started yet, and now you've locked down the modules before loading
ceph.
Disallowing at least non-root users from (implicit) module loading sounds like a useful mitigation, but the kernel does not support any modules_autoload_mode. Apparently Linus decided against it. And maybe it is too hard to reason about these permissions when there are also namespaces at play.
So, is there another middle ground?
Module vetting
Can we allowlist modules without loading them beforehand?
Yes, we can. If we put install * /bin/false in
/etc/modprobe.d/zz-denylist.conf, that gets loaded last
and rejects anything that is not previously allowed.
Allowlisting modules is then a matter of adding many, many lines of this:
install foo /sbin/modprobe --ignore-install foo install bar /sbin/modprobe --ignore-install bar install baz /sbin/modprobe --ignore-install baz
Make sure they are loaded earlier, by using a lexicographically earlier filename,
like /etc/modprobe.d/00-allowlist.conf.
The hard part
The hard part is knowing which modules we need. As mentioned above,
we get os.kernel loaded-module info from
GoCollect, so we have a good idea which modules we probably need.
Figuring out which modules we need is a tedious task, but if we simply look at the currently loaded modules on our fleet, we see that there are fewer than 600 modules loaded total on all machines, of differing types. In the most pessimistic scenario, a single machine would still only use 10% of the total modules available. So, allowing them, while denying the rest cuts down the available modules to attack by a great deal.
Assuming we now covered which modules we need, can we make it smarter?
kernel.modprobe
Yes, instead of hardcoding the list in configuration files, we can put them in a script. By using the kernel.modprobe sysctl setting, we can create a wrapper that does the vetting for our allowlist.
This wrapper script denies auto-load of certain modules: it does not disable insmod or (explicit) modprobe directly. This way it exactly targets the nonprivileged users we're trying to block, while still allowing the admin to load additional modules by hand if needed.
When the kernel tries to auto-load a module, it doesn't
necessarily call /sbin/modprobe. It calls the executable
in the kernel.modprobe sysctl — which we override as
/usr/local/sbin/vetted-modprobe. That script gets called
with arguments -q -- some_module and it can decide whether
to honour the request or not.
Note that the kernel calls the script. You cannot decide which process or user gets permissions, but you can choose which module is allowed.
/usr/local/sbin/vetted-modprobe
Instead of doing many lines in
/etc/modprobe.d/00-allowlist.conf, we create a
/usr/local/sbin/vetted-modprobe wrapper:
#!/bin/sh
# Requires: sysctl kernel.modprobe=/usr/local/sbin/vetted-modprobe
# See: https://www.osso.nl/blog/2026/linux-local-root-exploit-module-vetting/
set -u
log() {
if test -t 2; then echo "$0: $*" >&2; fi
logger -t vetted-modprobe -p auth.notice "$*"
}
# We assume we're called as "-q -- MODULE_LIST" (from the kernel).
# Some other services might omit "-q --".
# Process them one by one.
if test $# -ge 3 && test "$1" = '-q' && test "$2" = '--'; then
shift; shift
fi
# This may either give us an error:
# - modprobe: FATAL: Module foobar not found in directory /lib/modules/6.8.0...
# Or one or more suggested modules to load:
# - insmod /lib/modules/6.8.0-87-generic/kernel/crypto/af_alg.ko.zst
# - insmod /lib/modules/6.8.0-87-generic/kernel/crypto/algif_aead.ko.zst
plan=$(/sbin/modprobe -n -v -- "$@" 2>&1)
ret=$?
if test $ret -ne 0; then
log "modprobe -n failed for '$*': $plan"
exit $ret
fi
if test -z "$plan"; then
exit 0
fi
unvetted=$(printf '%s\n' "$plan" | while read action filename; do
test "$action" = insmod || continue
# Strip dirname and trailing ".ko" or ".ko.zstd".
filename=${filename##*/}; filename=${filename%.ko*}
# OBSERVE: If we take the list from lsmod, we see filenames with only
# underscores. But the files themselves might contain dashes. E.g.:
# "nls_iso8859-1.ko.zst" shows up as "nls_iso8859_1" in lsmod.
# Design choice: use the lsmod output, because it's so much easier to
# obtain than the dash/underscore mixing filenames.
# Normalize all dashes to underscores in pure POSIX shell:
while test "${filename#*-}" != "$filename"; do
filename=${filename%%-*}_${filename#*-}
done
case "$filename" in
# NOTE: Any aliases have been resolved (like net-pf-38 => af_alg).
#
# vv-------- EXAMPLES HERE --------vv
#
# Some modules:
allowed_module1|allowed_module2);;
# More modules:
mod_foo|mod_bar|mod_baz);;
#
# Explicitly _not_ allowed:
#
# CVE-2026-31431: "Copy Fail"
#algif_aead) echo "$filename";;
# CVE-2026-43284+CVE-2026-43500: "Dirty Frag"
#esp4|esp6|rxrpc) echo "$filename";;
#
# ^^-------- EXAMPLES HERE --------^^
#
# NOTE: The unmatched (unvetted) modules are echoed.
*) echo "$filename";;
esac
done)
if test -n "$unvetted"; then
log "deny 'modprobe -q -- $*'; because unvetted '$unvetted'"
exit 1
fi
exec /sbin/modprobe -q -- "$@"
That's the gist of the script. Only auto-loading of the modules in
the case statement is allowed. If you try to load an
unvetted module, it gets rejected with the following log
message:
$ sudo journalctl -t vetted-modprobe --facility auth deny 'modprobe -q -- algif-skcipher'; because unvetted 'algif_skcipher'
Which modules are used?
As mentioned, the hard part is deciding which modules to allow. The script itself is easy. The list I compiled today has fewer than 600 modules in it (including modules that are not available in all kernels), so it cuts down the amount of allowed modules by a big margin.
The following list goes as contents of the case statement above.
You should tweak this to your liking:
the allowlisted modules are matched without action.
The rest gets the echo "$filename" treatment and gets
rejected.
CAVEAT EMPTOR: These modules are NOT necessarily safe from exploits. But they are actively in use (in our systems), and they account for less than 10% of total modules, so we massively cut down the attack space.
#
# ---------------- VETTED BELOW ----------------
#
# NOTE: Any aliases have been resolved (like net-pf-38 => af_alg).
# Seen everywhere:
8250_dw|acpi_ipmi|acpi_pad|acpi_power_meter|acpi_tad|aesni_intel);;
af_packet_diag|ahci|amd64_edac|ast|autofs4|binfmt_misc|bonding);;
br_netfilter|bridge|btrfs|ccp|cdc_ether|cec|cfg80211|coretemp);;
crc32_pclmul|crct10dif_pclmul|cryptd|crypto_simd|dmi_sysfs);;
drm|drm_kms_helper|drm_ttm_helper|drm_vram_helper|edac_mce_amd);;
ee1004|efi_pstore|failover|fb_sys_fops|floppy|ghash_clmulni_intel);;
hid|hid_generic|i2c_algo_bit|i2c_i801|i2c_piix4|i2c_smbus);;
ib_core|ib_uverbs|icp|idma64|ie31200_edac|inet_diag|input_leds);;
intel_cstate|intel_lpss|intel_lpss_pci|intel_pch_thermal);;
intel_powerclamp|intel_rapl_common|intel_rapl_msr|intel_tcc_cooling);;
ip6_tables|ip6_udp_tunnel|ip6t_REJECT|ip6table_filter);;
ip6table_mangle|ip6table_raw|ip_set|ip_set_hash_ip|ip_set_hash_net);;
ip_tables|ipmi_devintf|ipmi_msghandler|ipmi_si|ipmi_ssif|ipt_REJECT);;
ipt_rpfilter|iptable_filter|iptable_mangle|iptable_nat|iptable_raw);;
irqbypass|joydev|k10temp|kvm|kvm_amd|kvm_intel|libahci|libcrc32c|llc);;
mac_hid|macsec|mei|mei_me|mii);;
mlx5_core|mlx5_dpll|mlx5_ib|mlxfw|mptcp_diag);;
net_failover|netlink_diag|nf_conntrack|nf_conntrack_netlink);;
nf_defrag_ipv4|nf_defrag_ipv6|nf_log_syslog|nf_nat|nf_reject_ipv4);;
nf_reject_ipv6|nf_socket_ipv4|nf_socket_ipv6|nf_tables);;
nf_tproxy_ipv4|nf_tproxy_ipv6|nfnetlink|nfnetlink_acct|nfnetlink_log);;
nft_chain_nat|nft_compat|nft_counter|nls_iso8859_1);;
nvme|nvme_auth|nvme_core|nvme_fabrics|nvme_keyring|overlay);;
pci_hyperv_intf|pinctrl_cannonlake|polyval_clmulni|polyval_generic);;
psample|psmouse|ptdma|raid6_pq|rapl|raw_diag|rc_core|rndis_host);;
sch_fq_codel|serio_raw|sha1_ssse3|sha256_ssse3|spl|stp);;
syscopyarea|sysfillrect|sysimgblt|tcp_diag|tls|ttm);;
udp_diag|udp_tunnel|unix_diag|usbhid|usbnet|veth|video|wmi|wmi_bmof);;
x86_pkg_temp_thermal|x_tables|xfrm_algo|xfrm_user);;
xhci_pci|xhci_pci_renesas|xor|xsk_diag|zavl|zcommon);;
zfs|zlua|znvpair|zunicode|zzstd);;
# Seen on many systems (30+):
8021q|amdgpu|amdxcp|async_memcpy|async_pq|async_raid6_recov|async_tx);;
async_xor|blake2b_generic|bnxt_en|bochs|bpfilter|chacha_x86_64);;
cls_bpf|cmdlinepart|curve25519_x86_64|dca|drm_buddy);;
drm_display_helper|drm_exec|drm_suballoc_helper|dummy);;
ebtable_filter|ebtables|garp|glue_helper|gpu_sched|igb);;
intel_pmc_core|intel_uncore_frequency|intel_uncore_frequency_common);;
intel_vsec|ioatdma|ip6table_nat|ip_tunnel|ipip|jc42|libceph);;
libchacha|libchacha20poly1305|libcurve25519_generic|linear);;
lp|lpc_ich|mrp|mtd|multipath|nbd|nfit|nft_limit|nft_log);;
parport|pata_acpi|pcspkr|pmt_class|pmt_telemetry|poly1305_x86_64);;
qemu_fw_cfg|raid0|raid1|raid10|raid456|rbd|sb_edac|sch_ingress);;
sctp|skx_edac_common|softdog|spi_intel|spi_intel_pci|spi_nor|sunrpc);;
tap|tunnel4|usbmouse|vga16fb|vgastate|vhost|vhost_iotlb|vhost_net);;
vmgenid|vxlan|wireguard|xfs|xhci_hcd);;
# Seen on GPU systems:
drm_gpuvm|nvidia|nvidia_drm|nvidia_modeset|nvidia_uvm);;
# Seen on mgmt/storage systems:
aufs|authenc|bluetooth|bochs_drm|ceph|cpuid|crc8|ecc|ecdh_generic);;
ftdi_sio|fscache|gnss|i40e|ice|intel_qat|irdma|isci|isst_if_common);;
libsas|mgag200|msr|netfs|qat_c62x);;
scsi_transport_iscsi|scsi_transport_sas|ses|usbserial|vmd);;
iommufd|pl2303|pnd2_edac|qat_c3xxx|vfio|vfio_iommu_type1);;
vfio_pci|vfio_pci_core);;
# Seen on storage systems:
cxl_acpi|cxl_core|cxl_port|dax_hmem|enclosure|iaa_crypto);;
idxd|idxd_bus|intel_ifs|intel_sdsi|mpt3sas|pfr_telemetry|pfr_update);;
pinctrl_emmitsburg|qat_4xxx);;
# Seen on NAT gateways or load balancers:
cls_matchall|cls_u32|tcp_bbr);;
# Seen on ci-runners (why?):
af_alg|algif_rng);;
# Seen on older Cumulus switches (common):
ablk_helper|accton_as7326_56x_platform|acpi_cpufreq|aes_x86_64|at24);;
cpr4011|crc32c_intel|cumulus_platform|dm_mod|ebt_police|ebt_setclass);;
eeprom_class|efivarfs|efivars|fuse|gf128mul|gpio_ich|hwmon);;
i2c_core|i2c_dev|i2c_ismt|i2c_mux|i2c_mux_pca954x);;
iTCO_vendor_support|iTCO_wdt|ixgbe|kernel_bde|knet|lm75);;
loop|lrw|mdio|mfd_core|mpls_iptunnel|mpls_router);;
nf_conntrack_ipv4|nf_nat_ipv4|pmbus_core|sff_8436_eeprom|shpchp|tg3);;
tpm|tpm_tis|tun|user_bde|vrf);;
# Seen on older Cumulus switches (rare):
accton_as7726_32x_platform|arp_tables|arptable_filter);;
delta_ag5648v1_platform|delta_ag9032v2_platform|dps460|emc2305);;
gpio_pca953x|ipmi_poweroff|quanta_ix7_cpld|quanta_ix7_platform);;
quanta_ix8_cpld|quanta_ix8_platform|quanta_ly4r_platform|thermal);;
tpm_crb|vhwmon);;
# Seen on IPsec:
# NOTE: Add 'esp4' in modules-load.d explicitly for ipsec.
echainiv|nf_conntrack_ftp|tunnel6);;
xfrm6_tunnel|xfrm_interface|xt_policy);;
# Seen on PVEs:
act_police|amd_atl|bnxt_re|cls_basic|drm_panel_backlight_quirks);;
drm_shmem_helper|ehci_hcd|ehci_pci|fwctl|i10nm_edac|isst_if_mbox_pci);;
iscsi_tcp|isst_if_mmio);;
libiscsi|libiscsi_tcp|mlx5_fwctl|nvme_common|raid_class|ramoops);;
scsi_common|scsi_dh_alua|scsi_dh_emc|scsi_dh_rdac|scsi_mod);;
sch_htb|sctp_diag|sdhci|sdhci_pci|sdhci_uhs2);;
sg|simplefb|skx_edac|spd5118);;
usbkbd|xt_connmark|xt_mac);;
# Seen on older systems:
reed_solomon|zstd_compress);;
pstore_blk|pstore_zone);;
# Seen on VPN:
ovpn);;
# iptables (heavy use)
xt_CT|xt_LOG|xt_MASQUERADE|xt_NFLOG|xt_POLICE|xt_SETCLASS|xt_TPROXY);;
xt_addrtype|xt_comment|xt_conntrack|xt_hashlimit|xt_length|xt_limit);;
xt_mark|xt_multiport|xt_nat|xt_nfacct|xt_physdev|xt_recent);;
xt_set|xt_socket|xt_state|xt_statistic|xt_tcpudp);;
# iptables (rare)
ip_set_bitmap_port|ip_set_hash_ipport|ip_set_hash_ipportip);;
ip_set_hash_ipportnet);;
xt_CHECKSUM|xt_REDIRECT|xt_hl|xt_owner|xt_string|xt_tcpmss|xt_u32);;
# netfilter (rare)
nf_conntrack_pptp);; # only rs420 tunnel
nf_log_common|nf_log_ipv4|nf_log_ipv6|nf_nat_ftp|nf_nat_ipv6);;
nft_masq);;
# virtio (common)
virtio_blk|virtio_net|virtio_scsi);;
# virtio (rare)
virtio|virtio_balloon|virtio_pci|virtio_pci_legacy_dev);;
virtio_pci_modern_dev|virtio_ring|virtio_rng);;
# Other (very rare.. leftovers):
aacraid|amd64_edac_mod|apex|ata_generic|ata_piix);;
button|cdrom|configfs|cqhci);;
crc16|crc32c_generic|crc64|crc64_rocksoft|crc_t10dif);;
crct10dif_common|crct10dif_generic|dm_multipath|e1000e);;
ebtable_nat|einj|evdev|ext4|gasket|geneve|hfs|hfsplus|hpilo);;
ib_cm|ib_iser|intel_pmc_ssram_telemetry);;
intel_pmt|intel_th|intel_th_gth|intel_th_pci);;
ip6t_rt|ip_vs|ip_vs_rr|ip_vs_sh|ip_vs_wrr|iw_cm|jbd2|jfs);;
kheaders|libata|mbcache|megaraid_sas|minix|msdos|mxm_wmi);;
nouveau|ntfs|pmbus|pmt_discovery|qnx4|qrtr|rdma_cm|regmap_i2c);;
rfkill|sd_mod|sfc|sha512_generic|sha512_ssse3|spi_intel_platform);;
sr_mod|t10_pi|ts_bm|uas|ufs|uhci_hcd);;
usb_common|usb_storage|usbcore);;
vhost_vsock|vmw_vsock_virtio_transport_common|vmwgfx);;
vsock|vsock_diag);;
# Seen on desktop systems, do not include these:
#algif_hash|algif_skcipher|amd_pmc|amd_pmf|amd_sfh|amdtee);;
#amdxdna|asus_wmi|auth_rpcgss|bnep|btbcm|btintel|btmtk|btrtl|btusb);;
#cdc_acm|cmac|cp210x|cros_ec|cros_ec_chardev|cros_ec_debugfs);;
#cros_ec_dev|cros_ec_hwmon|cros_ec_lpcs|cros_ec_proto|cros_ec_sysfs);;
#dm_crypt|eeepc_wmi|gpio_cros_ec|gpio_keys|grace|i915);;
#led_class_multicolor|leds_cros_ec|ledtrig_audio|libarc4|lockd);;
#mac80211|mei_hdcp|mei_pxp|mfd_aaeon|mt76|mt76_connac_lib);;
#mt7925_common|mt7925e|mt792x_lib|nfs_acl|nfsd);;
#parport_pc|platform_profile|ppdev|r8169|realtek|rfcomm);;
#snd|snd_hda_codec|snd_hda_codec_alc269|snd_hda_codec_atihdmi);;
#snd_hda_codec_generic|snd_hda_codec_hdmi|snd_hda_codec_realtek);;
#snd_hda_codec_realtek_lib);;
#snd_hda_core|snd_hda_intel|snd_hda_scodec_component|snd_hrtimer);;
#snd_hwdep|snd_intel_dspcfg|snd_intel_sdw_acpi|snd_pcm|snd_rawmidi);;
#snd_seq|snd_seq_device|snd_seq_dummy);;
#snd_seq_midi|snd_seq_midi_event);;
#snd_timer|soc_button_array|soundcore|sparse_keymap|tee|thunderbolt);;
#typec|typec_ucsi|ucsi_acpi|uhid);;
#
# ---------------- UNVETTED BELOW ----------------
#
# CVE-2026-31431: "Copy Fail"
#algif_aead) echo "$filename";;
# CVE-2026-43284+CVE-2026-43500: "Dirty Frag"
#esp4|esp6|rxrpc) echo "$filename";;
*) echo "$filename";;
The full script can be downloaded from vetted-modprobe.
Don't forget executable permissions on
/usr/local/sbin/vetted-modprobe and to set
/etc/sysctl.d/92-vetted-modprobe.conf to
kernel.modprobe=/usr/local/sbin/vetted-modprobe and apply
it with sysctl -p /etc/sysctl.d/92-vetted-modprobe.conf
And, because it is a script, you can complicate it all you want, with
includes and excludes and auto-updates and whatever floats your boat.
Maybe you only want to allow esp4 if ipsec is
in the hostname. The possibilities are endless.
Summarizing
If you can set kernel.modules_disabled=1, then please do.
If you can't, then maybe try the vetted-modprobe above.
Update 2026-05-12
The post has been edited to add the CVE numbers to Dirty Frag, and to fix a bug with the lsmod-provided values. The included modules all had underscores, but some filenames have dashes. All filenames are now normalized to underscores before checking the allowlist.
Update 2026-05-18
Additionally, the IRC nat/conntrack modules have been removed from
the examples. And the vetted-modprobe now also accepts modules without
-q -- as first two arguments. The reason being that some processes
(Proxmox components instead of the kernel?) call vetted-modprobe
instead of modprobe directly.
2026-02-04 - macos tahoe / ecn / slow downloads
Recently, a customer reported an intermittent but frustrating issue: since upgrading to macOS 26 (Tahoe), downloads from our servers were occasionally crawling. Not always, and not for everyone. The culprit turned out to be a combination of Explicit Congestion Notification (ECN), NIC offloading limitations, and the way classic congestion control algorithms react to "well-intentioned" signals.
The macOS Tahoe ECN lottery
The first mystery was the intermittency. We noticed that some connections were fast, while others to the same server appeared throttled.
The customer had already done a ton of investigating for us — this was great:
- Change: the issues started after upgrading their Macs to version 26 (Tahoe).
- Toggle: they told us that disabling ECN using
net.inet.tcp.ecn_initiate_out=0restored full speed. - Debug: they set up a while loop which downloaded the same file every 15 seconds — and provided us with the timestamps when the downloads were slow.
Looking at the traffic with tcpdump revealed that Explicit Congestion Notification (ECN) was indeed the trigger for the slowness. It also revealed that ECN was not applied to all outgoing connections, only to about 5% of them.
Connections with ECN and connections without
When a connection was using ECN, it looked like this in tcpdump:
14:24:46.778055 IP 1.1.1.1.51939 > 2.2.2.2.443: Flags [SEW], seq 276182108, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 936789772 ecr 0,sackOK,eol], length 0 14:24:46.778088 IP 2.2.2.2.443 > 1.1.1.1.51939: Flags [S.E], seq 1390193927, ack 276182109, win 65160, options [mss 1460,sackOK,TS val 1827830359 ecr 936789772,nop,wscale 7], length 0 14:24:46.781656 IP 1.1.1.1.51939 > 2.2.2.2.443: Flags [.], ack 1, win 2059, options [nop,nop,TS val 936789775 ecr 1827830359], length 0 14:24:46.782139 IP 1.1.1.1.51939 > 2.2.2.2.443: Flags [P.], seq 1:334, ack 1, win 2059, options [nop,nop,TS val 936789775 ecr 1827830359], length 333 14:24:46.782155 IP 2.2.2.2.443 > 1.1.1.1.51939: Flags [.E], ack 334, win 507, options [nop,nop,TS val 1827830363 ecr 936789775], length 0 14:24:46.784860 IP 2.2.2.2.443 > 1.1.1.1.51939: Flags [P.E], seq 1:3168, ack 334, win 507, options [nop,nop,TS val 1827830365 ecr 936789775], length 3167 ... 14:24:50.112764 IP 2.2.2.2.443 > 1.1.1.1.51939: Flags [.EW], seq 3071982:3073430, ack 615, win 506, options [nop,nop,TS val 1827833693 ecr 936793106], length 1448 14:24:50.112782 IP 1.1.1.1.51939 > 2.2.2.2.443: Flags [.E], ack 3071982, win 2004, options [nop,nop,TS val 936793106 ecr 1827833691], length 0 14:24:50.112793 IP 2.2.2.2.443 > 1.1.1.1.51939: Flags [.E], seq 3073430:3074878, ack 615, win 506, options [nop,nop,TS val 1827833693 ecr 936793106], length 1448 14:24:50.115426 IP 1.1.1.1.51939 > 2.2.2.2.443: Flags [.E], ack 3073430, win 2026, options [nop,nop,TS val 936793109 ecr 1827833693], length 0
The E and W in the initial packet refer to
the ECN-Echo (ECE) and Congestion Window Reduced
(CWR) bits. In the handshake, these flags are used to negotiate
whether both ends support the feature.
When a connection was not using ECN, we saw this:
14:25:05.190884 IP 1.1.1.1.52002 > 2.2.2.2.443: Flags [S], seq 1706341773, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1632358482 ecr 0,sackOK,eol], length 0 14:25:05.190949 IP 2.2.2.2.443 > 1.1.1.1.52002: Flags [S.], seq 3548265341, ack 1706341774, win 65160, options [mss 1460,sackOK,TS val 1827848772 ecr 1632358482,nop,wscale 7], length 0 14:25:05.194954 IP 1.1.1.1.52002 > 2.2.2.2.443: Flags [.], ack 1, win 2059, options [nop,nop,TS val 1632358485 ecr 1827848772], length 0 14:25:05.196046 IP 1.1.1.1.52002 > 2.2.2.2.443: Flags [P.], seq 1:334, ack 1, win 2059, options [nop,nop,TS val 1632358486 ecr 1827848772], length 333 ... 14:25:05.321982 IP 2.2.2.2.443 > 1.1.1.1.52002: Flags [P.], seq 2993200:3014920, ack 615, win 506, options [nop,nop,TS val 1827848903 ecr 1632358612], length 21720 14:25:05.322002 IP 2.2.2.2.443 > 1.1.1.1.52002: Flags [P.], seq 3014920:3059808, ack 615, win 506, options [nop,nop,TS val 1827848903 ecr 1632358612], length 44888 14:25:05.322031 IP 2.2.2.2.443 > 1.1.1.1.52002: Flags [P.], seq 3059808:3080080, ack 615, win 506, options [nop,nop,TS val 1827848903 ecr 1632358612], length 20272 14:25:05.322045 IP 1.1.1.1.52002 > 2.2.2.2.443: Flags [.], ack 2884600, win 6777, options [nop,nop,TS val 1632358613 ecr 1827848897], length 0
Three things stood out here:
- The ECN connections required about 4000 packets (in total);
- the packets in those connections appeared to be limited to 1448 octets in length;
- the total download time was significantly slower.
Analyzing a larger bunch of traffic
Because we were sniffing live data, and everything was encrypted with TLS, it was slightly more work to get the pcaps sorted. Luckily, the hostname used for the tests was rarely accessed outside of the test loop. This hostname is sent unencrypted as part of Server Name Indication (SNI), so that was a convenient way to filter out the necessary packets.
# Collect the remote ports used
tcpdump -nnr big.pcap |
sed -e 's/.*1[.]1[.]1[.]1[.]//;s/:\? .*//' |
sort -u >remote-ports
# Split the pcaps, if they contain the hostname (in SNI)
for port in $(cat remote-ports); do
tcpdump -Annr big.pcap port $port 2>/dev/null |
grep -q THE_UNIQUE_HOSTNAME &&
tcpdump -nnr big.pcap port $port -w samples/$port.pcap 2>/dev/null
done
Now I had a bunch of pcaps, all sized 3 MiB (because the test download was about that large):
# ls -lh samples/ total 155M -rw-r--r-- 1 tcpdump tcpdump 3.0M Feb 4 14:22 52323.pcap -rw-r--r-- 1 tcpdump tcpdump 3.0M Feb 4 14:22 52325.pcap -rw-r--r-- 1 tcpdump tcpdump 3.0M Feb 4 14:22 52328.pcap ...
Running some stats on these was easy:
# for fn in *.pcap; do
packets=$(tcpdump -nnr $fn 2>/dev/null | wc -l);
has_ecn=$(tcpdump -nnr $fn 'tcp[13] & 64 != 0' 2>/dev/null | grep -q '' && echo X || echo -);
lastline=$(tcpdump -nnr $fn 2>/dev/null | sed -ne '1p;$p' | wtimediff | tail -n1);
duration=$(echo $lastline | awk '{print $1}');
when=$(echo $lastline | awk '{print $2}');
echo $duration $when $fn ecn=$has_ecn packets=$packets
done | sort -k2V
That yielded the following (truncated) output:
+0.116948 14:21:59.860106 51642.pcap ecn=- packets=234 +0.135041 14:22:15.031269 51646.pcap ecn=- packets=237 ... +0.170701 14:24:16.486446 51698.pcap ecn=- packets=168 +0.229468 14:24:31.737016 51760.pcap ecn=- packets=639 +3.355901 14:24:50.133956 51939.pcap ecn=X packets=4211 +0.142125 14:25:05.333009 52002.pcap ecn=- packets=271 +0.168694 14:25:20.534368 52073.pcap ecn=- packets=357 ... +0.137638 14:29:23.340260 52303.pcap ecn=- packets=240 +0.122912 14:29:38.504557 52305.pcap ecn=- packets=266 +3.066842 14:29:56.607659 52308.pcap ecn=X packets=4055 +0.194032 14:30:11.863645 52314.pcap ecn=- packets=496 +0.128605 14:30:27.062803 52320.pcap ecn=- packets=229
Like clockwork, for every 20 opened connections — or once every 5 minutes — one connection was tried with ECN. That one would last for 3 seconds or more, and require about 4000 IP packets instead of the usual 100-500.
In fact, once we checked sysctl values configured on their systems, this was totally explainable. The connection count was not relevant, but the 5 minutes were:
# Max ECN setup percentage net.inet.tcp.ecn_setup_percentage = 100 # Initial minutes to wait before re-trying ECN net.inet.tcp.ecn_timeout = 5
Rather than enabling ECN globally for every flow — which could lead to connectivity issues on broken middleboxes — macOS Tahoe attempts to negotiate ECN on a per-connection basis, favoring it for a subset of outgoing flows to "test the waters."
"Middleboxes" are the firewalls, load balancers, and routers that sit between the client and server. Historically, some of these devices were configured to drop any packet with "unknown" or "reserved" bits set in the IP header — which included ECN. If a device simply discards ECN-marked packets, the connection hangs or times out.
Now that the cause for the flakiness was explained, on to the real problem: why did enabling ECN cause slowness?
Hardware offloading and the 1448 cap
When ECN was negotiated, we observed that the server stopped sending large, coalesced segments. Usually, when we run a tcpdump locally on the server, we see "super-packets" — massive chunks of data (often 20KB to 60KB) being pushed toward the NIC. This is the result of Generic Segmentation Offload (GSO) or TCP Segmentation Offload (TSO).
Checking the NIC capabilities with ethtool:
# ethtool -k enp2s0np0 | grep segmentation
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
generic-segmentation-offload: on
The hardware supports standard offloading, but not for ECN.
When ECN is active, the tx-tcp-ecn-segmentation: off [fixed]
flag tells the Linux kernel: "I don't know how to handle ECN bits in
giant chunks, you'll have to segment them yourself."
In theory, the kernel should still use GSO to handle this in software, keeping the packets large until the very last moment (after tcpdump has seen them). But in practice, our captures showed the server was sending individual 1448-byte segments (MTU minus headers).
Disabling TSO and GSO using ethtool -K iface
gso off tso off did not change any symptoms:
normal traffic was still fast and showed up as large packets in tcpdump,
ECN traffic was still slow and capped at 1448 octets.
Firmware upgrades
Since we had some issues with offloading and VLANs in 2025, I thought it was worth a try to check for updates.
It probably was, because 14.17.2020 is from before 2017, so almost 10 years old now.
# ./mlxup
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX4LX
Part Number: Super_Micro_AOC-C25G-m1S
Description: ConnectX-4 Lx EN adapter card; 25GbE single-port SFP28; PCIe3.0 x8
PSID: SM_2011000xxxxxx
PCI Device Name: 0000:02:00.0
Base MAC: 0cc47axxxxxx
Versions: Current Available
FW 14.17.2020 N/A
PXE 3.4.0903 N/A
UEFI 14.11.0028 N/A
Status: No matching image found
Those firmware versions look surprisingly similar to the official
Mellanox firmware versions. There the latest version for a
ConnectX4LX device is 14.32.1908. But for this
SuperMicro specific device — with a Micro-LP form factor
— no updates could be found.
On the SuperMicro site, there is a "web download" (wdl) link to an empty directory for the Super_Micro_AOC-C25G-I2S. But for the Super_Micro_AOC-C25G-m1S there's not even that.
Getting a new firmware for this device was not going to be the easy route.
Disabling ECN on the server
We could disable ECN on the server side, using the net.ipv4.tcp_ecn=0 sysctl.
Now every connection initiated by the macOS client (after the ecn_timeout had passed) tried ECN but it was ignored by the server.
This did solve the problem for them.
While this was a solution to the problem, it would be a regression in network tech, so we're not doing that.
Changing congestion control algorithm
After we depleted all other options, Google Gemini suggested tcp_congestion_control=bbr.
By default this is cubic on a generic Ubuntu or Debian system.
This is not a setting I touch generally, but it sounded safe enough to try. And it did exactly what we wanted. The packet summaries now looked like this:
... +0.100778 14:53:34.630292 52764.pcap ecn=- packets=154 +0.126388 14:53:49.789472 52781.pcap ecn=- packets=308 +0.114749 14:54:04.954907 52786.pcap ecn=X packets=388 +0.105739 14:54:20.126535 52789.pcap ecn=- packets=168 +0.109139 14:54:35.287600 52792.pcap ecn=- packets=159 ... +0.147384 14:58:37.988716 52866.pcap ecn=- packets=166 +0.112264 14:58:53.137022 52881.pcap ecn=- packets=174 +0.107939 14:59:08.281076 52890.pcap ecn=X packets=417 +0.105624 14:59:23.453946 52900.pcap ecn=- packets=161 +0.154969 14:59:38.667600 52901.pcap ecn=- packets=218 ...
A single download with ECN was now possible with fewer than 500 packets. The packet captures with and without ECN now looked roughly the same.
A few notes:
- The packet counts were consistently over 350, whereas the non-ECN flows could go as low as 150.
- The macOS client still only tried ECN once every 5 minutes, even though it wasn't slower than its non-ECN counterpart. I speculate that it only flags endpoints for consistent ECN if flow improves with it.
- It appears that hardware offloading had no real effect here. In fact, one could even speculate that offloaded ECN flows might suffer from a poor default congestion control as well. Maybe we'll need to disable hardware ECN offloading at some point when the NIC does support it.
The root cause
The root cause of the throughput collapse was the way CUBIC interprets network signals. In classic congestion control, an ECN-Echo (ECE) mark is treated as a "hard" congestion signal, identical to a dropped packet.
Our servers were caught in a "death spiral":
- Because the NIC could not offload ECN-marked traffic (TSO off), the kernel had to process all packets in software.
- Every individual packet marked with Congestion Experienced (CE) by a router surfaced immediately in the TCP stack as a distinct ECN-Echo signal.
- CUBIC reacts to these marks by aggressively reducing its Congestion Window (cwnd), treating each CE mark as a loss-equivalent congestion signal.
- Generic Segmentation Offload (GSO) could still have coalesced packets, but the constant "braking" from the shrinking cwnd prevented sufficient batching, resulting in a stream of MTU-sized (1448-byte) drips instead of high-speed bursts.
BBR congestion control as the fix
Switching to BBR (Bottleneck Bandwidth and Round-trip propagation time) resolved the issue by changing the server's reaction to those ECN marks. Unlike CUBIC, BBR is model-based. It ignores the "panic" of individual ECN marks and focuses instead on the actual measured delivery rate and RTT of the path.
By maintaining a high-speed flow based on the actual capacity of the "pipe," BBR allowed the Linux kernel to effectively use GSO. Even though the NIC hardware still couldn't do the segmentation, the CPU was now able to bundle data into large "super-packets" before handing them to the driver.
This explains the dramatic drop from 4000+ packets to under 500: BBR restored the flow efficiency that CUBIC had traded away for caution.
Conclusion
If you run into slow downloads and see ECN in the mix, look at these options:
sysctl -w net.inet.tcp.ecn_initiate_out=0on a macOS client (or the Linux equivalent);sysctl -w net.ipv4.tcp_ecn=0on the server (to confirm the hypothesis);ethtool -K iface tx-tcp-ecn-segmentation onon the server (if supported by your hardware);sysctl -w net.ipv4.tcp_congestion_control=bbron the server (as the robust, modern fix). You may need tomodprobe tcp_bbr.
One question remains: why doesn't the macOS client settle on ECN after it finds that it works equally well as the non-ECN flow?