Notes to self (wjd.nu)

Notes to self, 2026

Year: 2026 | 2025 | 2024 | 2023 | 2022 | 2021 | 2020 | 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | index

2026-05-24 - containerd / kubernetes / open file limits

After upgrading a customer's Kubernetes cluster running on Ubuntu/Jammy, we ran into a snag. The vernemq instances wouldn't start completely, instead they gave us EMFILE errors: Too many open files. This came as a surprise. After all, we hadn't touched any limits.

But, it turned out containerd v1.8 changed the LimitNOFILE setting from a very permissive infinity to the systemd default. The result? Processes inside Kubernetes would get a measly max 1024 file descriptors by default.

containerd upgrades in Ubuntu

Normally Ubuntu is super conservative when migrating versions: once a stable LTS is released, you need to move heaven and earth to get a patch in. This time however, they bumped the containerd package from 1.7 to 2.2 within Jammy (22.04) and Noble (24.04) — in fact, Jammy has even seen two version bumps, as it started with version 1.5.

For such a big version jump, you might expect there to be impacting changes — Canonical did not, or at least didn't see any impact.

`LimitNOFILE` and systemd

For systemd started processes, systemd sets the nofile ulimit to the values set in the service file for the specific application. For example:

[Service]
LimitNOFILE=infinity  # value of /proc/sys/fs/nr_open for soft/hard

Or the default:

[Service]
LimitNOFILE=1024:524288  # 1024 soft, 524288 hard

The systemd exec manual recommends leaving the soft limit to 1024 because of old processes still using select(2).

containerd ulimit history

At containerd they have had a hard time deciding what an appropriate value is:

Year	Ver	Limit	commit	PR
2017	v1.0	`LimitNOFILE=1048576`	b009642	#1846
2018	v1.2	`LimitNOFILE=infinity`	4972e3f	#2601
2019	v1.3	`LimitNOFILE=1048576`	1a1f8f1	#3202
2020	v1.5	`LimitNOFILE=infinity`	c691c36	#4475
2023	v1.8	`# unset`	3ca39ef	#8924

The maximum amount of open file descriptors available to a process has been alternating between 1048576:1048576 and infinity, which has in fact meant the same (except for pre-2019 systemd when infinity meant 65536:65536). Depending on which systemd version you were using, and which containerd defaults, you got between 65536 and 1048576 as the soft and hard limit.

But now, after 2023, since containerd 1.8 and higher, you get a default of 1024 and 524288 for soft and hard limits respectively. The earlier adjustments had no practical effect, but this last change did. 65536 is in many cases enough for everyone, but 1024 definitely isn't.

Impact on workload

If your containerd instances are starting your Kubernetes containers you may suddenly notice that you're running out of file descriptors. Example:

!!!!
!!!! WARNING: ulimit -n is 1024; 65536 is the recommended minimum.
!!!!
Exec:  /vernemq/bin/../erts-11.1.8/bin/erlexec -boot /vernemq/bin/../releases/1.13.0/vernemq
...
11:27:23.787 [error] File operation error: emfile. Target: /vernemq/bin/../lib/mongodb-3.4.4/ebin/vmq_ql_query.beam. Function: get_file. Process: code_server.

For some processes, like this old vernemq here, this is fatal. For other processes, this results in degraded performance when cache files or database tables have to be closed more quickly than strictly necessary.

The systemd default value of 1024 fixes a real problem — but only for very old software. For Kubernetes it creates a problem to which few applications have an answer.

Solutions

The best solution is if every application assesses their file descriptor need beforehand, and raises their soft limit to an appropriate value. However, not all applications actually do this.

Programmatically, one has to call setrlimit(RLIMIT_NOFILE, ...). Or, you could start your application from a shell and call ulimit first. For instance, for this old vernemq statefulset I had to extract the ENTRYPOINT from the image and then manually set this in the container spec:

command: ["/bin/sh"]
args: ["-c", "ulimit -n 131072; exec /usr/sbin/start_vernemq"]

(Setting ulimit -n like that would typically be done from a Dockerfile entrypoint shell script because Kubernetes does not provide any means to set it from a spec.)

Alternatively, we could set LimitNOFILE=infinity on the containerd daemon ourselves via a systemd drop-in, or at least raise it above 1024.

The question: do we change every application, or do we change the defaults? And to what?

Checking current workload

Before deploying containerd 2.2 everywhere, we'll have to decide what to do. And because we have more than one Kubernetes cluster to examine for this particular issue, we'll use a script to get the details: find_ulimit_nofile.py (view)

The script checks running processes, counts open file descriptors and reports the processes if (a) the soft limit appears unchanged, and (b) the open file descriptor count is getting close to or over 1024.

When run on a few containerd 1.7 nodes, we get output like:

# python3 ./find_ulimit_nofile.py
    PID    FDs        SOFT  COMM            EXE
 867893  16831     1048576  beam.smp        /vernemq/erts-11.1.8/bin/beam.smp
1843653   3064     1048576  argocd-applicat /usr/local/bin/argocd
4080120   2920     1048576  java            /usr/lib/jvm/java-17-openjdk-17.0.11.0.9-2.el8.x86_64/bin/java
   2391   1881     1048576  containerd      /usr/bin/containerd
2610690   1157     1048576  mysqld          /opt/bitnami/mariadb/sbin/mariadbd
2703298   1078     1048576  redis-server    /usr/local/bin/redis-server
1046409    990     1048576  postgres        /usr/lib/postgresql/16/bin/postgres
1046484    989     1048576  postgres        /usr/lib/postgresql/16/bin/postgres
1771834    972     1048576  dd-ipc-helper   /memfd:spawn_worker_trampoline (deleted)
1771390    834     1048576  dd-ipc-helper   /memfd:spawn_worker_trampoline (deleted)
1975742    777     1048576  nginx           /usr/local/nginx/sbin/nginx
1906067    777     1048576  nginx           /usr/local/nginx/sbin/nginx
 767917    771     1048576  java            /usr/share/elasticsearch/jdk/bin/java

This script output shows the LimitNOFILE=infinity situation, before moving to the heavily capped LimitNOFILE=1024:524288 situation. For each of these processes, we can check whether they are provisioned and capable to raise their limits.

As one of the few examples, MariaDB actually does set RLIMIT_NOFILE dynamically at startup, according to the source. Postgres does not, but does read the value and dynamically limits how many tables it can keep open. Containerd sets its own limit (hard = soft), but does not propagate it to its children.

Those two nginx ingress controllers? Yes, they set it, provided worker_rlimit_nofile is configured, which it normally is.

Elasticsearch? I doubt it sets anything from Java, but maybe an entrypoint.sh script does. That other Java process? Kafka this time. No idea. After investigating a bunch of different processes you can probably tell my enthusiasm is wearing off.

Conclusions

I think we can draw two conclusions here.

One: trying to figure out whether all workload sets its soft open file limits, is a losing battle. Only if you have a well defined limited scope of applications you run, would it make sense to raise it only for individual applications.

Two: seeing that the maximum amount of open file descriptors doesn't exceed 16k and usually not even 4k, we can instead raise the default limits to something generous but not ridiculous: 65536 or 131072.

The choice is a trade-off:

We accept that old-style select(2)-using applications might not work: we can fix them with a custom entrypoint ulimit if needed.
We could set the soft limit to the hard limit and give everyone 512K or 1024K limits. But keeping them slightly more modest gives us earlier detection of file descriptor leaks (in broken applications) and better behaviour if the process tries to close all possible file descriptors. (Examples of that and further reading at rsyslog doing 100% CPU trying to close a billion file descriptors, CPython adding close_range(2) support and a containerd summary of the rationale to remove LimitNOFILE=infinity.)

We'll settle for this:

[Service]
LimitNOFILE=131072:524288

2026-05-08 - linux / local root exploit / module vetting

Recently, we were greeted with the Copy Fail Linux kernel vulnerability. Mitigating this was a matter of denylisting a module. But, only eight days later, there was another exploit, also (ab)using AF_ALG and kernel module autoloading. I'm betting this is not the last, now that the kernel is scrutinized using AI models that keep getting more advanced.

Luckily, we had our machine inventory up to date. So when CVE-2026-31431 ("Copy Fail") came along, deploying a mitigation was a matter of:

Creating /etc/modprobe.d/cve-2026-31431.conf everywhere, with:
```
install algif_aead /bin/false
```
checking our loaded module inventory (the os.kernel GoCollect collector collects this for us) to see if af_alg, algif_aead or authencesn was already loaded anywhere;

and lastly, testing that the specific exploit is now mitigated:

$ python -c 'from socket import *;s=socket(AF_ALG,SOCK_SEQPACKET);s.bind(("aead","authencesn(hmac(sha256),cbc(aes))"));print("metsys elbarenluv a si siht ,tihs"[::-1])'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory

An error is good: it means the exploit won't work.

Locking down autoloading

I remembered we had been discussing locking down the kernels further, and specifically locking down the loading of (normally) unused modules. Because we expect more bugs to be found in other modules, we'd rather stay ahead of the game and reject them beforehand.

Right now, there seem to be two ways to handle auto-loading:

Disabling all explicit module loading — using the kernel.modules_disabled sysctl;
Allowing all module loading, including implicit module loading by unprivileged users — any user calling e.g. socket(AF_ALG) can get certain modules loaded into the kernel.

Obviously, disabling all unneeded modules or disabling module loading altogether seems like the most secure fix. But, we never got around to the tedious work of figuring out which modules we actually need.

And, locking down module loading once the system is up is nice. But do you really know when it is fully up? Maybe your Ceph daemonsets inside your Kubernetes cluster hadn't started yet, and now you've locked down the modules before loading ceph.

Disallowing at least non-root users from (implicit) module loading sounds like a useful mitigation, but the kernel does not support any modules_autoload_mode. Apparently Linus decided against it. And maybe it is too hard to reason about these permissions when there are also namespaces at play.

So, is there another middle ground?

Module vetting

Can we allowlist modules without loading them beforehand?

Yes, we can. If we put install * /bin/false in /etc/modprobe.d/zz-denylist.conf, that gets loaded last and rejects anything that is not previously allowed.

Allowlisting modules is then a matter of adding many, many lines of this:

install foo /sbin/modprobe --ignore-install foo
install bar /sbin/modprobe --ignore-install bar
install baz /sbin/modprobe --ignore-install baz

Make sure they are loaded earlier, by using a lexicographically earlier filename, like /etc/modprobe.d/00-allowlist.conf.

The hard part

The hard part is knowing which modules we need. As mentioned above, we get os.kernel loaded-module info from GoCollect, so we have a good idea which modules we probably need.

Figuring out which modules we need is a tedious task, but if we simply look at the currently loaded modules on our fleet, we see that there are fewer than 600 modules loaded total on all machines, of differing types. In the most pessimistic scenario, a single machine would still only use 10% of the total modules available. So, allowing them, while denying the rest cuts down the available modules to attack by a great deal.

Assuming we now covered which modules we need, can we make it smarter?

`kernel.modprobe`

Yes, instead of hardcoding the list in configuration files, we can put them in a script. By using the kernel.modprobe sysctl setting, we can create a wrapper that does the vetting for our allowlist.

This wrapper script denies auto-load of certain modules: it does not disable insmod or (explicit) modprobe directly. This way it exactly targets the nonprivileged users we're trying to block, while still allowing the admin to load additional modules by hand if needed.

When the kernel tries to auto-load a module, it doesn't necessarily call /sbin/modprobe. It calls the executable in the kernel.modprobe sysctl — which we override as /usr/local/sbin/vetted-modprobe. That script gets called with arguments -q -- some_module and it can decide whether to honour the request or not.

Note that the kernel calls the script. You cannot decide which process or user gets permissions, but you can choose which module is allowed.

`/usr/local/sbin/vetted-modprobe`

Instead of doing many lines in /etc/modprobe.d/00-allowlist.conf, we create a /usr/local/sbin/vetted-modprobe wrapper:

#!/bin/sh
# Requires: sysctl kernel.modprobe=/usr/local/sbin/vetted-modprobe
# See: https://www.osso.nl/blog/2026/linux-local-root-exploit-module-vetting/
set -u

log() {
    if test -t 2; then echo "$0: $*" >&2; fi
    logger -t vetted-modprobe -p auth.notice "$*"
}

# We assume we're called as "-q -- MODULE_LIST" (from the kernel).
# Some other services might omit "-q --".
# Process them one by one.
if test $# -ge 3 && test "$1" = '-q' && test "$2" = '--'; then
    shift; shift
fi

# This may either give us an error:
# - modprobe: FATAL: Module foobar not found in directory /lib/modules/6.8.0...
# Or one or more suggested modules to load:
# - insmod /lib/modules/6.8.0-87-generic/kernel/crypto/af_alg.ko.zst
# - insmod /lib/modules/6.8.0-87-generic/kernel/crypto/algif_aead.ko.zst
plan=$(/sbin/modprobe -n -v -- "$@" 2>&1)
ret=$?

if test $ret -ne 0; then
    log "modprobe -n failed for '$*': $plan"
    exit $ret
fi

if test -z "$plan"; then
    exit 0
fi

unvetted=$(printf '%s\n' "$plan" | while read action filename; do
    test "$action" = insmod || continue
    # Strip dirname and trailing ".ko" or ".ko.zstd".
    filename=${filename##*/}; filename=${filename%.ko*}
    # OBSERVE: If we take the list from lsmod, we see filenames with only
    # underscores. But the files themselves might contain dashes. E.g.:
    # "nls_iso8859-1.ko.zst" shows up as "nls_iso8859_1" in lsmod.
    # Design choice: use the lsmod output, because it's so much easier to
    # obtain than the dash/underscore mixing filenames.
    # Normalize all dashes to underscores in pure POSIX shell:
    while test "${filename#*-}" != "$filename"; do
        filename=${filename%%-*}_${filename#*-}
    done

    case "$filename" in
        # NOTE: Any aliases have been resolved (like net-pf-38 => af_alg).
        #
        # vv-------- EXAMPLES HERE --------vv
        #
        # Some modules:
        allowed_module1|allowed_module2);;
        # More modules:
        mod_foo|mod_bar|mod_baz);;
        #
        # Explicitly _not_ allowed:
        #
        # CVE-2026-31431: "Copy Fail"
        #algif_aead) echo "$filename";;
        # CVE-2026-43284+CVE-2026-43500: "Dirty Frag"
        #esp4|esp6|rxrpc) echo "$filename";;
        #
        # ^^-------- EXAMPLES HERE --------^^
        #
        # NOTE: The unmatched (unvetted) modules are echoed.
        *) echo "$filename";;
    esac
done)

if test -n "$unvetted"; then
    log "deny 'modprobe -q -- $*'; because unvetted '$unvetted'"
    exit 1
fi

exec /sbin/modprobe -q -- "$@"

That's the gist of the script. Only auto-loading of the modules in the case statement is allowed. If you try to load an unvetted module, it gets rejected with the following log message:

$ sudo journalctl -t vetted-modprobe --facility auth
deny 'modprobe -q -- algif-skcipher'; because unvetted 'algif_skcipher'

Which modules are used?

As mentioned, the hard part is deciding which modules to allow. The script itself is easy. The list I compiled today has fewer than 600 modules in it (including modules that are not available in all kernels), so it cuts down the amount of allowed modules by a big margin.

The following list goes as contents of the case statement above. You should tweak this to your liking: the allowlisted modules are matched without action. The rest gets the echo "$filename" treatment and gets rejected.

CAVEAT EMPTOR: These modules are NOT necessarily safe from exploits. But they are actively in use (in our systems), and they account for less than 10% of total modules, so we massively cut down the attack space.

        #
        # ---------------- VETTED BELOW ----------------
        #
        # NOTE: Any aliases have been resolved (like net-pf-38 => af_alg).
        # Seen everywhere:
        8250_dw|acpi_ipmi|acpi_pad|acpi_power_meter|acpi_tad|aesni_intel);;
        af_packet_diag|ahci|amd64_edac|ast|autofs4|binfmt_misc|bonding);;
        br_netfilter|bridge|btrfs|ccp|cdc_ether|cec|cfg80211|coretemp);;
        crc32_pclmul|crct10dif_pclmul|cryptd|crypto_simd|dmi_sysfs);;
        drm|drm_kms_helper|drm_ttm_helper|drm_vram_helper|edac_mce_amd);;
        ee1004|efi_pstore|failover|fb_sys_fops|floppy|ghash_clmulni_intel);;
        hid|hid_generic|i2c_algo_bit|i2c_i801|i2c_piix4|i2c_smbus);;
        ib_core|ib_uverbs|icp|idma64|ie31200_edac|inet_diag|input_leds);;
        intel_cstate|intel_lpss|intel_lpss_pci|intel_pch_thermal);;
        intel_powerclamp|intel_rapl_common|intel_rapl_msr|intel_tcc_cooling);;
        ip6_tables|ip6_udp_tunnel|ip6t_REJECT|ip6table_filter);;
        ip6table_mangle|ip6table_raw|ip_set|ip_set_hash_ip|ip_set_hash_net);;
        ip_tables|ipmi_devintf|ipmi_msghandler|ipmi_si|ipmi_ssif|ipt_REJECT);;
        ipt_rpfilter|iptable_filter|iptable_mangle|iptable_nat|iptable_raw);;
        irqbypass|joydev|k10temp|kvm|kvm_amd|kvm_intel|libahci|libcrc32c|llc);;
        mac_hid|macsec|mei|mei_me|mii);;
        mlx5_core|mlx5_dpll|mlx5_ib|mlxfw|mptcp_diag);;
        net_failover|netlink_diag|nf_conntrack|nf_conntrack_netlink);;
        nf_defrag_ipv4|nf_defrag_ipv6|nf_log_syslog|nf_nat|nf_reject_ipv4);;
        nf_reject_ipv6|nf_socket_ipv4|nf_socket_ipv6|nf_tables);;
        nf_tproxy_ipv4|nf_tproxy_ipv6|nfnetlink|nfnetlink_acct|nfnetlink_log);;
        nft_chain_nat|nft_compat|nft_counter|nls_iso8859_1);;
        nvme|nvme_auth|nvme_core|nvme_fabrics|nvme_keyring|overlay);;
        pci_hyperv_intf|pinctrl_cannonlake|polyval_clmulni|polyval_generic);;
        psample|psmouse|ptdma|raid6_pq|rapl|raw_diag|rc_core|rndis_host);;
        sch_fq_codel|serio_raw|sha1_ssse3|sha256_ssse3|spl|stp);;
        syscopyarea|sysfillrect|sysimgblt|tcp_diag|tls|ttm);;
        udp_diag|udp_tunnel|unix_diag|usbhid|usbnet|veth|video|wmi|wmi_bmof);;
        x86_pkg_temp_thermal|x_tables|xfrm_algo|xfrm_user);;
        xhci_pci|xhci_pci_renesas|xor|xsk_diag|zavl|zcommon);;
        zfs|zlua|znvpair|zunicode|zzstd);;
        # Seen on many systems (30+):
        8021q|amdgpu|amdxcp|async_memcpy|async_pq|async_raid6_recov|async_tx);;
        async_xor|blake2b_generic|bnxt_en|bochs|bpfilter|chacha_x86_64);;
        cls_bpf|cmdlinepart|curve25519_x86_64|dca|drm_buddy);;
        drm_display_helper|drm_exec|drm_suballoc_helper|dummy);;
        ebtable_filter|ebtables|garp|glue_helper|gpu_sched|igb);;
        intel_pmc_core|intel_uncore_frequency|intel_uncore_frequency_common);;
        intel_vsec|ioatdma|ip6table_nat|ip_tunnel|ipip|jc42|libceph);;
        libchacha|libchacha20poly1305|libcurve25519_generic|linear);;
        lp|lpc_ich|mrp|mtd|multipath|nbd|nfit|nft_limit|nft_log);;
        parport|pata_acpi|pcspkr|pmt_class|pmt_telemetry|poly1305_x86_64);;
        qemu_fw_cfg|raid0|raid1|raid10|raid456|rbd|sb_edac|sch_ingress);;
        sctp|skx_edac_common|softdog|spi_intel|spi_intel_pci|spi_nor|sunrpc);;
        tap|tunnel4|usbmouse|vga16fb|vgastate|vhost|vhost_iotlb|vhost_net);;
        vmgenid|vxlan|wireguard|xfs|xhci_hcd);;
        # Seen on GPU systems:
        drm_gpuvm|nvidia|nvidia_drm|nvidia_modeset|nvidia_uvm);;
        # Seen on mgmt/storage systems:
        aufs|authenc|bluetooth|bochs_drm|ceph|cpuid|crc8|ecc|ecdh_generic);;
        ftdi_sio|fscache|gnss|i40e|ice|intel_qat|irdma|isci|isst_if_common);;
        libsas|mgag200|msr|netfs|qat_c62x);;
        scsi_transport_iscsi|scsi_transport_sas|ses|usbserial|vmd);;
        iommufd|pl2303|pnd2_edac|qat_c3xxx|vfio|vfio_iommu_type1);;
        vfio_pci|vfio_pci_core);;
        # Seen on storage systems:
        cxl_acpi|cxl_core|cxl_port|dax_hmem|enclosure|iaa_crypto);;
        idxd|idxd_bus|intel_ifs|intel_sdsi|mpt3sas|pfr_telemetry|pfr_update);;
        pinctrl_emmitsburg|qat_4xxx);;
        # Seen on NAT gateways or load balancers:
        cls_matchall|cls_u32|tcp_bbr);;
        # Seen on ci-runners (why?):
        af_alg|algif_rng);;
        # Seen on older Cumulus switches (common):
        ablk_helper|accton_as7326_56x_platform|acpi_cpufreq|aes_x86_64|at24);;
        cpr4011|crc32c_intel|cumulus_platform|dm_mod|ebt_police|ebt_setclass);;
        eeprom_class|efivarfs|efivars|fuse|gf128mul|gpio_ich|hwmon);;
        i2c_core|i2c_dev|i2c_ismt|i2c_mux|i2c_mux_pca954x);;
        iTCO_vendor_support|iTCO_wdt|ixgbe|kernel_bde|knet|lm75);;
        loop|lrw|mdio|mfd_core|mpls_iptunnel|mpls_router);;
        nf_conntrack_ipv4|nf_nat_ipv4|pmbus_core|sff_8436_eeprom|shpchp|tg3);;
        tpm|tpm_tis|tun|user_bde|vrf);;
        # Seen on older Cumulus switches (rare):
        accton_as7726_32x_platform|arp_tables|arptable_filter);;
        delta_ag5648v1_platform|delta_ag9032v2_platform|dps460|emc2305);;
        gpio_pca953x|ipmi_poweroff|quanta_ix7_cpld|quanta_ix7_platform);;
        quanta_ix8_cpld|quanta_ix8_platform|quanta_ly4r_platform|thermal);;
        tpm_crb|vhwmon);;
        # Seen on IPsec:
        # NOTE: Add 'esp4' in modules-load.d explicitly for ipsec.
        echainiv|nf_conntrack_ftp|tunnel6);;
        xfrm6_tunnel|xfrm_interface|xt_policy);;
        # Seen on PVEs:
        act_police|amd_atl|bnxt_re|cls_basic|drm_panel_backlight_quirks);;
        drm_shmem_helper|ehci_hcd|ehci_pci|fwctl|i10nm_edac|isst_if_mbox_pci);;
        iscsi_tcp|isst_if_mmio);;
        libiscsi|libiscsi_tcp|mlx5_fwctl|nvme_common|raid_class|ramoops);;
        scsi_common|scsi_dh_alua|scsi_dh_emc|scsi_dh_rdac|scsi_mod);;
        sch_htb|sctp_diag|sdhci|sdhci_pci|sdhci_uhs2);;
        sg|simplefb|skx_edac|spd5118);;
        usbkbd|xt_connmark|xt_mac);;
        # Seen on older systems:
        reed_solomon|zstd_compress);;
        pstore_blk|pstore_zone);;
        # Seen on VPN:
        ovpn);;
        # iptables (heavy use)
        xt_CT|xt_LOG|xt_MASQUERADE|xt_NFLOG|xt_POLICE|xt_SETCLASS|xt_TPROXY);;
        xt_addrtype|xt_comment|xt_conntrack|xt_hashlimit|xt_length|xt_limit);;
        xt_mark|xt_multiport|xt_nat|xt_nfacct|xt_physdev|xt_recent);;
        xt_set|xt_socket|xt_state|xt_statistic|xt_tcpudp);;
        # iptables (rare)
        ip_set_bitmap_port|ip_set_hash_ipport|ip_set_hash_ipportip);;
        ip_set_hash_ipportnet);;
        xt_CHECKSUM|xt_REDIRECT|xt_hl|xt_owner|xt_string|xt_tcpmss|xt_u32);;
        # netfilter (rare)
        nf_conntrack_pptp);;  # only rs420 tunnel
        nf_log_common|nf_log_ipv4|nf_log_ipv6|nf_nat_ftp|nf_nat_ipv6);;
        nft_masq);;
        # virtio (common)
        virtio_blk|virtio_net|virtio_scsi);;
        # virtio (rare)
        virtio|virtio_balloon|virtio_pci|virtio_pci_legacy_dev);;
        virtio_pci_modern_dev|virtio_ring|virtio_rng);;
        # Other (very rare.. leftovers):
        aacraid|amd64_edac_mod|apex|ata_generic|ata_piix);;
        button|cdrom|configfs|cqhci);;
        crc16|crc32c_generic|crc64|crc64_rocksoft|crc_t10dif);;
        crct10dif_common|crct10dif_generic|dm_multipath|e1000e);;
        ebtable_nat|einj|evdev|ext4|gasket|geneve|hfs|hfsplus|hpilo);;
        ib_cm|ib_iser|intel_pmc_ssram_telemetry);;
        intel_pmt|intel_th|intel_th_gth|intel_th_pci);;
        ip6t_rt|ip_vs|ip_vs_rr|ip_vs_sh|ip_vs_wrr|iw_cm|jbd2|jfs);;
        kheaders|libata|mbcache|megaraid_sas|minix|msdos|mxm_wmi);;
        nouveau|ntfs|pmbus|pmt_discovery|qnx4|qrtr|rdma_cm|regmap_i2c);;
        rfkill|sd_mod|sfc|sha512_generic|sha512_ssse3|spi_intel_platform);;
        sr_mod|t10_pi|ts_bm|uas|ufs|uhci_hcd);;
        usb_common|usb_storage|usbcore);;
        vhost_vsock|vmw_vsock_virtio_transport_common|vmwgfx);;
        vsock|vsock_diag);;
        # Seen on desktop systems, do not include these:
        #algif_hash|algif_skcipher|amd_pmc|amd_pmf|amd_sfh|amdtee);;
        #amdxdna|asus_wmi|auth_rpcgss|bnep|btbcm|btintel|btmtk|btrtl|btusb);;
        #cdc_acm|cmac|cp210x|cros_ec|cros_ec_chardev|cros_ec_debugfs);;
        #cros_ec_dev|cros_ec_hwmon|cros_ec_lpcs|cros_ec_proto|cros_ec_sysfs);;
        #dm_crypt|eeepc_wmi|gpio_cros_ec|gpio_keys|grace|i915);;
        #led_class_multicolor|leds_cros_ec|ledtrig_audio|libarc4|lockd);;
        #mac80211|mei_hdcp|mei_pxp|mfd_aaeon|mt76|mt76_connac_lib);;
        #mt7925_common|mt7925e|mt792x_lib|nfs_acl|nfsd);;
        #parport_pc|platform_profile|ppdev|r8169|realtek|rfcomm);;
        #snd|snd_hda_codec|snd_hda_codec_alc269|snd_hda_codec_atihdmi);;
        #snd_hda_codec_generic|snd_hda_codec_hdmi|snd_hda_codec_realtek);;
        #snd_hda_codec_realtek_lib);;
        #snd_hda_core|snd_hda_intel|snd_hda_scodec_component|snd_hrtimer);;
        #snd_hwdep|snd_intel_dspcfg|snd_intel_sdw_acpi|snd_pcm|snd_rawmidi);;
        #snd_seq|snd_seq_device|snd_seq_dummy);;
        #snd_seq_midi|snd_seq_midi_event);;
        #snd_timer|soc_button_array|soundcore|sparse_keymap|tee|thunderbolt);;
        #typec|typec_ucsi|ucsi_acpi|uhid);;
        #
        # ---------------- UNVETTED BELOW ----------------
        #
        # CVE-2026-31431: "Copy Fail"
        #algif_aead) echo "$filename";;
        # CVE-2026-43284+CVE-2026-43500: "Dirty Frag"
        #esp4|esp6|rxrpc) echo "$filename";;
        *) echo "$filename";;

The full script can be downloaded from vetted-modprobe.

Don't forget executable permissions on /usr/local/sbin/vetted-modprobe and to set /etc/sysctl.d/92-vetted-modprobe.conf to kernel.modprobe=/usr/local/sbin/vetted-modprobe and apply it with sysctl -p /etc/sysctl.d/92-vetted-modprobe.conf

And, because it is a script, you can complicate it all you want, with includes and excludes and auto-updates and whatever floats your boat. Maybe you only want to allow esp4 if ipsec is in the hostname. The possibilities are endless.

Summarizing

If you can set kernel.modules_disabled=1, then please do. If you can't, then maybe try the vetted-modprobe above.

Update 2026-05-12

The post has been edited to add the CVE numbers to Dirty Frag, and to fix a bug with the lsmod-provided values. The included modules all had underscores, but some filenames have dashes. All filenames are now normalized to underscores before checking the allowlist.

Update 2026-05-18

Additionally, the IRC nat/conntrack modules have been removed from the examples. And the vetted-modprobe now also accepts modules without -q -- as first two arguments. The reason being that some processes (Proxmox components instead of the kernel?) call vetted-modprobe instead of modprobe directly.

2026-02-04 - macos tahoe / ecn / slow downloads

Recently, a customer reported an intermittent but frustrating issue: since upgrading to macOS 26 (Tahoe), downloads from our servers were occasionally crawling. Not always, and not for everyone. The culprit turned out to be a combination of Explicit Congestion Notification (ECN), NIC offloading limitations, and the way classic congestion control algorithms react to "well-intentioned" signals.

The macOS Tahoe ECN lottery

The first mystery was the intermittency. We noticed that some connections were fast, while others to the same server appeared throttled.

The customer had already done a ton of investigating for us — this was great:

Change: the issues started after upgrading their Macs to version 26 (Tahoe).
Toggle: they told us that disabling ECN using net.inet.tcp.ecn_initiate_out=0 restored full speed.
Debug: they set up a while loop which downloaded the same file every 15 seconds — and provided us with the timestamps when the downloads were slow.

Looking at the traffic with tcpdump revealed that Explicit Congestion Notification (ECN) was indeed the trigger for the slowness. It also revealed that ECN was not applied to all outgoing connections, only to about 5% of them.

Connections with ECN and connections without

When a connection was using ECN, it looked like this in tcpdump:

14:24:46.778055 IP 1.1.1.1.51939 > 2.2.2.2.443: Flags [SEW], seq 276182108, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 936789772 ecr 0,sackOK,eol], length 0
14:24:46.778088 IP 2.2.2.2.443 > 1.1.1.1.51939: Flags [S.E], seq 1390193927, ack 276182109, win 65160, options [mss 1460,sackOK,TS val 1827830359 ecr 936789772,nop,wscale 7], length 0
14:24:46.781656 IP 1.1.1.1.51939 > 2.2.2.2.443: Flags [.], ack 1, win 2059, options [nop,nop,TS val 936789775 ecr 1827830359], length 0
14:24:46.782139 IP 1.1.1.1.51939 > 2.2.2.2.443: Flags [P.], seq 1:334, ack 1, win 2059, options [nop,nop,TS val 936789775 ecr 1827830359], length 333
14:24:46.782155 IP 2.2.2.2.443 > 1.1.1.1.51939: Flags [.E], ack 334, win 507, options [nop,nop,TS val 1827830363 ecr 936789775], length 0
14:24:46.784860 IP 2.2.2.2.443 > 1.1.1.1.51939: Flags [P.E], seq 1:3168, ack 334, win 507, options [nop,nop,TS val 1827830365 ecr 936789775], length 3167
...
14:24:50.112764 IP 2.2.2.2.443 > 1.1.1.1.51939: Flags [.EW], seq 3071982:3073430, ack 615, win 506, options [nop,nop,TS val 1827833693 ecr 936793106], length 1448
14:24:50.112782 IP 1.1.1.1.51939 > 2.2.2.2.443: Flags [.E], ack 3071982, win 2004, options [nop,nop,TS val 936793106 ecr 1827833691], length 0
14:24:50.112793 IP 2.2.2.2.443 > 1.1.1.1.51939: Flags [.E], seq 3073430:3074878, ack 615, win 506, options [nop,nop,TS val 1827833693 ecr 936793106], length 1448
14:24:50.115426 IP 1.1.1.1.51939 > 2.2.2.2.443: Flags [.E], ack 3073430, win 2026, options [nop,nop,TS val 936793109 ecr 1827833693], length 0

The E and W in the initial packet refer to the ECN-Echo (ECE) and Congestion Window Reduced (CWR) bits. In the handshake, these flags are used to negotiate whether both ends support the feature.

When a connection was not using ECN, we saw this:

14:25:05.190884 IP 1.1.1.1.52002 > 2.2.2.2.443: Flags [S], seq 1706341773, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1632358482 ecr 0,sackOK,eol], length 0
14:25:05.190949 IP 2.2.2.2.443 > 1.1.1.1.52002: Flags [S.], seq 3548265341, ack 1706341774, win 65160, options [mss 1460,sackOK,TS val 1827848772 ecr 1632358482,nop,wscale 7], length 0
14:25:05.194954 IP 1.1.1.1.52002 > 2.2.2.2.443: Flags [.], ack 1, win 2059, options [nop,nop,TS val 1632358485 ecr 1827848772], length 0
14:25:05.196046 IP 1.1.1.1.52002 > 2.2.2.2.443: Flags [P.], seq 1:334, ack 1, win 2059, options [nop,nop,TS val 1632358486 ecr 1827848772], length 333
...
14:25:05.321982 IP 2.2.2.2.443 > 1.1.1.1.52002: Flags [P.], seq 2993200:3014920, ack 615, win 506, options [nop,nop,TS val 1827848903 ecr 1632358612], length 21720
14:25:05.322002 IP 2.2.2.2.443 > 1.1.1.1.52002: Flags [P.], seq 3014920:3059808, ack 615, win 506, options [nop,nop,TS val 1827848903 ecr 1632358612], length 44888
14:25:05.322031 IP 2.2.2.2.443 > 1.1.1.1.52002: Flags [P.], seq 3059808:3080080, ack 615, win 506, options [nop,nop,TS val 1827848903 ecr 1632358612], length 20272
14:25:05.322045 IP 1.1.1.1.52002 > 2.2.2.2.443: Flags [.], ack 2884600, win 6777, options [nop,nop,TS val 1632358613 ecr 1827848897], length 0

Three things stood out here:

The ECN connections required about 4000 packets (in total);
the packets in those connections appeared to be limited to 1448 octets in length;
the total download time was significantly slower.

Analyzing a larger bunch of traffic

Because we were sniffing live data, and everything was encrypted with TLS, it was slightly more work to get the pcaps sorted. Luckily, the hostname used for the tests was rarely accessed outside of the test loop. This hostname is sent unencrypted as part of Server Name Indication (SNI), so that was a convenient way to filter out the necessary packets.

# Collect the remote ports used
tcpdump -nnr big.pcap |
  sed -e 's/.*1[.]1[.]1[.]1[.]//;s/:\? .*//' |
  sort -u >remote-ports

# Split the pcaps, if they contain the hostname (in SNI)
for port in $(cat remote-ports); do
  tcpdump -Annr big.pcap port $port 2>/dev/null |
    grep -q THE_UNIQUE_HOSTNAME &&
  tcpdump -nnr big.pcap port $port -w samples/$port.pcap 2>/dev/null
done

Now I had a bunch of pcaps, all sized 3 MiB (because the test download was about that large):

# ls -lh samples/
total 155M
-rw-r--r-- 1 tcpdump tcpdump 3.0M Feb  4 14:22 52323.pcap
-rw-r--r-- 1 tcpdump tcpdump 3.0M Feb  4 14:22 52325.pcap
-rw-r--r-- 1 tcpdump tcpdump 3.0M Feb  4 14:22 52328.pcap
...

Running some stats on these was easy:

# for fn in *.pcap; do
    packets=$(tcpdump -nnr $fn 2>/dev/null | wc -l);
    has_ecn=$(tcpdump -nnr $fn 'tcp[13] & 64 != 0' 2>/dev/null | grep -q '' && echo X || echo -);
    lastline=$(tcpdump -nnr $fn 2>/dev/null | sed -ne '1p;$p' | wtimediff | tail -n1);
    duration=$(echo $lastline | awk '{print $1}');
    when=$(echo $lastline | awk '{print $2}');
  echo $duration $when $fn ecn=$has_ecn packets=$packets
done | sort -k2V

That yielded the following (truncated) output:

+0.116948 14:21:59.860106 51642.pcap ecn=- packets=234
+0.135041 14:22:15.031269 51646.pcap ecn=- packets=237
...
+0.170701 14:24:16.486446 51698.pcap ecn=- packets=168
+0.229468 14:24:31.737016 51760.pcap ecn=- packets=639
+3.355901 14:24:50.133956 51939.pcap ecn=X packets=4211
+0.142125 14:25:05.333009 52002.pcap ecn=- packets=271
+0.168694 14:25:20.534368 52073.pcap ecn=- packets=357
...
+0.137638 14:29:23.340260 52303.pcap ecn=- packets=240
+0.122912 14:29:38.504557 52305.pcap ecn=- packets=266
+3.066842 14:29:56.607659 52308.pcap ecn=X packets=4055
+0.194032 14:30:11.863645 52314.pcap ecn=- packets=496
+0.128605 14:30:27.062803 52320.pcap ecn=- packets=229

Like clockwork, for every 20 opened connections — or once every 5 minutes — one connection was tried with ECN. That one would last for 3 seconds or more, and require about 4000 IP packets instead of the usual 100-500.

In fact, once we checked sysctl values configured on their systems, this was totally explainable. The connection count was not relevant, but the 5 minutes were:

# Max ECN setup percentage
net.inet.tcp.ecn_setup_percentage = 100

# Initial minutes to wait before re-trying ECN
net.inet.tcp.ecn_timeout = 5

Rather than enabling ECN globally for every flow — which could lead to connectivity issues on broken middleboxes — macOS Tahoe attempts to negotiate ECN on a per-connection basis, favoring it for a subset of outgoing flows to "test the waters."

"Middleboxes" are the firewalls, load balancers, and routers that sit between the client and server. Historically, some of these devices were configured to drop any packet with "unknown" or "reserved" bits set in the IP header — which included ECN. If a device simply discards ECN-marked packets, the connection hangs or times out.

Now that the cause for the flakiness was explained, on to the real problem: why did enabling ECN cause slowness?

Hardware offloading and the 1448 cap

When ECN was negotiated, we observed that the server stopped sending large, coalesced segments. Usually, when we run a tcpdump locally on the server, we see "super-packets" — massive chunks of data (often 20KB to 60KB) being pushed toward the NIC. This is the result of Generic Segmentation Offload (GSO) or TCP Segmentation Offload (TSO).

Checking the NIC capabilities with ethtool:

# ethtool -k enp2s0np0 | grep segmentation
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
generic-segmentation-offload: on

The hardware supports standard offloading, but not for ECN. When ECN is active, the tx-tcp-ecn-segmentation: off [fixed] flag tells the Linux kernel: "I don't know how to handle ECN bits in giant chunks, you'll have to segment them yourself."

In theory, the kernel should still use GSO to handle this in software, keeping the packets large until the very last moment (after tcpdump has seen them). But in practice, our captures showed the server was sending individual 1448-byte segments (MTU minus headers).

Disabling TSO and GSO using ethtool -K iface gso off tso off did not change any symptoms: normal traffic was still fast and showed up as large packets in tcpdump, ECN traffic was still slow and capped at 1448 octets.

Firmware upgrades

Since we had some issues with offloading and VLANs in 2025, I thought it was worth a try to check for updates.

It probably was, because 14.17.2020 is from before 2017, so almost 10 years old now.

# ./mlxup
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX4LX
  Part Number:      Super_Micro_AOC-C25G-m1S
  Description:      ConnectX-4 Lx EN adapter card; 25GbE single-port SFP28; PCIe3.0 x8
  PSID:             SM_2011000xxxxxx
  PCI Device Name:  0000:02:00.0
  Base MAC:         0cc47axxxxxx
  Versions:         Current        Available
     FW             14.17.2020     N/A
     PXE            3.4.0903       N/A
     UEFI           14.11.0028     N/A

  Status:           No matching image found

Those firmware versions look surprisingly similar to the official Mellanox firmware versions. There the latest version for a ConnectX4LX device is 14.32.1908. But for this SuperMicro specific device — with a Micro-LP form factor — no updates could be found.

On the SuperMicro site, there is a "web download" (wdl) link to an empty directory for the Super_Micro_AOC-C25G-I2S. But for the Super_Micro_AOC-C25G-m1S there's not even that.

Getting a new firmware for this device was not going to be the easy route.

Disabling ECN on the server

We could disable ECN on the server side, using the net.ipv4.tcp_ecn=0 sysctl. Now every connection initiated by the macOS client (after the ecn_timeout had passed) tried ECN but it was ignored by the server. This did solve the problem for them.

While this was a solution to the problem, it would be a regression in network tech, so we're not doing that.

Changing congestion control algorithm

After we depleted all other options, Google Gemini suggested tcp_congestion_control=bbr. By default this is cubic on a generic Ubuntu or Debian system.

This is not a setting I touch generally, but it sounded safe enough to try. And it did exactly what we wanted. The packet summaries now looked like this:

...
+0.100778 14:53:34.630292 52764.pcap ecn=- packets=154
+0.126388 14:53:49.789472 52781.pcap ecn=- packets=308
+0.114749 14:54:04.954907 52786.pcap ecn=X packets=388
+0.105739 14:54:20.126535 52789.pcap ecn=- packets=168
+0.109139 14:54:35.287600 52792.pcap ecn=- packets=159
...
+0.147384 14:58:37.988716 52866.pcap ecn=- packets=166
+0.112264 14:58:53.137022 52881.pcap ecn=- packets=174
+0.107939 14:59:08.281076 52890.pcap ecn=X packets=417
+0.105624 14:59:23.453946 52900.pcap ecn=- packets=161
+0.154969 14:59:38.667600 52901.pcap ecn=- packets=218
...

A single download with ECN was now possible with fewer than 500 packets. The packet captures with and without ECN now looked roughly the same.

A few notes:

The packet counts were consistently over 350, whereas the non-ECN flows could go as low as 150.
The macOS client still only tried ECN once every 5 minutes, even though it wasn't slower than its non-ECN counterpart. I speculate that it only flags endpoints for consistent ECN if flow improves with it.
It appears that hardware offloading had no real effect here. In fact, one could even speculate that offloaded ECN flows might suffer from a poor default congestion control as well. Maybe we'll need to disable hardware ECN offloading at some point when the NIC does support it.

The root cause

The root cause of the throughput collapse was the way CUBIC interprets network signals. In classic congestion control, an ECN-Echo (ECE) mark is treated as a "hard" congestion signal, identical to a dropped packet.

Our servers were caught in a "death spiral":

Because the NIC could not offload ECN-marked traffic (TSO off), the kernel had to process all packets in software.
Every individual packet marked with Congestion Experienced (CE) by a router surfaced immediately in the TCP stack as a distinct ECN-Echo signal.
CUBIC reacts to these marks by aggressively reducing its Congestion Window (cwnd), treating each CE mark as a loss-equivalent congestion signal.
Generic Segmentation Offload (GSO) could still have coalesced packets, but the constant "braking" from the shrinking cwnd prevented sufficient batching, resulting in a stream of MTU-sized (1448-byte) drips instead of high-speed bursts.

BBR congestion control as the fix

Switching to BBR (Bottleneck Bandwidth and Round-trip propagation time) resolved the issue by changing the server's reaction to those ECN marks. Unlike CUBIC, BBR is model-based. It ignores the "panic" of individual ECN marks and focuses instead on the actual measured delivery rate and RTT of the path.

By maintaining a high-speed flow based on the actual capacity of the "pipe," BBR allowed the Linux kernel to effectively use GSO. Even though the NIC hardware still couldn't do the segmentation, the CPU was now able to bundle data into large "super-packets" before handing them to the driver.

This explains the dramatic drop from 4000+ packets to under 500: BBR restored the flow efficiency that CUBIC had traded away for caution.

Conclusion

If you run into slow downloads and see ECN in the mix, look at these options:

sysctl -w net.inet.tcp.ecn_initiate_out=0 on a macOS client (or the Linux equivalent);
sysctl -w net.ipv4.tcp_ecn=0 on the server (to confirm the hypothesis);
ethtool -K iface tx-tcp-ecn-segmentation on on the server (if supported by your hardware);
sysctl -w net.ipv4.tcp_congestion_control=bbr on the server (as the robust, modern fix). You may need to modprobe tcp_bbr.

One question remains: why doesn't the macOS client settle on ECN after it finds that it works equally well as the non-ECN flow?

wjd.nu