Notes to self, 2025

2025-07-06 - loopback through leaf / single nic iperf test

When bringing up new servers, especially with new or untested network interface cards (NICs) or cables, you might want to test their raw network throughput. While standard iperf tests between two different machines are common, sometimes you want to isolate the test to a single NIC, pushing traffic through it and back, effectively looping it at a nearby switch. This verifies the physical path, the SFP, and the NIC's transmit/receive capabilities without involving another server's NIC or network stack.

The challenge, however, is that an iperf server and client running on the same host will typically use the internal loopback interface (lo) for communication, never touching the physical NIC. We need to trick the Linux network stack into routing traffic out of the specific NIC and then back in to reach the iperf server on the same machine. This is where ip rule and policy-based routing come in.

The Scenario: Looping iperf on enp1s0f0np0

Imagine you have a server with a NIC named enp1s0f0np0 that you want to test. It's connected to a layer-3 switch. We'll set up a dedicated, isolated IP address on this NIC and use ip rule to ensure traffic to that address always leaves and re-enters via enp1s0f0np0.

The Initial State: Default Routing Policy

Before we begin, a quick look at the default IP routing policy on a typical Linux host:

# ip rule show
0:  from all lookup local
32766:  from all lookup main
32767:  from all lookup default

This shows the standard lookup order: first, the local table (for local addresses), then main (your primary routing table), and finally default (a fallback). Our goal is to insert rules before local that force our specific iperf traffic to use a custom route.

Step 1: Configuring the neighbour switch

This may or may not be needed, depending on your setup. If you already have a valid address configured on your host, you might get away with using that. You will need to replace the IP-addresses in the example configurations below.

On a Cumulus switch, configuring a link-local test network with two IPs might look like:

# net add interface swp1 ip address 169.254.101.0/31
# net comm

Here, swp1 is the switch port connected to enp1s0f0np0. I chose the 169.254.0.0/16 link-local subnet because these addresses are made for exactly this purpose.

We're handing out 169.254.101.1 to the host and 169.254.101.0 to the switch. You're free to choose anything you like. You can even just re-use any currently configured IP. Make sure you replace the host and gateway IPs in the commands below though.

Step 2: Configuring the Host NIC and Routing Policy

Now for the magic on the host. We'll add the corresponding IP address to our NIC and then, crucially, define a new routing table and rules to direct traffic through it.

Add the host IP to the NIC:

# ip addr add 169.254.101.1/31 dev enp1s0f0np0

Create a custom routing table (table 169) for our looped traffic. This says: to reach 169.254.101.1, go via 169.254.101.0 (the switch's IP) out of enp1s0f0np0.

# ip route add 169.254.101.1/31 via 169.254.101.0 \
    dev enp1s0f0np0 table 169

Policy-based routing rule 1: Prioritize traffic from our test IP entering enp1s0f0np0 to be handled by the local table. This is for the return traffic.

# ip rule add prio 169 from 169.254.101.1 \
    iif enp1s0f0np0 lookup local

Policy-based routing rule 2: Prioritize traffic to our test IP to use our custom table 169. This forces outgoing traffic destined for ourselves to go via the switch.

# ip rule add prio 170 to 169.254.101.1 lookup 169

Policy-based routing rule 3: This is a critical adjustment. The default from all lookup local rule (prio 0) would catch our 169.254.101.1 address and loop it internally. We need to make sure our custom rules come before it.

The trick here is to insert a new lookup local rule with a higher priority and then delete the old prio 0 rule. This effectively re-prioritizes the local lookup:

# ip rule add prio 171 from all lookup local
# ip rule del prio 0 from all lookup local

The Result: A Tailored Routing Policy

After these changes, your routing policy (ip rule show) and tables (ip route show, ip route show table 169) should look something like this:

# ip rule show
169:  from 169.254.101.1 iif enp1s0f0np0 lookup local
170:  from all to 169.254.101.1 lookup 169
171:  from all lookup local
32766:  from all lookup main
32767:  from all lookup default
# ip route show
default via 10.20.30.1 dev enp1s0f0np0
10.20.30.0/24 dev enp1s0f0np0 proto kernel scope link src 10.20.30.44
169.254.101.0/31 dev enp1s0f0np0 proto kernel scope link src 169.254.101.1
# ip route show table 169
169.254.101.1 via 169.254.101.0 dev enp1s0f0np0

Notice how ip rule now ensures that any traffic to 169.254.101.1 goes through lookup 169, forcing it out enp1s0f0np0 via 169.254.101.0 (the switch). When it comes back, ip rule prio 169 ensures traffic from 169.254.101.1 on interface enp1s0f0np0 is handled locally. This is key to bypass the standard loopback.

The 10.20.30.44 is just a sample IP that may or may not be configured on your host.

A quick test, comparing the link-local IP to localhost:

# ping -I 169.254.101.1 169.254.101.1
PING 169.254.101.1 (169.254.101.1) from 169.254.101.1 : 56(84) bytes of data.
64 bytes from 169.254.101.1: icmp_seq=1 ttl=63 time=0.578 ms
64 bytes from 169.254.101.1: icmp_seq=2 ttl=63 time=0.538 ms
^C
# ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.015 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.025 ms
^C

The successful ping -I 169.254.101.1 169.254.101.1 with measurable latency and ttl of 63 confirms the traffic is indeed leaving and re-entering the enp1s0f0np0 interface.

Running iperf

With the routing in place, running iperf is straightforward.

On the server, start iperf binding to our test IP:

# iperf -B 169.254.101.1 -s

On the client (the same machine!), connect to the server using the test IP:

# iperf -B 169.254.101.1 -c 169.254.101.1 -t 15 -i 5

The -B flag ensures iperf binds to our chosen IP address for its traffic, which our ip rule configuration will then intercept and route correctly. You can now observe the throughput directly through your target NIC.

This method provides a robust way to isolate and thoroughly test a single NIC's performance by forcing traffic through the physical layer, giving you confidence in your new hardware or setup.

2025-05-31 - invalid elf header magic / rook-ceph / k8s

Yesterday, we noticed some previously unseen messages in the logs: Invalid ELF header magic: != \x7fELF. Of course an ELF binary should start with a valid header. Who/what is trying to run something else?

These messages appeared on Kubernetes nodes that were recently installed and not yet fully in use for production workloads. Coincidentally, these particular nodes were running Ubuntu 24.04 (Noble Numbat), while the rest of this cluster was still on older versions.

Yes, we're already well into 2025. Upgrading from Jammy (22.04) to Noble (24.04) has been less pain-free than the upgrade from Focal (20.04) to Jammy (22.04), so we are a bit behind. If others have been slow to upgrade as well, that might explain why so many new application bugs turn up this late.

In dmesg the “Invalid ELF” message was clear and readable. In journalctl it reads [33B blob data]:

# journalctl -S '2025-05-30 11:00' --grep 'ELF'
mei 30 11:00:25 noble.example.com kernel: [33B blob data]
mei 30 11:05:30 noble.example.com kernel: [33B blob data]
mei 30 11:10:41 noble.example.com kernel: [33B blob data]

journalctl makes sure no unprintable garbage goes to the screen — in this case the \x7f. But that replacement replaces everything, including the printable ASCII.

Now that I knew what log line to look for, I checked surrounding log lines. There, in the middle, there's the kernel: [33B blob data]. Before it, containerd shows that a container is started:

kubelet[1605]: I0530 11:05:30.172379    1605 scope.go:117]
  "RemoveContainer" containerID="<id>"

containerd[1717]: time="2025-05-30T11:05:30.173674804+02:00" level=info
  msg="CreateContainer within sandbox \"<id>\" for container &ContainerMetadata{Name:csi-rbdplugin,Attempt:853,}"

containerd[1717]: time="2025-05-30T11:05:30.182002680+02:00" level=info
  msg="CreateContainer within sandbox \"<id>\" for &ContainerMetadata{Name:csi-rbdplugin,Attempt:853,} returns container id \"<id>\""

containerd[1717]: time="2025-05-30T11:05:30.182178318+02:00" level=info
  msg="StartContainer for \"<id>\""

ctre[1581]: C 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0

systemd[1]: Started cri-containerd-<id>.scope - libcontainer container <id>.

containerd[1717]: time="2025-05-30T11:05:30.218713573+02:00" level=info
  msg="StartContainer for \"<id>\" returns successfully"

ctre[1581]: T 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0

kernel: [33B blob data]

systemd[1]: cri-containerd-<id>.scope: Deactivated successfully.

containerd[1717]: time="2025-05-30T11:05:30.258058437+02:00" level=info
  msg="received exit event container_id:\"<id>\"  id:\"<id>\"  pid:1942336  exit_status:255  exited_at:{seconds:1748595930  nanos:257930298}"

ctre[1581]: t 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0

systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-<id>-rootfs.mount: Deactivated successfully.

containerd[1717]: time="2025-05-30T11:05:31.547087240+02:00" level=info
  msg="shim disconnected" id=<id> namespace=k8s.io

containerd[1717]: time="2025-05-30T11:05:31.547113810+02:00" level=warning
  msg="cleaning up after shim disconnected" id=<id> namespace=k8s.io

containerd[1717]: time="2025-05-30T11:05:31.547119691+02:00" level=info
  msg="cleaning up dead shim" namespace=k8s.io

kubelet[1605]: I0530 11:05:32.203748    1605 scope.go:117] "RemoveContainer" containerID="<id>"

ctre sidenote

There's also a ctre application log in there:

ctre[1581]: T 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0

ctre is our custom tool that keeps track of the starting and stopping of images in containerd.

Docker and Kubernetes tend to accumulate lots of old images in containerd. These unused images aren't just disk space hogs; they can also degrade the performance of the underlying ZFS filesystem. We needed a smarter way to prune.

While common methods like crictl prune aggressively clear everything they can (barring currently in-use images), this isn't always ideal.

Imagine a periodic cron job relying on a large image; if it's pruned every time the job isn't running, the image has to be downloaded anew for the next execution.

ctre addresses this by recording image usage timestamps. Our nightly ctre prune then selectively keeps images likely to be reused, while purging those that haven't seen recent activity.

When looking at the logs, this log line provided an additional hint as to the problem's source.

bpftrace to find the execve

The “Invalid ELF” errors were appearing in the logs regular as clockwork, so a good place to start seemed to be to look for any execve() calls. Maybe a non-zero return value would point us right to the culprit.

bpftrace deserves a blog post of its own. One thing it can do, is do strace-like functionality, but for all processes. Instead of strace -e execve on a single application, I do this:

# bpftrace -e '
tracepoint:syscalls:sys_enter_execve
{
  printf("%s.%d [%d] execve (%s) ", strftime("%H:%M:%S", nsecs), nsecs % 1000000000, pid, comm);
  join(args->argv);
}'

tar (child): xz: Cannot exec: No such file or directory
...
11:46:40.231050035 [2826375] execve (zfs-driver) zfs list -d 1 -s name -o name,guid,available,used -H -p
11:46:40.231597246 [2826376] execve (zfs) chroot /host /sbin/zfs list -d 1 -s name -o name,guid,available,used -H -p
...

This will show all execve() calls happening on the system.

You can safely ignore that xz: Cannot exec. But if you want to get rid of it, see the tips in the bpftrace-deb README:

Another thing to note is the comm — the argument between parentheses like (zfs-driver) — which looks like the parent process. And that is not because it is the parent but because the parent forked/cloned itself before becoming the new process: when your bash shell invokes a new process, it first fork()s so there is a bash parent and a bash child. The child then uses execve() to become the new process. The bpftrace log shows the moment just before the transition. That explains that comm looks like the parent.

The log messages were generally about 5 minutes apart, so I didn't have to wait long for the next message to appear. And indeed, I didn't have to wait very long. The log listed the next error:

[vr mei 30 11:46:40 2025] Invalid ELF header magic: != \x7fELF

And around that time, these suspicious looking execve calls were made.

11:46:40.231597246 [2826376] execve (sh) modprobe rbd
11:46:40.231597246 [2826376] execve (sh) umount

csi-rbdplugin in CrashLoopBackoff

That reminds me, I did see some csi-rbdplugin rook-ceph Kubernetes pods in CrashLoopBackOff state.

That was no coincidence. All failing pods were on the new Noble nodes. The pods running on Jammy were fine.

I ran modprobe rbd manually on the node and waited for the next cycle.

11:51:44.533278002 [2826376] execve (sh) modprobe nbd

I imported the nbd module as well. Now the all the pods started successfully, and the ELF error disappeared from the logs.

Old modprobe on new system

Now everything fell into place: during initialization, the ceph pods try to load rbd and nbd modules. But on the newer systems they couldn't.

First, let's look at the modprobe version on the host system:

root@noble.example.com# modprobe --version
kmod version 31
+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL

Next, the modprobe version inside the quay.io/cephcsi/cephcsi:v3.8.0 image:

sh-4.4# modprobe --version
kmod version 25
+XZ +ZLIB +OPENSSL -EXPERIMENTAL

Here's the critical difference. Previously, on the older Jammy systems, the kernel modules were not compressed:

root@jammy.example.com# find /lib/modules/$(uname -r) -name rbd*
/lib/modules/5.15.0-116-generic/kernel/drivers/block/rbd.ko

But now, on the Noble systems, they were compressed with ZSTD:

root@noble.example.com# find /lib/modules/$(uname -r) -name rbd*
/lib/modules/6.8.0-60-generic/kernel/drivers/block/rbd.ko.zst

To confirm the kmod incompatibility directly from within the container, I did a quick test:

sh-4.4# cat /etc/os-release
NAME="CentOS Stream"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
sh-4.4# find /lib/modules/6.8.0-60-generic/ -name ipta*
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_raw.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_security.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_filter.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_nat.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_mangle.ko.zst
sh-4.4# modprobe iptable_nat
modprobe: ERROR: could not insert 'iptable_nat': Exec format error
[vr mei 30 12:36:09 2025] Invalid ELF header magic: != \x7fELF

Yes, that test confirms that modprobe was the culprit. The older kmod version (25) inside the container does not cope with the ZST compressed kernel modules found on the Noble host. Manually preloading the rbd and nbd modules on the nodes was the easiest fix for now.

Technical nuance

While I was initially looking to blame an execve call for directly triggering the “Invalid ELF header magic” error, it turns out that's not strictly possible. An execve call attempting to run something that is not a valid binary will typically return ENOEXEC (8) with the message “Exec format error” to the calling process. No kernel error would be logged.

The kernel error we observed is actually only triggered internally by the kernel during a module load failure. This can be seen in the kernel source code for module loading:

/*
 * Check userspace passed ELF module against our expectations, and cache
 * useful variables for further processing as we go.
...
 */
static int elf_validity_cache_copy(struct load_info *info, int flags)
{
// ...
        if (memcmp(info->hdr->e_ident, ELFMAG, SELFMAG) != 0) {
                pr_err("Invalid ELF header magic: != %s\n", ELFMAG);
                goto no_exec;

However, tracing the execve calls was still crucial as it directly pointed me to modprobe as the culprit process. So, while the initial assumption about the direct trigger mechanism was slightly off, the debugging tactic was ultimately effective.

2025-05-30 - nginx / no realip in logs / bug

Recently we were doing a vulnerability scan on an external endpoint of a website. During this time, we noticed simultaneous suspicious activity coming from inside our internal network. What's the deal?

To make a long story short: we use a proxy (like HAProxy) in front of our webservers, and it speaks the Proxy Protocol to pass along the original client's IP address. That way, when a request hits our internal nginx server, the logs show the real client's IP, not the IP of the reverse proxy that actually made the connection.

But during our scan, some failed requests showed up in the logs with the proxy's IP address — not the external client's. This made us pause: “Wait… is someone inside our network scanning this site too?”

We figured it out quickly, but it was still unsettling. And in our view, this is a bug.

Reproducing

Step one: fetch the latest nginx.

$ git clone https://github.com/nginx/nginx
...

Configure and build it:

$ ./auto/configure --with-debug --with-http_realip_module
...
$ make -j$(nproc)
...
$ ./objs/nginx -V
nginx version: nginx/1.29.0
built by gcc 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
configure arguments: --with-http_realip_module

Set up a basic config in ./conf/nginx_realip_bug.conf:

worker_processes  1;
error_log stderr info;

events {
    worker_connections  1024;
}

http {
    include       mime.types;
    default_type  text/plain;

    set_real_ip_from  127.0.0.0/8;
    real_ip_header    proxy_protocol;

    access_log /dev/stderr;

    server {
        listen       8080 proxy_protocol_always;
        server_name  localhost;
        location / {
            return 200 "Hello world\n";
        }
    }
}

Run nginx in the foreground:

$ ./objs/nginx -p $PWD -c conf/nginx_realip_bug.conf -g 'daemon off;'
2025/05/29 22:59:23 [notice] 154902#0: using the "epoll" event method
2025/05/29 22:59:23 [notice] 154902#0: nginx/1.29.0
2025/05/29 22:59:23 [notice] 154902#0: built by gcc 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
...

Do a simple request:

$ printf '%s\r\n' \
    'PROXY TCP4 1.2.3.4 127.127.127.127 80 8080' \
    'GET /foo HTTP/1.0' '' |
    nc 127.0.0.1 8080 -w2

HTTP/1.1 200 OK
Server: nginx/1.29.0
Date: Fri, 30 May 2025 19:36:40 GMT
Content-Type: text/plain
Content-Length: 12
Connection: close

Hello world

This could use some explanation. The GET /foo HTTP/1.0 is the oldest/most basic HTTP request, followed by two sets of CR LF. The printf '%s\r\n' provides those.

We could have used curl --path-as-is --haproxy-protocol localhost:8080/foo, instead of the printf and nc combination. But curl doesn't let us set values (like 1.2.3.4), and for this example we need them to be visibly different.

The location / in the configuration accepts everything below / — like /foo — and returns a 200 OK status code and the Hello world message.

Proxy protocol

The Proxy Protocol — originally proposed by Willy Tarreau — is a simple way to preserve information about the real client when traffic passes through one or more proxies.

Shown here is the v1 version:

PROXY TCP4 1.2.3.4 127.127.127.127 80 8080

This line tells the receiving server that the original IPv4 client (1.2.3.4) is connecting to the destination IP (127.127.127.127) — from port 80 to port 8080. A reverse proxy injects this line at the start of the connection, so the upstream (or "origin") server can see who actually made the request, even though it's technically connected to the proxy.

Yes, HTTP has X-Forwarded-For, but that header is less reliable: it depends on careful trust boundaries, can be spoofed, and is harder to manage across non-HTTP protocols.

This is all explained in detail at proxy-protocol.txt.

We use reverse proxies to protect our internal network from the big bad internet. But for the machines behind those proxies, the Proxy Protocol is how they know who's really knocking.

Real IP in logs

With the provided nginx config, the first GET /foo from above shows up in the logs as:

1.2.3.4 - - [30/May/2025:21:36:40 +0200] "GET /foo HTTP/1.0" 200 12 "-" "-"

Let's do another request, with the funky, but somewhat legal, /foo/../foo path:

$ printf '%s\r\n' \
    'PROXY TCP4 1.2.3.4 127.127.127.127 80 8080' \
    'GET /foo/../foo HTTP/1.0' '' |
    nc 127.0.0.1 8080 -w2

HTTP/1.1 200 OK
Server: nginx/1.29.0
Date: Fri, 30 May 2025 19:36:57 GMT
Content-Type: text/plain
Content-Length: 12
Connection: close

Hello world

We still get a 200 status. And this log line:

1.2.3.4 - - [30/May/2025:21:36:57 +0200] "GET /foo/../foo HTTP/1.0" 200 12 "-" "-"

So far so good. Now make it worse by trying to escape from the root path with /foo/../../foo:

$ printf '%s\r\n' \
    'PROXY TCP4 1.2.3.4 127.127.127.127 80 8080' \
    'GET /foo/../../foo HTTP/1.0' '' |
    nc 127.0.0.1 8080 -w2

HTTP/1.1 400 Bad Request
Server: nginx/1.29.0
Date: Fri, 30 May 2025 19:37:11 GMT
Content-Type: text/html
Content-Length: 157
Connection: close

<html>
<head><title>400 Bad Request</title></head>
<body>
<center><h1>400 Bad Request</h1></center>
<hr><center>nginx/1.29.0</center>
</body>
</html>

Boom! Rejected. That's okay. That's good in fact.

Proxy IP in logs

And the logs?

2025/05/30 21:37:11 [info] 175878#0: *3 client sent invalid request while reading client request line, client: 127.0.0.1, server: localhost, request: "GET /foo/../../foo HTTP/1.0"
127.0.0.1 - - [30/May/2025:21:37:11 +0200] "GET /foo/../../foo HTTP/1.0" 400 157 "-" "-"

Two log lines this time. But here's the problem: they show the proxy's IP (127.0.0.1) instead of the real client IP (1.2.3.4). Not nice.

There's a simple technical reason. nginx rejects the malformed request before any modules get a chance to run — including the realip module that rewrites IP addresses.

This isn't limited to bad HTTP requests either. TLS handshake errors and other early-stage failures can also bypass the modules, leading to the same issue: logs that show the proxy's IP instead of the real one.

But just because there's an explanation doesn't mean it's okay.

Bug filed at nginx/nginx#717.

Let's see what upstream has to say.

2025-05-29 - supermicro / ikvm / expired certificate

Supermicro computers generally work very well. At our company we've been using them for ages. One thing that does keep giving us the occasional trouble however is the BMC. Today, we'll be looking at expired certificates of the iKVM interface.

Introduction

First, some definitions, which we generally use interchangeably:

The Baseboard Management Controller (BMC) is the chipset core of the Intelligent Platform Management Interface (IPMI) implementation, which we mostly use to remotely attach a Keyboard, Video and Mouse (KVM, over IP).

iKVM — for IP-KVM or IPMI-KVM? — is the Java implementation that comes with the Supermicro BMC Firmware. The Supermicro BMC Interface (or IPMI) provides access to the KVM via either a HTML interface or via its iKVM Java application.

The use case for IPMI KVM access is when we've lost, or never had, regular ssh or vpn access to a remote server. Often you'll want more than just a keyboard; you also want to attach a virtual CD-ROM or USB-drive to get access to installation/recovery tools.

The HTML interface does not provide the capability to mount a virtual drive. The Java application does. So, we prefer working with that one. We'll be calling that iKVM, and BMC will refer to the hardware, firmware and the web interface of the Supermicro IPMI.

(Note that the Java application is dated and buggy, but with the right tools (vcutil ipmikvm and xpaste) it's usable.)

As mentioned, the Supermicro BMC and iKVM have been a recurring item in my notes to self section, all the way back to 2016:

That issue with the inaccurate BMC clock was a particularly insidious one: the clock was set in the past, causing it to reject TLS connections from iKVM because it considered the TLS certificate not yet valid. But it was hard to debug because the server side did the rejection and only provided an uninformative status code.

Today, there's another “fun” one, also related to certificates and time.

Retry =10 (Connection failed)

There was a connectivity problem on a machine and we needed get KVM access.

No biggie! Just run vcutil ipmikvm.

$ ipmikvm 10.10.10.10
attempting login on '10.10.10.10' with user ADMIN
...
+ exec java -Djava.library.path=.local/lib/ipmikvm/iKVM__V1.69.42.0x0 -cp ...

connect failed sd:16
Retry =1
Retry =2
Retry =3
Retry =4
Retry =5
Retry =6
Retry =7
Retry =8
Retry =9
Retry =10
Java popup that says 'Connection Failed'

Okay, that's not good. Generally we get about one to three retries and then it works.

We could log into the web interface after clicking past the self-signed and expired certificate warning:

Web browser warning stating: 'Your connection is not private'

In the web interface we could use the HTML interface to the KVM and the power buttons. But using iKVM would consistently fail.

Expired certificate

A few moments of debugging later, and the expired certificate was assigned blame:

$ easycert -T 10.10.10.10 443
Certificate chain
 0 s: {1B:29:EF:3D:CA:7B:26:2D:A0:AE:AA:71:39:E9:5C:91:0C:C1:28:A8} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
   i: {1B:29:EF:3D:CA:7B:26:2D:A0:AE:AA:71:39:E9:5C:91:0C:C1:28:A8} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
---
Expires in -578 days

vcutil easycert is a quick and easy wrapper around the openssl CLI to check certificate chains and expiry. As you can see, it shows that the certificate is (a) self-signed and (b) expired 578 days ago.

On the https port 443, the certificate had expired almost two years ago. Doing the same check on the main iKVM port 5900:

$ easycert -T 10.10.10.10 5900
Certificate chain
 0 s: {BF:00:C8:10:AC:B2:58:E9:E7:4A:D9:7E:06:A2:DB:79:F6:3B:8F:38} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
   i: {BF:00:C8:10:AC:B2:58:E9:E7:4A:D9:7E:06:A2:DB:79:F6:3B:8F:38} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
---
Expires in -578 days

Exactly the same. Only the X509v3 Subject Key Identifier was different. (Insert foreshadowing here.)

I tried to update the certificate in the web interface. The openssl command to create fresh certificate with the same subject and encryption algorithm:

$ openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
    -keyout ipmi-priv.pem -out ipmi-pub.pem \
    -subj '/C=US/ST=California/L=San Jose/O=Super Micro Computer/OU=Software/CN=IPMI'
...

In the BMC web interface, the Date and Time were set correctly this time. And in the Configuration > Network > SSL Certificates there's a form to upload the freshly generated certificates.

After uploading them, the web certificate was not updated yet. This required a BMC reset located through Maintenance > BMC Reset > Unit Reset. The reset took more than a minute, but afterwards, the certificate on port 443 was updated to the freshly generated one. Good.

$ easycert -T 10.10.10.10 443
Certificate chain
 0 s: {B6:52:AB:9A:A1:C4:67:9C:8F:40:FF:C3:A8:75:78:8D:85:7F:6B:22} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
   i: {B6:52:AB:9A:A1:C4:67:9C:8F:40:FF:C3:A8:75:78:8D:85:7F:6B:22} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
---
Expires in 3649 days

But unfortunately, the certificate on port 5900 wasn't. It still showed the same expired certificate.

I tried — in vain — to run ipmikvm again, but as expected, it failed with the same Retry =10 and Connection failed messages.

Updating firmware

Next, I did some frantic Googling and ChatGPTing to find another button/option to update the certificates, but the general concensus was that on this board, the TLS certificates were baked into the BMC firmware and could not be independently updated.

Supermicro lists on its downloads page that you shouldn't update the BMC firmware unless you really really have to:

'Please do not download / upgrade the BMC/Firmware UNLESS your system has a BMC/Firmware-related issue.
Flashing the wrong BMC/Firmware can cause irreparable damage to the system.
In no event shall Supermicro be liable for direct, indirect, special, incidental, or consequential damages arising from a BMC/Firmware update.'

But, I do want iKVM to work again. Let's find the right firmware. Start out by figuring out what hardware we're dealing with:

# for x in system-manufacturer system-product-name \
           baseboard-manufacturer baseboard-product-name
  do printf '%-24s: %s\n' "$x" "$(dmidecode -s $x)"; done

system-manufacturer     : S3S
system-product-name     : AS -1114S-WN10RT
baseboard-manufacturer  : Supermicro
baseboard-product-name  : H12SSW-NTR

Google turns up this AS-1114S-WN10RT which seems plausible, especially since it speaks of the H12SSW-NTR motherboard. It has two downloads available:

BIOS / BMC / Bundle Firmware for AS-1114S-WN10RT and BIOS / BMC / Bundle Firmware for AS-1114S-WN10RT.

Oh wait, that's the same download. I want to update the BIOS even less than I want to update the BMC. And it also mentions that updating needs to be done via the EFI shell, which requires the system to be rebooted, which I also don't want.

Is there another way?

Faking the time

So, as far as I understand, the trouble is that a component in the iKVM application validates the certificate expiry time. This could be inside the bundled Java application, or in an auxiliary application like stunnel.

In any case, unlike the earlier issues where iKVM was rejected due to the BMC clock being set incorrectly, this problem looks to be client side. And on the client, we have control. Like control over the “current time”.

Mr. Chat, whip me up a quick LD_PRELOAD shared libary that fakes the time being 2 years in the past.

In mere seconds, I received an almost functional faketime.c. Fixed, and annotated with docs:

/**
 * LD_PRELOAD library to "fake" the current time. Takes a
 * FAKETIME_OFFSET=<seconds> as environment variable.
 *
 * Compile:
 *   gcc -shared -fPIC -o faketime.so faketime.c -ldl -Wall
 * Run:
 *   FAKETIME_OFFSET=$((2*365*24*60*60)) LD_PRELOAD=$PWD/faketime.so date
 */
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

static time_t offset_seconds = 0;

time_t time(time_t *tloc) {
    static time_t (*real_time)(time_t *) = NULL;
    time_t real, fake;
    if (!real_time)
        real_time = dlsym(RTLD_NEXT, "time");
    real = real_time(NULL);
    fake = real - offset_seconds;
    if (tloc)
        *tloc = fake;
    return fake;
}

int gettimeofday(struct timeval *tv, void *tz) {
    static int (*real_gettimeofday)(struct timeval *, struct timezone *) = NULL;
    int res;
    if (!real_gettimeofday)
        real_gettimeofday = dlsym(RTLD_NEXT, "gettimeofday");
    res = real_gettimeofday(tv, tz);
    if (res == 0)
        tv->tv_sec -= offset_seconds;
    return res;
}

int clock_gettime(clockid_t clk_id, struct timespec *tp) {
    static int (*real_clock_gettime)(clockid_t, struct timespec *) = NULL;
    int res;
    if (!real_clock_gettime)
        real_clock_gettime = dlsym(RTLD_NEXT, "clock_gettime");
    res = real_clock_gettime(clk_id, tp);
    if (res == 0 && (clk_id == CLOCK_REALTIME || clk_id == CLOCK_REALTIME_COARSE))
        tp->tv_sec -= offset_seconds;
    return res;
}

__attribute__((constructor))
void init() {
    const char *env = getenv("FAKETIME_OFFSET");
    if (env) {
        offset_seconds = strtoll(env, NULL, 10);
    } else {
        /* Default to 2 years (730 days) if not set */
        offset_seconds = 730 * 24 * 60 * 60;
    }
}

A quick polish, compilation with gcc -shared -fPIC -o faketime.so faketime.c -ldl and test:

$ LD_PRELOAD=$PWD/faketime.so date
Tue May 30 11:29:26 UTC 2023
$ FAKETIME_OFFSET=$((-2*24*60*60)) LD_PRELOAD=$PWD/faketime.so date
Sat May 31 11:29:53 UTC 2025

Yup, it works, even with negative offsets for future dates.

But does it also work with iKVM? Drum rolls...

$ FAKETIME_OFFSET=$((2*365*86400)) LD_PRELOAD=$PWD/faketime.so ipmikvm 10.10.10.10
attempting login on '10.10.10.10' with user ADMIN
...
connect failed sd:16
Retry =1
Gtk-Message: 13:31:13.519: Failed to load module "canberra-gtk-module"
a singal 17 is raised
GetFileDevStr:4051 media_type = 40

Success! A working iKVM viewer. And another utility library for under my belt.

BMC upgrade anyway

When writing this post, I did fetch the BIOS firmware again, to double check some things, and I noticed that inside H12SSW-NTR_3.3_AS01.05.21_SAA1.2.0-p2.zip, there is both a BIOS zip and a and a BMC BMC_H12AST2500-ROT-2201MS_20250430_01.05.21_STDsp.zip file.

Through Maintenance > Firmware Management > Update and deselecting “Preserve SSL Certificate” the BMC firmware could be updated. (I guess that the SSL certificate update question only applies to the web certificate though.)

The update appeared to stall at Upload firmware : 75% for a while, but it completed after a few minutes. The web interface was ungrayed, but a FW Update Mode: Current system is in FW update mode, any configuration changes are not recommended notice stayed visible and the current and new BMC firmware versions were listed, along with an Update button (which you need to press, or nothing happens, pro-tip).

Updating from 01.00.25 to 01.05.21... took about 10 more minutes. After System is rebooting, Please wait! it might be done. You'll have to check yourself.

And after all that waiting, I was finally greeted with new certificates. Both for the web interface and the iKVM interface:

$ easycert -T 10.10.10.10 443
Certificate chain
 0 s: {9C:70:1A:CF:57:9E:B9:A1:CE:EB:B8:99:47:3C:A8:4D:88:1B:BD:16} [a8c7e6cd] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI, emailAddress = support@supermicro.com
   i: {9C:70:1A:CF:57:9E:B9:A1:CE:EB:B8:99:47:3C:A8:4D:88:1B:BD:16} [a8c7e6cd] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI, emailAddress = support@supermicro.com
---
Expires in 3622 days
$ easycert -T 10.10.10.10 5900
Certificate chain
 0 s: {68:6D:88:26:E2:B4:34:3F:CC:06:FC:E7:02:99:2B:6C:FC:57:61:7D} [a8c7e6cd] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI, emailAddress = support@supermicro.com
   i: {68:6D:88:26:E2:B4:34:3F:CC:06:FC:E7:02:99:2B:6C:FC:57:61:7D} [a8c7e6cd] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI, emailAddress = support@supermicro.com
---
Expires in 3622 days

Ok. No faketime.so needed anymore.

Hardware crash

Unfortunately, the BMC firmware upgrade came with a “present”:

mlx5_core 0000:81:00.0: temp_warn:165:(pid 0): High temperature on sensors with bit set 0 1
mlx5_core 0000:81:00.1: temp_warn:165:(pid 0): High temperature on sensors with bit set 0 1
...
mlx5_core 0000:81:00.0: mlx5_crdump_collect:50:(pid 1100352): crdump: failed to lock vsc gw err -16
mlx5_core 0000:81:00.0: mlx5_health_try_recover:327:(pid 1100352): handling bad device here
mlx5_core 0000:81:00.0: mlx5_error_sw_reset:233:(pid 1100352): start
mlx5_core 0000:81:00.1: mlx5_crdump_collect:50:(pid 1031132): crdump: failed to lock vsc gw err -16
mlx5_core 0000:81:00.1: mlx5_health_try_recover:327:(pid 1031132): handling bad device here
mlx5_core 0000:81:00.1: mlx5_error_sw_reset:233:(pid 1031132): start
mlx5_core 0000:81:00.0: NIC IFC still 7 after 1000ms.
mlx5_core 0000:81:00.0: mlx5_error_sw_reset:266:(pid 1100352): end
mlx5_core 0000:81:00.1: NIC IFC still 7 after 1000ms.
mlx5_core 0000:81:00.1: mlx5_error_sw_reset:266:(pid 1031132): end
mlx5_core 0000:81:00.0: mlx5_wait_for_pages:735:(pid 1100352): Skipping wait for vf pages stage
mlx5_core 0000:81:00.1: mlx5_wait_for_pages:735:(pid 1031132): Skipping wait for vf pages stage
mlx5_core 0000:81:00.0: mlx5_health_try_recover:330:(pid 1100352): health recovery flow aborted, PCI reads still not working
mlx5_core 0000:81:00.1: mlx5_health_try_recover:330:(pid 1031132): health recovery flow aborted, PCI reads still not working

Not sure if the upgrade was the culprit, but mid-upgrade the Mellanox network devices both crashed. Unloading and reloading the kernel modules did not revive them, so a reboot was needed.

Luckily I didn't have to fight the expired certificate this time.

2025-05-05 - windows install / media driver missing / ventoy

Recently I had the misfortune of assisting someone with installing Windows 11 on their computer. I tried to push Ubuntu/Linux on them, but they wouldn't budge. So I begrudgingly prepped the USB stick with the ISO and got the installer running.

# dd if=windows11.iso of=/dev/sdx \
    bs=1M status=progress oflag=dsync
Windows installation screen prompting language selection. Blue title bar.

The ISO boots from USB. So, writing the USB was successful. And we got language selection.

But, on the second screen, we were greeted with the following error:

A media driver your computer needs is missing. This could be a DVD, USB or Hard disk driver.

Windows installation screen saying: A media driver your computer needs is missing. This could be a DVD, USB or Hard disk driver. Gray title bar.

A media driver your computer needs is missing

The message seems descriptive, but really it isn't. The BIOS had detected all the drives in this machine: an M.2, and two SSD SATA drives. What are you not detecting? SSDs on SATA that have existed for ages now. Surely Windows driver support cannot be this bad.

The two logical options would be either to get specific motherboard drivers (unexpected, but possible if Windows is really lacking in support, or if this hardware was really uncommon), or that a BIOS setting needed to be adjusted.

The Gigabyte B550M had "preinstall" drivers, which I downloaded onto a second(!) USB. One of the NVMe drivers was visible (supposedly compatible) for installation, but after some loading, it wasn't compatible/usable anyway.

For BIOS options, the internet told us to disable (or enable!) various BIOS options, related to: CSM, TPM, Secure boot, AHCI, Legacy USB. And switching from a USB 3.0 or a USB 2.0 port, or vice versa. Nothing helped.

Lastly, I reconsidered the advice that writing the USB had to be done using the Media Creation Tool or Rufus.exe. This did not make sense to me, because it booted just fine, so what could be wrong with my dd-written USB? But, we'd have to try this option too.

I run Linux on my systems, so I did not have any Windows laying around to run Media Creation Tool or Rufus with. When looking for Linux alternatives to those, the internet pointed to Ventoy.

Ventoy

Ventoy is not, like I thought it would be, a tool that writes an ISO to your USB stick, like dd(1). No, you don't tell it which ISO to write. Instead, it sets up a nifty boot loader and an empty filesystem.

Ventoy GUI showing an Install button and which USB media to write to.

Initializing the USB stick was easy with the GUI installer. Unplugging and plugging the USB back in was sufficient for Ubuntu to recognise the Ventoy filesystem on the stick. Now it was a simple matter of copying the ISO to that filesystem — using cp(1), not dd(1).

Lo and behold, when booting we were presented with something akin to this:

Ventoy boot menu, showing various ISO files to choose to boot from.

Nice! We can have a bootable USB stick with several installation options. One stick, many installs — no more last-minute flashing. (On top of that, Ventoy can leave free space for general purpose partitions.)

And the interesting part: the Windows 11 installer now looked slightly different.

Windows installation screen prompting language selection. Gray title bar.

Observe how the title bar is now gray instead of blue, how it says Collecting information at the bottom and how the rest of the design matches the following screen.

And that's not all, pressing next, greeted us with a partition selection like one would expect:

Windows installation screen prompting which partition to use for installing Windows. Gray title bar.

And the rest of the installation worked as expected.

Two morals of the story this time:

  • When creating a bootable Linux USB, using dd(1) is sufficient, but for a Windows ISO it is not.
  • I can get used to this Ventoy boot loader. Having bootable installers for multiple operating systems on a single USB stick sounds super convenient.

2025-04-15 - pveproxy / systemd journald / perl buffering

Because I was debugging — what later turned out to be — an application firewall that was messing with my connections to Proxmox pveproxy on port 8006, I wanted more debug logs. Telling pveproxy to give me more info gave me a bit of a hard time.

Connections from a remote peer would get disconnected, and it initially looked like pveproxy was to blame.

Step one: get more info from pveproxy.

pveproxy has a pveproxy start --debug=1 mode. I added that to the systemd config using an /etc/systemd/system/pveproxy.service.d/override.conf file:

[Service]
# Wipe the original ExecStart= first.
ExecStart=
# Enable debugging.
ExecStart=/usr/bin/pveproxy start --debug=1

Alas, this refused to start, because --debug runs in the foreground. Add Type=simple:

[Service]
# pveproxy --debug stays in the foreground.
Type=simple
# Wipe the original ExecStart= first.
ExecStart=
# Enable debugging.
ExecStart=/usr/bin/pveproxy start --debug=1

Okay. The pveproxy service now started successfully. But no extra logging appeared. Except.. after a restart there suddenly were logs.

I've seen this before. Output buffering? Yes.

Output buffering is an optimization where output isn't flushed after every line, but first when a larger buffer is full. Useful if you're writing lots of output. Terrible if it means that log output is saved for later. I want to see my debug output live.

For Python, there is the PYTHONUNBUFFERED=1 environment variable or the python -u flag. pveproxy is a Perl application however. ChatGPT will gladly hallucinate a PERL_UNBUFFERED=1, but that doesn't exist. You need this code instead:

select STDERR;$|=1;select STDOUT;$|=1;

Doing file descriptor modifications, just before program start, with stdbuf(1) doesn't fix things. And I did not want to install Expect just for the unbuffer(1) application.

Is there another way to inject the $|=1; without modifying the existing Perl script?

I settled for the following. (I did have some trouble with systemd escaping at first. Not sure if that was an older systemd or just a brain fart on my end.)

[Service]
# pveproxy --debug stays in the foreground.
Type=simple
# Wipe the original ExecStart= first.
ExecStart=
# Enable debugging, with PERL_UNBUFFERED=1.
#ExecStart=/usr/bin/pveproxy start --debug=1
#ExecStart=/bin/sh -c 'exec /usr/bin/perl -Te '\''select STDERR;$|=1;select STDOUT;$|=1;do "/usr/bin/pveproxy" or die "Could not run: $@ $!";'\'' -- start --debug=1'
ExecStart=/usr/bin/perl -Te 'use strict;use warnings;select STDERR;$|=1;select STDOUT;$|=1;my $s="/usr/bin/pveproxy";do $s;die "Failed to start $s: $!" if $!;' -- start --debug=1

Now pveproxy starts and the debug output appears in journald as soon as it's written.

For the record, the debug output turned out to be useless. Instead, tcpdump revealed that traffic from the client would contain data (a [P], for the TLS ClientHello), but when looking at the traffic on the receiving host, I would see a [R] (reset). The TCP traffic had been altered to tear down the connection in the outbound direction. Apparently there was an application firewall in the middle, sniffing that port 8006 traffic, and deciding that TLS was dangerous on non-443 ports. *Sigh*

2025-03-17 - sed / regular expressions / optimizing

Last week, we were looking at using sed(1) to remove line feeds from CSV files. The regular expressions in that post could use some further inspection.

I'm sorry that this is a rather dry post. I wanted to get the numbers out there because the differences are significant. But without the example runs you're just left with numbers.

Feel free to skip right to the conclusions.

To recap: we were working on multiline CSV files. Matching even and odd numbers of double quotes is relevant, because they tell us whether or not we're still working on the same row.

We start off by creating two sample files:

$ python3 -c 'for x in range(100000): print("A"*1500)' >many-alpha1500.dat
$ python3 -c 'for x in range(100000): print("\""*1500)' >many-dquotes1500.dat

many-alpha1500.dat has 100k lines with 1500 letters A: zero double quotes.

many-dquotes1500.dat has 100k lines with 1500 double quotes: basically everything is a double quote.

We want to compare the speed of a match versus a non-match, and take printing a line or not out of the equation. For sed we add !d or d to the case that matches:

$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.950s (all user time)

The astute reader will recognise d41d8cd98f00b204e9800998ecf8427e as the empty md5 hash. We want no output, because printing would affect the timing.

The regular expression from the previous post, which was quickly cobbled together, looked like this:

/^[^"]*("[^"]*"[^"]*)*$/

That is, zero-or-more non-double quotes, then a combination of two double quotes with non-double quotes around it, and that as many times as possible. Or, in plain English: an even amount of double quotes.

Reducing backtracking

Regular expression backtracking wisdom says you need to avoid optional matches (?, *, +). Rewriting this slightly to /^([^"]|"[^"]*")*$/ makes sense, both for clarity and fewer optionals. For speed it doesn't seem to matter though.

$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.901s (all user time)
$ time sed -Ee '/^([^"]|"[^"]*")*$/d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.844s (all user time)
$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.071s (all user time)
$ time sed -Ee '/^([^"]|"[^"]*")*$/d' many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.098s (all user time)

The regular expression did not seem to matter, but the dataset did. For some reason, the matching double quotes take a slightly faster path.

Here it's hard to tell whether it's slow because of backtracking, or because of something else in the regex. It's not the address code itself, because replacing the regex with /./ or /(.*)/ makes it fast. And when feeding it double quotes (the "hard" case), it was in fact faster (0m4.071s/0m4.098s); maybe because of the larger match between parentheses?

What we do observe though, is if we make it not-matching (look for an uneven count of double quotes), it's suddenly fast for the A case:

$ time sed -Ee '/^([^"]|"[^"]*")*"[^"]*$/!d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.365s (all user time)

My instinct says those two matches should take the same amount of time on a long string of A characters. But apparently I'm wrong.

The lots-of-double-quotes case is now (expectedly) slow, even though it doesn't match either.

$ time sed -Ee '/^([^"]|"[^"]*")*"[^"]*$/!d' many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.490s (all user time)

grep and perlre

We take a quick detour using grep.

grep differentiates between our old and our new regular expression:

$ time grep -vE '^[^"]*("[^"]*"[^"]*)*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m13.411s (all user time)
$ time grep -vE '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m21.086s (all user time)

.. but in a bad way. The — in my opinion — better expression performs worse. And it's horribly slow compared to sed.

grep can also do Perl regular expressions (perlre, using the -P option). Let's see how that compares:

$ time grep -vE '^.*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.403s (all user time)
$ time grep -vP '^.*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.106s (all user time)

Using -P is twice as fast for the most basic expression. Let's try a slightly less simple, but still trivial, one:

$ time grep -vE '^[^"]*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.062s (all user time)
$ time grep -vP '^[^"]*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.113s (all user time)

The above example tells us that using even the simplest regular expressions can be slow using the old POSIX engine in grep. Perl or other modern implementations will be slightly faster in trivial cases, and insanely much faster in slightly more complicated cases:

$ time grep -vE '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m21.072s (all user time)
$ time grep -vP '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.267s (all user time)

And no, before you ask, memory usage is only slightly higher when using the Perl engine.

Summary

Here's a quick summary of the results from above, although it doesn't show all the nuance between the different inputs and matching vs. non-matching.

sed -E grep -E grep -P
^.*$2.400.400.11
^[^"]*$4.254.150.11
^([^"]|"[^"]*")*$5.0021.200.32
^[^"]*("[^"]*"[^"]*)*$5.0013.700.11

(Lower is better.)

And what about Python? Yes, there is that too. More different results. Hard to keep track of all dimensions now. And now there is even system time involved.

$ time python -c "$(printf '%s\n' \
    'import re;import sys;c=re.compile(r'\''^[^"]*("[^"]*"[^"]*)*$'\'')' \
    'for l in sys.stdin:' ' if not c.match(l):' '  print(l)')" \
    < many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.122s
user  0m0.086s
sys   0m0.036s
$ time python -c "$(printf '%s\n' \
    'import re;import sys;c=re.compile(r'\''^[^"]*("[^"]*"[^"]*)*$'\'')' \
    'for l in sys.stdin:' ' if not c.match(l):' '  print(l)')" \
    < many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.176s
user  0m2,768s
sys   0m1.369s

I give up trying to explain any of this.

Conclusions

The conclusions I draw from the above are:

  1. you should test your regular expressions on exaggerated (larger) datasets instead of assuming, and
  2. before you try to optimize your regular expressions you may want to consider a different regular expression engine instead,
  3. and you'll probably want to avoid both grep -E and sed altogether if speed matters.

2025-03-15 - sed / remove line feeds / csv

Can you use sed(1) to remove line feeds?

Yes you can. Even though sed, the stream editor, is line based, you can use it to remove line feeds from output.

In this post we'll be exploring how that is done and look into how a slight refactoring step can provide a big speed gain. The task at hand is removing line feeds from CSV file with multiline strings. Why? Because I want to use grep and awk on it, and multiline strings would complicate that a lot. Why sed? Because I wanted to know if it's as easy as I think it should be.

Let's dig in.

sample.csv

We'll start off with a small sample.csv that we can play around with, to get us going:

idx,name,year,summary,url
1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:
Eastnor Castle Deer Park
Eastnor
Herefordshire (UK)",https://www.emfcamp.org/schedule/2018
2,"Legoland trip",2019,"Went to Billund for team building.",https://www.osso.nl/blog/2019/recap2019/
3,"MCH2022 ""May Contain Hackers""",2022,"Nonprofit outdoors hacker camp.
Location: Zeewolde (NL)",https://wiki.mch2022.org/Village:OSSO
4,"WHY2025 ""What Hackers Yearn""",2025,"Successor to MCH2022.",-

As you can see, this CSV file has one header row and four CSV data rows. But because some rows are split over multiple lines, doing a simple grep, cut or awk will not work as intended.

Here's a feedble attempt to get all the URLs from the fifth column, by asking awk to split by comma:

$ awk -F, '{print $5}' sample.csv
url




https://www.osso.nl/blog/2019/recap2019/


-

As expected, awk will not work, as it works on the lines, and has no idea where a record/row begins.

Quick sidenote: using cut -d, -f5 sample.csv would produce different, and even more wrong, output. If there is no fifth field, you'll get an earlier field instead. See this example:

$ echo how | cut -d, -f2,3,4
how
$ echo how,are,you | cut -d, -f2,3,4
are,you

That's unexpected behaviour that awk avoids.

The goal

Selecting the URLs from the CSV sample would have worked fine, had it looked like this:

idx,name,year,summary,url
1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at: Eastnor Castle Deer Park Eastnor Herefordshire (UK)",https://www.emfcamp.org/schedule/2018
2,"Legoland trip",2019,"Went to Billund for team building.",https://www.osso.nl/blog/2019/recap2019/
3,"MCH2022 ""May Contain Hackers""",2022,"Nonprofit outdoors hacker camp. Location: Zeewolde (NL)",https://wiki.mch2022.org/Village:OSSO
4,"WHY2025 ""What Hackers Yearn""",2025,"Successor to MCH2022.",-

Here we have one header and four rows, and they add up to five lines. Selecting the fifth column is easy now:

$ awk -F, '{print $5}' goal.csv
url
https://www.emfcamp.org/schedule/2018
https://www.osso.nl/blog/2019/recap2019/
https://wiki.mch2022.org/Village:OSSO
-

As you may have noticed, a comma in CSV column/string would trip us up. Ignore that for now. You can argue that using regular unix utilities to work with CSV files is the wrong approach, but I think that often times the command line is king. If you're working with new/varied data, using the command line to quickly explore and prototype is unbeatable. You can choose a real programming language later when you're done prototyping.

So, we have the input (sample.csv), and we have the output (goal.csv with md5 hash d7a5eba9e0a80a1d8c1158a1430927a3). Can we coax sed into producing goal.csv? We can. (I used the md5 hash to confirm that all the versions of goal.csv are equal.)

Attempt #1: making it work

Here's attempt one, which I have to admit took me longer to produce than I wanted. And I have to say ChatGPT was of no help at all. I can tell you its sed skills are abysmal.

#!/bin/sed -Ef

# Append line to hold space, clear pattern
# space, put hold space in pattern space.
H
z
x

# If pattern space contains an uneven number of
# double quotes, place everything back into the
# hold space and don't display.
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    h
    d
}

# If pattern space contains an even number of
# double quotes, we're done collecting.
# The first hold added a linefeed to void,
# remove it now.
# Replace the others with a single space.
/^[^"]*("[^"]*"[^"]*)*$/ {
    s/^\n//
    s/\n/ /g
}

That's my first working attempt:

$ ./attempt-1.sed | md5sum
d7a5eba9e0a80a1d8c1158a1430927a3  -

It works! But it has a couple of odd quirks.

Taking you through it one by one:

  1. sed has two buffers: the pattern buffer and the hold buffer. You're always working on the pattern buffer. The first H appends the input (the current pattern) to the hold buffer.
  2. The z wipes (zeroes) the pattern buffer, the x swaps them. Now the pattern buffer has all the collected input so far.
  3. The /^[^"]*("[^"]*"[^"]*)*"[^"]*$/ regex checks that there is an uneven number of double quotes: if it is uneven, the pattern buffer is replaced into the hold buffer, the line is deleted (not printed), and the cycle restarts.
  4. If we get to the next regex (/^[^"]*("[^"]*"[^"]*)*$/) we've confirmed that the quotes are even. We remove the first linefeed (which got added after the first H) and replace the rest with a single space.
  5. If we get here, we're not replacing the pattern buffer into the hold buffer (using h), so the next cycle starts clean.

So. That works. And after looking more into the sed info pages, it turns out that the d command restarts the cycle. That means that the second regex match isn't even needed. It could instead just be:

# If we get here, the pattern space contains an
# even number of double quotes. The first hold
# added a linefeed to void, remove it now.
# Replace the others with a single space.
s/^\n//
s/\n/ /g

Attempt #2: use only the pattern buffer

When working with real data, it turned out that this was rather slow. For a simple 1000 row, 1 MiB file, having the script take more than a second seemed inappropriate.

The thing that immediately stands out is the swapping between pattern and hold space. Step one is avoiding that.

#!/bin/sed -Ef

:x

# If pattern space contains an uneven number of
# double quotes, add more to the line and
# continue at x.
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    N
    bx
}

# Apparently we have an even number of double
# quotes. Remove LFs and print (by ending the
# cycle).
s/\n/ /g

This looks better already. We don't need the hold buffer at all.

  1. The N reads a new line and appends it to the pattern buffer.
  2. bx jumps back to the :x label.
  3. This repeats for as long as there is an uneven amount of double quotes in the pattern buffer.
  4. Once the double quotes are even, we end up on the last line.
  5. There we replace the line feeds with spaces and let the cycle restart with the automatic printing of the pattern buffer.

Is it faster?

Well.. according to a couple of sample runs with a 1.3 MiB dataset, maybe we go from 690ms to 670ms. Definitely not earth shattering speed gains. What's next?

Attempt #3: stop doing double work

On to fix the real problem, the repetitive regular expression. The data set I'm working with at the moment looks like this:

$ wc dataset.csv
  23679  156039 1389021 dataset.csv
$ ./attempt-2.sed dataset.csv | wc
   1001  156039 1389021

A 1.3 MiB file, 1001 rows, 23679 lines. On average, that's 27 lines per row and 1388 bytes per row.

That means that on average, it has to rerun the regular expression 27 times before it doesn't match anymore (because the double quotes are even). And it has to do so on a line that grows to up to 1388 characters:

1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:

Uneven amount of quotes? Yes, 3. Restart.

1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:
Eastnor Castle Deer Park

Still uneven amount of quotes? Yes, 3. Restart.

1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:
Eastnor Castle Deer Park
Eastnor

Still uneven amount of quotes? Yes, 3. Restart again.

You can see how that's inefficient. It would be better if it just counted the double quotes on the latest input line.

Attempt #3 of the script fixes that:

#!/bin/sed -nEf

# Find an uneven amount of double quotes.
# If found, populate hold space, load next line
# and jump to inquote.
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    h
    n
    b inquote
}

# Not found? Jump to done.
b done

# Loop here until the input line also has an
# uneven amount of quotes: uneven + uneven is
# even. In the mean time, add to the hold
# buffer and fetch the next line.
:inquote
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    H
    x
    b done
}
H
n
b inquote

:done
s/\n/ /g
# sed -n and manual printing is needed,
# otherwise the 'n' will print the pattern
# space.
p

Admittedly, this looks a lot less like a sed oneliner, and more like an obscure programming language. But, boy, it is so much faster. With the same dataset, we go down to mere milliseconds.

It works like this:

  1. Find an uneven amount of double quotes. If not found, jump to the done label.
  2. At the done label, we replace the line feeds, and print the output. Restart the cycle.
  3. If we did find an uneven amount of double quotes, store the pattern in the hold space, get the next pattern, and go to the inquote label.
  4. At inquote we also look for an uneven amount of quotes. If not found it means that we're not done (uneven + even = uneven). We append the pattern to the hold buffer using H and load up the next pattern with n. Go back to inqoute.
  5. If we did find an uneven amount of double quotes, we're done (uneven + uneven = even). Append this line as well with H, and then exchange the hold and the pattern buffer with x so we can work on the entire line. (Using H;x is the same as x;G in this case.) Go to done (see above).

Timings

For completeness sake, here are some representative timing outputs:

$ time ./attempt-1.sed dataset.csv >/dev/null

real  0m0.696s (all user time)
$ time ./attempt-2.sed dataset.csv >/dev/null

real  0m0.664s (all user time)
$ time ./attempt-3.sed dataset.csv >/dev/null

real  0m0.033s (all user time)

As you can see, that's much better.

In the next post, we'll see if we can improve the regular expressions.