Notes to self (wjd.nu)

Notes to self, 2025

Year: 2025 |2024 |2023 |2022 |2021 |2020 |2019 |2018 |2017 |2016 |2015 |2014 |2013 |2012 |2011 |2010 |2009 |2008 |2007 |2006 |index

2025-05-31 - invalid elf header magic / rook-ceph / k8s

Yesterday, we noticed some previously unseen messages in the logs: Invalid ELF header magic: != \x7fELF. Of course an ELF binary should start with the same valid header. Who/what is trying to run something else?

... theses nodes where installed recently and not put to use yet ... ... It's already 2025, and we haven't deployed Ubuntu 2024 everywhere yet. Shame. But that's because of all the bugs ... ...

# journalctl -S '2025-05-30 11:00' --grep 'ELF'
mei 30 11:00:25 node5.example.com kernel: [33B blob data]
mei 30 11:05:30 node5.example.com kernel: [33B blob data]
mei 30 11:10:41 node5.example.com kernel: [33B blob data]

Seemed to coincide with these logs:

kubelet[1605]: I0530 11:05:30.172379    1605 scope.go:117] "RemoveContainer" containerID="<id>"
containerd[1717]: time="2025-05-30T11:05:30.173674804+02:00" level=info msg="CreateContainer within sandbox \"<id>\" for container &ContainerMetadata{Name:csi-rbdplugin,Attempt:853,}"
containerd[1717]: time="2025-05-30T11:05:30.182002680+02:00" level=info msg="CreateContainer within sandbox \"<id>\" for &ContainerMetadata{Name:csi-rbdplugin,Attempt:853,} returns container id \"<id>\""
containerd[1717]: time="2025-05-30T11:05:30.182178318+02:00" level=info msg="StartContainer for \"<id>\""
ctre[1581]: C 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0
systemd[1]: Started cri-containerd-<id>.scope - libcontainer container <id>.
containerd[1717]: time="2025-05-30T11:05:30.218713573+02:00" level=info msg="StartContainer for \"<id>\" returns successfully"
ctre[1581]: T 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0
kernel: [33B blob data]
systemd[1]: cri-containerd-<id>.scope: Deactivated successfully.
containerd[1717]: time="2025-05-30T11:05:30.258058437+02:00" level=info msg="received exit event container_id:\"<id>\"  id:\"<id>\"  pid:1942336  exit_status:255  exited_at:{seconds:1748595930  nanos:257930298}"
ctre[1581]: t 1748595930 k8s.io quay.io/cephcsi/cephcsi:v3.8.0
systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-<id>-rootfs.mount: Deactivated successfully.
containerd[1717]: time="2025-05-30T11:05:31.547087240+02:00" level=info msg="shim disconnected" id=<id> namespace=k8s.io
containerd[1717]: time="2025-05-30T11:05:31.547113810+02:00" level=warning msg="cleaning up after shim disconnected" id=<id> namespace=k8s.io
containerd[1717]: time="2025-05-30T11:05:31.547119691+02:00" level=info msg="cleaning up dead shim" namespace=k8s.io
kubelet[1605]: I0530 11:05:32.203748    1605 scope.go:117] "RemoveContainer" containerID="<id>"

That ctre is our ... that keeps track of used container images. K8S (or docker) systems tend to accumulate a lot of old images over the course of time. This can eat up disk space - and even worse - make ZFS perform poorly. We need regular cleanup. ctrictl prune can do that, but it will happily prune images that you're using every minute. ctre keeps track of started images. If you run ctre prune, it will not prune images that have been started recently.

And in this case it was helpful to see the image name in the logs directly.

Coincidentally, we had a bunch of "csi-rbdplugin" pods in our rook-ceph namespace in a CrashLoopBackOff. ... de machines met issues waren allemaal noble en hadden de modules niet geload. ...

bpftrace deserves a blog post of its own. But in this case ... since we expected the execve() to trigger the kernel error ... it would be smart to list those.

# bpftrace -e '
tracepoint:syscalls:sys_enter_execve
{
  printf("%s.%d [%d] execve (%s) ", strftime("%H:%M:%S", nsecs), nsecs % 1000000000, pid, comm);
  join(args->argv);
}'

tar (child): xz: Cannot exec: No such file or directory
...
12:23:40.231050035 [2826375] execve (zfs-driver) zfs list -d 1 -s name -o name,guid,available,used -H -p
12:23:40.231597246 [2826376] execve (zfs) chroot /host /sbin/zfs list -d 1 -s name -o name,guid,available,used -H -p
...

You can safely ignore that xz: Cannot exec. But if you want to get rid of it, see the tips in the bpftrace-deb README:

Another thing to note is the comm which looks like the parent process. And that is not because it is the parent but because the parent forked/cloned itself before becoming the new process: when your bash shell invokes a new process, it first fork()s so there is a bash parent and a bash child. Then the child uses execve() to become the new process. The bpftrace log shows the moment just before the transition. And that explains that comm looks like the parent.

The log messages were generally a bout 5 minutes apart, so I didn't have to wait long for the next message to appear.

After waiting for the expected event, we finally saw:

[vr mei 30 11:46:35 2025] Invalid ELF header magic: != \x7fELF

In journalctl they were hidden as "[33B blob data]" because of the \x7f.

Fri 2025-05-30 12:17:10.254579 CEST [s=16d20f738b2a480b9c0412e995f2fc09;i=36989;b=91a67d76e7ef4040ab9a545e8c62fb60;m=3cd60bec22;t=63657b79417f3;x=78ac67338c9c22ae]
    _BOOT_ID=91a67d76e7ef4040ab9a545e8c62fb60
    _MACHINE_ID=3ece1b18b84a483096d0fc703b3b0b32
    _HOSTNAME=node5.example.com
    _RUNTIME_SCOPE=system
    _TRANSPORT=kernel
    PRIORITY=3
    SYSLOG_FACILITY=0
    SYSLOG_IDENTIFIER=kernel
    MESSAGE=[33B blob data]
    _SOURCE_MONOTONIC_TIMESTAMP=261293684730

  
12:23:40.231597246 [2826376] execve (sh) modprobe rbd
12:23:40.231597246 [2826376] execve (sh) umount

That looked very smoking gunny. I ran that modprobe manually and waited for the next run:

12:28:44.533278002 [2826376] execve (sh) modprobe nbd

Let's import that too.

Now everything fell into place:

root@node5.example.com# modprobe --version
kmod version 31
+ZSTD +XZ -ZLIB +LIBCRYPTO -EXPERIMENTAL

sh-4.4# modprobe --version
kmod version 25
+XZ +ZLIB +OPENSSL -EXPERIMENTAL

Different kmod version inside the quay.io/cephcsi/cephcsi:v3.8.0 lib.

root@node2.example.com# find /lib/modules/$(uname -r) -name rbd*
/lib/modules/5.15.0-116-generic/kernel/drivers/block/rbd.ko

root@node5.example.com# find /lib/modules/$(uname -r) -name rbd*
/lib/modules/6.8.0-60-generic/kernel/drivers/block/rbd.ko.zst

sh-4.4# cat /etc/os-release 
NAME="CentOS Stream"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"

sh-4.4# find /lib/modules/6.8.0-60-generic/ -name ipta*
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_raw.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_security.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_filter.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_nat.ko.zst
/lib/modules/6.8.0-60-generic/kernel/net/ipv4/netfilter/iptable_mangle.ko.zst

sh-4.4# modprobe iptable_nat
modprobe: ERROR: could not insert 'iptable_nat': Exec format error

[vr mei 30 12:36:09 2025] Invalid ELF header magic: != \x7fELF

Yes, that test confirmed that the modprobe was the culprit, and that preloading the modules was the best fix for now.

2025-05-30 - nginx / no realip in logs / bug

Recently we were doing a vulnerability scan on an external endpoint of a website. During this time, we noticed simultaneous suspicious activity coming from inside our internal network. What's the deal?

To make a long story short: we use a proxy (like HAProxy) in front of our webservers, and it speaks the Proxy Protocol to pass along the original client's IP address. That way, when a request hits our internal nginx server, the logs show the real client's IP, not the IP of the reverse proxy that actually made the connection.

But during our scan, some failed requests showed up in the logs with the proxy's IP address — not the external client's. This made us pause: “Wait… is someone inside our network scanning this site too?”

We figured it out quickly, but it was still unsettling. And in our view, this is a bug.

Reproducing

Step one: fetch the latest nginx.

$ git clone https://github.com/nginx/nginx
...

Configure and build it:

$ ./auto/configure --with-debug --with-http_realip_module
...

$ make -j$(nproc)
...

$ ./objs/nginx -V
nginx version: nginx/1.29.0
built by gcc 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
configure arguments: --with-http_realip_module

Set up a basic config in ./conf/nginx_realip_bug.conf:

worker_processes  1;
error_log stderr info;

events {
    worker_connections  1024;
}

http {
    include       mime.types;
    default_type  text/plain;

    set_real_ip_from  127.0.0.0/8;
    real_ip_header    proxy_protocol;

    access_log /dev/stderr;

    server {
        listen       8080 proxy_protocol_always;
        server_name  localhost;
        location / {
            return 200 "Hello world\n";
        }
    }
}

Run nginx in the foreground:

$ ./objs/nginx -p $PWD -c conf/nginx_realip_bug.conf -g 'daemon off;'
2025/05/29 22:59:23 [notice] 154902#0: using the "epoll" event method
2025/05/29 22:59:23 [notice] 154902#0: nginx/1.29.0
2025/05/29 22:59:23 [notice] 154902#0: built by gcc 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
...

Do a simple request:

$ printf '%s\r\n' \
    'PROXY TCP4 1.2.3.4 127.127.127.127 80 8080' \
    'GET /foo HTTP/1.0' '' |
    nc 127.0.0.1 8080 -w2

HTTP/1.1 200 OK
Server: nginx/1.29.0
Date: Fri, 30 May 2025 19:36:40 GMT
Content-Type: text/plain
Content-Length: 12
Connection: close

Hello world

This could use some explanation. The GET /foo HTTP/1.0 is the oldest/most basic HTTP request, followed by two sets of CR LF. The printf '%s\r\n' provides those.

We could have used curl --path-as-is --haproxy-protocol localhost:8080/foo, instead of the printf and nc combination. But curl doesn't let us set values (like 1.2.3.4), and for this example we need them to be visibly different.

The location / in the configuration accepts everything below / — like /foo — and returns a 200 OK status code and the Hello world message.

Proxy protocol

The Proxy Protocol — originally proposed by Willy Tarreau — is a simple way to preserve information about the real client when traffic passes through one or more proxies.

Shown here is the v1 version:

PROXY TCP4 1.2.3.4 127.127.127.127 80 8080

This line tells the receiving server that the original IPv4 client (1.2.3.4) is connecting to the destination IP (127.127.127.127) — from port 80 to port 8080. A reverse proxy injects this line at the start of the connection, so the upstream (or "origin") server can see who actually made the request, even though it's technically connected to the proxy.

Yes, HTTP has X-Forwarded-For, but that header is less reliable: it depends on careful trust boundaries, can be spoofed, and is harder to manage across non-HTTP protocols.

This is all explained in detail at proxy-protocol.txt.

We use reverse proxies to protect our internal network from the big bad internet. But for the machines behind those proxies, the Proxy Protocol is how they know who's really knocking.

Real IP in logs

With the provided nginx config, the first GET /foo from above shows up in the logs as:

1.2.3.4 - - [30/May/2025:21:36:40 +0200] "GET /foo HTTP/1.0" 200 12 "-" "-"

Let's do another request, with the funky, but somewhat legal, /foo/../foo path:

$ printf '%s\r\n' \
    'PROXY TCP4 1.2.3.4 127.127.127.127 80 8080' \
    'GET /foo/../foo HTTP/1.0' '' |
    nc 127.0.0.1 8080 -w2

HTTP/1.1 200 OK
Server: nginx/1.29.0
Date: Fri, 30 May 2025 19:36:57 GMT
Content-Type: text/plain
Content-Length: 12
Connection: close

Hello world

We still get a 200 status. And this log line:

1.2.3.4 - - [30/May/2025:21:36:57 +0200] "GET /foo/../foo HTTP/1.0" 200 12 "-" "-"

So far so good. Now make it worse by trying to escape from the root path with /foo/../../foo:

$ printf '%s\r\n' \
    'PROXY TCP4 1.2.3.4 127.127.127.127 80 8080' \
    'GET /foo/../../foo HTTP/1.0' '' |
    nc 127.0.0.1 8080 -w2

HTTP/1.1 400 Bad Request
Server: nginx/1.29.0
Date: Fri, 30 May 2025 19:37:11 GMT
Content-Type: text/html
Content-Length: 157
Connection: close

<html>
<head><title>400 Bad Request</title></head>
<body>
<center><h1>400 Bad Request</h1></center>
<hr><center>nginx/1.29.0</center>
</body>
</html>

Boom! Rejected. That's okay. That's good in fact.

Proxy IP in logs

And the logs?

2025/05/30 21:37:11 [info] 175878#0: *3 client sent invalid request while reading client request line, client: 127.0.0.1, server: localhost, request: "GET /foo/../../foo HTTP/1.0"
127.0.0.1 - - [30/May/2025:21:37:11 +0200] "GET /foo/../../foo HTTP/1.0" 400 157 "-" "-"

Two log lines this time. But here's the problem: they show the proxy's IP (127.0.0.1) instead of the real client IP (1.2.3.4). Not nice.

There's a simple technical reason. nginx rejects the malformed request before any modules get a chance to run — including the realip module that rewrites IP addresses.

This isn't limited to bad HTTP requests either. TLS handshake errors and other early-stage failures can also bypass the modules, leading to the same issue: logs that show the proxy's IP instead of the real one.

But just because there's an explanation doesn't mean it's okay.

Bug filed at nginx/nginx#717.

Let's see what upstream has to say.

2025-05-29 - supermicro / ikvm / expired certificate

Supermicro computers generally work very well. At our company we've been using them for ages. One thing that does keep giving us the occasional trouble however is the BMC. Today, we'll be looking at expired certificates of the iKVM interface.

Introduction

First, some definitions, which we generally use interchangeably:

The Baseboard Management Controller (BMC) is the chipset core of the Intelligent Platform Management Interface (IPMI) implementation, which we mostly use to remotely attach a Keyboard, Video and Mouse (KVM, over IP).

iKVM — for IP-KVM or IPMI-KVM? — is the Java implementation that comes with the Supermicro BMC Firmware. The Supermicro BMC Interface (or IPMI) provides access to the KVM via either a HTML interface or via its iKVM Java application.

The use case for IPMI KVM access is when we've lost, or never had, regular ssh or vpn access to a remote server. Often you'll want more than just a keyboard, you also want to attach a virtual CD-ROM or USB-drive to get access to installation/recovery tools.

The HTML interface does not provide the capability to mount a virtual drive. The Java application does. So, we prefer working with that one. We'll be calling that iKVM, and BMC will refer to the hardware, firmware and the web interface of the Supermicro IPMI.

(Note that the Java application is dated and buggy, but with the right tools (vcutil ipmikvm and xpaste) it's usable.)

As mentioned, the Supermicro BMC and iKVM have been a recurring item in my notes to self section, all the way back to 2016:

That issue with the inaccurate BMC clock was a particularly insidious one: the clock was set in the past, causing it to reject TLS connections from iKVM because it considered the TLS certificate not yet valid. But it was hard to debug because the server side did the rejection and only provided an uninformative status code.

Today, there's another “fun” one, also related to certificates and time.

Retry =10 (Connection failed)

There was a connectivity problem on a machine and we needed get KVM access.

No biggie! Just run vcutil ipmikvm.

$ ipmikvm 10.10.10.10
attempting login on '10.10.10.10' with user ADMIN
...
+ exec java -Djava.library.path=.local/lib/ipmikvm/iKVM__V1.69.42.0x0 -cp ...

connect failed sd:16
Retry =1
Retry =2
Retry =3
Retry =4
Retry =5
Retry =6
Retry =7
Retry =8
Retry =9
Retry =10

Okay, that's not good. Generally we get about one to three retries and then it works.

We could log into the web interface after clicking past the self-signed and expired certificate warning:

Web browser warning stating: 'Your connection is not private'

In the web interface we could use the HTML interface to the KVM and the power buttons. But using iKVM would consistently fail.

Expired certificate

A few moments of debugging later, and the expired certificate was assigned blame:

$ easycert -T 10.10.10.10 443
Certificate chain
 0 s: {1B:29:EF:3D:CA:7B:26:2D:A0:AE:AA:71:39:E9:5C:91:0C:C1:28:A8} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
   i: {1B:29:EF:3D:CA:7B:26:2D:A0:AE:AA:71:39:E9:5C:91:0C:C1:28:A8} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
---
Expires in -578 days

vcutil easycert is a quick and easy wrapper around the openssl CLI to check certificate chains and expiry. As you can see it shows that the certificate is (a) self-signed and (b) expired 578 days ago.

On the https port 443, the certificate had expired almost two years ago. Doing the same check on the main iKVM port 5900:

$ easycert -T 10.10.10.10 5900
Certificate chain
 0 s: {BF:00:C8:10:AC:B2:58:E9:E7:4A:D9:7E:06:A2:DB:79:F6:3B:8F:38} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
   i: {BF:00:C8:10:AC:B2:58:E9:E7:4A:D9:7E:06:A2:DB:79:F6:3B:8F:38} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
---
Expires in -578 days

Exactly the same. Only the X509v3 Subject Key Identifier is different. (Insert foreshadowing here.)

Try to update the certificate in the web interface. Begin with a fresh certificate with the same subject and encryption algorithm:

$ openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
    -keyout ipmi-priv.pem -out ipmi-pub.pem \
    -subj '/C=US/ST=California/L=San Jose/O=Super Micro Computer/OU=Software/CN=IPMI'
...

In the BMC web interface, the Date and Time were set correctly this time. And in the Configuration > Network > SSL Certificates there's a form to upload the freshly generated certificates.

After uploading them, the web certificate was not updated yet. This required a BMC reset as found through Maintenance > BMC Reset > Unit Reset. The reset took more than a minute, but afterwards, the certificate on port 443 was updated to the freshly generated one. Good.

$ easycert -T 10.10.10.10 443
Certificate chain
 0 s: {B6:52:AB:9A:A1:C4:67:9C:8F:40:FF:C3:A8:75:78:8D:85:7F:6B:22} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
   i: {B6:52:AB:9A:A1:C4:67:9C:8F:40:FF:C3:A8:75:78:8D:85:7F:6B:22} [9adfe6f5] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI
---
Expires in 3649 days

But unfortunately, the certificate on port 5900 wasn't. It still showed the same expired certificate.

I tried — in vain — to run ipmikvm, but as expected, it failed with the same Retry =10 and Connection failed messages.

Updating firmware

Next, I did some frantic Googling and ChatGPTing to find another button/option to update the certificates, but the general concensus was that on this board, the TLS certificates were baked into the BMC firmware and could not be independently updated.

Supermicro lists on its downloads page that you shouldn't update the BMC firmware unless you really really have to:

'Please do not download / upgrade the BMC/Firmware UNLESS your system has a BMC/Firmware-related issue.
Flashing the wrong BMC/Firmware can cause irreparable damage to the system.
In no event shall Supermicro be liable for direct, indirect, special, incidental, or consequential damages arising from a BMC/Firmware update.'

But, I do want iKVM to work again. Let's find the right firmware. Start out by figuring out what hardware we're dealing with:

# for x in system-manufacturer system-product-name \
           baseboard-manufacturer baseboard-product-name
  do printf '%-24s: %s\n' "$x" "$(dmidecode -s $x)"; done

system-manufacturer     : S3S
system-product-name     : AS -1114S-WN10RT
baseboard-manufacturer  : Supermicro
baseboard-product-name  : H12SSW-NTR

Google turns up this AS-1114S-WN10RT which seems plausible, especially since it speaks of the H12SSW-NTR motherboard. It has two downloads available:

BIOS / BMC / Bundle Firmware for AS-1114S-WN10RT and BIOS / BMC / Bundle Firmware for AS-1114S-WN10RT.

Oh wait, that's the same download. I want to update the BIOS even less than I want to update the BMC. And it also mentions that updating needs to be done via the EFI shell, which requires the system to be rebooted, which I also don't want.

Is there another way?

Faking the time

So, as far as I understand, the trouble is that a component in the iKVM application validates the certificate expiry time. This could be inside the bundled Java application, or in an auxiliary application like stunnel.

In any case, unlike the earlier issues where iKVM was rejected due to the BMC clock being set incorrectly, this problem looks to be client side. And on the client, we have control. Like control over the “current time”.

Mr. Chat, whip me up a quick LD_PRELOAD shared libary that fakes the time being 2 years in the past.

In mere seconds, I received an almost functional faketime.c. Fixed, and annotated with docs:

/**
 * LD_PRELOAD library to "fake" the current time. Takes a
 * FAKETIME_OFFSET=<seconds> as environment variable.
 *
 * Compile:
 *   gcc -shared -fPIC -o faketime.so faketime.c -ldl -Wall
 * Run:
 *   FAKETIME_OFFSET=$((2*365*24*60*60)) LD_PRELOAD=$PWD/faketime.so date
 */
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

static time_t offset_seconds = 0;

time_t time(time_t *tloc) {
    static time_t (*real_time)(time_t *) = NULL;
    time_t real, fake;
    if (!real_time)
        real_time = dlsym(RTLD_NEXT, "time");
    real = real_time(NULL);
    fake = real - offset_seconds;
    if (tloc)
        *tloc = fake;
    return fake;
}

int gettimeofday(struct timeval *tv, void *tz) {
    static int (*real_gettimeofday)(struct timeval *, struct timezone *) = NULL;
    int res;
    if (!real_gettimeofday)
        real_gettimeofday = dlsym(RTLD_NEXT, "gettimeofday");
    res = real_gettimeofday(tv, tz);
    if (res == 0)
        tv->tv_sec -= offset_seconds;
    return res;
}

int clock_gettime(clockid_t clk_id, struct timespec *tp) {
    static int (*real_clock_gettime)(clockid_t, struct timespec *) = NULL;
    int res;
    if (!real_clock_gettime)
        real_clock_gettime = dlsym(RTLD_NEXT, "clock_gettime");
    res = real_clock_gettime(clk_id, tp);
    if (res == 0 && (clk_id == CLOCK_REALTIME || clk_id == CLOCK_REALTIME_COARSE))
        tp->tv_sec -= offset_seconds;
    return res;
}

__attribute__((constructor))
void init() {
    const char *env = getenv("FAKETIME_OFFSET");
    if (env) {
        offset_seconds = strtoll(env, NULL, 10);
    } else {
        /* Default to 2 years (730 days) if not set */
        offset_seconds = 730 * 24 * 60 * 60;
    }
}

A quick polish, compilation with gcc -shared -fPIC -o faketime.so faketime.c -ldl and test:

$ LD_PRELOAD=$PWD/faketime.so date
Tue May 30 11:29:26 UTC 2023

$ FAKETIME_OFFSET=$((-2*24*60*60)) LD_PRELOAD=$PWD/faketime.so date
Sat May 31 11:29:53 UTC 2025

Yup, it works, even with negative offsets for future dates.

But does it also work with iKVM? Drum rolls...

$ FAKETIME_OFFSET=$((2*365*86400)) LD_PRELOAD=$PWD/faketime.so ipmikvm 10.10.10.10
attempting login on '10.10.10.10' with user ADMIN
...
connect failed sd:16
Retry =1
Gtk-Message: 13:31:13.519: Failed to load module "canberra-gtk-module"
a singal 17 is raised
GetFileDevStr:4051 media_type = 40

Success! A working iKVM viewer. And another utility library for under my belt.

BMC upgrade anyway

When writing this post, I did fetch the BIOS firmware again, to double check some things, and I noticed that inside H12SSW-NTR_3.3_AS01.05.21_SAA1.2.0-p2.zip, there is both a BIOS zip and a and a BMC BMC_H12AST2500-ROT-2201MS_20250430_01.05.21_STDsp.zip file.

Through Maintenance > Firmware Management > Update and deselecting “Preserve SSL Certificate” the BMC firmware could be updated. (I guess that the SSL certificate update question only applies to the web certificate though.)

The update appeared to stall at Upload firmware : 75% for a while, but it completed after a few minutes. The web interface was ungrayed, but a FW Update Mode: Current system is in FW update mode, any configuration changes are not recommended notice stayed visible and the current and new BMC firmware versions were listed, along with an Update button (which you need to press, or nothing happens, pro-tip).

Updating from 01.00.25 to 01.05.21... took about 10 more minutes. After System is rebooting, Please wait! it might be done. You'll have to check yourself.

And after all that waiting, I was finally greeted with new certificates. Both for the web interface and the iKVM interface:

$ easycert -T 10.10.10.10 443
Certificate chain
 0 s: {9C:70:1A:CF:57:9E:B9:A1:CE:EB:B8:99:47:3C:A8:4D:88:1B:BD:16} [a8c7e6cd] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI, emailAddress = support@supermicro.com
   i: {9C:70:1A:CF:57:9E:B9:A1:CE:EB:B8:99:47:3C:A8:4D:88:1B:BD:16} [a8c7e6cd] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI, emailAddress = support@supermicro.com
---
Expires in 3622 days

$ easycert -T 10.10.10.10 5900
Certificate chain
 0 s: {68:6D:88:26:E2:B4:34:3F:CC:06:FC:E7:02:99:2B:6C:FC:57:61:7D} [a8c7e6cd] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI, emailAddress = support@supermicro.com
   i: {68:6D:88:26:E2:B4:34:3F:CC:06:FC:E7:02:99:2B:6C:FC:57:61:7D} [a8c7e6cd] C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI, emailAddress = support@supermicro.com
---
Expires in 3622 days

Ok. No faketime.so needed anymore.

Hardware crash

Unfortunately, the BMC firmware upgrade came with a “present”:

mlx5_core 0000:81:00.0: temp_warn:165:(pid 0): High temperature on sensors with bit set 0 1
mlx5_core 0000:81:00.1: temp_warn:165:(pid 0): High temperature on sensors with bit set 0 1
...
mlx5_core 0000:81:00.0: mlx5_crdump_collect:50:(pid 1100352): crdump: failed to lock vsc gw err -16
mlx5_core 0000:81:00.0: mlx5_health_try_recover:327:(pid 1100352): handling bad device here
mlx5_core 0000:81:00.0: mlx5_error_sw_reset:233:(pid 1100352): start
mlx5_core 0000:81:00.1: mlx5_crdump_collect:50:(pid 1031132): crdump: failed to lock vsc gw err -16
mlx5_core 0000:81:00.1: mlx5_health_try_recover:327:(pid 1031132): handling bad device here
mlx5_core 0000:81:00.1: mlx5_error_sw_reset:233:(pid 1031132): start
mlx5_core 0000:81:00.0: NIC IFC still 7 after 1000ms.
mlx5_core 0000:81:00.0: mlx5_error_sw_reset:266:(pid 1100352): end
mlx5_core 0000:81:00.1: NIC IFC still 7 after 1000ms.
mlx5_core 0000:81:00.1: mlx5_error_sw_reset:266:(pid 1031132): end
mlx5_core 0000:81:00.0: mlx5_wait_for_pages:735:(pid 1100352): Skipping wait for vf pages stage
mlx5_core 0000:81:00.1: mlx5_wait_for_pages:735:(pid 1031132): Skipping wait for vf pages stage
mlx5_core 0000:81:00.0: mlx5_health_try_recover:330:(pid 1100352): health recovery flow aborted, PCI reads still not working
mlx5_core 0000:81:00.1: mlx5_health_try_recover:330:(pid 1031132): health recovery flow aborted, PCI reads still not working

Not sure if the upgrade was the culprit, but mid-upgrade the Mellanox network devices both crashed. Unloading and reloading the kernel modules did not revive them, so a reboot was needed.

Luckily I didn't have to fight the expired certificate this time.

2025-05-05 - windows install / media driver missing / ventoy

Recently I had the misfortune of assisting someone with installing Windows 11 on their computer. I tried to push Ubuntu/Linux on them, but they wouldn't budge. So I begrudgingly prepped the USB stick with the ISO and got the installer running.

# dd if=windows11.iso of=/dev/sdx \
    bs=1M status=progress oflag=dsync

Windows installation screen prompting language selection. Blue title bar.

The ISO boots from USB. So, writing the USB was successful. And we got language selection.

But, on the second screen, we were greeted with the following error:

A media driver your computer needs is missing. This could be a DVD, USB or Hard disk driver.

A media driver your computer needs is missing

The message seems descriptive, but really it isn't. The BIOS had detected all the drives in this machine: an M.2, and two SSD SATA drives. What are you not detecting? SSDs on SATA that have existed for ages now. Surely Windows driver support cannot be this bad.

The two logical options would be either to get specific motherboard drivers (unexpected, but possible if Windows is really lacking in support, or if this hardware was really uncommon), or that a BIOS setting needed to be adjusted.

The Gigabyte B550M had "preinstall" drivers, which I downloaded onto a second(!) USB. One of the NVMe drivers was visible (supposedly compatible) for installation, but after some loading, it wasn't compatible/usable anyway.

For BIOS options, the internet told us to disable (or enable!) various BIOS options, related to: CSM, TPM, Secure boot, AHCI, Legacy USB. And switching from a USB 3.0 or a USB 2.0 port, or vice versa. Nothing helped.

Lastly, I reconsidered the advice that writing the USB had to be done using the Media Creation Tool or Rufus.exe. This did not make sense to me, because it booted just fine, so what could be wrong with my dd-written USB? But, we'd have to try this option too.

I run Linux on my systems, so I did not have any Windows laying around to run Media Creation Tool or Rufus with. When looking for Linux alternatives to those, the internet pointed to Ventoy.

Ventoy

Ventoy is not, like I thought it would be, a tool that writes an ISO to your USB stick, like dd(1). No, you don't tell it which ISO to write. Instead, it sets up a nifty boot loader and an empty filesystem.

Initializing the USB stick was easy with the GUI installer. Unplugging and plugging the USB back in was sufficient for Ubuntu to recognise the Ventoy filesystem on the stick. Now it was a simple matter of copying the ISO to that filesystem — using cp(1), not dd(1).

Lo and behold, when booting we were presented with something akin to this:

Ventoy boot menu, showing various ISO files to choose to boot from.

Nice! We can have a bootable USB stick with several installation options. One stick, many installs — no more last-minute flashing. (On top of that, Ventoy can leave free space for general purpose partitions.)

And the interesting part: the Windows 11 installer now looked slightly different.

Windows installation screen prompting language selection. Gray title bar.

Observe how the title bar is now gray instead of blue, how it says Collecting information at the bottom and how the rest of the design matches the following screen.

And that's not all, pressing next, greeted us with a partition selection like one would expect:

Windows installation screen prompting which partition to use for installing Windows. Gray title bar.

And the rest of the installation worked as expected.

Two morals of the story this time:

When creating a bootable Linux USB, using dd(1) is sufficient, but for a Windows ISO it is not.
I can get used to this Ventoy boot loader. Having bootable installers for multiple operating systems on a single USB stick sounds super convenient.

2025-04-15 - pveproxy / systemd journald / perl buffering

Because I was debugging — what later turned out to be — an application firewall that was messing with my connections to Proxmox pveproxy on port 8006, I wanted more debug logs. Telling pveproxy to give me more info gave me a bit of a hard time.

Connections from a remote peer would get disconnected, and it initially looked like pveproxy was to blame.

Step one: get more info from pveproxy.

pveproxy has a pveproxy start --debug=1 mode. I added that to the systemd config using an /etc/systemd/system/pveproxy.service.d/override.conf file:

[Service]
# Wipe the original ExecStart= first.
ExecStart=
# Enable debugging.
ExecStart=/usr/bin/pveproxy start --debug=1

Alas, this refused to start, because --debug runs in the foreground. Add Type=simple:

[Service]
# pveproxy --debug stays in the foreground.
Type=simple
# Wipe the original ExecStart= first.
ExecStart=
# Enable debugging.
ExecStart=/usr/bin/pveproxy start --debug=1

Okay. The pveproxy service now started successfully. But no extra logging appeared. Except.. after a restart there suddenly were logs.

I've seen this before. Output buffering? Yes.

Output buffering is an optimization where output isn't flushed after every line, but first when a larger buffer is full. Useful if you're writing lots of output. Terrible if it means that log output is saved for later. I want to see my debug output live.

For Python, there is the PYTHONUNBUFFERED=1 environment variable or the python -u flag. pveproxy is a Perl application however. ChatGPT will gladly hallucinate a PERL_UNBUFFERED=1, but that doesn't exist. You need this code instead:

select STDERR;$|=1;select STDOUT;$|=1;

Doing file descriptor modifications, just before program start, with stdbuf(1) doesn't fix things. And I did not want to install Expect just for the unbuffer(1) application.

Is there another way to inject the $|=1; without modifying the existing Perl script?

I settled for the following. (This should actually be possible without the sh -c, but systemd escaping did not agree with me.)

[Service]
# pveproxy --debug stays in the foreground.
Type=simple
# Wipe the original ExecStart= first.
ExecStart=
# Enable debugging, with PERL_UNBUFFERED=1.
#ExecStart=/usr/bin/pveproxy start --debug=1
ExecStart=/bin/sh -c 'exec /usr/bin/perl -Te '\''select STDERR;$|=1;select STDOUT;$|=1;do "/usr/bin/pveproxy" or die "Could not run: $@ $!";'\'' -- start --debug=1'

Now pveproxy starts and the debug output appears in journald as soon as it's written.

For the record, the debug output turned out to be useless. Instead, tcpdump revealed that traffic from the client would contain data (a [P], for the TLS ClientHello), but when looking at the traffic on the receiving host, I would see a [R] (reset). The TCP traffic had been altered to tear down the connection in the outbound direction. Apparently there was an application firewall in the middle, sniffing that port 8006 traffic, and deciding that TLS was dangerous on non-443 ports. *Sigh*

2025-03-17 - sed / regular expressions / optimizing

Last week, we were looking at using sed(1) to remove line feeds from CSV files. The regular expressions in that post could use some further inspection.

I'm sorry that this is a rather dry post. I wanted to get the numbers out there because the differences are significant. But without the example runs you're just left with numbers.

Feel free to skip right to the conclusions.

To recap: we were working on multiline CSV files. Matching even and odd numbers of double quotes is relevant, because they tell us whether or not we're still working on the same row.

We start off by creating two sample files:

$ python3 -c 'for x in range(100000): print("A"*1500)' >many-alpha1500.dat
$ python3 -c 'for x in range(100000): print("\""*1500)' >many-dquotes1500.dat

many-alpha1500.dat has 100k lines with 1500 letters A: zero double quotes.

many-dquotes1500.dat has 100k lines with 1500 double quotes: basically everything is a double quote.

We want to compare the speed of a match versus a non-match, and take printing a line or not out of the equation. For sed we add !d or d to the case that matches:

$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.950s (all user time)

The astute reader will recognise d41d8cd98f00b204e9800998ecf8427e as the empty md5 hash. We want no output, because printing would affect the timing.

The regular expression from the previous post, which was quickly cobbled together, looked like this:

/^[^"]*("[^"]*"[^"]*)*$/

That is, zero-or-more non-double quotes, then a combination of two double quotes with non-double quotes around it, and that as many times as possible. Or, in plain English: an even amount of double quotes.

Reducing backtracking

Regular expression backtracking wisdom says you need to avoid optional matches (?, *, +). Rewriting this slightly to /^([^"]|"[^"]*")*$/ makes sense, both for clarity and fewer optionals. For speed it doesn't seem to matter though.

$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.901s (all user time)

$ time sed -Ee '/^([^"]|"[^"]*")*$/d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.844s (all user time)

$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.071s (all user time)

$ time sed -Ee '/^([^"]|"[^"]*")*$/d' many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.098s (all user time)

The regular expression did not seem to matter, but the dataset did. For some reason, the matching double quotes take a slightly faster path.

Here it's hard to tell whether it's slow because of backtracking, or because of something else in the regex. It's not the address code itself, because replacing the regex with /./ or /(.*)/ makes it fast. And when feeding it double quotes (the "hard" case), it was in fact faster (0m4.071s/0m4.098s); maybe because of the larger match between parentheses?

What we do observe though, is if we make it not-matching (look for an uneven count of double quotes), it's suddenly fast for the A case:

$ time sed -Ee '/^([^"]|"[^"]*")*"[^"]*$/!d' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.365s (all user time)

My instinct says those two matches should take the same amount of time on a long string of A characters. But apparently I'm wrong.

The lots-of-double-quotes case is now (expectedly) slow, even though it doesn't match either.

$ time sed -Ee '/^([^"]|"[^"]*")*"[^"]*$/!d' many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.490s (all user time)

grep and perlre

We take a quick detour using grep.

grep differentiates between our old and our new regular expression:

$ time grep -vE '^[^"]*("[^"]*"[^"]*)*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m13.411s (all user time)

$ time grep -vE '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m21.086s (all user time)

.. but in a bad way. The — in my opinion — better expression performs worse. And it's horribly slow compared to sed.

grep can also do Perl regular expressions (perlre, using the -P option). Let's see how that compares:

$ time grep -vE '^.*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.403s (all user time)

$ time grep -vP '^.*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.106s (all user time)

Using -P is twice as fast for the most basic expression. Let's try a slightly less simple, but still trivial, one:

$ time grep -vE '^[^"]*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.062s (all user time)

$ time grep -vP '^[^"]*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.113s (all user time)

The above example tells us that using even the simplest regular expressions can be slow using the old POSIX engine in grep. Perl or other modern implementations will be slightly faster in trivial cases, and insanely much faster in slightly more complicated cases:

$ time grep -vE '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m21.072s (all user time)

$ time grep -vP '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.267s (all user time)

And no, before you ask, memory usage is only slightly higher when using the Perl engine.

Summary

Here's a quick summary of the results from above, although it doesn't show all the nuance between the different inputs and matching vs. non-matching.

	`sed -E`	`grep -E`	`grep -P`
`^.*$`	2.40	0.40	0.11
`^[^"]*$`	4.25	4.15	0.11
`^([^"]\|"[^"]")$`	5.00	21.20	0.32
`^[^"]("[^"]"[^"])$`	5.00	13.70	0.11

(Lower is better.)

And what about Python? Yes, there is that too. More different results. Hard to keep track of all dimensions now. And now there is even system time involved.

$ time python -c "$(printf '%s\n' \
    'import re;import sys;c=re.compile(r'\''^[^"]*("[^"]*"[^"]*)*$'\'')' \
    'for l in sys.stdin:' ' if not c.match(l):' '  print(l)')" \
    < many-alpha1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m0.122s
user  0m0.086s
sys   0m0.036s

$ time python -c "$(printf '%s\n' \
    'import re;import sys;c=re.compile(r'\''^[^"]*("[^"]*"[^"]*)*$'\'')' \
    'for l in sys.stdin:' ' if not c.match(l):' '  print(l)')" \
    < many-dquotes1500.dat | md5sum
d41d8cd98f00b204e9800998ecf8427e  -

real  0m4.176s
user  0m2,768s
sys   0m1.369s

I give up trying to explain any of this.

Conclusions

The conclusions I draw from the above are:

you should test your regular expressions on exaggerated (larger) datasets instead of assuming, and
before you try to optimize your regular expressions you may want to consider a different regular expression engine instead,
and you'll probably want to avoid both grep -E and sed altogether if speed matters.

2025-03-15 - sed / remove line feeds / csv

Can you use sed(1) to remove line feeds?

Yes you can. Even though sed, the stream editor, is line based, you can use it to remove line feeds from output.

In this post we'll be exploring how that is done and look into how a slight refactoring step can provide a big speed gain. The task at hand is removing line feeds from CSV file with multiline strings. Why? Because I want to use grep and awk on it, and multiline strings would complicate that a lot. Why sed? Because I wanted to know if it's as easy as I think it should be.

Let's dig in.

sample.csv

We'll start off with a small sample.csv that we can play around with, to get us going:

idx,name,year,summary,url
1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:
Eastnor Castle Deer Park
Eastnor
Herefordshire (UK)",https://www.emfcamp.org/schedule/2018
2,"Legoland trip",2019,"Went to Billund for team building.",https://www.osso.nl/blog/2019/recap2019/
3,"MCH2022 ""May Contain Hackers""",2022,"Nonprofit outdoors hacker camp.
Location: Zeewolde (NL)",https://wiki.mch2022.org/Village:OSSO
4,"WHY2025 ""What Hackers Yearn""",2025,"Successor to MCH2022.",-

As you can see, this CSV file has one header row and four CSV data rows. But because some rows are split over multiple lines, doing a simple grep, cut or awk will not work as intended.

Here's a feedble attempt to get all the URLs from the fifth column, by asking awk to split by comma:

$ awk -F, '{print $5}' sample.csv
url




https://www.osso.nl/blog/2019/recap2019/


-

As expected, awk will not work, as it works on the lines, and has no idea where a record/row begins.

Quick sidenote: using cut -d, -f5 sample.csv would produce different, and even more wrong, output. If there is no fifth field, you'll get an earlier field instead. See this example:
$ echo how | cut -d, -f2,3,4
how
$ echo how,are,you | cut -d, -f2,3,4
are,you
That's unexpected behaviour that awk avoids.

The goal

Selecting the URLs from the CSV sample would have worked fine, had it looked like this:

idx,name,year,summary,url
1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at: Eastnor Castle Deer Park Eastnor Herefordshire (UK)",https://www.emfcamp.org/schedule/2018
2,"Legoland trip",2019,"Went to Billund for team building.",https://www.osso.nl/blog/2019/recap2019/
3,"MCH2022 ""May Contain Hackers""",2022,"Nonprofit outdoors hacker camp. Location: Zeewolde (NL)",https://wiki.mch2022.org/Village:OSSO
4,"WHY2025 ""What Hackers Yearn""",2025,"Successor to MCH2022.",-

Here we have one header and four rows, and they add up to five lines. Selecting the fifth column is easy now:

$ awk -F, '{print $5}' goal.csv
url
https://www.emfcamp.org/schedule/2018
https://www.osso.nl/blog/2019/recap2019/
https://wiki.mch2022.org/Village:OSSO
-

As you may have noticed, a comma in CSV column/string would trip us up. Ignore that for now. You can argue that using regular unix utilities to work with CSV files is the wrong approach, but I think that often times the command line is king. If you're working with new/varied data, using the command line to quickly explore and prototype is unbeatable. You can choose a real programming language later when you're done prototyping.

So, we have the input (sample.csv), and we have the output (goal.csv with md5 hash d7a5eba9e0a80a1d8c1158a1430927a3). Can we coax sed into producing goal.csv? We can. (I used the md5 hash to confirm that all the versions of goal.csv are equal.)

Attempt #1: making it work

Here's attempt one, which I have to admit took me longer to produce than I wanted. And I have to say ChatGPT was of no help at all. I can tell you its sed skills are abysmal.

#!/bin/sed -Ef

# Append line to hold space, clear pattern
# space, put hold space in pattern space.
H
z
x

# If pattern space contains an uneven number of
# double quotes, place everything back into the
# hold space and don't display.
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    h
    d
}

# If pattern space contains an even number of
# double quotes, we're done collecting.
# The first hold added a linefeed to void,
# remove it now.
# Replace the others with a single space.
/^[^"]*("[^"]*"[^"]*)*$/ {
    s/^\n//
    s/\n/ /g
}

That's my first working attempt:

$ ./attempt-1.sed | md5sum
d7a5eba9e0a80a1d8c1158a1430927a3  -

It works! But it has a couple of odd quirks.

Taking you through it one by one:

sed has two buffers: the pattern buffer and the hold buffer. You're always working on the pattern buffer. The first H appends the input (the current pattern) to the hold buffer.
The z wipes (zeroes) the pattern buffer, the x swaps them. Now the pattern buffer has all the collected input so far.
The /^[^"]*("[^"]*"[^"]*)*"[^"]*$/ regex checks that there is an uneven number of double quotes: if it is uneven, the pattern buffer is replaced into the hold buffer, the line is deleted (not printed), and the cycle restarts.
If we get to the next regex (/^[^"]*("[^"]*"[^"]*)*$/) we've confirmed that the quotes are even. We remove the first linefeed (which got added after the first H) and replace the rest with a single space.
If we get here, we're not replacing the pattern buffer into the hold buffer (using h), so the next cycle starts clean.

So. That works. And after looking more into the sed info pages, it turns out that the d command restarts the cycle. That means that the second regex match isn't even needed. It could instead just be:

# If we get here, the pattern space contains an
# even number of double quotes. The first hold
# added a linefeed to void, remove it now.
# Replace the others with a single space.
s/^\n//
s/\n/ /g

Attempt #2: use only the pattern buffer

When working with real data, it turned out that this was rather slow. For a simple 1000 row, 1 MiB file, having the script take more than a second seemed inappropriate.

The thing that immediately stands out is the swapping between pattern and hold space. Step one is avoiding that.

#!/bin/sed -Ef

:x

# If pattern space contains an uneven number of
# double quotes, add more to the line and
# continue at x.
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    N
    bx
}

# Apparently we have an even number of double
# quotes. Remove LFs and print (by ending the
# cycle).
s/\n/ /g

This looks better already. We don't need the hold buffer at all.

The N reads a new line and appends it to the pattern buffer.
bx jumps back to the :x label.
This repeats for as long as there is an uneven amount of double quotes in the pattern buffer.
Once the double quotes are even, we end up on the last line.
There we replace the line feeds with spaces and let the cycle restart with the automatic printing of the pattern buffer.

Is it faster?

Well.. according to a couple of sample runs with a 1.3 MiB dataset, maybe we go from 690ms to 670ms. Definitely not earth shattering speed gains. What's next?

Attempt #3: stop doing double work

On to fix the real problem, the repetitive regular expression. The data set I'm working with at the moment looks like this:

$ wc dataset.csv
  23679  156039 1389021 dataset.csv

$ ./attempt-2.sed dataset.csv | wc
   1001  156039 1389021

A 1.3 MiB file, 1001 rows, 23679 lines. On average, that's 27 lines per row and 1388 bytes per row.

That means that on average, it has to rerun the regular expression 27 times before it doesn't match anymore (because the double quotes are even). And it has to do so on a line that grows to up to 1388 characters:

1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:

Uneven amount of quotes? Yes, 3. Restart.

1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:
Eastnor Castle Deer Park

Still uneven amount of quotes? Yes, 3. Restart.

1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:
Eastnor Castle Deer Park
Eastnor

Still uneven amount of quotes? Yes, 3. Restart again.

You can see how that's inefficient. It would be better if it just counted the double quotes on the latest input line.

Attempt #3 of the script fixes that:

#!/bin/sed -nEf

# Find an uneven amount of double quotes.
# If found, populate hold space, load next line
# and jump to inquote.
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    h
    n
    b inquote
}

# Not found? Jump to done.
b done

# Loop here until the input line also has an
# uneven amount of quotes: uneven + uneven is
# even. In the mean time, add to the hold
# buffer and fetch the next line.
:inquote
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/ {
    H
    x
    b done
}
H
n
b inquote

:done
s/\n/ /g
# sed -n and manual printing is needed,
# otherwise the 'n' will print the pattern
# space.
p

Admittedly, this looks a lot less like a sed oneliner, and more like an obscure programming language. But, boy, it is so much faster. With the same dataset, we go down to mere milliseconds.

It works like this:

Find an uneven amount of double quotes. If not found, jump to the done label.
At the done label, we replace the line feeds, and print the output. Restart the cycle.
If we did find an uneven amount of double quotes, store the pattern in the hold space, get the next pattern, and go to the inquote label.
At inquote we also look for an uneven amount of quotes. If not found it means that we're not done (uneven + even = uneven). We append the pattern to the hold buffer using H and load up the next pattern with n. Go back to inqoute.
If we did find an uneven amount of double quotes, we're done (uneven + uneven = even). Append this line as well with H, and then exchange the hold and the pattern buffer with x so we can work on the entire line. (Using H;x is the same as x;G in this case.) Go to done (see above).

Timings

For completeness sake, here are some representative timing outputs:

$ time ./attempt-1.sed dataset.csv >/dev/null

real  0m0.696s (all user time)

$ time ./attempt-2.sed dataset.csv >/dev/null

real  0m0.664s (all user time)

$ time ./attempt-3.sed dataset.csv >/dev/null

real  0m0.033s (all user time)

As you can see, that's much better.

In the next post, we'll see if we can improve the regular expressions.