Notes to self, 2021

2021-11-19 - systemd / zpool import / zfs mount / dependencies

On getting regular ZFS mount points to work with systemd dependency ordering.

ZFS on Ubuntu is nice. And so is systemd. But getting them to play nice together sometimes requires a little extra effort.

A problem we were facing was that services would get started before their respective mount points had all been made available. For example, for some setups, we have a local-storage ZFS zpool that holds the /var/lib/docker directory.

If there is no dependency ordering, Docker may start before the /var/lib/docker directory is mounted. Docker will start just fine, but it will start writing files in the wrong location. This is extremely inconvenient. So we want to force docker.service to start first after /var/lib/docker has been mounted.

Luckily we can have systemd handle the dependency ordering for us. We'll have to tell the specific service, in this case docker.service, to depend on one or more mount points. That might look like this:

# /etc/systemd/system/docker.service.d/override.conf
[Unit]
RequiresMountsFor=/data/kubernetes/static-provisioner
RequiresMountsFor=/var/lib/docker

These unit file directives specify that the Docker service can start first when these two paths have both been mounted. To this end, systemd will look for definitions of data-kubernetes-static\x2dprovisioner.mount and var-lib-docker.mount.

The mount unit file format is described in systemd.mount(5). The filename must adhere to the output of systemd-escape --path --suffix=mount MOUNTPOINT. So for the above two mount points, you might make these two unit files:

# /etc/systemd/system/data-kubernetes-static\x2dprovisioner.mount
[Unit]
Documentation=https://github.com/ossobv/vcutil/blob/main/mount.zfs-non-legacy
After=zfs-mount.service
Requires=zfs-mount.service

[Mount]
Where=/data/kubernetes/static-provisioner
What=local-storage/kubernetes/static-provisioner
Type=zfs-non-legacy
# /etc/systemd/system/var-lib-docker.mount
[Unit]
Documentation=https://github.com/ossobv/vcutil/blob/main/mount.zfs-non-legacy
After=zfs-mount.service
Requires=zfs-mount.service

[Mount]
Where=/var/lib/docker
What=local-storage/docker
Type=zfs-non-legacy

Observe: that Type=zfs-non-legacy requires some extra magic.

You might have been tempted to set Type=zfs, but that only works for so-called legacy mounts in ZFS. For those, you can do mount -t zfs tank/dataset /mointpoint. But for regular ZFS mounts, you cannot. They are handled by zfs mount and the mountpoint and canmount properties.

Sidenote: usually, you'll let zfs-mount.service mount everything. It will zfs mount -a, which works if the zpool was also correctly/automatically imported. But because you have no guarantees there, it is nice to manually force the dependency on the specific directories you need, as done above.

To complete the magic, we add a helper script as /usr/sbin/mount.zfs-non-legacy. This gets the burden of ensuring that the zpool is imported and that the dataset is mounted. That script then basically looks like this:

#!/bin/sh
# Don't use this script, use the better version from:
# https://github.com/ossobv/vcutil/blob/main/mount.zfs-non-legacy
name="$1"  # local-storage/docker
path="$2"  # /var/lib/docker

zpool import "${name%%/*}" || true  # import poool
zfs mount "${name}" || true

# Quick hack: mount again, it should be mounted now. If it isn't,
# we have failed and report that back to mount(8).
zfs mount "${name}" 2>&1 | grep -qF 'filesystem already mounted'
# (the status of grep is returned to the caller)

That allows systemd to call mount -t zfs-non-legacy local-storage/docker /var/lib/docker and that gets handled by the script.

A better version of this script can be found as mount.zfs-non-legacy from ossobv/vcutil. This should get you proper moint point dependencies and graceful breakage when something fails.

2021-10-22 - zpool import / no pools / stale zdb labels

Today, when trying to import a newly created ZFS pool, we had to supply the -d DEV argument to find the pool.

# zpool import
no pools available to import

But I know it's there.

# zpool import local-storage
cannot import 'local-storage': no such pool available

And by specifying -d with a device search path, it can be found:

# zpool import local-storage -d /dev/disk/by-id

Success!

# zpool list -oname
NAME
bpool
local-storage
rpool

Manually specifying a search path is not real convenient. It would make the boot process a lot less smooth. We'd have to alter the distribution provided scripts, which in turn makes upgrading more painful.

The culprit — it turned out — was an older zpool that had existed on this device. This caused the zeroth label to be cleared, but the first label to be used:

# zdb -l /dev/disk/by-id/nvme-Micron_9300_MTFDHAL3T8TDP_1234
failed to read label 0
------------------------------------
LABEL 1
------------------------------------
    version: 5000
    name: 'local-storage'
    state: 0
...

The easy fix here is to flush the data and start from scratch:

# zfs destroy local-storage

At this point, zdb -l still lists LABEL 1 as used.

# zfs labelclear /dev/disk/by-id/nvme-Micron_9300_MTFDHAL3T8TDP_1234

Now the labels are gone:

# zdb -l /dev/disk/by-id/nvme-Micron_9300_MTFDHAL3T8TDP_1234
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

And after recreating the pool everything works normally:

# zpool create -O compression=lz4 -O mountpoint=/data local-storage \
    /dev/disk/by-id/nvme-Micron_9300_MTFDHAL3T8TDP_1234
# zdb -l /dev/disk/by-id/nvme-Micron_9300_MTFDHAL3T8TDP_1234
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'local-storage'
    state: 0
...
# zpool export local-storage
# zpool import
   pool: local-storage
     id: 8392074971509924158
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
...
# zpool import local-storage

All good. Imports without having to specify a device.

2021-10-04 - letsencrypt root / certificate validation on jessie

On getting LetsEncrypt certificates to work on Debian/Jessie or Cumulus Linux 3 again.

Since last Thursday the 30th, the old LetsEncrypt certificate root stopped working at 14:01 UTC. This was a known and anticipated issue. All certificates had long been double signed by a new root that doubled as intermediate. Unfortunately, this does not mean that everything worked on older platforms with OpenSSL 1.0.1 or 1.0.2.

See this Debian/Jessie box — we see similar behaviour on Cumulux Linux 3.x:

# apt-get dist-upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

Everything is up to date.

# curl https://wctegeltje.nl
curl: (60) SSL certificate problem: certificate has expired
More details here: http://curl.haxx.se/docs/sslcerts.html

Yet the certificate is marked as expired.

Quickly check the chain on another box:

$ easycert -T wctegeltje.nl 443
Certificate chain
 0 s: [bb678ac6] CN = wctegeltje.nl
   i: [8d33f237] C = US, O = Let's Encrypt, CN = R3
 1 s: [8d33f237] C = US, O = Let's Encrypt, CN = R3
   i: [4042bcee] C = US, O = Internet Security Research Group, CN = ISRG Root X1
 2 s: [4042bcee] C = US, O = Internet Security Research Group, CN = ISRG Root X1
   i: [2e5ac55d] O = Digital Signature Trust Co., CN = DST Root CA X3
---
Expires in 30 days

So yeah. The root-most part here has expired, but the intermediate-root-double has not. See these:

# openssl x509 -in /etc/ssl/certs/2e5ac55d.0 -enddate -noout
notAfter=Sep 30 14:01:15 2021 GMT
# openssl x509 -in /etc/ssl/certs/4042bcee.0 -enddate -noout
notAfter=Jun  4 11:04:38 2035 GMT

How do we fix this? Easy. Just clear out the expired root:

# mv /usr/share/ca-certificates/mozilla/DST_Root_CA_X3.crt{,.old}
# sed -i -e 's#^mozilla/DST_Root_CA_X3.crt#!&#' /etc/ca-certificates.conf
# update-ca-certificates
Updating certificates in /etc/ssl/certs... 0 added, 1 removed; done.
Running hooks in /etc/ca-certificates/update.d....done.

(That last step removes /etc/ssl/certs/2e5ac55d.0 which is a symlink to DST_Root_CA_X3.pem.)

# curl https://wctegeltje.nl
<!DOCTYPE html>
...

2021-10-03 - umount -l / needs --make-slave

The other day I learned — the hard way — that umount -l can be dangerous. Using the --make-slave mount option makes it safer.

The scenario went like this:

A virtual machine on our Proxmox VE cluster wouldn't boot. No biggie, I thought. Just mount the filesystem on the host and do a proper grub-install from a chroot:

# fdisk -l /dev/zvol/zl-pve2-ssd1/vm-215-disk-3
/dev/zvol/zl-pve2-ssd1/vm-215-disk-3p1 *         2048 124999679 124997632 59.6G 83 Linux
/dev/zvol/zl-pve2-ssd1/vm-215-disk-3p2      124999680 125827071    827392  404M 82 Linux swap / Solaris
# mount /dev/zvol/zl-pve2-ssd1/vm-215-disk-3p1 /mnt/root
# cd /mnt/root
# for x in dev proc sys; do mount --rbind /$x $x; done
# chroot /mnt/root

There I could run the necessary commands to fix the boot procedure.

All done? Exit the chroot, unmount and start the VM:

# logout
# umount -l /mnt/root
# qm start 215

And at that point, things started failing miserably.

You see, in my laziness, I used umount -l instead of four umounts for: /mnt/root/dev, /mnt/root/proc, /mnt/root/sys and lastly /mnt/root. But what I was unaware of, was that there were mounts inside dev, proc and sys too, that now also got unmounted.

And that led to an array of failures:

systemd complained about binfmt_misc.automount breakage:

systemd[1]: proc-sys-fs-binfmt_misc.automount: Got invalid poll event 16 on pipe (fd=44)
systemd[1]: proc-sys-fs-binfmt_misc.automount: Failed with result 'resources'.

pvedaemon could not bring up any VMs:

pvedaemon[32825]: <root@pam> starting task qmstart:215:root@pam:
pvedaemon[46905]: start VM 215: UPID:pve2:ID:qmstart:215:root@pam:
systemd[1]: 215.scope: Failed to create cgroup /qemu.slice/215.scope:
  No such file or directory
systemd[1]: 215.scope: Failed to create cgroup /qemu.slice/215.scope:
  No such file or directory
systemd[1]: 215.scope: Failed to add PIDs to scope's control group:
  No such file or directory
systemd[1]: 215.scope: Failed with result 'resources'.
systemd[1]: Failed to start 215.scope.
pvedaemon[46905]: start failed: systemd job failed
pvedaemon[32825]: <root@pam> end task qmstart:215:root@pam:
  start failed: systemd job failed

The root runtime dir could not get auto-created:

systemd[1]: user-0.slice: Failed to create cgroup
  /user.slice/user-0.slice: No such file or directory
systemd[1]: Created slice User Slice of UID 0.
systemd[1]: user-0.slice: Failed to create cgroup
  /user.slice/user-0.slice: No such file or directory
systemd[1]: Starting User Runtime Directory /run/user/0...
systemd[4139]: user-runtime-dir@0.service: Failed to attach to
  cgroup /user.slice/user-0.slice/user-runtime-dir@0.service:
  No such file or directory
systemd[4139]: user-runtime-dir@0.service:
  Failed at step CGROUP spawning /lib/systemd/systemd-user-runtime-dir:
  No such file or directory

The Proxmox VE replication runner failed to start:

systemd[1]: pvesr.service: Failed to create cgroup
  /system.slice/pvesr.service: No such file or directory
systemd[1]: Starting Proxmox VE replication runner...
systemd[5538]: pvesr.service: Failed to attach to cgroup
  /system.slice/pvesr.service: No such file or directory
systemd[5538]: pvesr.service: Failed at step CGROUP spawning
  /usr/bin/pvesr: No such file or directory
systemd[1]: pvesr.service: Main process exited, code=exited,
  status=219/CGROUP
systemd[1]: pvesr.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Proxmox VE replication runner.

And, worst of all, new ssh logins to the host machine failed:

sshd[24551]: pam_systemd(sshd:session):
  Failed to create session: Connection timed out
sshd[24551]: error: openpty: No such file or directory
sshd[31553]: error: session_pty_req: session 0 alloc failed
sshd[31553]: Received disconnect from 10.x.x.x port 55190:11:
  disconnected by user

As you understand by now. This was my own doing, and it was caused by various missing mount points.

The failing ssh? A missing /dev/pts.

Most of the other failures? Mostly mounts missing in /sys/fs/cgroup.

Fixing

First order of business was to get this machine to behave again. Luckily I had a different machine where I could take a peek at what was supposed to be mounted.

On the other machine, I ran this one-liner:

$ mount | sed -e '/ on \/\(dev\|proc\|sys\)\//!d
   s#^\([^ ]*\) on \([^ ]*\) type \([^ ]*\) (\([^)]*\)).*#'\
'mountpoint -q \2 || '\
'( mkdir -p \2; mount -n -t \3 \1 -o \4 \2 || rmdir \2 )#' |
  sort -V

That resulted in this output that could be pasted into the one ssh shell I still had at my disposal:

mountpoint -q /dev/hugepages || ( mkdir -p /dev/hugepages; mount -n -t hugetlbfs hugetlbfs -o rw,relatime,pagesize=2M /dev/hugepages || rmdir /dev/hugepages )
mountpoint -q /dev/mqueue || ( mkdir -p /dev/mqueue; mount -n -t mqueue mqueue -o rw,relatime /dev/mqueue || rmdir /dev/mqueue )
mountpoint -q /dev/pts || ( mkdir -p /dev/pts; mount -n -t devpts devpts -o rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 /dev/pts || rmdir /dev/pts )
mountpoint -q /dev/shm || ( mkdir -p /dev/shm; mount -n -t tmpfs tmpfs -o rw,nosuid,nodev,inode64 /dev/shm || rmdir /dev/shm )
mountpoint -q /proc/sys/fs/binfmt_misc || ( mkdir -p /proc/sys/fs/binfmt_misc; mount -n -t autofs systemd-1 -o rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=45161 /proc/sys/fs/binfmt_misc || rmdir /proc/sys/fs/binfmt_misc )
mountpoint -q /sys/fs/bpf || ( mkdir -p /sys/fs/bpf; mount -n -t bpf none -o rw,nosuid,nodev,noexec,relatime,mode=700 /sys/fs/bpf || rmdir /sys/fs/bpf )
mountpoint -q /sys/fs/cgroup || ( mkdir -p /sys/fs/cgroup; mount -n -t tmpfs tmpfs -o ro,nosuid,nodev,noexec,mode=755,inode64 /sys/fs/cgroup || rmdir /sys/fs/cgroup )
mountpoint -q /sys/fs/cgroup/blkio || ( mkdir -p /sys/fs/cgroup/blkio; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,blkio /sys/fs/cgroup/blkio || rmdir /sys/fs/cgroup/blkio )
mountpoint -q /sys/fs/cgroup/cpuset || ( mkdir -p /sys/fs/cgroup/cpuset; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,cpuset /sys/fs/cgroup/cpuset || rmdir /sys/fs/cgroup/cpuset )
mountpoint -q /sys/fs/cgroup/cpu,cpuacct || ( mkdir -p /sys/fs/cgroup/cpu,cpuacct; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct || rmdir /sys/fs/cgroup/cpu,cpuacct )
mountpoint -q /sys/fs/cgroup/devices || ( mkdir -p /sys/fs/cgroup/devices; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,devices /sys/fs/cgroup/devices || rmdir /sys/fs/cgroup/devices )
mountpoint -q /sys/fs/cgroup/freezer || ( mkdir -p /sys/fs/cgroup/freezer; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,freezer /sys/fs/cgroup/freezer || rmdir /sys/fs/cgroup/freezer )
mountpoint -q /sys/fs/cgroup/hugetlb || ( mkdir -p /sys/fs/cgroup/hugetlb; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,hugetlb /sys/fs/cgroup/hugetlb || rmdir /sys/fs/cgroup/hugetlb )
mountpoint -q /sys/fs/cgroup/memory || ( mkdir -p /sys/fs/cgroup/memory; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,memory /sys/fs/cgroup/memory || rmdir /sys/fs/cgroup/memory )
mountpoint -q /sys/fs/cgroup/net_cls,net_prio || ( mkdir -p /sys/fs/cgroup/net_cls,net_prio; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio || rmdir /sys/fs/cgroup/net_cls,net_prio )
mountpoint -q /sys/fs/cgroup/perf_event || ( mkdir -p /sys/fs/cgroup/perf_event; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,perf_event /sys/fs/cgroup/perf_event || rmdir /sys/fs/cgroup/perf_event )
mountpoint -q /sys/fs/cgroup/pids || ( mkdir -p /sys/fs/cgroup/pids; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,pids /sys/fs/cgroup/pids || rmdir /sys/fs/cgroup/pids )
mountpoint -q /sys/fs/cgroup/rdma || ( mkdir -p /sys/fs/cgroup/rdma; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,rdma /sys/fs/cgroup/rdma || rmdir /sys/fs/cgroup/rdma )
mountpoint -q /sys/fs/cgroup/systemd || ( mkdir -p /sys/fs/cgroup/systemd; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd /sys/fs/cgroup/systemd || rmdir /sys/fs/cgroup/systemd )
mountpoint -q /sys/fs/cgroup/unified || ( mkdir -p /sys/fs/cgroup/unified; mount -n -t cgroup2 cgroup2 -o rw,nosuid,nodev,noexec,relatime /sys/fs/cgroup/unified || rmdir /sys/fs/cgroup/unified )
mountpoint -q /sys/fs/fuse/connections || ( mkdir -p /sys/fs/fuse/connections; mount -n -t fusectl fusectl -o rw,relatime /sys/fs/fuse/connections || rmdir /sys/fs/fuse/connections )
mountpoint -q /sys/fs/pstore || ( mkdir -p /sys/fs/pstore; mount -n -t pstore pstore -o rw,nosuid,nodev,noexec,relatime /sys/fs/pstore || rmdir /sys/fs/pstore )
mountpoint -q /sys/kernel/config || ( mkdir -p /sys/kernel/config; mount -n -t configfs configfs -o rw,relatime /sys/kernel/config || rmdir /sys/kernel/config )
mountpoint -q /sys/kernel/debug || ( mkdir -p /sys/kernel/debug; mount -n -t debugfs debugfs -o rw,relatime /sys/kernel/debug || rmdir /sys/kernel/debug )
mountpoint -q /sys/kernel/debug/tracing || ( mkdir -p /sys/kernel/debug/tracing; mount -n -t tracefs tracefs -o rw,relatime /sys/kernel/debug/tracing || rmdir /sys/kernel/debug/tracing )
mountpoint -q /sys/kernel/security || ( mkdir -p /sys/kernel/security; mount -n -t securityfs securityfs -o rw,nosuid,nodev,noexec,relatime /sys/kernel/security || rmdir /sys/kernel/security )

Finishing touches:

$ for x in /sys/fs/cgroup/*; do
    test -L $x &&
    echo ln -s $(readlink $x) $x
  done
ln -s cpu,cpuacct /sys/fs/cgroup/cpu
ln -s cpu,cpuacct /sys/fs/cgroup/cpuacct
ln -s net_cls,net_prio /sys/fs/cgroup/net_cls
ln -s net_cls,net_prio /sys/fs/cgroup/net_prio

Running those commands returned the system to a usable state.

The real fix

Next time, I shall refrain from doing the lazy -l umount.

But, as a better solution, I'll also be adding --make-slave to the rbind mount command. Doing that will ensure that an unmount in the bound locations does not unmount the original mount points:

# for x in dev proc sys; do
    mount --rbind --make-slave /$x $x
  done

With --make-slave a umount -l of your chroot path does not break your system.

2021-09-28 - a singal 17 is raised

When running the iKVM software on the BMC of SuperMicro machines, we regularly see an interesting "singal" typo.

(For the interested, we use a helper script to access the KVM console: ipmikvm. Without it, you need Java support enabled in your browser, and that has always given us trouble. The ipmikvm script logs on to the web interface, downloads the required Java bytecode and runs it locally.)

Connect to somewhere, wait for the KVM console to open, close it, and you might see something like this:

$ ipmikvm 10.x.x.x
attempting login on '10.x.x.x' with user USER
connect failed sd:18
Retry =1
a singal 17 is raised
GetFileDevStr:4051 media_type = 40
GetFileDevStr:4051 media_type = 45
GetFileDevStr:4051 media_type = 40
GetFileDevStr:4051 media_type = 45
GetFileDevStr:4051 media_type = 40
GetFileDevStr:4051 media_type = 45
a singal 17 is raised

Signal 17 would be SIGCHLD (see kill -l) which is the signal a process receives when one of its children exits.

Where does this typo come from? A quick web search does not reveal much. But a grep through the binaries does.

$ cd ~/.local/lib/ipmikvm/iKVM__V1.69.39.0x0
$ unzip iKVM__V1.69.39.0x0.jar
$ grep -r singal .
Binary file ./libiKVM64.so matches

Where exactly is this?

$ objdump -s -j .rodata libiKVM64.so | grep singal
 27690 4d697363 00612073 696e6761 6c202564  Misc.a singal %d

Okay, at 0x27695 (27690 + 5) there's the text. Let's see if we can find some more:

$ objdump -Cd libiKVM64.so | grep -C3 27695

0000000000016f90 <signal_handle(int)@@Base>:
   16f90: 89 fe                 mov    %edi,%esi
   16f92: 48 8d 3d fc 06 01 00  lea    0x106fc(%rip),%rdi        # 27695 <typeinfo name for RMMisc@@Base+0x8>
   16f99: 31 c0                 xor    %eax,%eax
   16f9b: e9 98 71 ff ff        jmpq   e138 <printf@plt>

So, it's indeed used in a signal handler. (Address taken from the instruction pointer 0x16f99 plus offset 0x106fc.) And at least one person got the signal spelling right.

Was this relevant? No. But it's fun to poke around in the binaries. And by now we've gotten so accustomed to this message that I hope they never fix it.

2021-09-12 - mariabackup / selective table restore

When using mariabackup (xtrabackup/innobackupex) for your MySQL/MariaDB backups, you get a snapshot of the mysql lib dir. This is faster than doing an old-style mysqldump, but it is slightly more complicated to restore. Especially if you just want access to data from a single table.

Assume you have a big database, and you're backing it up like this, using the mariadb-backup package:

# ulimit -n 16384
# mariabackup \
    --defaults-file=/etc/mysql/debian.cnf \
    --backup \
    --compress --compress-threads=2 \
    --target-dir=/var/backups/mysql \
    [--parallel=8] [--galera-info]
...
[00] 2021-09-12 15:23:52 mariabackup: Generating a list of tablespaces
[00] 2021-09-12 15:23:53 >> log scanned up to (132823770290)
[01] 2021-09-12 15:23:53 Compressing ibdata1 to /var/backups/mysql/ibdata1.qp
...
[00] 2021-09-12 15:25:40 Compressing backup-my.cnf
[00] 2021-09-12 15:25:40         ...done
[00] 2021-09-12 15:25:40 Compressing xtrabackup_info
[00] 2021-09-12 15:25:40         ...done
[00] 2021-09-12 15:25:40 Redo log (from LSN 132823770281 to 132823770290) was copied.
[00] 2021-09-12 15:25:40 completed OK!

Optionally followed by a:

# find /var/backups/mysql \
    -type f '!' -name '*.gpg' -print0 |
  sort -z |
  xargs -0 sh -exc 'for src in "$@"; do
      dst=${src}.gpg &&
        gpg --batch --yes --encrypt \
          --compression-algo none \
          --trust-model always \
          --recipient EXAMPLE_RECIPIENT \
          --output "$dst" "$src" &&
        touch -r "$src" "$dst" &&
        rm "$src" || exit $?
  done' unused_argv0

You'll end up with a bunch of qpress-compressed gpg-encrypted files, like these:

# ls -F1 /var/backups/mysql/
aria_log.00000001.qp.gpg
aria_log_control.qp.gpg
backup-my.cnf.qp.gpg
ib_buffer_pool.qp.gpg
ibdata1.qp.gpg
ib_logfile0.qp.gpg
my_project_1/
my_project_2/
my_project_3/
mysql/
performance_schema/
xtrabackup_binlog_info.qp.gpg
xtrabackup_checkpoints.qp.gpg
xtrabackup_info.qp.gpg

Let's assume we want only my_project_3.important_table restored.

Start out by figuring out where the decryption key was at:

$ gpg --list-packets /var/backups/mysql/my_project_3/important_table.ibd.qp.gpg
gpg: encrypted with 4096-bit RSA key, ID 1122334455667788, created 2017-10-10
      "Example Recipient" <recipient@example.com>"
gpg: decryption failed: No secret key
# off=0 ctb=85 tag=1 hlen=3 plen=524
:pubkey enc packet: version 3, algo 1, keyid 1122334455667788
  data: [4096 bits]
# off=527 ctb=d2 tag=18 hlen=3 plen=3643 new-ctb
:encrypted data packet:
  length: 3643
  mdc_method: 2

This PGP keyid we see, corresponds to the fingerprint of an encryption subkey:

$ gpg --list-keys --with-subkey-fingerprints recipient@example.com
pub   rsa4096 2017-10-10 [SC] [expires: 2021-10-13]
      ...some..key...
uid           [ unknown] Example Recipient <recipient@example.com>
sub   rsa4096 2017-10-10 [E] [expires: 2021-10-13]
      0000000000000000000000001122334455667788      <-- here it is!
sub   rsa4096 2017-10-10 [A] [expires: 2021-10-13]
      ...some..other..key...
sub   rsa4096 2017-10-10 [S] [expires: 2021-10-13]
      ...yet..another..key..

That matches. Good.

After assuring you have the right credentials, it's time to select which files we actually need. They are:

backup-my.cnf.qp.gpg
ibdata1.qp.gpg
ib_logfile0.qp.gpg
my_project_3/important_table.frm.qp.gpg
my_project_3/important_table.ibd.qp.gpg
xtrabackup_binlog_info.qp.gpg
xtrabackup_checkpoints.qp.gpg
xtrabackup_info.qp.gpg

Collect the files, decrypt and decompress.

Decrypting can be done with gpg, decompressing can either be done using qpress -dov $SRC >${SRC%.qp} or mariabackup --decompress --target-dir=.

(Yes, for --decompress and --prepare the --target-dir= setting means the backup-location, i.e. where the backups are now. Slightly confusing indeed.)

$ find . -name '*.gpg' -print0 |
  xargs -0 sh -xec 'for src in "$@"; do
      gpg --decrypt --output "${src%.gpg}" "$src" &&
        rm "$src" || exit $?
  done' unused_argv0
$ find . -name '*.qp' -print0 |
  xargs -0 sh -xec 'for src in "$@"; do
      qpress -dov "$src" >"${src%.qp}" &&
        rm "$src" || exit $?
  done' unused_argv0

Ok, we have files. Time to whip out the correct mariabackup, for example from a versioned Docker image.

$ docker run -it \
  -v `pwd`:/var/lib/mysql:rw mariadb:10.3.23 \
  bash

Inside the docker image, we'll fetch screen, which we'll be needing shortly:

# apt-get update -qq &&
    apt-get install -qqy screen less

Fix ownership, and "prepare" the mysql files:

# cd /var/lib/mysql
# chown -R mysql:mysql .
# su -s /bin/bash mysql
$ mariabackup --prepare --use-memory=20G --target-dir=.

(You may want to tweak that --use-memory=20G to your needs. For a 10GiB ib_logfile0, this setting made a world of difference: 10 minutes restore time, instead of infinite.)

(Also, mariabackup has a --databases="DB[.TABLE1][ DB.TABLE2 ...]" option that might come in handy if you're working with all files during the --prepare phase.)

mariabackup based on MariaDB server 10.3.23-MariaDB debian-linux-gnu (x86_64)
[00] 2021-09-12 16:04:30 cd to /var/lib/mysql/
...
2021-09-12 16:04:30 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=132823770281
2021-09-12 16:04:30 0 [Note] InnoDB: Last binlog file 'mysql-bin.000008', position 901
[00] 2021-09-12 16:04:30 Last binlog file mysql-bin.000008, position 901
[00] 2021-09-12 16:04:31 completed OK!

At this point we don't need to copy/move them to /var/lib/mysql. We're there already.

All set, fire up a screen (or tmux, or whatever) and start mysqld, explicitly ignoring mysql permissions.

$ screen
$ mysqld --skip-grant-tables 2>&1 |
    tee /tmp/mysql-boot-error.log |
    grep -vE '\[ERROR\]|Ignoring tablespace'
2021-09-12 16:05:56 0 [Note] mysqld (mysqld 10.3.23-MariaDB-1:10.3.23+maria~bionic-log) starting as process 526 ...
...
2021-09-12 16:05:56 0 [Note] InnoDB: Setting log file ./ib_logfile101 size to 50331648 bytes
2021-09-12 16:05:56 0 [Note] InnoDB: Setting log file ./ib_logfile1 size to 50331648 bytes
2021-09-12 16:05:57 0 [Note] InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0
2021-09-12 16:05:57 0 [Note] InnoDB: New log files created, LSN=132823770290
...

At this point the screen would get flooded with the following error messages, if it weren't for the grep -v:

2021-09-12 16:05:57 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
2021-09-12 16:05:57 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
2021-09-12 16:05:57 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them.
2021-09-12 16:05:57 0 [ERROR] InnoDB: Cannot open datafile for read-only: './my_project_1/aboutconfig_item.ibd' OS error: 71
2021-09-12 16:05:57 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
2021-09-12 16:05:57 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
2021-09-12 16:05:57 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them.
2021-09-12 16:05:57 0 [ERROR] InnoDB: Could not find a valid tablespace file for ``my_project_1`.`aboutconfig_item``. Please refer to https://mariadb.com/kb/en/innodb-data-dictionary-troubleshooting/ for how to resolve the issue.
2021-09-12 16:05:57 0 [Warning] InnoDB: Ignoring tablespace for `my_project_1`.`aboutconfig_item` because it could not be opened.

You'll get those for every table that you didn't include. Let's just ignore them.

Finally, when mysqld is done plowing through the (possibly big) ibdata1, it should read something like:

...
2021-09-12 16:05:57 6 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory")
2021-09-12 16:05:57 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
2021-09-12 16:05:57 0 [Note] Server socket created on IP: '127.0.0.1'.
2021-09-12 16:05:57 0 [Warning] Can't open and lock time zone table: Table 'mysql.time_zone_leap_second' doesn't exist trying to live without them
2021-09-12 16:05:57 7 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory")
2021-09-12 16:05:57 0 [Note] Reading of all Master_info entries succeeded
2021-09-12 16:05:57 0 [Note] Added new Master_info '' to hash table
2021-09-12 16:05:57 0 [Note] mysqld: ready for connections.
Version: '10.3.23-MariaDB-1:10.3.23+maria~bionic-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution

At this point, you can fire up a mysql or mysqldump client and extract the needed data.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| my_project_3       |
+--------------------+
MariaDB [(none)]> select count(*) from my_project_3.important_table;
+----------+
| count(*) |
+----------+
|        6 |
+----------+

Good, we have the data. And we didn't need to decrypt/decompress everything.

Stopping the mysqld is a matter of: mysqladmin shutdown

2021-08-31 - apt / downgrading back to current release

If you're running an older Debian or Ubuntu, you may sometimes want to check out a newer version of a package, to see if a particular bug has been fixed.

I know, this is not supported, but this scheme Generally Works (*):

  • replace the current release name in /etc/apt/sources.list, with the next release — e.g. from bionic to focal
  • do an apt-get update
  • and an apt-get install SOME-PACKAGE

You can test the package while replacing the sources.list with the original so the rest of the system doesn't get upgraded by accident. (Don't forget this step.)

(*) Except for the rare case when it doesn't work, because of ABI changes in a library that were not properly recorded.

Once you know whether you want this newer package or not, you can decide to get your system back into original state. This is a matter of downgrading the appropriate package(s).

For example:

# apt-cache policy gdb
gdb:
  Installed: 9.2-0ubuntu1~20.04
  Candidate: 9.2-0ubuntu1~20.04
  Version table:
 *** 9.2-0ubuntu1~20.04 100
        100 /var/lib/dpkg/status
     8.1.1-0ubuntu1 500
        500 http://MIRROR/ubuntu bionic-updates/main amd64 Packages
     8.1-0ubuntu3 500
        500 http://MIRROR/ubuntu bionic/main amd64 Packages
# apt-get install gdb=8.1.1-0ubuntu1
The following packages will be DOWNGRADED:
  gdb
Do you want to continue? [Y/n]

If you use apt-find-foreign you might notice there are a bunch of packages that need downgrading back to the original state:

# apt-find-foreign
Lists with corresponding package counts:
  22  (local only)
  296 http://MIRROR/ubuntu

Lists with very few packages (or local only):
  (local only)
    - binutils
    - binutils-common
    - binutils-x86-64-linux-gnu
    - gcc-10-base
    - gdb
    - libbinutils
    - libc-bin
    - libc6
    - libc6-dbg
    - libcrypt1
    - libctf-nobfd0
    - libctf0
    - libffi7
    - libgcc-s1
    - libidn2-0
    - libncursesw6
    - libpython3.8
    - libpython3.8-minimal
    - libpython3.8-stdlib
    - libreadline8
    - libtinfo6
    - locales

Looking up the right versions for all those packages we just dragged in, sounds like tedious work.

Luckily we can convince apt to do this for us, using a temporary /etc/apt/preferences.d/force_downgrade_to_bionic.pref:

Package: *
Pin: release n=bionic*
Pin-Priority: 1000

With priority 1000, apt will prefer the Bionic release so much that it suggests downgrades:

# apt-get dist-upgrade
The following packages will be REMOVED:
  libcrypt1 libctf0
The following packages will be DOWNGRADED:
  binutils binutils-common binutils-x86-64-linux-gnu gdb libbinutils
  libc-bin libc6 libc6-dbg libidn2-0 libpython3.8 libpython3.8-minimal
  libpython3.8-stdlib locales
Do you want to continue? [Y/n]

Make sure you remove force_downgrade_to_bionic.pref afterwards.

2021-08-15 - k8s / lightweight redirect

Spinning up pods just to for parked/redirect sites? I think not.

Recently, I had to HTTP(S)-redirect a handful of hostnames to elsewhere. Pointing them into our well maintained K8S cluster was the easy thing to do. It would manage LetsEncrypt certificates automatically using cert-manager.io.

From the cluster, I could spin up a service and an nginx deployment with a bunch of redirect/302 rules.

However, spinning up one or more nginx instances just to have it do simple redirects sounds like overkill. After all, the ingress controller already runs nginx. Why not have it do the 302s?

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/cluster-issuer: http01-issuer
    nginx.ingress.kubernetes.io/server-snippet: |
      # Do not match .well-known, or we get no certs.
      location ~ ^/($|[^.]) {
        return 302 https://target-url.example.com;
        #return 302 https://target-url.example.com/$1;
        #add_header Content-Type text/plain;
        #return 200 'This domain is parked';
      }
  name: my-redirects
spec:
  rules:
  - host: this-domain-redirects.example.com
    http:
      paths:
      - backend:
          serviceName: dummy
          servicePort: 65535
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - this-domain-redirects.example.com
    secretName: this-domain-redirects.example.com--tls

Using the nginx.ingress.kubernetes.io/server-snippet we can hook in a custom nginx snippet that does the redirecting for us.

Adding the http with backend config in the rules is mandatory. But you can point it do a non-existent dummy service, as seen above.

And if you want parked domains instead of redirected domains, simply replace return 302 https://target-url.example.com; with return 200 'This domain is parked';.

2021-08-13 - move the mouse programmatically / x11

“Can I move the mouse cursor in X programmatically?”, asked my son.

Python version

Yes you can. Here's a small Python snippet that will work on X with python-xlib:

import os
import time
import Xlib.display

display = Xlib.display.Display()
# XWarpPointer won't work on Wayland.
assert os.environ.get('XDG_SESSION_TYPE', 'x11') == 'x11', os.environ
win = display.screen()['root']            # entire desktop window
#win = display.get_input_focus().focus    # or, focused app window

for i in range(0, 1000):
    win.warp_pointer(x=i, y=i)
    display.flush()
    time.sleep(0.01)

Nothing too fancy. But getting working (XFlush()!) examples of code interfacing with the X server is not always easy. Another small example on the web — this right here — helps.

In all fairness, he did wonder if this was possible in (Java) Processing. It should be.

Processing version

Indeed, in (Java) Processing it's actually less complicated, but requires the oddly named "Robot" class.

import java.awt.AWTException;
import java.awt.Robot;

Robot robby;

int mouse_x;
int mouse_y;

void setup() {
  try {
    robby = new Robot();
  } catch (AWTException e) {
    /* Catching exceptions is mandatory in Java */
    println("Robot class not supported by your system!");
    println(e);
    exit();
  }
}

void draw() {
  mouse_x += 10;
  mouse_y += 10;
  if (mouse_x > displayWidth)
    mouse_x = 0;
  if (mouse_y > displayHeight)
    mouse_y = 0;

  set_mouse_pointer(mouse_x, mouse_y);
}

void set_mouse_pointer(int x, int y) {
  robby.mouseMove(x, y);
}

Ah, that was too easy. But can we do native X11 calls?

Native X11 Processing version

Yes! Of course we can.

You'll need to fetch the Java Native Access (JNA) Platform libraries and install them into an appropriate place for Processing. It looks like one can fetch a two jars from github.com/java-native-access/jna:
jna-5.8.0.jar and jna-platform-5.8.0.jar

These jars go into your .../sketchbook/libraries like this:

sketchbook/
  libraries/
    com_sun_jna/
      library/
        jna-5.8.0.jar
        jna-platform-5.8.0.jar

But, you need to have one jar here with the name com_sun_jna.jar (like the directory). Otherwise Processing will complain. Symlinking jna-platform-5.8.0.jar works nicely:
ln -s jna-platform-5.8.0.jar com_sun_jna.jar

Now that that's out of the way, you should be able to fire up Processing and run the following snippet. It's a bit longer, but it has the same access that the Python version has:

import com.sun.jna.Native;
import com.sun.jna.platform.unix.X11;

/* Declare what XWarpPointer looks like in libX11.so */
interface ExtendedX11 extends X11 /* or "Library" */ {
  int XWarpPointer(
    X11.Display dpy,
    X11.Window src_win,
    X11.Window dest_win,
    int src_x,
    int src_y,
    int src_width,
    int src_height,
    int dest_x,
    int dest_y);
}

/* We could do this, and get access to XOpenDisplay and others */
//X11 x11 = X11.INSTANCE;
/* Instead, we'll get everything that X11.INSTANCE would have
 * _plus_ XWarpPointer */
ExtendedX11 x11;

X11.Display x11display;
X11.Window x11window;

int mouse_x;
int mouse_y;

void setup() {
  /* Manually import libX11.so to get at XWarpPointer */
  x11 = (ExtendedX11)Native.loadLibrary("X11", ExtendedX11.class);
  x11display = x11.XOpenDisplay(null);
  x11window = x11.XDefaultRootWindow(x11display);
}

void draw() {
  mouse_x += 10;
  mouse_y += 10;
  if (mouse_x > displayWidth)
    mouse_x = 0;
  if (mouse_y > displayHeight)
    mouse_y = 0;

  set_mouse_pointer(mouse_x, mouse_y);
}

void set_mouse_pointer(int x, int y) {
  x11.XWarpPointer(x11display, null, x11window, 0, 0, 0, 0, x, y);
  x11.XFlush(x11display);
}

Obviously this native version doesn't work on Windows. But this one was more fun than the AWT version, no?

2021-08-12 - traverse path permissions / namei

How does one traverse a long path to quickly find out where you lack permissions?

So, I wanted to test some stuff in Debian/Buster. I already had an LXC container through LXD. I just needed to get some source files to the right place.

lxd$ sudo zfs list | grep buster
data/containers/buster-builder  692M  117G  862M
  /var/snap/lxd/common/lxd/storage-pools/data/containers/buster-builder
lxd$ sudo zfs mount data/containers/buster-builder

Make sure there's somewhere where I can write:

lxd$ sudo mkdir \
  /var/snap/lxd/common/lxd/storage-pools/data/containers/buster-builder/rootfs/home/osso/walter
lxd$ sudo chown walter \
  /var/snap/lxd/common/lxd/storage-pools/data/containers/buster-builder/rootfs/home/osso/walter

Awesome. Time to rsync some files there.

otherhost$ rsync -va --progress FILES \
  lxd:/var/snap/lxd/common/lxd/storage-pools/data/containers/buster-builder/rootfs/home/osso/walter/
rsync: [Receiver] ERROR: cannot stat destination
  "/var/snap/lxd/common/lxd/storage-pools/data/containers/buster-builder/rootfs/home/osso/walter/":
  Permission denied (13)

Drat! Missing perms.

Now comes the nifty part. Instead of doing an ls -ld on each individual component, there is a simple tool which name I keep forgetting: namei

lxd$ namei -l \
  /var/snap/lxd/common/lxd/storage-pools/data/containers/buster-builder/rootfs/home/osso/walter
f: /var/snap/lxd/common/lxd/storage-pools/data/containers/buster-builder/rootfs/home/osso/walter
drwxr-xr-x root   root    /
drwxr-xr-x root   root    var
drwxr-xr-x root   root    snap
drwxr-xr-x root   root    lxd
drwxr-xr-x root   root    common
drwx--x--x lxd    nogroup lxd
drwx--x--x root   root    storage-pools
drwx--x--x root   root    data
drwx--x--x root   root    containers
d--x------ 100000 root    buster-builder
                          rootfs - Permission denied

Okay. No permissions on buster-builder then.

lxd$ sudo chmod 701 \
  /var/snap/lxd/common/lxd/storage-pools/data/containers/buster-builder

Repeat the namei call, and now all is well. Time for that rsync.

2021-08-06 - migrating vm interfaces / eth0 to ens18

How about finally getting rid of eth0 and eth1 in those ancient Ubuntu VMs that you keep upgrading?

Debian and Ubuntu have been doing a good job at keeping the old names during upgrades. But it's time to move past that.

We expect ens18 and ens19 now. There's no need to hang on to the past. (And you have moved on to Netplan already, yes?)

Steps:

  • rm /etc/udev/rules.d/80-net-setup-link.rules
  • update-initramfs -u
  • rm /etc/systemd/network/50-virtio-kernel-names.link
  • Update all references in /etc to the new ens18 style names.
  • Reboot.

2021-08-05 - kioxia nvme / num_err_log_entries 0xc004 / smartctl

So, these new Kioxia NVMe drives were incrementing the num_err_log_entries as soon as they were inserted into the machine. But the error said INVALID_FIELD. What gives?

In contrast to the other (mostly Intel) drives, these drives started incrementing the num_err_log_entries as soon as they were plugged in:

# nvme smart-log /dev/nvme21n1
Smart Log for NVME device:nvme21n1 namespace-id:ffffffff
...
num_err_log_entries                 : 932

The relevant errors should be readable in the error-log. All 64 errors in the log looked the same:

error_count  : 932
sqid         : 0
cmdid        : 0xc
status_field : 0xc004(INVALID_FIELD)
parm_err_loc : 0x4
lba          : 0xffffffffffffffff
nsid         : 0x1
vs           : 0

INVALID_FIELD, what is this?

The error count kept increasing regularly — like clockwork actually. And the internet gave us no clues what this might be.

It turns out it was our monitoring. The Zabbix scripts we employ fetch drive health status values from various sources. And one of the things they do, is run smartctl -a on all drives. And for every such call, the error count was incremented.

# nvme list
Node           SN            Model                FW Rev
-------------  ------------  -------------------  --------
...
/dev/nvme20n1  PHLJ9110xxxx  INTEL SSDPE2KX010T8  VDV10131
/dev/nvme21n1  X0U0A02Dxxxx  KCD6DLUL3T84         0102
/dev/nvme22n1  X0U0A02Jxxxx  KCD6DLUL3T84         0102

If we run it on the Intel drive, we get this:

# smartctl -a /dev/nvme20n1
...
Model Number:                       INTEL SSDPE2KX010T8
...

=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x4002
# nvme smart-log /dev/nvme20n1 | grep ^num_err
num_err_log_entries                 : 0
# nvme error-log /dev/nvme20n1 | head -n12
Error Log Entries for device:nvme20n1 entries:64
.................
 Entry[ 0]
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0

But on the Kioxias, we get this:

# smartctl -a /dev/nvme21n1
...
Model Number:                       KCD6DLUL3T84
...

=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x6002
# nvme smart-log /dev/nvme21n1 | grep ^num_err
num_err_log_entries                 : 933
# nvme error-log /dev/nvme21n1 | head -n12
Error Log Entries for device:nvme21n1 entries:64
.................
 Entry[ 0]
.................
error_count  : 933
sqid         : 0
cmdid        : 0x6
status_field : 0xc004(INVALID_FIELD)
parm_err_loc : 0x4
lba          : 0xffffffffffffffff
nsid         : 0x1
vs           : 0

Apparently the Kioxia drive does not like what smartctl is sending.

Luckily this turned out to be an issue that smartctl claims responsibility for. And it had already been fixed.

If this works, the problem is that this drive requires that the broadcast namespace is specified if SMART/Health and Error Information logs are requested. This issue was unspecified in early revisions of the NVMe standard.

In our case, applying this fix was easy on this Ubuntu/Bionic machine:

# apt-cache policy smartmontools
smartmontools:
  Installed: 6.5+svn4324-1ubuntu0.1
  Candidate: 6.5+svn4324-1ubuntu0.1
  Version table:
     7.0-0ubuntu1~ubuntu18.04.1 100
        100 http://MIRROR/ubuntu bionic-backports/main amd64 Packages
 *** 6.5+svn4324-1ubuntu0.1 500
        500 http://MIRROR/ubuntu bionic-updates/main amd64 Packages
        100 /var/lib/dpkg/status
# apt-get install smartmontools=7.0-0ubuntu1~ubuntu18.04.1

This smartmontools update from 6.5 to 7.0 not only got rid of the new errors, it also showed more relevant health output.

Now if we could just reset the error-log count on the drives, then this would be even better...

2021-06-18 - openssl / error 42 / certificate not yet valid

In yesterday's post about not being able to connect to the SuperMicro iKVM IPMI, I wondered “why stunnel/openssl did not send error 45 (certificate_expired) for a not-yet-valid certificate.” Here's a closer examination.

Quick recap: yesterday, I got SSL alert/error 42 as response to a client certificate that was not yet valid. The server was living in 2015 and refused to accept a client certificate that would be valid first in 2016. That error 42 code could mean anything, so checking the period of validity of the certificate was not something that occurred to me immediately.

Here are the SSL alert codes in the 40s, taken from RFC 5246 7.2+:

codeidentifierdescription
40handshake_failure The sender was unable to negotiate an acceptable set of security parameters with the available options.
41no_certificate This alert was used in SSLv3 but not any version of TLS. (Don't use this.)
42bad_certificate A certificate was corrupt, contained signatures that did not verify correctly, etc.
43unsupported_certificate A certificate was of an unsupported type.
44certificate_revoked A certificate was revoked by its signer.
45certificate_expired A certificate has expired or is not currently valid.
46certificate_unknown Some other (unspecified) issue arose in processing the certificate, rendering it unacceptable.
47illegal_parameter A field in the handshake was out of range or inconsistent with other fields.
48unknown_ca A valid certificate chain or partial chain was received, but the certificate was rejected because it was not signed by a trusted CA.
49access_denied A valid certificate or PSK was received, but when access control was applied, the sender decided not to proceed with negotiation.

I would argue that a certificate that is not valid yet would be better off with error 45 than error 42. After all, the description from the RFC includes the phrase: “or is not currently valid.”

It turns out it was OpenSSL that opted for the more generic 42.

Testing

Here's how you generate one too old and one too new certificate:

$ CURDATE=$(date -R) &&
    sudo date --set="$(date -R -d'-2 days')" &&
    openssl req -batch -x509 -nodes -newkey rsa:4096 -days 1 \
      -keyout not-valid-anymore.key -out not-valid-anymore.crt &&
    sudo date --set="$CURDATE"
$ CURDATE=$(date -R) &&
    sudo date --set="$(date -R -d'+2 days')" &&
    openssl req -batch -x509 -nodes -newkey rsa:4096 -days 1 \
      -keyout not-yet-valid.key -out not-yet-valid.crt &&
    sudo date --set="$CURDATE"

You can then use openssl s_server and openssl s_client to test the libssl behaviour:

$ openssl s_server -port 8443 \
    -cert server.crt -key server.key \
    -CAfile not-valid-anymore.crt \
    -verify_return_error -verify 1
$ openssl s_client -connect 127.0.0.1:8443 \
    -servername whatever -tls1_2 -showcerts -debug \
    -cert not-valid-anymore.crt -key not-valid-anymore.key
...
read from 0x55ca16174150 [0x55ca161797a8] (2 bytes => 2 (0x2))
0000 - 02 2d    .-
140480733312320:error:14094415:SSL routines:ssl3_read_bytes:
  sslv3 alert certificate expired:../ssl/record/rec_layer_s3.c:1543:
  SSL alert number 45

So, error 45 (certificate_expired) for the not valid anymore case.

And for the not yet valid case?

$ openssl s_server -port 8443 \
    -cert server.crt -key server.key \
    -CAfile not-yet-valid.crt \
    -verify_return_error -verify 1
$ openssl s_client -connect 127.0.0.1:8443 \
    -servername whatever -tls1_2 -showcerts -debug \
    -cert not-yet-valid.crt -key not-yet-valid.key
...
read from 0x55be814cd150 [0x55be814d27a8] (2 bytes => 2 (0x2))
0000 - 02 2a    .*
140374994916672:error:14094412:SSL routines:ssl3_read_bytes:
  sslv3 alert bad certificate:../ssl/record/rec_layer_s3.c:1543:
  SSL alert number 42

Ah, there's that pesky number 42 again.

Source code

Ultimately, this number is produced during the translation from internal OpenSSL X509 errors to SSL/TLS alerts. Previously in ssl_verify_alarm_type(), nowadays in ssl_x509err2alert().

Traced back to:

commit d02b48c63a58ea4367a0e905979f140b7d090f86
Author: Ralf S. Engelschall
Date:   Mon Dec 21 10:52:47 1998 +0000

    Import of old SSLeay release: SSLeay 0.8.1b

In ssl/s3_both.c there is:

int ssl_verify_alarm_type(type)
int type;
        {
        int al;

        switch(type)
                {
        case X509_V_ERR_UNABLE_TO_GET_ISSUER_CERT:
// ...
        case X509_V_ERR_CERT_NOT_YET_VALID:
// ...
                al=SSL3_AD_BAD_CERTIFICATE;
                break;
        case X509_V_ERR_CERT_HAS_EXPIRED:
                al=SSL3_AD_CERTIFICATE_EXPIRED;
                break;

And more recently, in ssl/statem/statem_lib.c translated by ssl_x509err2alert():

static const X509ERR2ALERT x509table[] = {
    {X509_V_ERR_APPLICATION_VERIFICATION, SSL_AD_HANDSHAKE_FAILURE},
    {X509_V_ERR_CA_KEY_TOO_SMALL, SSL_AD_BAD_CERTIFICATE},
// ...
    {X509_V_ERR_CERT_NOT_YET_VALID, SSL_AD_BAD_CERTIFICATE},
// ...
    {X509_V_ERR_CRL_HAS_EXPIRED, SSL_AD_CERTIFICATE_EXPIRED},
// ...
}

Apparently behaviour has been like this since 1998 and before. A.k.a. since forever. I guess we'll have to keep the following list in mind next time we encounter error 42:

X509_V_ERR_CA_KEY_TOO_SMALL
X509_V_ERR_CA_MD_TOO_WEAK
X509_V_ERR_CERT_NOT_YET_VALID
X509_V_ERR_CERT_REJECTED
X509_V_ERR_CERT_UNTRUSTED
X509_V_ERR_CRL_NOT_YET_VALID
X509_V_ERR_DANE_NO_MATCH
X509_V_ERR_EC_KEY_EXPLICIT_PARAMS
X509_V_ERR_EE_KEY_TOO_SMALL
X509_V_ERR_EMAIL_MISMATCH
X509_V_ERR_ERROR_IN_CERT_NOT_AFTER_FIELD
X509_V_ERR_ERROR_IN_CERT_NOT_BEFORE_FIELD
X509_V_ERR_ERROR_IN_CRL_LAST_UPDATE_FIELD
X509_V_ERR_ERROR_IN_CRL_NEXT_UPDATE_FIELD
X509_V_ERR_HOSTNAME_MISMATCH
X509_V_ERR_IP_ADDRESS_MISMATCH
X509_V_ERR_UNABLE_TO_DECODE_ISSUER_PUBLIC_KEY
X509_V_ERR_UNABLE_TO_DECRYPT_CERT_SIGNATURE
X509_V_ERR_UNABLE_TO_DECRYPT_CRL_SIGNATURE

That is, assuming you're talking to libssl (OpenSSL). But that's generally the case.

P.S. The GPLv2 OpenSSL replacement WolfSSL appears to do the right thing in DoCertFatalAlert(), returning certificate_expired for both ASN_AFTER_DATE_E and ASN_BEFORE_DATE_E. Yay!

2021-06-17 - supermicro / ikvm / sslv3 alert bad certificate

Today I was asked to look at a machine that disallowed iKVM IPMI console access. It allowed access through the “iKVM/HTML5”, but when connecting using the “Console Redirection” (Java client, see also ipmikvm) it would quit after 10 failed attempts.

TL;DR: The clock of the machine had been reset to a timestamp earlier than the first validity of the supplied client certificate. After changing the BMC time from 2015 to 2021, everything worked fine again.

If you're interested, here are some steps I took to debug the situation:

Debugging

My attempts at logging in started like this:

$ ipmikvm 10.1.2.3
...
Retry =1
Retry =2
Retry =3

Okay, no good. Luckily syslog had some info (some lines elided):

Service [ikvm] accepted (FD=3) from 127.0.0.1:40868
connect_blocking: connected 10.1.2.3:5900
Service [ikvm] connected remote server from 10.0.0.2:38222
Remote socket (FD=9) initialized
SNI: sending servername: 10.1.2.3
...
CERT: Locally installed certificate matched
Certificate accepted: depth=0,
  /C=US/ST=California/L=San Jose/O=Super Micro Computer
  /OU=Software/CN=IPMI/emailAddress=Email
SSL state (connect): SSLv3 read server certificate A
...
SSL state (connect): SSLv3 write client certificate A
...
SSL state (connect): SSLv3 write finished A
SSL state (connect): SSLv3 flush data
SSL alert (read): fatal: bad certificate
SSL_connect: 14094412: error:14094412:
  SSL routines:SSL3_READ_BYTES:sslv3 alert bad certificate
Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Remote socket (FD=9) closed
Local socket (FD=3) closed
Service [ikvm] finished (0 left)

The Java iKVM client uses an stunnel sidecar to take care of the TLS bits, hence the extra connection to 127.0.0.1. Right now, that's not important. What is important is the “SSL routines:SSL3_READ_BYTES:sslv3 alert bad certificate” message. Apparently someone disagrees with something in the TLS handshake.

First question: is the client or the server to blame? The log isn't totally clear on that. Let's find out who disconnects.

Rerunning ipmikvm 10.1.2.3 but with sh -x shows us how to invoke the client:

$ sh -x /usr/bin/ipmikvm 10.1.2.3
...
+ dirname /home/walter/.local/lib/ipmikvm/iKVM__V1.69.38.0x0/iKVM__V1.69.38.0x0.jar
+ exec java -Djava.library.path=/home/walter/.local/lib/ipmikvm/iKVM__V1.69.38.0x0 \
    -cp /home/walter/.local/lib/ipmikvm/iKVM__V1.69.38.0x0/iKVM__V1.69.38.0x0.jar \
    tw.com.aten.ikvm.KVMMain 10.1.2.3 RXBoZW1lcmFsVXNlcg== TGlrZUknZFRlbGxZb3U= \
    null 63630 63631 0 0 1 5900 623

We can rerun that one and see what it does, by running it through strace, and redirecting the syscalls to tmp.log. (The output is large enough not to want it on your console, trust me.)

$ strace -s 8192 -ttf \
    java -Djava.library.path=/home/walter/.local/lib/ipmikvm/iKVM__V1.69.38.0x0 \
    -cp /home/walter/.local/lib/ipmikvm/iKVM__V1.69.38.0x0/iKVM__V1.69.38.0x0.jar \
    tw.com.aten.ikvm.KVMMain 10.1.2.3 RXBoZW1lcmFsVXNlcg== TGlrZUknZFRlbGxZb3U= \
    null 63630 63631 0 0 1 5900 623 \
    2> tmp.log
connect failed sd:18
Retry =1
^C

We expect a tcp connect() to 10.1.2.3:

[pid 130553] 08:50:30.418890 connect(
  9, {sa_family=AF_INET, sin_port=htons(5900),
  sin_addr=inet_addr("10.1.2.3")}, 16)
    = -1 EINPROGRESS (Operation now in progress)

Not quite connected yet, but it's apparently non-blocking. Scrolling downward in pid 130553 we see:

[pid 130553] 08:50:30.419037 poll(
  [{fd=9, events=POLLIN|POLLOUT}], 1, 10000)
    = 1 ([{fd=9, revents=POLLOUT}])

Good, it's connected. Now following the read/write (usually recv/send or a similar syscall, but not this time) on fd 9 shows us:

[pid 130553] 08:50:30.684609 read(
  9, "\25\3\3\0\2", 5) = 5
[pid 130553] 08:50:30.684664 read(
  9, "\2*", 2) = 2
[pid 130553] 08:50:30.684775 sendto(
  6, "<31>Jun 17 08:50:30 stunnel: LOG7[130546:139728830977792]:
      SSL alert (read): fatal: bad certificate",
  99, MSG_NOSIGNAL, NULL, 0) = 99
...
[pid 130553] 08:50:30.685333 close(9)   = 0

So, the client is the one closing the connection after receiving "\2" (2) and "*" (0x2A, 42 decimal).

OpenSSL errors

We can go back in the strace output and look for certificates used:

[pid 130519] 08:50:29.203145 openat(
  AT_FDCWD, "/tmp/1623912629200/client.crt",
  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 18
[pid 130519] 08:50:29.203497 write(
  18, "-----BEGIN CERTIFICATE-----\nMIID7TCCAt...) = 1424

The sidecar setup writes a client.crt, client.key and server.crt.

If we extract their contents from the strace output and write them to a file, we can use the openssl s_client to connect directly and get additional debug info:

$ openssl s_client -connect 10.1.2.3:5900 -servername 10.1.2.3 \
    -showcerts -debug
...
read from 0x55bbada227a0 [0x55bbada2a708]
  (2 bytes => 2 (0x2))
0000 - 02 28    .(
140555689141568:error:14094410:SSL routines:ssl3_read_bytes:
  sslv3 alert handshake failure:../ssl/record/rec_layer_s3.c:1543:
  SSL alert number 40

So, not supplying a client certificate gets us an error 40 (0x28), followed by a disconnect from the server (read returns -1). This is fine. Error 40 (handshake_failure) means that one or more security parameters were bad. In this case because we didn't supply the client certificate.

What happens if we send a self-generated client certificate?

$ openssl s_client -connect 10.1.2.3:5900 -servername 10.1.2.3 \
    -showcerts -debug -cert CUSTOMCERT.crt -key CUSTOMCERT.key
...
read from 0x5604603d7750 [0x5604603dcd68]
  (2 bytes => 2 (0x2))
0000 - 02 30    .0
139773856281920:error:14094418:SSL routines:ssl3_read_bytes:
  tlsv1 alert unknown ca:../ssl/record/rec_layer_s3.c:1543:
  SSL alert number 48

Error 48 (unknown_ca). That makes sense, as the server does not know the CA of our custom generated certificate.

Lastly with the correct certificate, we get an error 42 (0x2A):

$ openssl s_client -connect 10.1.2.3:5900 -servername 10.1.2.3 \
    -showcerts -debug -cert client.crt -key client.key
...
read from 0x556b27ca7cd0 [0x556b27cad2e8]
  (2 bytes => 2 (0x2))
0000 - 02 2a    .*
140701791647040:error:14094412:SSL routines:ssl3_read_bytes:
  sslv3 alert bad certificate:../ssl/record/rec_layer_s3.c:1543:
  SSL alert number 42

Error 42 is bad_certificate, with this description from RFC 5246 (7.2.2):

   bad_certificate
      A certificate was corrupt, contained signatures that did not
      verify correctly, etc.

We're now quite certain it's our client certificate that is being rejected. But we're no closer to the reason why. If we openssl-verify client.crt locally, it appears to be just fine.

Upgrading and inspecting the BMC firmware

This particular motherboard — X10SRD-F — already had the latest Firmware according to the SuperMicro BIOS IPMI downloads: REDFISH_X10_388_20200221_unsigned.zip

As a last ditch effort, we checked if we could upgrade to a newer version. After all, in the changelog for version 3.90 (and similar for 3.89) it said:

7. Corrected issues with KVM console.

Ignoring the fact that version 3.89 was not listed for our hardware, we went ahead and upgraded to 3.89. That went smoothly, but the problem persisted.

Upgrade to 3.90 then? Or maybe there is something else we're overlooking. Let's see if we can disect the firmware:

$ unzip ../REDFISH_X10_390_20200717_unsigned.zip
Archive:  ../REDFISH_X10_390_20200717_unsigned.zip
  inflating: REDFISH_X10_390_20200717_unsigned.bin
...
$ binwalk REDFISH_X10_390_20200717_unsigned.bin

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
103360        0x193C0         CRC32 polynomial table, little endian
4194304       0x400000        CramFS filesystem, little endian, size: 15253504, version 2, sorted_dirs, CRC 0x4148A5CC, edition 0, 8631 blocks, 1100 files
20971520      0x1400000       uImage header, header size: 64 bytes, header CRC: 0xC3B2AF42, created: 2020-07-17 09:02:52, image size: 1537698 bytes, Data Address: 0x40008000, Entry Point: 0x40008000, data CRC: 0x4ACB7660, OS: Linux, CPU: ARM, image type: OS Kernel Image, compression type: gzip, image name: "21400000"
20971584      0x1400040       gzip compressed data, maximum compression, has original file name: "linux.bin", from Unix, last modified: 2020-07-17 08:56:49
24117248      0x1700000       CramFS filesystem, little endian, size: 7446528, version 2, sorted_dirs, CRC 0x1D3A953F, edition 0, 3089 blocks, 456 files

4194304 and 24117248 are both multiples of 4096 (0x1000) (obvious from the zeroes in the hexadecimal column), this speeds up this dd step a bit:

$ dd if=REDFISH_X10_390_20200717_unsigned.bin \
    bs=$(( 0x1000 )) skip=$(( 0x400000 / 0x1000 )) \
    count=$(( (0x1400000 - 0x400000) / 0x1000 )) \
    of=cramfs1.bin
$ dd if=REDFISH_X10_390_20200717_unsigned.bin \
    bs=$(( 0x1000 )) skip=$(( 0x1700000 / 0x1000 )) \
    of=cramfs2.bin
$ du -b cramfs*
16777216  cramfs1.bin
9437184   cramfs2.bin

We can mount these and inspect their contents:

$ sudo mkdir /mnt/cramfs{1,2}
$ sudo mount -t cramfs ./cramfs1.bin /mnt/cramfs1
$ sudo mount -t cramfs ./cramfs2.bin /mnt/cramfs2

cramfs1.bin contains a Linux filesystem with an stunnel configuration:

$ ls /mnt/cramfs1/
bin  linuxrc     proc   SMASH  var
dev  lost+found  run    sys    web
etc  mnt         sbin   tmp
lib  nv          share  usr
$ ls /mnt/cramfs1/etc/stunnel/
client.crt  server.key
server.crt  stunnel.conf

This also looks sane. The server.key matches the server.crt we already had. And the client.crt also matches what we had. And any and all validation on these just succeeds.

cramfs2.bin then?

$ ls /mnt/cramfs2/
cgi
cgi-bin
css
extract.in
iKVM__V1.69.38.0x0.jar.pack.gz
...

This looks like the webserver contents on https://10.1.2.3. (Interestingly the iKVM__V1.69.38.0x0.jar.pack.gz file differs between 3.88 and 3.89 and 3.90, but that turned out to be of no significance.)

Peering into the jar yielded no additional clues unfortunately:

$ unpack200 /mnt/cramfs2/iKVM__V1.69.38.0x0.jar.pack.gz iKVM__V1.69.38.0x0.jar
$ unzip iKVM__V1.69.38.0x0.jar
Archive:  iKVM__V1.69.38.0x0.jar
PACK200
  inflating: META-INF/MANIFEST.MF
...
$ ls res/*.crt res/*.key
res/client.crt
res/client.key
res/server.crt

Same certificates. Everything matches.

Date and Time

At this point I'm giving up. I had tried the Syslog option in the BMC, which gave me 0 output thusfar. I had tried replacing the webserver certificate. Upgrading the BMC...

Out of ideas I'm mindlessly clicking around in the web interface. This landed me at Configuration -> Date and Time. Apparently the local date was set to somewhere in the year 2015.

We might as well fix that and try one last time.

Yes! After fixing the date, connecting suddenly worked.

Immediately all pieces fit together:

$ openssl x509 -in client.crt -noout -text |
    grep Validity -A3

        Validity
            Not Before: May 19 09:46:36 2016 GMT
            Not After : May 17 09:46:36 2026 GMT
        Subject: C = US, ST = California, L = San Jose,
          O = Super Micro Computer, OU = Software,
          CN = IPMI, emailAddress = Email

Crap. The server had been looking at a “not yet valid” certificate the whole time. The certificate would be valid between 2016 and 2026, but the server was still living in the year 2015.

I wonder why stunnel/openssl did not send error 45 (certificate_expired). After all, the RFC says “a certificate has expired or is not currently valid” (my emphasis). That would've pointed us to the cause immediately.

This problem was one giant time sink. But we did learn a few things about structure of the BMC Firmware. And, also important: after the 17th of May in the year 2026, the iKVM connections will stop working unless we upgrade the firmware or fiddle with the time.

Maybe set a reminder for that event, in case these servers are still around by then...

2021-05-24 - ancient acer h340 nas / quirks and fixes

A while ago, I got a leftover NAS from someone. It's ancient, and what's worse, it's headless — that is, there is no video adapter on it. So installing an OS on it, or debugging failures is not entirely trivial.

Here are some tips and tricks to get it moving along.

First, according to dmidecode it's a:

Manufacturer: Acer
Product Name: Aspire easyStore H340
Product Name: WG945GCM

The previous owner had already placed the JP3 jumper on the motherboard. This likely made it boot from USB. So, I debootstrapped a USB stick with Debian/Buster, a user and an autostarting sshd daemon so I could look around. I use the following on-boot script to get some network:

#!/bin/sh -x
#
# Called from /etc/systemd/system/networking.service.d/override.conf:
#   ExecStartPre=/usr/local/bin/auto-ifaces
#

/bin/rm /etc/network/interfaces
for iface in $(
        /sbin/ip link | /bin/sed -e '/^[0-9]/!d;s/^[0-9]*: //;s/:.*//'); do
    if test $iface = lo; then
        printf 'auto lo\niface lo inet loopback\n' \
          >>/etc/network/interfaces
    elif test $iface = enp9s0 -o $iface = enp9s0f0; then
        # physically broken interface
        # [   19.191416] sky2 0000:09:00.0: enp9s0f0: phy I/O error
        # [   19.191442] sky2 0000:09:00.0: enp9s0f0: phy I/O error
        /sbin/ip link set down $iface
    else
        # ethtool not needed with dongle anymore.
        #/sbin/ethtool -s $iface speed 100 duplex full
        printf 'auto %s\niface %s inet dhcp\n' $iface $iface \
          >>/etc/network/interfaces
    fi
done

(At first, enp9s0 wasn't broken yet. Later I let a USB dongle took over its role.)

Note that a S.M.A.R.T. (disk) error would land it in "Press F1 to continue"-land. Attaching a keyboard and pressing F1 — or, alternatively, ejecting all disks before booting — bypassed that. Watch for the blue blinking i symbol. If there isn't any, it's not booting.

Next, it appeared that it refused to shutdown on poweroff. After completing the shutdown sequence, it would reboot.

Various reports on the internet claim that disabling USB 3.0 in the BIOS works. But, I don't have access to the BIOS (no video, remember) and I'm using the USB for a network dongle.

Adding XHCI_SPURIOUS_WAKEUP (262144) | XHCI_SPURIOUS_REBOOT (8182) to the linux command line as xhci_hcd.quirks=270336 appeared to work, but this was fake news, at least on Linux kernel 4.19.0-16-amd64.

The acpi=off command line suggestion seen elsewhere is a bit much. It will not reboot after shutdown, but it will require a manual button press. And additionally, only one CPU (thread) will be detected, and dmesg gets flooded with PCI Express errors.

The solution that did work, was the following (at shutdown) in /etc/rc0.d/K02rmmod-against-reboot:

#!/bin/sh
/usr/sbin/modprobe -r ehci_pci ehci_hcd
/bin/true

(Removing ehci_pci when it's still running breaks the USB networking, so be careful when you test.)

Lastly, you may want to install mediasmartserverd — the Linux daemon that controls the status LEDs of Acer Aspire EasyStore H340 [...]. It will get you some nice colored leds next to your hard disks. (Do use --brightness=6, as brightness level one is too weak.)

2021-05-20 - partially removed pve node / proxmox cluster

The case of the stale (removed but not removed) PVE node in our Proxmox cluster.

On one of our virtual machine clusters, a node — pve3 — had been removed on purpose, yet is was still visible in the GUI with a big red cross (because it was unavailable). This was not only ugly, but also caused problems for the node enumeration done by proxmove.

The node had been properly removed, according to the removing a cluster node documentation. Yet it was apparently still there.

# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pve1 (local)
         2          1 pve2
         3          1 pve4
         5          1 pve5

This listing looked fine: pve3 (nodeid 4) was absent. And all remaining nodes showed the same info.

But, a quick grep through /etc did turn up some references to pve3:

# grep pve3 /etc/* -rl
/etc/corosync/corosync.conf
/etc/pve/.version
/etc/pve/.members
/etc/pve/corosync.conf

Those two corosync.conf config files are in sync. Both between themselves and equal to those files on the other three nodes. But they did contain a reference to the removed node:

nodelist {
...
  node {
    name: pve3
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.x.x.x
  }

The .version and .members json files were different, albeit similar on all nodes. They all included the 5 nodes (one too many):

# cat /etc/pve/.members
{
"nodename": "pve1",
"version": 77,
"cluster": { "name": "my-clustername", "version": 6, "nodes": 5, "quorate": 1 },
"nodelist": {
  "pve1": { "id": 1, "online": 1, "ip": "10.x.x.x"},
  "pve2": { "id": 2, "online": 1, "ip": "10.x.x.x"},
  "pve3": { "id": 4, "online": 0, "ip": "10.x.x.x"},
  "pve4": { "id": 3, "online": 1, "ip": "10.x.x.x"},
  "pve5": { "id": 5, "online": 1, "ip": "10.x.x.x"}
  }
}

The document versions were all a bit different, but the cluster versions were the same between the nodes. Except for one node, on which the cluster version was 5 instead of 6.

Restarting corosync on that node fixed that problem: the cluster versions were now 6 everywhere.

With that problem tackled, it was a matter of:

# pvecm expected 4
# pvecm delnode pve3
Killing node 4

All right! Even though it did not list nodeid 4 in the pvecm nodes output, delnode did find the right one. And this properly removed all traces of pve3 from the remaining files, making the cluster happy again.

2021-05-11 - enable noisy build / opensips

How do you enable the noisy build when building OpenSIPS? The one where the actual gcc invocations are not hidden.

In various projects the compilation and linking steps called by make are cleaned up, so you only see things like:

Compiling db/db_query.c
Compiling db/db_id.c
...

This looks cleaner. But sometimes you want to see (or temporarily change) the compilation/linking call:

gcc -g -O9 -funroll-loops -Wcast-align -Wall [...] -c db/db_query.c -o db/db_query.o
gcc -g -O9 -funroll-loops -Wcast-align -Wall [...] -c db/db_id.c -o db/db_id.o
...

(I elided about 800 characters per line in this example. Noisy indeed.)

The setting to disable this “clean” output and favor a “noisy” one generally exists, but there is no consensus on a standardized name.

For projects built with CMake, enabling verbose mode probably does the trick (VERBOSE=1). For other projects, the name varies. (I'm partial to the NOISY_BUILD=yes that Asterisk PBX uses.)

For OpenSIPS you can achieve this effect by setting the Q variable to empty:

$ make Q= opensips modules

Other OpenSIPS make variables

Okay. And what about parallel jobs?

Use the FASTER=1 variable, along with the -j flag:

$ make -j4 FASTER=1 opensips modules

And building only a specific module?

Generally you'll want to preset which modules to include or exclude in Makefile.conf. There you have exclude_modules?= and include_modules?= which are used unless you set them earlier (on the command line).

(Beware: touching Makefile.conf will force a rebuild of the entire project.)

For a single run, you can specify either one of them or modules on the command line, where modules takes space separated modules with a modules/ prefix:

$ make modules exclude_modules=presence
(builds all modules except presence)
$ make modules include_modules=presence
(builds the selected modules from Makefile.conf and the presence module)
$ make modules modules=modules/presence
(builds only the presence module)

This might save you some time.

2021-05-05 - missing serial / scsi / disk by-id

When you have a lot of storage devices, it's best practice to assign them to raid arrays or ZFS pools by something identifiable. And preferably something that's also readable when outside a computer. Commonly: the disk manufacturer and the serial number.

Usually, both the disk manufacturer and the disk serial number are printed on a small label on the disk. So, if you're in the data center replacing a disk, one glance is sufficient to know you got the correct disk.

For this reason, our ZFS storage pool configurations look like this:

  NAME                                  STATE
  tank                                  ONLINE
    raidz2-0                            ONLINE
      scsi-SSEAGATE_ST10000NM0226_6351  ONLINE
      scsi-SSEAGATE_ST10000NM0226_0226  ONLINE
      scsi-SSEAGATE_ST10000NM0226_8412  ONLINE
      scsi-SSEAGATE_ST10000NM0226_...   ONLINE

Instead of this:

  NAME        STATE
  tank        ONLINE
    raidz2-0  ONLINE
      sda     ONLINE
      sdb     ONLINE
      sdc     ONLINE
      sd...   ONLINE

If you're replacing a faulty disk, you can match it to the serial number and confirm that you haven't done anything stupid.

Referencing these disks is as easy as using the symlink in /dev/disk/by-id.

No model names and serial numbers in udev?

But I don't have any serial numbers in /dev/disk/by-id, I only have these wwn- numbers.

If your /dev/disk/by-id looks like this:

# ls -1 /dev/disk/by-id/
scsi-35000c5001111138e
scsi-35000c50011111401
...
wwn-0x5000c5001111138e
wwn-0x5000c5001111140f
...

And it has no manufacturer/serial symlinks, then udev is letting you down.

Looking at udevadm info /dev/sda may reveal that you're missing some udev rules. On this particular machine I did have ID_SCSI_SERIAL, but not SCSI_VENDOR, SCSI_MODEL or SCSI_IDENT_SERIAL.

On Ubuntu/Focal, the fix was to install sg3-utils-udev which provides udev rules in 55-scsi-sg3_id.rules and 58-scsi-sg3_symlink.rules:

# apt-get install sg3-utils-udev
# udevadm trigger --action=change
# ls -1 /dev/disk/by-id/
scsi-35000c5001111138e
scsi-35000c50011111401
...
scsi-SSEAGATE_ST10000NM0226_8327
scsi-SSEAGATE_ST10000NM0226_916D
...
wwn-0x5000c5001111138e
wwn-0x5000c5001111140f
...

Awesome. Devices with serial numbers. I'm off to create a nice zpool.

2021-04-29 - smtp_domain / gitlab configuration

What is the smtp_domain in the GitLab configuration? There is also a smtp_address and smtp_user_name; so what would you put in the “domain” field?

Contrary to what the examples on GitLab Omnibus SMTP lead you to believe: smtp_domain is the HELO/EHLO domain; i.e. your hostname.

RFC 5321 has this to say about the HELO/EHLO parameter:

   o  The domain name given in the EHLO command MUST be either a primary
      host name (a domain name that resolves to an address RR) or, if
      the host has no name, an address literal, as described in
      Section 4.1.3 and discussed further in the EHLO discussion of
      Section 4.1.4.

So, the term [smtp_]helo_hostname would've been a lot more appropriate.

2021-04-26 - yubico otp / pam / openvpn

Quick notes on setting up pam_yubico.so with OpenVPN.

Add to OpenVPN server config:

plugin /usr/lib/x86_64-linux-gnu/openvpn/plugins/openvpn-plugin-auth-pam.so openvpn

# Use a generated token instead of user/password for up
# to 16 hours, so you'll need to re-enter your otp daily.
auth-gen-token 57600

Sign up at https://upgrade.yubico.com/getapikey/. It's really quick. Store client_id and secret (or id and key respectively). You'll need them in the config below.

Get PAM module:

# apt-get install --no-install-recommends libpam-yubico

Create /etc/pam.d/openvpn:

# This file is called /etc/pam.d/openvpn; and it is used by openvpn through:
# plugin /usr/lib/x86_64-linux-gnu/openvpn/plugins/openvpn-plugin-auth-pam.so openvpn

# Settings for pam_yubico.so
# --------------------------
# debug
#   yes, we want debugging (DISABLE when done)
# debug_file=stderr
#   stdout/stderr/somefile all go to journald;
#   but stdout will get truncated because it's not flush()ed.
# mode=client
#   client for OTP validation
# authfile=/etc/openvpn/server/authorized_yubikeys
#   the file with "USERNAME:YUBI1[:YUBI2:...]" lines
# #alwaysok
#   this is the dry-run (allow all)
# #use_first_pass/try_first_pass
#   do NOT use these for openvpn/openssh; the password is fetched
#   through PAM_CONV:
#   > pam_yubico.c:935 (pam_sm_authenticate): get password returned: (null)
# #verbose_otp
#   do NOT use this for openvpn/openssh; it will break password input
#   without any meaningful debug info:
#   > pam_yubico.c:1096 (pam_sm_authenticate): conv returned 1 bytes
#   > pam_yubico.c:1111 (pam_sm_authenticate): Skipping first 0 bytes. [...]
#   > pam_yubico.c:1118 (pam_sm_authenticate): OTP: username ID: username

# First, the username+password is checked:
auth required pam_yubico.so debug debug_file=stderr mode=client authfile=/etc/openvpn/server/authorized_yubikeys id=<client_id> key=<secret>

# Second, an account is needed: pam_sm_acct_mgmt returning PAM_SUCCESS
# (It checks the value of 'yubico_setcred_return' which was set by
# pam_sm_authenticate.) This one needs no additional config:
account required pam_yubico.so debug debug_file=stderr

As you can see in the comments above, some of that config had me puzzled for a while.

The above should be sufficient to get a second factor (2FA) for OpenVPN logins, next to your valid certificate. But, as someone immediately cleverly pointed out: if you use it like this, you have 2 x 1FA. Not 2FA.

That means that the usefulness of this is limited...

2021-04-08 - proxmox / virtio-blk / disk by-id

Why does the virtio-blk /dev/vda block device not show up in /dev/disk/by-id?

Yesterday, I wrote about how Proxmox VE attaches scsi0 and virtio0 block devices differently. That is the starting point for todays question: how come do I get /dev/sda in /dev/disk/by-id while /dev/vda is nowhere to be found?

This question is relevant if you're used to referencing disks through /dev/disk/by-id (for example when setting up ZFS, using the device identifiers). The named devices can be a lot more convenient to keep track of.

If you're on a QEMU VM using virtio-scsi, the block devices do show up:

# ls -log /dev/disk/by-id/
total 0
lrwxrwxrwx 1  9 apr  8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0
lrwxrwxrwx 1  9 apr  8 14:50 scsi-0QEMU_QEMU_HARDDISK_drive-scsi0 -> ../../sda
lrwxrwxrwx 1 10 apr  8 14:50 scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-part1 -> ../../sda1

But if you're using virtio-blk, they do not:

# ls -log /dev/disk/by-id/
total 0
lrwxrwxrwx 1 9 apr  8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0

There: no symlinks to /dev/vda, while it does exist and it does show up in /dev/disk/by-path:

# ls -l /dev/vda{,1}
brw-rw---- 1 root disk 254, 0 apr  8 14:50 /dev/vda
brw-rw---- 1 root disk 254, 1 apr  8 14:50 /dev/vda1
# ls -log /dev/disk/by-path/
total 0
lrwxrwxrwx 1  9 apr  8 14:50 pci-0000:00:01.1-ata-2 -> ../../sr0
lrwxrwxrwx 1  9 apr  8 14:50 pci-0000:00:0a.0 -> ../../vda
lrwxrwxrwx 1 10 apr  8 14:50 pci-0000:00:0a.0-part1 -> ../../vda1
lrwxrwxrwx 1  9 apr  8 14:50 virtio-pci-0000:00:0a.0 -> ../../vda
lrwxrwxrwx 1 10 apr  8 14:50 virtio-pci-0000:00:0a.0-part1 -> ../../vda1

udev rules

Who creates these? It's udev.

If you look at the udev rules in 60-persistent-storage.rules, you'll see a bunch of these:

# grep -E '"(sd|vd)' /lib/udev/rules.d/60-persistent-storage.rules
KERNEL=="vd*[!0-9]", ATTRS{serial}=="?*", ENV{ID_SERIAL}="$attr{serial}", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}"
KERNEL=="vd*[0-9]", ATTRS{serial}=="?*", ENV{ID_SERIAL}="$attr{serial}", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}-part%n"
...
KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", SUBSYSTEMS=="scsi", ATTRS{vendor}=="ATA", IMPORT{program}="ata_id --export $devnode"
KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", SUBSYSTEMS=="scsi", ATTRS{type}=="5", ATTRS{scsi_level}=="[6-9]*", IMPORT{program}="ata_id --export $devnode"
...
KERNEL=="sd*|sr*|cciss*", ENV{DEVTYPE}=="disk", ENV{ID_SERIAL}=="?*", SYMLINK+="disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}"
KERNEL=="sd*|cciss*", ENV{DEVTYPE}=="partition", ENV{ID_SERIAL}=="?*", SYMLINK+="disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}-part%n"
...

So, udev is in the loop, and would create symlinks, if it matched the appropriate rules.

Comparing output from udevadm:

# udevadm info /dev/sda
P: /devices/pci0000:00/0000:00:05.0/virtio1/host2/target2:0:0/2:0:0:0/block/sda
N: sda
...
E: DEVNAME=/dev/sda
E: DEVTYPE=disk
...
E: ID_SERIAL=0QEMU_QEMU_HARDDISK_drive-scsi0
E: ID_SERIAL_SHORT=drive-scsi0
E: ID_BUS=scsi
E: ID_PATH=pci-0000:00:05.0-scsi-0:0:0:0
...

and:

# udevadm info /dev/vda
P: /devices/pci0000:00/0000:00:0a.0/virtio1/block/vda
N: vda
...
E: DEVNAME=/dev/vda
E: DEVTYPE=disk
...
E: ID_PATH=pci-0000:00:0a.0
...

The output from /dev/vda is a lot shorter. And there is no ID_BUS nor ID_SERIAL. And the lack of a serial is what causes this rule to be skipped:

KERNEL=="vd*[!0-9]", ATTRS{serial}=="?*", ENV{ID_SERIAL}="$attr{serial}", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}"

We could hack the udev rules, adding a default serial when it's unavailable:

KERNEL=="vd*[!0-9]", ATTRS{serial}!="?*", ENV{ID_SERIAL}="MY_SERIAL", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}"
# udevadm control --reload
# udevadm trigger --action=change
# ls -log /dev/disk/by-id/
lrwxrwxrwx 1 9 apr  8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0
lrwxrwxrwx 1 9 apr  8 14:50 virtio-MY_SERIAL -> ../../vda

But that's awkward. And it breaks things if we ever add a second disk.

Adding a serial through Proxmox

Instead, we can hand-hack the Proxmox VE QEMU configuration file and add a (custom 20 bytes) ,serial=MY_SERIAL parameter to the disk configuration. We'll use disk0 as serial for now:

--- /etc/pve/qemu-server/NNN.conf
+++ /etc/pve/qemu-server/NNN.conf
@@ -10,5 +10,5 @@ ostype: l26
 scsihw: virtio-scsi-pci
 smbios1: uuid=d41e78ad-4ff6-4000-8882-c343e3233945
 sockets: 1
-virtio0: somedisk:vm-NNN-disk-0,size=32G
+virtio0: somedisk:vm-NNN-disk-0,serial=disk0,size=32G
 vmgenid: 2ffdfa16-769a-421f-91f3-71397562c6b9

Stop the VM, start it again, and voilĂ , the disk is matched:

# ls -log /dev/disk/by-id/
total 0
lrwxrwxrwx 1  9 apr  8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0
lrwxrwxrwx 1  9 apr  8 14:50 virtio-disk0 -> ../../vda
lrwxrwxrwx 1 10 apr  8 14:50 virtio-disk0-part1 -> ../../vda1

As long as you don't create duplicate serials in the same VM, this should be fine.

2021-04-07 - proxmox / alter default create vm parameters

The Proxmox Virtual Environment has defaults when creating a new VM, but it has no option to change those defaults. Here's a quick example of hacking in some defaults.

Why? (Changing SCSI controller does not change existing disks)

In the next post I wanted to talk about /dev/disk/by-id and why disks that use the VirtIO SCSI controller do not show up there. A confusing matter in this situation was that creating a VM disk using a different SCSI controller and then switching does not change the storage driver for the existing disks completely!

If you're on Proxmox VE 6.x (observed with 6.1 and 6.3) and you create a VM with the VirtIO SCSI controller, your virtual machine parameters may look like this, and you get a /dev/vda device inside your QEMU VM:

/usr/bin/kvm \
  ...
  -device virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa \
  -drive file=/dev/zvol/somedisk/vm-NNN-disk-0,if=none,id=drive-virtio0,format=raw

But if you create it with the (default) LSI 53C895A SCSI controller first, and then switch to VirtIO SCSI, you still keep the (ATA) /dev/sda block device name. The VM is started with these command line arguments:

/usr/bin/kvm \
  ...
  -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 \
  -drive file=/dev/zvol/somedisk/vm-NNN-disk-0,if=none,id=drive-scsi0,format=raw \
  -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0

If you look at the configuration in /etc/pve/qemu-server/NNN.conf, both would have:

scsihw: virtio-scsi-pci

But the disk configuration type/name is different:

virtio0: somedisk:vm-NNN-disk-0,size=32G

vs.

scsi0: somedisk:vm-NNN-disk-0,size=32G

This virtio-scsi-pci + scsi0 is turned into -device virtio-scsi-pci flags.
While virtio-scsi-pci + virtio0 translates to -device virtio-blk-pci.

It's not a bad thing though, that it does not change from scsi0 to virtio0. After all, if the device did change from /dev/sda to /dev/vda, your boot procedure and mounts might be impacted. But it does mean that you want the VirtIO SCSI option selected before you create any disks.

How? (Hacking defaults into pvemanagerlib.js)

In the pve-manager package, there's a /usr/share/pve-manager/js/pvemanagerlib.js that controls much of the user interface. Altering the default appears to be a matter of:

--- /usr/share/pve-manager/js/pvemanagerlib.js
+++ /usr/share/pve-manager/js/pvemanagerlib.js
@@ -21771,7 +21771,7 @@ Ext.define('PVE.qemu.OSDefaults', {
         scsi: 2,
         virtio: 1
       },
-      scsihw: ''
+      scsihw: 'virtio-scsi-pci'
     };
 
     // virtio-net is in kernel since 2.6.25

For bonus points, we can disable the firewall default, which we manage elsewhere anyway:

--- /usr/share/pve-manager/js/pvemanagerlib.js
+++ /usr/share/pve-manager/js/pvemanagerlib.js
@@ -22434,7 +22434,7 @@ Ext.define('PVE.qemu.NetworkInputPanel',
         xtype: 'proxmoxcheckbox',
         fieldLabel: gettext('Firewall'),
         name: 'firewall',
-        checked: (me.insideWizard || me.isCreate)
+        checked: false
       }
     ];
 
@@ -27909,7 +27909,7 @@ Ext.define('PVE.lxc.NetworkInputPanel',
       cdata.name = 'eth0';
       me.dataCache = {};
     }
-    cdata.firewall = (me.insideWizard || me.isCreate);
+    cdata.firewall = false;
 
     if (!me.dataCache) {
       throw "no dataCache specified";

Of course these changes will get wiped whenever you update Proxmox VE. Keeping your hacks active will be an exercise for the reader.

2021-01-21 - openvpn / hardened fox-it openvpn-nl

Today, we will be evaluating OpenVPN-NL — “[a] hardened version of OpenVPN that includes as many of the security measures required to operate in a classified environment as possible — and whether we can use it as a drop-in replacement for regular OpenVPN.

While OpenVPN allows many insecure configurations, such as turning off encryption, or the use of outdated cryptographic functions in security critical places, the goal of OpenVPN-NL — a fork created and maintained by Fox-IT — is to strip insecure configuration and verify that the distributed version is uncompromised.

We'll be answering the question of whether it's compatible and whether we want to use it.

For Ubuntu Bionic and Xenial, repositories exist. But the Bionic version works just fine on Ubuntu Focal.

OpenVPN OpenVPN-NL
repoubuntu defaultfox-it repository
packageopenvpnopenvpn-nl
version2.4.7-1ubuntu22.4.7-bionicnl1
dependencieslzo2-2, lz4-1, pkcs11, ssl, systemd0, iproute2lzo2-2, net-tools, (embedded) Mbed TLS 2.16.2
size1160 KiB1627 KiB
binary/usr/sbin/openvpn/usr/sbin/openvpn-nl
systemd notifyYES-

As you can already see in the above list:

  • the versions are similar (2.4.7);
  • OpenVPN is linked to OpenSSL while OpenVPN-NL embeds Mbed TLS. This means that:
    • it is not affected by OpenSSL specific security issues,
    • but it will be affected by Mbed TLS issues and we'll have to rely on updates from Fox-IT, should such issues arise.
  • OpenVPN-NL can be installed alongside OpenVPN, which makes switching between the two convenient;
  • it depends on older networking tools (net-tools);
  • it does not support sd_notify — you'll have to disable Type=notify in your SystemD service files.

On to the hardening bits

The hardening done by Fox-IT appears to consist of the following changes:

  • Mbed TLS is used instead of OpenSSL:
    • if you assume that OpenSSL is riddled with flaws, then this is a good thing;
    • if you assume that any security product, including Mbed TLS will have its flaws, then a drawback is that you get fewer features (no TLS 1.3) and that you have to rely on timely patches from Fox-IT.
  • OpenVPN-NL drastically limits the allowed cryptography algorithms — both on the weak and on the strong side of the spectrum — leaving you with really no option but SHA256, RSA and AES-256;
  • it enforces a few options that you should have enabled, like certificate validation, and specifically remote-cert-tls to prevent client-to-client man in the middle attacks;
  • it removes a few options that you should not have enabled, like no-iv, client-cert-not-required or optional verify-client-cert;
  • certificates must be signed with a SHA256 hash, or the certificates will be rejected;
  • it delays startup until there is sufficient entropy on the system (it does so by reading and discarding min-platform-entropy bytes from /dev/random, which strikes me as an odd way to accomplish that) — during testing you can set min-platform-entropy 0.

Note that we're only using Linux, so we did not check any Windows build scripts/fixes that may also be done. The included PKCS#11 code — for certificates on hardware tokens — was not checked either at this point.

The available algorithms:

OpenVPN OpenVPN-NL
--show-digests.. lots and lots ..SHA256
--show-tls .. anything that OpenSSL supports, for TLS 1.3 and below .. TLS 1.2 (only) with ciphers:
TLS-ECDHE-RSA-WITH-AES-256-GCM-SHA384
TLS-DHE-RSA-WITH-AES-256-GCM-SHA384
TLS-DHE-RSA-WITH-AES-256-CBC-SHA256
--show-ciphers.. lots and lots ..AES-256-CBC
AES-256-GCM

Notable in the above list is that SHA512 is not allowed, nor are ECDSA ciphers: so no new fancy ed25519 or secp521r1 elliptic curve (EC) ciphers, but only plain old RSA large primes. (The diff between openvpn-2.4.7-1ubuntu2 and Fox-IT bionicnl1 even explicitly states that EC is disabled, except for during the Diffie-Hellman key exchange. No motivation is given.)

So, compatibility with vanilla OpenVPN is available, if you stick to this configuration, somewhat.

Server settings:

mode server
tls-server

Client settings:

client  # equals: pull + tls-client

Server and client settings:

local SERVER_IP # [server: remote SERVER_IP]
proto udp
port 1194
nobind          # [server: server VPN_NET 255.255.255.0]
dev vpn-DOMAIN  # named network devices are nice
dev-type tun

# HMAC auth, first line of defence against brute force
auth SHA256
tls-auth DOMAIN/ta.key 1  # [server: tls-auth DOMAIN/ta.key 0]
key-direction 1           # int as above, allows inline <tls-auth>

# TLS openvpn-nl compatibility config
tls-version-min 1.2
#[not necessary]#tls-version-max 1.2    # MbedTLS has no 1.3

# DH/TLS setup
# - no ECDSA for openvpn-nl
# - no TLS 1.3 for openvpn-nl
tls-cipher TLS-ECDHE-RSA-WITH-AES-256-GCM-SHA384
tls-ciphersuites TLS_AES_256_GCM_SHA384 # only for TLS 1.3
ecdh-curve secp384r1
#[only server]#dh none  # (EC)DHE, thus no permanent parameters

# TLS certificates
# Note that the certificates must be:
# - SHA-256 signed
# - using RSA 2048 or higher (choose at least 4096), and not Elliptic Curve
# - including "X509v3 Extended Key Usage" (EKU) for Server vs. Client
remote-cert-tls server  # [server: remote-cert-tls client] (EKU)
ca DOMAIN/ca.crt        # CA to validate the peer certificate against
cert DOMAIN/client-or-server.crt
key DOMAIN/client-or-server.key
#[only server]#crl-verify DOMAIN/crl.pem  # check for revoked certs

# Data channel
cipher AES-256-GCM      # or AES-256-CBC
ncp-disable             # and no cipher negotiation

# Drop privileges; keep tunnel across restarts; keepalives
# useradd -md /var/spool/openvpn -k /dev/null -r -s /usr/sbin/nologin openvpn
user openvpn
group nogroup
persist-key
persist-tun
keepalive 15 55             # ping every 15, disconnect after 55
#[only server]#opt-verify   # force compatible options

The lack of SystemD notify support is a minor annoyance. When editing the SystemD service file, set Type to simple and remove --daemon from the options. Otherwise you may end up with unmounted PrivateTmp mounts and multiple openvpn-nl daemons (which of course hold on to the listening socket your new daemon needs, causing strange client-connect errors):

# /etc/systemd/system/openvpn@server.service.d/override.conf
[Service]
ExecStart=
# Take the original ExecStart, replace "openvpn" with "openvpn-nl"
# and remove "--daemon ...":
ExecStart=/usr/sbin/openvpn-nl --status /run/openvpn/%i.status 10
    --cd /etc/openvpn --script-security 2 --config /etc/openvpn/%i.conf
    --writepid /run/openvpn/%i.pid
Type=simple

If you're okay with sticking to SHA256 and RSA for now, then OpenVPN-NL is compatible with vanilla OpenVPN. Do note that hardware acceleration in Mbed TLS is explicitly marked as disabled on the OpenVPN-NL lifecycle page. I'm not sure if this is a security decision, but it may prove to be less performant.

In conclusion: there is no immediate need to use OpenVPN-NL, but it is wise to take their changes to heart. Make sure:

  • you validate and trust packages from your software repository;
  • all your certificates are SHA256-signed;
  • remote-cert-tls is enabled (and your certificates are marked with the correct key usage, e.g. by using a recent easy-rsa to sign your keys);
  • ciphers are fixed or non-negotiable using ncp-disable;
  • auth, cipher and tls-cipher are set to something modern.

But if you stick to the above configuration, then using OpenVPN-NL is fine too..

.. although still I cannot put my finger on how discarding bytes from /dev/random would make things more secure.

Notes about RNG, min-platform-entropy and hardware support

About “how discarding bytes from /dev/random makes things more secure.”

I think the theory is that throwing away some bytes makes things more secure, because the initially seeded bytes after reboot might be guessable. And instead of working against the added code — by lowering min-platform-entropy — we can instead attempt to get more/better entropy.

If the rdrand processor flag is available then this might be a piece of cake:

$ grep -q '^flags.*\brdrand\b' /proc/cpuinfo && echo has rdrand
has rdrand

If it isn't, and this is a virtual machine, you'll need to (a) confirm that it's available on the VM host and (b) enable the host processor in the VM guest (-cpu host). (If you wanted AES CPU acceleration, you would have enabled host CPU support already.)

When the processor flag is available, you can start benefitting from host-provided entropy.

$ cat /proc/sys/kernel/random/entropy_avail
701

This old entropy depletes faster than a Coca-Cola bottle with a Mentos in it, once you start reading from /dev/random directly.

But, if you install rng-tools, you get a nice /usr/sbin/rngd that checks entropy levels and reads from /dev/hwrng, replenishing the entropy as needed.

17:32:00.784031 poll([{fd=4, events=POLLOUT}], 1, -1) = 1 ([{fd=4, revents=POLLOUT}])
17:32:00.784162 ioctl(4, RNDADDENTROPY, {entropy_count=512, buf_size=64, buf="\262\310"...}) = 0
$ cat /proc/sys/kernel/random/entropy_avail
3138

Instant replenish! Now you can consider enabling use-prediction-resistance if you're using MbedTLS (through OpenVPN-NL).

Footnotes

See also blog.g3rt.nl openvpn security tips and how to harden OpenVPN in 2020.

2021-01-15 - postgresql inside kubernetes / no space left on device

Running PostgreSQL inside Kubernetes? Getting occasional "No space left on device" errors? Know that 64MB is not enough for everyone.

With the advent of more services running inside Kubernetes, we're now running into new issues and complexities specific to the containerization. For instance, to solve the problem of regular file backups of distributed filesystems, we've resorted to using rsync wrapped inside a pod (or sidecar). And now for containerized PostgreSQL, we're running into an artificial memory limit that needs fixing.

Manifestation

The issue manifests itself like this:

ERROR: could not resize shared memory segment "/PostgreSQL.491173048"
  to 4194304 bytes: No space left on device

This shared memory that PostgreSQL speaks of, is the shared memory made available to it through /dev/shm.

On your development machine, it may look like this:

$ mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
$ df -h | sed -ne '1p;/shm/p'
Filesystem  Size  Used Avail Use% Mounted on
tmpfs        16G  948M   15G   6% /dev/shm

That's fine. 16GiB is plenty of space. But in Kubernetes we get a Kubernetes default of a measly 64MiB and no means to change the shm-size. So, inside the pod with the PostgreSQL daemon, things look like this:

$ mount | grep shm
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
$ df -h | sed -ne '1p;/shm/p'
Filesystem  Size  Used Avail Use% Mounted on
shm          64M     0   64M   0% /dev/shm

For a bunch of database operations, that is definitely too little. Any PostgreSQL database doing any serious work will quickly use up that much temporary space. (And run into this error.)

According to Thomas Munro on the postgrespro.com mailing list:

PostgreSQL creates segments in /dev/shm for parallel queries (via shm_open()), not for shared buffers. The amount used is controlled by work_mem. Queries can use up to work_mem for each node you see in the EXPLAIN plan, and for each process, so it can be quite a lot if you have lots of parallel worker processes and/or lots of tables/partitions being sorted or hashed in your query.

Basically what they're saying is: you need sufficient space in /dev/shm, period!

On the docker-library postgres page it is documented that you may want to increase the --shm-size (ShmSize). That is quite doable for direct Docker or docker-compose instantiations. But for PostgreSQL daemon pods in Kubernetes resizing shm does not seem to be possible.

Any other fixes then?

Well, I'm glad you asked! /dev/shm is just one of the ways that the PostgreSQL daemon can be configured to allocate shared memory through:

dynamic_shared_memory_type (enum)
Specifies the dynamic shared memory implementation that the server should use. Possible values are posix (for POSIX shared memory allocated using shm_open), sysv (for System V shared memory allocated via shmget), windows (for Windows shared memory), and mmap (to simulate shared memory using memory-mapped files stored in the data directory). [...]

(from PostgresSQL runtime config)

When using the posix shm_open(), we're directly opening files in /dev/shm. If we however opt to use the (old fashioned) sysv shmget(), the memory allocation is not pinned to this filesystem and it is not limited (unless someone has been touching /proc/sys/kernel/shm*).

Technical details of using System V shared memory

Using System V shared memory is a bit more convoluted than using POSIX shm. For POSIX shared memory calling shm_open() is basically the same as opening a (mmap-able) file in /dev/shm. For System V however, you're looking at an incantation like this shmdemo.c example:

#include <stdio.h>
#include <string.h>
#include <sys/ipc.h>
#include <sys/shm.h>

#define SHM_SIZE (size_t)(512 * 1024 * 1024UL) /* 512MiB */

int main(int argc, char *argv[])
{
    key_t key;
    int shmid;
    char *data;

    if (argc > 2) {
        fprintf(stderr, "usage: shmdemo [data_to_write]\n");
        return 1;
    }
    /* The file here is used as a "pointer to memory". The key is
     * calculated based on the inode number and non-zero 8 bits: */
    if ((key = ftok("./pointer-to-memory.txt", 1 /* project_id */)) == -1) {
        fprintf(stderr, "please create './pointer-to-memory.txt'\n");
        return 2;
    }
    if ((shmid = shmget(key, SHM_SIZE, 0644 | IPC_CREAT)) == -1)
        return 3;
    if ((data = shmat(shmid, NULL, 0)) == (char *)(-1)) /* attach */
        return 4;

    /* read or modify the segment, based on the command line: */
    if (argc == 2) {
        printf("writing to segment %#x: \"%s\"\n", key, argv[1]);
        strncpy(data, argv[1], SHM_SIZE);
    } else {
        printf("segment %#x contained: \"%s\"\n", key, data);
        shmctl(shmid, IPC_RMID, NULL); /* free the memory */
    }

    if (shmdt(data) == -1) /* detach */
        return 5;
    return 0;
}

(Luckily the PostgreSQL programmers concerned themselves with these awkward semantics, so we won't have to.)

If you want to confirm that you have access to sufficient System V shared memory inside your pod, you could try the above code sample to test. Invoking it looks like:

$ ./shmdemo
please create './pointer-to-memory.txt'
$ touch ./pointer-to-memory.txt
$ ./shmdemo
segment 0x1010dd5 contained: ""
$ ./shmdemo 'please store this in shm'
writing to segment 0x1010dd5: "please store this in shm"
$ ./shmdemo
segment 0x1010dd5 contained: "please store this in shm"
$ ./shmdemo
segment 0x1010dd5 contained: ""

And if you skipped/forget the IPC_RMID, you can see the leftovers using ipcs:

$ ipcs | awk '{if(int($6)==0)print}'

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x52010e16 688235     walter     644        536870912  0
0x52010e19 688238     walter     644        536870912  0

------ Semaphore Arrays --------
key        semid      owner      perms      nsems

And remove them with ipcrm:

$ ipcrm -M 0x52010e16
$ ipcrm -M 0x52010e19

But, you probably did not come here for lessons in ancient IPC. Quickly moving on to the next paragraph...

Configuring sysv dynamic_shared_memory_type in stolon

For stolon — the Kubernetes PostgreSQL manager that we're using — you can configure different parameters through the pgParameters setting. It keeps the configuration in a configMap:

$ kubectl -n NS get cm stolon-cluster-mycluster -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":...}'
    stolon-clusterdata: '{"formatVersion":1,...}'
...

Where the stolon-clusterdata holds both the configuration and current state:

{
  "formatVersion": 1,
  "changeTime": "2021-01-15T10:17:54.297700008Z",
  "cluster": {
...
    "spec": {
...
      "pgParameters": {
        "datestyle": "iso, mdy",
        "default_text_search_config": "pg_catalog.english",
        "dynamic_shared_memory_type": "posix",
...

You should not be editing this directly, but it can be educational to look at.

To edit the pgParameters you'll be using stolonctl from inside a stolon-proxy as specified in the cluster specification patching docs:

$ stolonctl --cluster-name=mycluster --store-backend=kubernetes \
    --kube-resource-kind=configmap update --patch \
    '{"pgParameters": {"dynamic_shared_memory_type": "sysv"}}'
$ stolonctl --cluster-name=mycluster --store-backend=kubernetes \
    --kube-resource-kind=configmap update --patch \
    '{"pgParameters": {"shared_buffers": "6144MB"}}'

And a restart:

$ kubectl -n NS rollout restart sts stolon-keeper

And that, my friends, should get rid of that pesky 64MiB limit.

2021-01-05 - chromium snap / wrong fonts

So, since a couple of weeks my snap-installed Chromium browser on Ubuntu Focal started acting up: suddenly it chooses the wrong fonts on some web pages. The chosen fonts are from the ~/.local/share/fonts/ directory.

[wjd.nu pages with incorrect looking font]

Look! That's not the correct font. And it's even more apparent that the font is off when seeing the source view.

[browser html source view with incorrect looking font]

Bah. That's not even a monospaced font.

A fix that appeared to work — but unfortunately only temporarily — involves temporarily moving the custom local fonts out of the way and then flushing the font cache:

$ mkdir ~/.local/share/DISABLED-fonts
$ mv ~/.local/share/fonts/* ~/.local/share/DISABLED-fonts/
$ fc-cache -rv && sudo fc-cache -rv

Restarting chromium-browser by using the about:restart took quite a while. Some patience had to be exercised.

When it finally did start, all font issues were solved.

Can we now restore our custom local fonts again?

$ mv ~/.local/share/DISABLED-fonts/* ~/.local/share/fonts/
$ fc-cache -rv && sudo fc-cache -rv

And another about:restart — which was fast as normal again — and everything was still fine. So yes, apparently, we can.

However, after half a day of work, the bug reappeared.

A semi-permanent fix is refraining from using the the local fonts directory. But that's not really good enough.

Appently there's a bug report showing that not only Chromium is affected. And while I'm not sure how to fix things yet, at least the following seems suspect:

$ grep include.*/snap/ \
    ~/snap/chromium/current/.config/fontconfig/fonts.conf
  <include ignore_missing="yes">/snap/chromium/1424/gnome-platform/etc/fonts/fonts.conf</include>

This would make sense, if current/ pointed to 1424, but current/ now points to 1444.

Here's a not yet merged pull request that look's promising. And here, there's someone who grew tired of hotfixing the fonts.conf and symlinked all global font conf files into ~/.local/share/fonts/. That might also be worth a try...

A more permanent solution?

$ mkdir -p ~/snap/chromium/common/.config/fontconfig
$ cat >>~/snap/chromium/common/.config/fontconfig/fonts.conf <<EOF
<fontconfig>
  <include>/etc/fonts/conf.d</include>
</fontconfig>
EOF

I settled for a combination of the linked suggestions. The above snippet looks like it works. Crosses fingers...

Three weeks later...

Or at least, for a while. It looks like a new snap-installed version of Chromium broke things again. When logging in after the weekend, I was presented with the wrong fonts again.

This time, I:

  • fixed the symlinks,
  • removed the older/unused 1444 snap revision,
  • reran the fc-cache flush, and
  • restarted Chromium.

Permanent? No!

TL;DR

(Months later by now.. still a problem.)

It feels as if I'm the only one suffering from this. At least now the following sequence appears to work reliably:

  • new Chromium snap has been silently installed;
  • fonts are suddenly broken in currently running version;
  • sudo rm /var/snap/chromium/common/fontconfig/* ;
  • shut down / kill Chromium (make sure you get them all);
  • start Chromium and reopen work with ctrl-shift-T.

(It's perhaps also worth looking into whether the default Chromium fonts are missing after snapd has been updated ticket has been resolved.)

2021-01-02 - stale apparmor config / mysql refuses to start

So, recently we had an issue with a MariaDB server that refused to start. Or, actually, it would start, but before long, SystemD would kill it. But why?

# systemctl start mariadb.service
Job for mariadb.service failed because a timeout was exceeded.
See "systemctl status mariadb.service" and "journalctl -xe" for details.

After 90 seconds, it would be killed. systemctl status mariadb.service shows the immediate cause:

# systemctl status mariadb.service
...
systemd[1]: mariadb.service: Start operation timed out. Terminating.
systemd[1]: mariadb.service: Main process exited, code=killed, status=15/TERM
systemd[1]: mariadb.service: Failed with result 'timeout'.

Ok, a start operation timeout. That is caused by the notify type: apparently the mysqld doesn't get a chance to tell SystemD that it has succesfully completed startup.

First, a quickfix, so we can start at all:

# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf
[Service]
Type=simple
EOF

That fixes so we can start — because now SystemD won't require for any "started" notification anymore — but it doesn't explain what is wrong.

Second, an attempt at debugging the cause:

# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf
[Service]
NotifyAccess=all
ExecStart=
ExecStart=/usr/bin/strace -fesendmsg,sendto,connect,socket -s8192 \
  /usr/sbin/mysqld $MYSQLD_OPTS
EOF

Okay, that one showed EACCESS errors on the sendmsg() call on the /run/systemd/notify unix socket:

strace[55081]: [pid 55084] socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 46
strace[55081]: [pid 55084] sendmsg(46, {msg_name={sa_family=AF_UNIX,
  sun_path="/run/systemd/notify"}, msg_namelen=22,
  msg_iov=[{iov_base="READY=1\nSTATUS=Taking your SQL requests now...\n", iov_len=47}],
  msg_iovlen=1, msg_controllen=0, msg_flags=0},
  MSG_NOSIGNAL) = -1 EACCES (Permission denied)

Permission denied? But why?

# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf
[Service]
NotifyAccess=all
ExecStart=
ExecStart=/usr/bin/strace -fesendmsg,sendto,connect,socket -s8192 \
  /bin/sh -c 'printf "READY=1\nSTATUS=Taking your SQL requests now...\n" | \
    socat - UNIX-SENDTO:/run/systemd/notify; sleep 3600
EOF

This worked:

strace[54926]: [pid 54931] socket(AF_UNIX, SOCK_DGRAM, 0) = 5
strace[54926]: [pid 54931] sendto(5,
  "READY=1\nSTATUS=Taking your SQL requests now...\n", 47, 0,
  {sa_family=AF_UNIX, sun_path="/run/systemd/notify"}, 21) = 47

(Unless someone is really trying to mess with you, you can regard sendto() and sendmsg() as equivalent here. socat simply uses the other one.)

That means that there is nothing wrong with SystemD or /run/systemd/notify. So the problem must be related to /usr/sbin/mysqld.

After looking at journalctl -u mariadb.service for the nth time, I decided to peek at all of journalctl without any filters. And there it was after all: audit logs.

# journalctl -t audit
audit[1428513]: AVC apparmor="DENIED" operation="sendmsg"
  info="Failed name lookup - disconnected path" error=-13
  profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=1428513
  comm="mysqld" requested_mask="w" denied_mask="w" fsuid=104 ouid=0

(Observe the -t in the journalctl invocation above which looks for the SYSLOG_IDENTIFIER=audit key-value pair.)

Okay. And fixing it?

# aa-remove-unknown
Skipping profile in /etc/apparmor.d/disable: usr.sbin.mysqld
Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
Removing '/usr/sbin/mysqld'

A-ha! Stale cruft in /var/cache/apparmor.

# /etc/init.d/apparmor restart
Restarting apparmor (via systemctl): apparmor.service.

Finally we could undo the override.conf and everything started working as expected.

2021-01-01 - zfs / zvol / partition does not show up

On our Proxmox virtual machine I had to go into a volume to quickly fix an IP address. The volume exists on the VM host, so surely mounting is easy. Right?

I checked in /dev/zvol/pve2-pool/ where I found the disk:

# ls /dev/zvol/pve2-pool/vm-125-virtio0*
total 0
lrwxrwxrwx 1 root root 10 Dec 29 15:55 vm-125-virtio0 -> ../../zd48

Good, there's a disk:

# fdisk -l /dev/zd48
Disk /dev/zd48: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disklabel type: dos
Disk identifier: 0x000aec27

Device      Boot    Start       End  Sectors  Size Id Type
/dev/zd48p1 *        2048  97656831 97654784 46.6G 83 Linux
/dev/zd48p2      97656832 104855551  7198720  3.4G 82 Linux swap / Solaris

And it has partitions. Now if I could only find them, so I can mount them...

Apparently, there's a volmode on the ZFS volume that specifies how volumes should be exposed to the OS.

Setting it to full exposes volumes as fully fledged block devices, providing maximal functionality. [...] Setting it to dev hides its partitions. Volumes with property set to none are not exposed outside ZFS, but can be snapshoted, cloned, replicated, etc, that can be suitable for backup purposes.

So:

# zfs get volmode zl-pve2-ssd1/vm-125-virtio0
NAME                         PROPERTY  VALUE    SOURCE
zl-pve2-ssd1/vm-125-virtio0  volmode   default  default
# zfs set volmode=full zl-pve2-ssd1/vm-125-virtio0
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0
NAME                         PROPERTY  VALUE    SOURCE
zl-pve2-ssd1/vm-125-virtio0  volmode   full     local
# ls -1 /dev/zl-pve2-ssd1/
vm-122-virtio0
vm-123-virtio0
vm-124-virtio0
vm-125-virtio0
vm-125-virtio0-part1
vm-125-virtio0-part2

Yes! Partitions for vm-125-virtio0.

If that partition does not show up as expected, a call to partx -a /dev/zl-pve2-ssd1/vm-125-virtio0 might do the trick.

Quick, do some mount /dev/zl-pve2-ssd1/vm-125-virtio0-part1 /mnt/root; edit some files.

But, try to refrain from editing the volume while the VM is running. That may cause filesystem corruption.

Lastly umount and unset the volmode again:

# zfs inherit volmode zl-pve2-ssd1/vm-125-virtio0
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0
NAME                         PROPERTY  VALUE    SOURCE
zl-pve2-ssd1/vm-125-virtio0  volmode   default  default

And optionally updating kernel bookkeeping, with: partx -d -n 1:2 /dev/zl-pve2-ssd1/vm-125-disk-0