Notes to self, 2021
2021-11-19 - systemd / zpool import / zfs mount / dependencies
On getting regular ZFS mount points to work with systemd dependency ordering.
ZFS on Ubuntu is nice. And so is systemd. But getting them to play nice together sometimes requires a little extra effort.
A problem we were facing was that services would get started before
their respective mount points had all been made available. For example,
for some setups, we have a local-storage
ZFS zpool that holds the
/var/lib/docker
directory.
If there is no dependency ordering, Docker may start before
the /var/lib/docker
directory is mounted. Docker
will start just fine, but it will start writing files in the wrong
location. This is extremely inconvenient. So we want to force
docker.service
to start first after
/var/lib/docker
has been mounted.
Luckily we can have systemd handle the dependency ordering
for us. We'll have to tell the specific service, in this case
docker.service
, to depend on one or more mount points. That
might look like this:
# /etc/systemd/system/docker.service.d/override.conf [Unit] RequiresMountsFor=/data/kubernetes/static-provisioner RequiresMountsFor=/var/lib/docker
These unit file directives
specify that the Docker service can start first when these two paths
have both been mounted. To this end, systemd will
look for definitions of
data-kubernetes-static\x2dprovisioner.mount
and
var-lib-docker.mount
.
The mount unit file format is described in systemd.mount(5).
The filename must adhere to the output of systemd-escape --path --suffix=mount MOUNTPOINT
.
So for the above two mount points, you might make these two unit files:
# /etc/systemd/system/data-kubernetes-static\x2dprovisioner.mount [Unit] Documentation=https://github.com/ossobv/vcutil/blob/main/mount.zfs-non-legacy After=zfs-mount.service Requires=zfs-mount.service [Mount] Where=/data/kubernetes/static-provisioner What=local-storage/kubernetes/static-provisioner Type=zfs-non-legacy
# /etc/systemd/system/var-lib-docker.mount [Unit] Documentation=https://github.com/ossobv/vcutil/blob/main/mount.zfs-non-legacy After=zfs-mount.service Requires=zfs-mount.service [Mount] Where=/var/lib/docker What=local-storage/docker Type=zfs-non-legacy
Observe: that Type=zfs-non-legacy
requires some extra magic.
You might have been tempted to set Type=zfs
, but that only
works for so-called legacy mounts in ZFS. For those, you can
do mount -t zfs tank/dataset /mointpoint
. But for
regular ZFS mounts, you cannot. They are handled by zfs
mount
and the mountpoint
and canmount
properties.
Sidenote: usually, you'll let zfs-mount.service
mount everything.
It will zfs mount -a
, which works if the zpool was also
correctly/automatically imported. But because you have no guarantees
there, it is nice to manually force the dependency on the specific
directories you need, as done above.
To complete the magic, we add a helper script as
/usr/sbin/mount.zfs-non-legacy
. This gets the burden of
ensuring that the zpool is imported and that the dataset is mounted.
That script then basically looks like this:
#!/bin/sh # Don't use this script, use the better version from: # https://github.com/ossobv/vcutil/blob/main/mount.zfs-non-legacy name="$1" # local-storage/docker path="$2" # /var/lib/docker zpool import "${name%%/*}" || true # import poool zfs mount "${name}" || true # Quick hack: mount again, it should be mounted now. If it isn't, # we have failed and report that back to mount(8). zfs mount "${name}" 2>&1 | grep -qF 'filesystem already mounted' # (the status of grep is returned to the caller)
That allows systemd to call mount -t zfs-non-legacy
local-storage/docker /var/lib/docker
and that gets handled by
the script.
A better version of this script can be found as mount.zfs-non-legacy from ossobv/vcutil. This should get you proper moint point dependencies and graceful breakage when something fails.
2021-10-22 - zpool import / no pools / stale zdb labels
Today, when trying to import a newly created ZFS pool, we had to supply the -d DEV argument to find the pool.
# zpool import no pools available to import
But I know it's there.
# zpool import local-storage cannot import 'local-storage': no such pool available
And by specifying -d
with a device search path, it can be found:
# zpool import local-storage -d /dev/disk/by-id
Success!
# zpool list -oname NAME bpool local-storage rpool
Manually specifying a search path is not real convenient. It would make the boot process a lot less smooth. We'd have to alter the distribution provided scripts, which in turn makes upgrading more painful.
The culprit — it turned out — was an older zpool that had existed on this device. This caused the zeroth label to be cleared, but the first label to be used:
# zdb -l /dev/disk/by-id/nvme-Micron_9300_MTFDHAL3T8TDP_1234 failed to read label 0 ------------------------------------ LABEL 1 ------------------------------------ version: 5000 name: 'local-storage' state: 0 ...
The easy fix here is to flush the data and start from scratch:
# zfs destroy local-storage
At this point, zdb -l
still lists LABEL 1 as used.
# zfs labelclear /dev/disk/by-id/nvme-Micron_9300_MTFDHAL3T8TDP_1234
Now the labels are gone:
# zdb -l /dev/disk/by-id/nvme-Micron_9300_MTFDHAL3T8TDP_1234 failed to unpack label 0 failed to unpack label 1 failed to unpack label 2 failed to unpack label 3
And after recreating the pool everything works normally:
# zpool create -O compression=lz4 -O mountpoint=/data local-storage \ /dev/disk/by-id/nvme-Micron_9300_MTFDHAL3T8TDP_1234
# zdb -l /dev/disk/by-id/nvme-Micron_9300_MTFDHAL3T8TDP_1234 ------------------------------------ LABEL 0 ------------------------------------ version: 5000 name: 'local-storage' state: 0 ...
# zpool export local-storage
# zpool import pool: local-storage id: 8392074971509924158 state: ONLINE action: The pool can be imported using its name or numeric identifier. ...
# zpool import local-storage
All good. Imports without having to specify a device.
2021-10-04 - letsencrypt root / certificate validation on jessie
On getting LetsEncrypt certificates to work on Debian/Jessie or Cumulus Linux 3 again.
Since last Thursday the 30th, the old LetsEncrypt certificate root stopped working at 14:01 UTC. This was a known and anticipated issue. All certificates had long been double signed by a new root that doubled as intermediate. Unfortunately, this does not mean that everything worked on older platforms with OpenSSL 1.0.1 or 1.0.2.
See this Debian/Jessie box — we see similar behaviour on Cumulux Linux 3.x:
# apt-get dist-upgrade Reading package lists... Done Building dependency tree Reading state information... Done Calculating upgrade... Done 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Everything is up to date.
# curl https://wctegeltje.nl curl: (60) SSL certificate problem: certificate has expired More details here: http://curl.haxx.se/docs/sslcerts.html
Yet the certificate is marked as expired.
Quickly check the chain on another box:
$ easycert -T wctegeltje.nl 443 Certificate chain 0 s: [bb678ac6] CN = wctegeltje.nl i: [8d33f237] C = US, O = Let's Encrypt, CN = R3 1 s: [8d33f237] C = US, O = Let's Encrypt, CN = R3 i: [4042bcee] C = US, O = Internet Security Research Group, CN = ISRG Root X1 2 s: [4042bcee] C = US, O = Internet Security Research Group, CN = ISRG Root X1 i: [2e5ac55d] O = Digital Signature Trust Co., CN = DST Root CA X3 --- Expires in 30 days
So yeah. The root-most part here has expired, but the intermediate-root-double has not. See these:
# openssl x509 -in /etc/ssl/certs/2e5ac55d.0 -enddate -noout notAfter=Sep 30 14:01:15 2021 GMT
# openssl x509 -in /etc/ssl/certs/4042bcee.0 -enddate -noout notAfter=Jun 4 11:04:38 2035 GMT
How do we fix this? Easy. Just clear out the expired root:
# mv /usr/share/ca-certificates/mozilla/DST_Root_CA_X3.crt{,.old}
# sed -i -e 's#^mozilla/DST_Root_CA_X3.crt#!&#' /etc/ca-certificates.conf
# update-ca-certificates Updating certificates in /etc/ssl/certs... 0 added, 1 removed; done. Running hooks in /etc/ca-certificates/update.d....done.
(That last step removes /etc/ssl/certs/2e5ac55d.0
which is a symlink to DST_Root_CA_X3.pem
.)
# curl https://wctegeltje.nl <!DOCTYPE html> ...
2021-10-03 - umount -l / needs --make-slave
The other day I learned — the hard way — that umount -l
can be dangerous.
Using the --make-slave
mount option makes it safer.
The scenario went like this:
A virtual machine on our Proxmox VE cluster wouldn't boot. No biggie, I thought. Just mount the filesystem on the host and do a proper grub-install from a chroot:
# fdisk -l /dev/zvol/zl-pve2-ssd1/vm-215-disk-3 /dev/zvol/zl-pve2-ssd1/vm-215-disk-3p1 * 2048 124999679 124997632 59.6G 83 Linux /dev/zvol/zl-pve2-ssd1/vm-215-disk-3p2 124999680 125827071 827392 404M 82 Linux swap / Solaris
# mount /dev/zvol/zl-pve2-ssd1/vm-215-disk-3p1 /mnt/root
# cd /mnt/root
# for x in dev proc sys; do mount --rbind /$x $x; done
# chroot /mnt/root
There I could run the necessary commands to fix the boot procedure.
All done? Exit the chroot, unmount and start the VM:
# logout
# umount -l /mnt/root
# qm start 215
And at that point, things started failing miserably.
You see, in my laziness, I used umount -l
instead of four umounts for:
/mnt/root/dev
, /mnt/root/proc
,
/mnt/root/sys
and lastly /mnt/root
.
But what I was unaware of, was that there were mounts inside
dev
, proc
and sys
too, that
now also got unmounted.
And that led to an array of failures:
systemd complained about binfmt_misc.automount
breakage:
systemd[1]: proc-sys-fs-binfmt_misc.automount: Got invalid poll event 16 on pipe (fd=44) systemd[1]: proc-sys-fs-binfmt_misc.automount: Failed with result 'resources'.
pvedaemon could not bring up any VMs:
pvedaemon[32825]: <root@pam> starting task qmstart:215:root@pam: pvedaemon[46905]: start VM 215: UPID:pve2:ID:qmstart:215:root@pam: systemd[1]: 215.scope: Failed to create cgroup /qemu.slice/215.scope: No such file or directory systemd[1]: 215.scope: Failed to create cgroup /qemu.slice/215.scope: No such file or directory systemd[1]: 215.scope: Failed to add PIDs to scope's control group: No such file or directory systemd[1]: 215.scope: Failed with result 'resources'. systemd[1]: Failed to start 215.scope. pvedaemon[46905]: start failed: systemd job failed pvedaemon[32825]: <root@pam> end task qmstart:215:root@pam: start failed: systemd job failed
The root runtime dir could not get auto-created:
systemd[1]: user-0.slice: Failed to create cgroup /user.slice/user-0.slice: No such file or directory systemd[1]: Created slice User Slice of UID 0. systemd[1]: user-0.slice: Failed to create cgroup /user.slice/user-0.slice: No such file or directory systemd[1]: Starting User Runtime Directory /run/user/0... systemd[4139]: user-runtime-dir@0.service: Failed to attach to cgroup /user.slice/user-0.slice/user-runtime-dir@0.service: No such file or directory systemd[4139]: user-runtime-dir@0.service: Failed at step CGROUP spawning /lib/systemd/systemd-user-runtime-dir: No such file or directory
The Proxmox VE replication runner failed to start:
systemd[1]: pvesr.service: Failed to create cgroup /system.slice/pvesr.service: No such file or directory systemd[1]: Starting Proxmox VE replication runner... systemd[5538]: pvesr.service: Failed to attach to cgroup /system.slice/pvesr.service: No such file or directory systemd[5538]: pvesr.service: Failed at step CGROUP spawning /usr/bin/pvesr: No such file or directory systemd[1]: pvesr.service: Main process exited, code=exited, status=219/CGROUP systemd[1]: pvesr.service: Failed with result 'exit-code'. systemd[1]: Failed to start Proxmox VE replication runner.
And, worst of all, new ssh logins to the host machine failed:
sshd[24551]: pam_systemd(sshd:session): Failed to create session: Connection timed out sshd[24551]: error: openpty: No such file or directory sshd[31553]: error: session_pty_req: session 0 alloc failed sshd[31553]: Received disconnect from 10.x.x.x port 55190:11: disconnected by user
As you understand by now. This was my own doing, and it was caused by various missing mount points.
The failing ssh? A missing /dev/pts
.
Most of the other failures? Mostly mounts missing in /sys/fs/cgroup
.
Fixing
First order of business was to get this machine to behave again. Luckily I had a different machine where I could take a peek at what was supposed to be mounted.
On the other machine, I ran this one-liner:
$ mount | sed -e '/ on \/\(dev\|proc\|sys\)\//!d s#^\([^ ]*\) on \([^ ]*\) type \([^ ]*\) (\([^)]*\)).*#'\ 'mountpoint -q \2 || '\ '( mkdir -p \2; mount -n -t \3 \1 -o \4 \2 || rmdir \2 )#' | sort -V
That resulted in this output that could be pasted into the one ssh shell I still had at my disposal:
mountpoint -q /dev/hugepages || ( mkdir -p /dev/hugepages; mount -n -t hugetlbfs hugetlbfs -o rw,relatime,pagesize=2M /dev/hugepages || rmdir /dev/hugepages ) mountpoint -q /dev/mqueue || ( mkdir -p /dev/mqueue; mount -n -t mqueue mqueue -o rw,relatime /dev/mqueue || rmdir /dev/mqueue ) mountpoint -q /dev/pts || ( mkdir -p /dev/pts; mount -n -t devpts devpts -o rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 /dev/pts || rmdir /dev/pts ) mountpoint -q /dev/shm || ( mkdir -p /dev/shm; mount -n -t tmpfs tmpfs -o rw,nosuid,nodev,inode64 /dev/shm || rmdir /dev/shm ) mountpoint -q /proc/sys/fs/binfmt_misc || ( mkdir -p /proc/sys/fs/binfmt_misc; mount -n -t autofs systemd-1 -o rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=45161 /proc/sys/fs/binfmt_misc || rmdir /proc/sys/fs/binfmt_misc ) mountpoint -q /sys/fs/bpf || ( mkdir -p /sys/fs/bpf; mount -n -t bpf none -o rw,nosuid,nodev,noexec,relatime,mode=700 /sys/fs/bpf || rmdir /sys/fs/bpf ) mountpoint -q /sys/fs/cgroup || ( mkdir -p /sys/fs/cgroup; mount -n -t tmpfs tmpfs -o ro,nosuid,nodev,noexec,mode=755,inode64 /sys/fs/cgroup || rmdir /sys/fs/cgroup ) mountpoint -q /sys/fs/cgroup/blkio || ( mkdir -p /sys/fs/cgroup/blkio; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,blkio /sys/fs/cgroup/blkio || rmdir /sys/fs/cgroup/blkio ) mountpoint -q /sys/fs/cgroup/cpuset || ( mkdir -p /sys/fs/cgroup/cpuset; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,cpuset /sys/fs/cgroup/cpuset || rmdir /sys/fs/cgroup/cpuset ) mountpoint -q /sys/fs/cgroup/cpu,cpuacct || ( mkdir -p /sys/fs/cgroup/cpu,cpuacct; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct || rmdir /sys/fs/cgroup/cpu,cpuacct ) mountpoint -q /sys/fs/cgroup/devices || ( mkdir -p /sys/fs/cgroup/devices; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,devices /sys/fs/cgroup/devices || rmdir /sys/fs/cgroup/devices ) mountpoint -q /sys/fs/cgroup/freezer || ( mkdir -p /sys/fs/cgroup/freezer; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,freezer /sys/fs/cgroup/freezer || rmdir /sys/fs/cgroup/freezer ) mountpoint -q /sys/fs/cgroup/hugetlb || ( mkdir -p /sys/fs/cgroup/hugetlb; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,hugetlb /sys/fs/cgroup/hugetlb || rmdir /sys/fs/cgroup/hugetlb ) mountpoint -q /sys/fs/cgroup/memory || ( mkdir -p /sys/fs/cgroup/memory; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,memory /sys/fs/cgroup/memory || rmdir /sys/fs/cgroup/memory ) mountpoint -q /sys/fs/cgroup/net_cls,net_prio || ( mkdir -p /sys/fs/cgroup/net_cls,net_prio; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio || rmdir /sys/fs/cgroup/net_cls,net_prio ) mountpoint -q /sys/fs/cgroup/perf_event || ( mkdir -p /sys/fs/cgroup/perf_event; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,perf_event /sys/fs/cgroup/perf_event || rmdir /sys/fs/cgroup/perf_event ) mountpoint -q /sys/fs/cgroup/pids || ( mkdir -p /sys/fs/cgroup/pids; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,pids /sys/fs/cgroup/pids || rmdir /sys/fs/cgroup/pids ) mountpoint -q /sys/fs/cgroup/rdma || ( mkdir -p /sys/fs/cgroup/rdma; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,rdma /sys/fs/cgroup/rdma || rmdir /sys/fs/cgroup/rdma ) mountpoint -q /sys/fs/cgroup/systemd || ( mkdir -p /sys/fs/cgroup/systemd; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd /sys/fs/cgroup/systemd || rmdir /sys/fs/cgroup/systemd ) mountpoint -q /sys/fs/cgroup/unified || ( mkdir -p /sys/fs/cgroup/unified; mount -n -t cgroup2 cgroup2 -o rw,nosuid,nodev,noexec,relatime /sys/fs/cgroup/unified || rmdir /sys/fs/cgroup/unified ) mountpoint -q /sys/fs/fuse/connections || ( mkdir -p /sys/fs/fuse/connections; mount -n -t fusectl fusectl -o rw,relatime /sys/fs/fuse/connections || rmdir /sys/fs/fuse/connections ) mountpoint -q /sys/fs/pstore || ( mkdir -p /sys/fs/pstore; mount -n -t pstore pstore -o rw,nosuid,nodev,noexec,relatime /sys/fs/pstore || rmdir /sys/fs/pstore ) mountpoint -q /sys/kernel/config || ( mkdir -p /sys/kernel/config; mount -n -t configfs configfs -o rw,relatime /sys/kernel/config || rmdir /sys/kernel/config ) mountpoint -q /sys/kernel/debug || ( mkdir -p /sys/kernel/debug; mount -n -t debugfs debugfs -o rw,relatime /sys/kernel/debug || rmdir /sys/kernel/debug ) mountpoint -q /sys/kernel/debug/tracing || ( mkdir -p /sys/kernel/debug/tracing; mount -n -t tracefs tracefs -o rw,relatime /sys/kernel/debug/tracing || rmdir /sys/kernel/debug/tracing ) mountpoint -q /sys/kernel/security || ( mkdir -p /sys/kernel/security; mount -n -t securityfs securityfs -o rw,nosuid,nodev,noexec,relatime /sys/kernel/security || rmdir /sys/kernel/security )
Finishing touches:
$ for x in /sys/fs/cgroup/*; do test -L $x && echo ln -s $(readlink $x) $x done
ln -s cpu,cpuacct /sys/fs/cgroup/cpu ln -s cpu,cpuacct /sys/fs/cgroup/cpuacct ln -s net_cls,net_prio /sys/fs/cgroup/net_cls ln -s net_cls,net_prio /sys/fs/cgroup/net_prio
Running those commands returned the system to a usable state.
The real fix
Next time, I shall refrain from doing the lazy -l
umount.
But, as a better solution, I'll also be adding --make-slave
to the rbind mount command. Doing that will ensure that an
unmount in the bound locations does not unmount the original
mount points:
# for x in dev proc sys; do mount --rbind --make-slave /$x $x done
With --make-slave
a umount -l
of your chroot path does not break your system.
2021-09-28 - a singal 17 is raised
When running the iKVM software on the BMC of SuperMicro machines, we regularly see an interesting "singal" typo.
(For the interested, we use a helper script to access the KVM console: ipmikvm. Without it, you need Java support enabled in your browser, and that has always given us trouble. The ipmikvm script logs on to the web interface, downloads the required Java bytecode and runs it locally.)
Connect to somewhere, wait for the KVM console to open, close it, and you might see something like this:
$ ipmikvm 10.x.x.x attempting login on '10.x.x.x' with user USER connect failed sd:18 Retry =1 a singal 17 is raised GetFileDevStr:4051 media_type = 40 GetFileDevStr:4051 media_type = 45 GetFileDevStr:4051 media_type = 40 GetFileDevStr:4051 media_type = 45 GetFileDevStr:4051 media_type = 40 GetFileDevStr:4051 media_type = 45 a singal 17 is raised
Signal 17 would be SIGCHLD
(see kill -l
)
which is the signal a process receives when one of its children exits.
Where does this typo come from? A quick web search does not reveal much. But a grep through the binaries does.
$ cd ~/.local/lib/ipmikvm/iKVM__V1.69.39.0x0
$ unzip iKVM__V1.69.39.0x0.jar
$ grep -r singal . Binary file ./libiKVM64.so matches
Where exactly is this?
$ objdump -s -j .rodata libiKVM64.so | grep singal 27690 4d697363 00612073 696e6761 6c202564 Misc.a singal %d
Okay, at 0x27695 (27690 + 5) there's the text. Let's see if we can find some more:
$ objdump -Cd libiKVM64.so | grep -C3 27695 0000000000016f90 <signal_handle(int)@@Base>: 16f90: 89 fe mov %edi,%esi 16f92: 48 8d 3d fc 06 01 00 lea 0x106fc(%rip),%rdi # 27695 <typeinfo name for RMMisc@@Base+0x8> 16f99: 31 c0 xor %eax,%eax 16f9b: e9 98 71 ff ff jmpq e138 <printf@plt>
So, it's indeed used in a signal handler. (Address taken from the instruction pointer 0x16f99 plus offset 0x106fc.) And at least one person got the signal spelling right.
Was this relevant? No. But it's fun to poke around in the binaries. And by now we've gotten so accustomed to this message that I hope they never fix it.
2021-09-12 - mariabackup / selective table restore
When using mariabackup (xtrabackup/innobackupex) for your MySQL/MariaDB backups, you get a snapshot of the mysql lib dir. This is faster than doing an old-style mysqldump, but it is slightly more complicated to restore. Especially if you just want access to data from a single table.
Assume you have a big database, and you're backing it up like this, using the mariadb-backup package:
# ulimit -n 16384
# mariabackup \ --defaults-file=/etc/mysql/debian.cnf \ --backup \ --compress --compress-threads=2 \ --target-dir=/var/backups/mysql \ [--parallel=8] [--galera-info]
... [00] 2021-09-12 15:23:52 mariabackup: Generating a list of tablespaces [00] 2021-09-12 15:23:53 >> log scanned up to (132823770290) [01] 2021-09-12 15:23:53 Compressing ibdata1 to /var/backups/mysql/ibdata1.qp ... [00] 2021-09-12 15:25:40 Compressing backup-my.cnf [00] 2021-09-12 15:25:40 ...done [00] 2021-09-12 15:25:40 Compressing xtrabackup_info [00] 2021-09-12 15:25:40 ...done [00] 2021-09-12 15:25:40 Redo log (from LSN 132823770281 to 132823770290) was copied. [00] 2021-09-12 15:25:40 completed OK!
Optionally followed by a:
# find /var/backups/mysql \ -type f '!' -name '*.gpg' -print0 | sort -z | xargs -0 sh -exc 'for src in "$@"; do dst=${src}.gpg && gpg --batch --yes --encrypt \ --compression-algo none \ --trust-model always \ --recipient EXAMPLE_RECIPIENT \ --output "$dst" "$src" && touch -r "$src" "$dst" && rm "$src" || exit $? done' unused_argv0
You'll end up with a bunch of qpress-compressed gpg-encrypted files, like these:
# ls -F1 /var/backups/mysql/ aria_log.00000001.qp.gpg aria_log_control.qp.gpg backup-my.cnf.qp.gpg ib_buffer_pool.qp.gpg ibdata1.qp.gpg ib_logfile0.qp.gpg my_project_1/ my_project_2/ my_project_3/ mysql/ performance_schema/ xtrabackup_binlog_info.qp.gpg xtrabackup_checkpoints.qp.gpg xtrabackup_info.qp.gpg
Let's assume we want only my_project_3.important_table
restored.
Start out by figuring out where the decryption key was at:
$ gpg --list-packets /var/backups/mysql/my_project_3/important_table.ibd.qp.gpg gpg: encrypted with 4096-bit RSA key, ID 1122334455667788, created 2017-10-10 "Example Recipient" <recipient@example.com>" gpg: decryption failed: No secret key # off=0 ctb=85 tag=1 hlen=3 plen=524 :pubkey enc packet: version 3, algo 1, keyid 1122334455667788 data: [4096 bits] # off=527 ctb=d2 tag=18 hlen=3 plen=3643 new-ctb :encrypted data packet: length: 3643 mdc_method: 2
This PGP keyid we see, corresponds to the fingerprint of an encryption subkey:
$ gpg --list-keys --with-subkey-fingerprints recipient@example.com pub rsa4096 2017-10-10 [SC] [expires: 2021-10-13] ...some..key... uid [ unknown] Example Recipient <recipient@example.com> sub rsa4096 2017-10-10 [E] [expires: 2021-10-13] 0000000000000000000000001122334455667788 <-- here it is! sub rsa4096 2017-10-10 [A] [expires: 2021-10-13] ...some..other..key... sub rsa4096 2017-10-10 [S] [expires: 2021-10-13] ...yet..another..key..
That matches. Good.
After assuring you have the right credentials, it's time to select which files we actually need. They are:
backup-my.cnf.qp.gpg ibdata1.qp.gpg ib_logfile0.qp.gpg my_project_3/important_table.frm.qp.gpg my_project_3/important_table.ibd.qp.gpg xtrabackup_binlog_info.qp.gpg xtrabackup_checkpoints.qp.gpg xtrabackup_info.qp.gpg
Collect the files, decrypt and decompress.
Decrypting can be done
with gpg, decompressing can either be done using
qpress -dov $SRC >${SRC%.qp}
or mariabackup --decompress
--target-dir=.
(Yes, for --decompress
and
--prepare
the --target-dir=
setting means the
backup-location, i.e. where the backups are now. Slightly
confusing indeed.)
$ find . -name '*.gpg' -print0 | xargs -0 sh -xec 'for src in "$@"; do gpg --decrypt --output "${src%.gpg}" "$src" && rm "$src" || exit $? done' unused_argv0
$ find . -name '*.qp' -print0 | xargs -0 sh -xec 'for src in "$@"; do qpress -dov "$src" >"${src%.qp}" && rm "$src" || exit $? done' unused_argv0
Ok, we have files. Time to whip out the correct mariabackup, for example from a versioned Docker image.
$ docker run -it \ -v `pwd`:/var/lib/mysql:rw mariadb:10.3.23 \ bash
Inside the docker image, we'll fetch screen, which we'll be needing shortly:
# apt-get update -qq && apt-get install -qqy screen less
Fix ownership, and "prepare" the mysql files:
# cd /var/lib/mysql
# chown -R mysql:mysql .
# su -s /bin/bash mysql
$ mariabackup --prepare --use-memory=20G --target-dir=.
(You may want to tweak that --use-memory=20G
to your
needs. For a 10GiB ib_logfile0, this setting made a world of
difference: 10 minutes restore time, instead of infinite.)
(Also, mariabackup has a --databases="DB[.TABLE1][ DB.TABLE2
...]"
option that might come in handy if you're working with all
files during the --prepare
phase.)
mariabackup based on MariaDB server 10.3.23-MariaDB debian-linux-gnu (x86_64) [00] 2021-09-12 16:04:30 cd to /var/lib/mysql/ ... 2021-09-12 16:04:30 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=132823770281 2021-09-12 16:04:30 0 [Note] InnoDB: Last binlog file 'mysql-bin.000008', position 901 [00] 2021-09-12 16:04:30 Last binlog file mysql-bin.000008, position 901 [00] 2021-09-12 16:04:31 completed OK!
At this point we don't need to copy/move them to
/var/lib/mysql
. We're there already.
All set, fire up a screen (or tmux, or whatever) and start mysqld, explicitly ignoring mysql permissions.
$ screen
$ mysqld --skip-grant-tables 2>&1 | tee /tmp/mysql-boot-error.log | grep -vE '\[ERROR\]|Ignoring tablespace'
2021-09-12 16:05:56 0 [Note] mysqld (mysqld 10.3.23-MariaDB-1:10.3.23+maria~bionic-log) starting as process 526 ... ... 2021-09-12 16:05:56 0 [Note] InnoDB: Setting log file ./ib_logfile101 size to 50331648 bytes 2021-09-12 16:05:56 0 [Note] InnoDB: Setting log file ./ib_logfile1 size to 50331648 bytes 2021-09-12 16:05:57 0 [Note] InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0 2021-09-12 16:05:57 0 [Note] InnoDB: New log files created, LSN=132823770290 ...
At this point the screen would get flooded with the following error
messages, if it weren't for the grep -v
:
2021-09-12 16:05:57 0 [ERROR] InnoDB: Operating system error number 2 in a file operation. 2021-09-12 16:05:57 0 [ERROR] InnoDB: The error means the system cannot find the path specified. 2021-09-12 16:05:57 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them. 2021-09-12 16:05:57 0 [ERROR] InnoDB: Cannot open datafile for read-only: './my_project_1/aboutconfig_item.ibd' OS error: 71 2021-09-12 16:05:57 0 [ERROR] InnoDB: Operating system error number 2 in a file operation. 2021-09-12 16:05:57 0 [ERROR] InnoDB: The error means the system cannot find the path specified. 2021-09-12 16:05:57 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them. 2021-09-12 16:05:57 0 [ERROR] InnoDB: Could not find a valid tablespace file for ``my_project_1`.`aboutconfig_item``. Please refer to https://mariadb.com/kb/en/innodb-data-dictionary-troubleshooting/ for how to resolve the issue. 2021-09-12 16:05:57 0 [Warning] InnoDB: Ignoring tablespace for `my_project_1`.`aboutconfig_item` because it could not be opened.
You'll get those for every table that you didn't include. Let's just ignore them.
Finally, when mysqld is done plowing through the (possibly big) ibdata1
, it should read something like:
... 2021-09-12 16:05:57 6 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory") 2021-09-12 16:05:57 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist 2021-09-12 16:05:57 0 [Note] Server socket created on IP: '127.0.0.1'. 2021-09-12 16:05:57 0 [Warning] Can't open and lock time zone table: Table 'mysql.time_zone_leap_second' doesn't exist trying to live without them 2021-09-12 16:05:57 7 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory") 2021-09-12 16:05:57 0 [Note] Reading of all Master_info entries succeeded 2021-09-12 16:05:57 0 [Note] Added new Master_info '' to hash table 2021-09-12 16:05:57 0 [Note] mysqld: ready for connections. Version: '10.3.23-MariaDB-1:10.3.23+maria~bionic-log' socket: '/var/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution
At this point, you can fire up a mysql
or mysqldump
client and extract the needed data.
MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | my_project_3 | +--------------------+
MariaDB [(none)]> select count(*) from my_project_3.important_table; +----------+ | count(*) | +----------+ | 6 | +----------+
Good, we have the data. And we didn't need to decrypt/decompress everything.
Stopping the mysqld is a matter of: mysqladmin shutdown
2021-08-31 - apt / downgrading back to current release
If you're running an older Debian or Ubuntu, you may sometimes want to check out a newer version of a package, to see if a particular bug has been fixed.
I know, this is not supported, but this scheme Generally Works (*):
- replace the current release name in
/etc/apt/sources.list
, with the next release — e.g. frombionic
tofocal
- do an
apt-get update
- and an
apt-get install SOME-PACKAGE
You can test the package while replacing the
sources.list
with the original so the rest of the system
doesn't get upgraded by accident. (Don't forget this step.)
Once you know whether you want this newer package or not, you can decide to get your system back into original state. This is a matter of downgrading the appropriate package(s).
For example:
# apt-cache policy gdb gdb: Installed: 9.2-0ubuntu1~20.04 Candidate: 9.2-0ubuntu1~20.04 Version table: *** 9.2-0ubuntu1~20.04 100 100 /var/lib/dpkg/status 8.1.1-0ubuntu1 500 500 http://MIRROR/ubuntu bionic-updates/main amd64 Packages 8.1-0ubuntu3 500 500 http://MIRROR/ubuntu bionic/main amd64 Packages
# apt-get install gdb=8.1.1-0ubuntu1 The following packages will be DOWNGRADED: gdb Do you want to continue? [Y/n]
If you use apt-find-foreign you might notice there are a bunch of packages that need downgrading back to the original state:
# apt-find-foreign Lists with corresponding package counts: 22 (local only) 296 http://MIRROR/ubuntu Lists with very few packages (or local only): (local only) - binutils - binutils-common - binutils-x86-64-linux-gnu - gcc-10-base - gdb - libbinutils - libc-bin - libc6 - libc6-dbg - libcrypt1 - libctf-nobfd0 - libctf0 - libffi7 - libgcc-s1 - libidn2-0 - libncursesw6 - libpython3.8 - libpython3.8-minimal - libpython3.8-stdlib - libreadline8 - libtinfo6 - locales
Looking up the right versions for all those packages we just dragged in, sounds like tedious work.
Luckily we can convince
apt to do this for us, using a temporary
/etc/apt/preferences.d/force_downgrade_to_bionic.pref
:
Package: * Pin: release n=bionic* Pin-Priority: 1000
With priority 1000, apt will prefer the Bionic release so much that it suggests downgrades:
# apt-get dist-upgrade The following packages will be REMOVED: libcrypt1 libctf0 The following packages will be DOWNGRADED: binutils binutils-common binutils-x86-64-linux-gnu gdb libbinutils libc-bin libc6 libc6-dbg libidn2-0 libpython3.8 libpython3.8-minimal libpython3.8-stdlib locales Do you want to continue? [Y/n]
Make sure you remove force_downgrade_to_bionic.pref
afterwards.
2021-08-15 - k8s / lightweight redirect
Spinning up pods just to for parked/redirect sites? I think not.
Recently, I had to HTTP(S)-redirect a handful of hostnames to elsewhere. Pointing them into our well maintained K8S cluster was the easy thing to do. It would manage LetsEncrypt certificates automatically using cert-manager.io.
From the cluster, I could spin up a service and an nginx deployment with a bunch of redirect/302 rules.
However, spinning up one or more nginx instances just to have it do simple redirects sounds like overkill. After all, the ingress controller already runs nginx. Why not have it do the 302s?
apiVersion: extensions/v1beta1 kind: Ingress metadata: annotations: cert-manager.io/cluster-issuer: http01-issuer nginx.ingress.kubernetes.io/server-snippet: | # Do not match .well-known, or we get no certs. location ~ ^/($|[^.]) { return 302 https://target-url.example.com; #return 302 https://target-url.example.com/$1; #add_header Content-Type text/plain; #return 200 'This domain is parked'; } name: my-redirects spec: rules: - host: this-domain-redirects.example.com http: paths: - backend: serviceName: dummy servicePort: 65535 path: / pathType: ImplementationSpecific tls: - hosts: - this-domain-redirects.example.com secretName: this-domain-redirects.example.com--tls
Using the nginx.ingress.kubernetes.io/server-snippet
we
can hook in a custom nginx snippet that does the redirecting
for us.
Adding the http
with backend
config in the
rules
is mandatory. But you can point it do a non-existent
dummy service, as seen above.
And if you want parked domains instead of redirected
domains, simply replace return 302
https://target-url.example.com;
with return 200 'This
domain is parked';
.
2021-08-13 - move the mouse programmatically / x11
“Can I move the mouse cursor in X programmatically?”, asked my son.
Python version
Yes you can. Here's a small Python snippet that will work on X with python-xlib:
import os import time import Xlib.display display = Xlib.display.Display() # XWarpPointer won't work on Wayland. assert os.environ.get('XDG_SESSION_TYPE', 'x11') == 'x11', os.environ win = display.screen()['root'] # entire desktop window #win = display.get_input_focus().focus # or, focused app window for i in range(0, 1000): win.warp_pointer(x=i, y=i) display.flush() time.sleep(0.01)
Nothing too fancy. But getting working
(XFlush()
!) examples of code interfacing with the X server
is not always easy. Another small example on the web — this
right here — helps.
In all fairness, he did wonder if this was possible in (Java) Processing. It should be.
Processing version
Indeed, in (Java) Processing it's actually less complicated, but requires the oddly named "Robot" class.
import java.awt.AWTException; import java.awt.Robot; Robot robby; int mouse_x; int mouse_y; void setup() { try { robby = new Robot(); } catch (AWTException e) { /* Catching exceptions is mandatory in Java */ println("Robot class not supported by your system!"); println(e); exit(); } } void draw() { mouse_x += 10; mouse_y += 10; if (mouse_x > displayWidth) mouse_x = 0; if (mouse_y > displayHeight) mouse_y = 0; set_mouse_pointer(mouse_x, mouse_y); } void set_mouse_pointer(int x, int y) { robby.mouseMove(x, y); }
Ah, that was too easy. But can we do native X11 calls?
Native X11 Processing version
Yes! Of course we can.
You'll need to fetch the Java Native Access (JNA) Platform
libraries and install them into an appropriate place for
Processing. It looks like one can fetch a two jars from github.com/java-native-access/jna:
jna-5.8.0.jar
and
jna-platform-5.8.0.jar
These jars go into your .../sketchbook/libraries
like this:
sketchbook/ libraries/ com_sun_jna/ library/ jna-5.8.0.jar jna-platform-5.8.0.jar
But, you need to have one jar here with the name
com_sun_jna.jar
(like the directory). Otherwise
Processing will complain. Symlinking
jna-platform-5.8.0.jar
works nicely:
ln -s jna-platform-5.8.0.jar com_sun_jna.jar
Now that that's out of the way, you should be able to fire up Processing and run the following snippet. It's a bit longer, but it has the same access that the Python version has:
import com.sun.jna.Native; import com.sun.jna.platform.unix.X11; /* Declare what XWarpPointer looks like in libX11.so */ interface ExtendedX11 extends X11 /* or "Library" */ { int XWarpPointer( X11.Display dpy, X11.Window src_win, X11.Window dest_win, int src_x, int src_y, int src_width, int src_height, int dest_x, int dest_y); } /* We could do this, and get access to XOpenDisplay and others */ //X11 x11 = X11.INSTANCE; /* Instead, we'll get everything that X11.INSTANCE would have * _plus_ XWarpPointer */ ExtendedX11 x11; X11.Display x11display; X11.Window x11window; int mouse_x; int mouse_y; void setup() { /* Manually import libX11.so to get at XWarpPointer */ x11 = (ExtendedX11)Native.loadLibrary("X11", ExtendedX11.class); x11display = x11.XOpenDisplay(null); x11window = x11.XDefaultRootWindow(x11display); } void draw() { mouse_x += 10; mouse_y += 10; if (mouse_x > displayWidth) mouse_x = 0; if (mouse_y > displayHeight) mouse_y = 0; set_mouse_pointer(mouse_x, mouse_y); } void set_mouse_pointer(int x, int y) { x11.XWarpPointer(x11display, null, x11window, 0, 0, 0, 0, x, y); x11.XFlush(x11display); }
Obviously this native version doesn't work on Windows. But this one was more fun than the AWT version, no?
2021-08-06 - migrating vm interfaces / eth0 to ens18
How about finally getting rid of eth0 and eth1 in those ancient Ubuntu VMs that you keep upgrading?
Debian and Ubuntu have been doing a good job at keeping the old names during upgrades. But it's time to move past that.
We expect ens18 and ens19 now. There's no need to hang on to the past. (And you have moved on to Netplan already, yes?)
Steps:
rm /etc/udev/rules.d/80-net-setup-link.rules
update-initramfs -u
rm /etc/systemd/network/50-virtio-kernel-names.link
- Update all references in
/etc
to the new ens18 style names. - Reboot.
2021-08-05 - kioxia nvme / num_err_log_entries 0xc004 / smartctl
So, these new Kioxia NVMe drives were incrementing the num_err_log_entries as soon as they were inserted into the machine. But the error said INVALID_FIELD. What gives?
In contrast to the other (mostly Intel) drives, these drives started
incrementing the num_err_log_entries
as soon as they were
plugged in:
# nvme smart-log /dev/nvme21n1 Smart Log for NVME device:nvme21n1 namespace-id:ffffffff ... num_err_log_entries : 932
The relevant errors should be readable in the error-log. All 64 errors in the log looked the same:
error_count : 932 sqid : 0 cmdid : 0xc status_field : 0xc004(INVALID_FIELD) parm_err_loc : 0x4 lba : 0xffffffffffffffff nsid : 0x1 vs : 0
INVALID_FIELD, what is this?
The error count kept increasing regularly — like clockwork actually. And the internet gave us no clues what this might be.
It turns out it was our monitoring. The Zabbix scripts we
employ fetch drive health status values from various sources.
And one of the things they do, is run smartctl -a
on all
drives.
And for every such call, the error count was incremented.
# nvme list Node SN Model FW Rev ------------- ------------ ------------------- -------- ... /dev/nvme20n1 PHLJ9110xxxx INTEL SSDPE2KX010T8 VDV10131 /dev/nvme21n1 X0U0A02Dxxxx KCD6DLUL3T84 0102 /dev/nvme22n1 X0U0A02Jxxxx KCD6DLUL3T84 0102
If we run it on the Intel drive, we get this:
# smartctl -a /dev/nvme20n1 ... Model Number: INTEL SSDPE2KX010T8 ... === START OF SMART DATA SECTION === Read NVMe SMART/Health Information failed: NVMe Status 0x4002
# nvme smart-log /dev/nvme20n1 | grep ^num_err num_err_log_entries : 0
# nvme error-log /dev/nvme20n1 | head -n12 Error Log Entries for device:nvme20n1 entries:64 ................. Entry[ 0] ................. error_count : 0 sqid : 0 cmdid : 0 status_field : 0(SUCCESS) parm_err_loc : 0 lba : 0 nsid : 0 vs : 0
But on the Kioxias, we get this:
# smartctl -a /dev/nvme21n1 ... Model Number: KCD6DLUL3T84 ... === START OF SMART DATA SECTION === Read NVMe SMART/Health Information failed: NVMe Status 0x6002
# nvme smart-log /dev/nvme21n1 | grep ^num_err num_err_log_entries : 933
# nvme error-log /dev/nvme21n1 | head -n12 Error Log Entries for device:nvme21n1 entries:64 ................. Entry[ 0] ................. error_count : 933 sqid : 0 cmdid : 0x6 status_field : 0xc004(INVALID_FIELD) parm_err_loc : 0x4 lba : 0xffffffffffffffff nsid : 0x1 vs : 0
Apparently the Kioxia drive does not like what smartctl is sending.
Luckily this turned out to be an issue that smartctl claims responsibility for. And it had already been fixed.
If this works, the problem is that this drive requires that the broadcast namespace is specified if SMART/Health and Error Information logs are requested. This issue was unspecified in early revisions of the NVMe standard.
In our case, applying this fix was easy on this Ubuntu/Bionic machine:
# apt-cache policy smartmontools smartmontools: Installed: 6.5+svn4324-1ubuntu0.1 Candidate: 6.5+svn4324-1ubuntu0.1 Version table: 7.0-0ubuntu1~ubuntu18.04.1 100 100 http://MIRROR/ubuntu bionic-backports/main amd64 Packages *** 6.5+svn4324-1ubuntu0.1 500 500 http://MIRROR/ubuntu bionic-updates/main amd64 Packages 100 /var/lib/dpkg/status
# apt-get install smartmontools=7.0-0ubuntu1~ubuntu18.04.1
This smartmontools update from 6.5 to 7.0 not only got rid of the new errors, it also showed more relevant health output.
Now if we could just reset the error-log count on the drives, then this would be even better...
2021-06-18 - openssl / error 42 / certificate not yet valid
In yesterday's post about not being able to connect to the SuperMicro iKVM IPMI, I wondered “why stunnel/openssl did not send error 45 (certificate_expired) for a not-yet-valid certificate.” Here's a closer examination.
Quick recap: yesterday, I got SSL alert/error 42 as response to a client certificate that was not yet valid. The server was living in 2015 and refused to accept a client certificate that would be valid first in 2016. That error 42 code could mean anything, so checking the period of validity of the certificate was not something that occurred to me immediately.
Here are the SSL alert codes in the 40s, taken from RFC 5246 7.2+:
code | identifier | description |
---|---|---|
40 | handshake_failure | The sender was unable to negotiate an acceptable set of security parameters with the available options. |
41 | no_certificate | This alert was used in SSLv3 but not any version of TLS. (Don't use this.) |
42 | bad_certificate | A certificate was corrupt, contained signatures that did not verify correctly, etc. |
43 | unsupported_certificate | A certificate was of an unsupported type. |
44 | certificate_revoked | A certificate was revoked by its signer. |
45 | certificate_expired | A certificate has expired or is not currently valid. |
46 | certificate_unknown | Some other (unspecified) issue arose in processing the certificate, rendering it unacceptable. |
47 | illegal_parameter | A field in the handshake was out of range or inconsistent with other fields. |
48 | unknown_ca | A valid certificate chain or partial chain was received, but the certificate was rejected because it was not signed by a trusted CA. |
49 | access_denied | A valid certificate or PSK was received, but when access control was applied, the sender decided not to proceed with negotiation. |
I would argue that a certificate that is not valid yet would be better off with error 45 than error 42. After all, the description from the RFC includes the phrase: “or is not currently valid.”
It turns out it was OpenSSL that opted for the more generic 42.
Testing
Here's how you generate one too old and one too new certificate:
$ CURDATE=$(date -R) && sudo date --set="$(date -R -d'-2 days')" && openssl req -batch -x509 -nodes -newkey rsa:4096 -days 1 \ -keyout not-valid-anymore.key -out not-valid-anymore.crt && sudo date --set="$CURDATE"
$ CURDATE=$(date -R) && sudo date --set="$(date -R -d'+2 days')" && openssl req -batch -x509 -nodes -newkey rsa:4096 -days 1 \ -keyout not-yet-valid.key -out not-yet-valid.crt && sudo date --set="$CURDATE"
You can then use openssl s_server and openssl s_client to test the libssl behaviour:
$ openssl s_server -port 8443 \ -cert server.crt -key server.key \ -CAfile not-valid-anymore.crt \ -verify_return_error -verify 1
$ openssl s_client -connect 127.0.0.1:8443 \ -servername whatever -tls1_2 -showcerts -debug \ -cert not-valid-anymore.crt -key not-valid-anymore.key ... read from 0x55ca16174150 [0x55ca161797a8] (2 bytes => 2 (0x2)) 0000 - 02 2d .- 140480733312320:error:14094415:SSL routines:ssl3_read_bytes: sslv3 alert certificate expired:../ssl/record/rec_layer_s3.c:1543: SSL alert number 45
So, error 45 (certificate_expired) for the not valid anymore case.
And for the not yet valid case?
$ openssl s_server -port 8443 \ -cert server.crt -key server.key \ -CAfile not-yet-valid.crt \ -verify_return_error -verify 1
$ openssl s_client -connect 127.0.0.1:8443 \ -servername whatever -tls1_2 -showcerts -debug \ -cert not-yet-valid.crt -key not-yet-valid.key ... read from 0x55be814cd150 [0x55be814d27a8] (2 bytes => 2 (0x2)) 0000 - 02 2a .* 140374994916672:error:14094412:SSL routines:ssl3_read_bytes: sslv3 alert bad certificate:../ssl/record/rec_layer_s3.c:1543: SSL alert number 42
Ah, there's that pesky number 42 again.
Source code
Ultimately, this number is produced during the translation from
internal OpenSSL X509 errors to SSL/TLS alerts. Previously
in ssl_verify_alarm_type()
, nowadays in ssl_x509err2alert()
.
Traced back to:
commit d02b48c63a58ea4367a0e905979f140b7d090f86 Author: Ralf S. Engelschall Date: Mon Dec 21 10:52:47 1998 +0000 Import of old SSLeay release: SSLeay 0.8.1b
In ssl/s3_both.c
there is:
int ssl_verify_alarm_type(type) int type; { int al; switch(type) { case X509_V_ERR_UNABLE_TO_GET_ISSUER_CERT: // ... case X509_V_ERR_CERT_NOT_YET_VALID: // ... al=SSL3_AD_BAD_CERTIFICATE; break; case X509_V_ERR_CERT_HAS_EXPIRED: al=SSL3_AD_CERTIFICATE_EXPIRED; break;
And more recently, in ssl/statem/statem_lib.c
translated by
ssl_x509err2alert()
:
static const X509ERR2ALERT x509table[] = { {X509_V_ERR_APPLICATION_VERIFICATION, SSL_AD_HANDSHAKE_FAILURE}, {X509_V_ERR_CA_KEY_TOO_SMALL, SSL_AD_BAD_CERTIFICATE}, // ... {X509_V_ERR_CERT_NOT_YET_VALID, SSL_AD_BAD_CERTIFICATE}, // ... {X509_V_ERR_CRL_HAS_EXPIRED, SSL_AD_CERTIFICATE_EXPIRED}, // ... }
Apparently behaviour has been like this since 1998 and before. A.k.a. since forever. I guess we'll have to keep the following list in mind next time we encounter error 42:
X509_V_ERR_CA_KEY_TOO_SMALL X509_V_ERR_CA_MD_TOO_WEAK X509_V_ERR_CERT_NOT_YET_VALID X509_V_ERR_CERT_REJECTED X509_V_ERR_CERT_UNTRUSTED X509_V_ERR_CRL_NOT_YET_VALID X509_V_ERR_DANE_NO_MATCH X509_V_ERR_EC_KEY_EXPLICIT_PARAMS X509_V_ERR_EE_KEY_TOO_SMALL X509_V_ERR_EMAIL_MISMATCH X509_V_ERR_ERROR_IN_CERT_NOT_AFTER_FIELD X509_V_ERR_ERROR_IN_CERT_NOT_BEFORE_FIELD X509_V_ERR_ERROR_IN_CRL_LAST_UPDATE_FIELD X509_V_ERR_ERROR_IN_CRL_NEXT_UPDATE_FIELD X509_V_ERR_HOSTNAME_MISMATCH X509_V_ERR_IP_ADDRESS_MISMATCH X509_V_ERR_UNABLE_TO_DECODE_ISSUER_PUBLIC_KEY X509_V_ERR_UNABLE_TO_DECRYPT_CERT_SIGNATURE X509_V_ERR_UNABLE_TO_DECRYPT_CRL_SIGNATURE
That is, assuming you're talking to libssl (OpenSSL). But that's generally the case.
P.S. The GPLv2 OpenSSL replacement WolfSSL
appears to do the right thing in DoCertFatalAlert()
,
returning certificate_expired
for both
ASN_AFTER_DATE_E
and ASN_BEFORE_DATE_E
. Yay!
2021-06-17 - supermicro / ikvm / sslv3 alert bad certificate
Today I was asked to look at a machine that disallowed iKVM IPMI console access. It allowed access through the “iKVM/HTML5”, but when connecting using the “Console Redirection” (Java client, see also ipmikvm) it would quit after 10 failed attempts.
TL;DR: The clock of the machine had been reset to a timestamp earlier than the first validity of the supplied client certificate. After changing the BMC time from 2015 to 2021, everything worked fine again.
If you're interested, here are some steps I took to debug the situation:
Debugging
My attempts at logging in started like this:
$ ipmikvm 10.1.2.3 ... Retry =1 Retry =2 Retry =3
Okay, no good. Luckily syslog had some info (some lines elided):
Service [ikvm] accepted (FD=3) from 127.0.0.1:40868 connect_blocking: connected 10.1.2.3:5900 Service [ikvm] connected remote server from 10.0.0.2:38222 Remote socket (FD=9) initialized SNI: sending servername: 10.1.2.3 ... CERT: Locally installed certificate matched Certificate accepted: depth=0, /C=US/ST=California/L=San Jose/O=Super Micro Computer /OU=Software/CN=IPMI/emailAddress=Email SSL state (connect): SSLv3 read server certificate A ... SSL state (connect): SSLv3 write client certificate A ... SSL state (connect): SSLv3 write finished A SSL state (connect): SSLv3 flush data SSL alert (read): fatal: bad certificate SSL_connect: 14094412: error:14094412: SSL routines:SSL3_READ_BYTES:sslv3 alert bad certificate Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket Remote socket (FD=9) closed Local socket (FD=3) closed Service [ikvm] finished (0 left)
The Java iKVM client uses an stunnel sidecar to take care of the TLS bits, hence the extra connection to 127.0.0.1. Right now, that's not important. What is important is the “SSL routines:SSL3_READ_BYTES:sslv3 alert bad certificate” message. Apparently someone disagrees with something in the TLS handshake.
First question: is the client or the server to blame? The log isn't totally clear on that. Let's find out who disconnects.
Rerunning ipmikvm 10.1.2.3
but with sh -x
shows us how to invoke the client:
$ sh -x /usr/bin/ipmikvm 10.1.2.3 ... + dirname /home/walter/.local/lib/ipmikvm/iKVM__V1.69.38.0x0/iKVM__V1.69.38.0x0.jar + exec java -Djava.library.path=/home/walter/.local/lib/ipmikvm/iKVM__V1.69.38.0x0 \ -cp /home/walter/.local/lib/ipmikvm/iKVM__V1.69.38.0x0/iKVM__V1.69.38.0x0.jar \ tw.com.aten.ikvm.KVMMain 10.1.2.3 RXBoZW1lcmFsVXNlcg== TGlrZUknZFRlbGxZb3U= \ null 63630 63631 0 0 1 5900 623
We can rerun that one and see what it does, by running it through strace,
and redirecting the syscalls to tmp.log
. (The output is large enough not
to want it on your console, trust me.)
$ strace -s 8192 -ttf \ java -Djava.library.path=/home/walter/.local/lib/ipmikvm/iKVM__V1.69.38.0x0 \ -cp /home/walter/.local/lib/ipmikvm/iKVM__V1.69.38.0x0/iKVM__V1.69.38.0x0.jar \ tw.com.aten.ikvm.KVMMain 10.1.2.3 RXBoZW1lcmFsVXNlcg== TGlrZUknZFRlbGxZb3U= \ null 63630 63631 0 0 1 5900 623 \ 2> tmp.log connect failed sd:18 Retry =1 ^C
We expect a tcp connect()
to 10.1.2.3
:
[pid 130553] 08:50:30.418890 connect( 9, {sa_family=AF_INET, sin_port=htons(5900), sin_addr=inet_addr("10.1.2.3")}, 16) = -1 EINPROGRESS (Operation now in progress)
Not quite connected yet, but it's apparently non-blocking. Scrolling downward in pid 130553 we see:
[pid 130553] 08:50:30.419037 poll( [{fd=9, events=POLLIN|POLLOUT}], 1, 10000) = 1 ([{fd=9, revents=POLLOUT}])
Good, it's connected. Now following the read/write
(usually
recv/send
or a similar syscall, but not this time) on fd 9 shows
us:
[pid 130553] 08:50:30.684609 read( 9, "\25\3\3\0\2", 5) = 5 [pid 130553] 08:50:30.684664 read( 9, "\2*", 2) = 2 [pid 130553] 08:50:30.684775 sendto( 6, "<31>Jun 17 08:50:30 stunnel: LOG7[130546:139728830977792]: SSL alert (read): fatal: bad certificate", 99, MSG_NOSIGNAL, NULL, 0) = 99 ... [pid 130553] 08:50:30.685333 close(9) = 0
So, the client is the one closing the connection after receiving
"\2"
(2) and "*"
(0x2A, 42 decimal).
OpenSSL errors
We can go back in the strace output and look for certificates used:
[pid 130519] 08:50:29.203145 openat( AT_FDCWD, "/tmp/1623912629200/client.crt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 18 [pid 130519] 08:50:29.203497 write( 18, "-----BEGIN CERTIFICATE-----\nMIID7TCCAt...) = 1424
The sidecar setup writes a client.crt
,
client.key
and server.crt
.
If we extract their contents from the strace output and write them to a file, we can use the openssl s_client to connect directly and get additional debug info:
$ openssl s_client -connect 10.1.2.3:5900 -servername 10.1.2.3 \ -showcerts -debug ... read from 0x55bbada227a0 [0x55bbada2a708] (2 bytes => 2 (0x2)) 0000 - 02 28 .( 140555689141568:error:14094410:SSL routines:ssl3_read_bytes: sslv3 alert handshake failure:../ssl/record/rec_layer_s3.c:1543: SSL alert number 40
So, not supplying a client certificate gets us an error 40 (0x28), followed by a disconnect from the server (read returns -1). This is fine. Error 40 (handshake_failure) means that one or more security parameters were bad. In this case because we didn't supply the client certificate.
What happens if we send a self-generated client certificate?
$ openssl s_client -connect 10.1.2.3:5900 -servername 10.1.2.3 \ -showcerts -debug -cert CUSTOMCERT.crt -key CUSTOMCERT.key ... read from 0x5604603d7750 [0x5604603dcd68] (2 bytes => 2 (0x2)) 0000 - 02 30 .0 139773856281920:error:14094418:SSL routines:ssl3_read_bytes: tlsv1 alert unknown ca:../ssl/record/rec_layer_s3.c:1543: SSL alert number 48
Error 48 (unknown_ca). That makes sense, as the server does not know the CA of our custom generated certificate.
Lastly with the correct certificate, we get an error 42 (0x2A):
$ openssl s_client -connect 10.1.2.3:5900 -servername 10.1.2.3 \ -showcerts -debug -cert client.crt -key client.key ... read from 0x556b27ca7cd0 [0x556b27cad2e8] (2 bytes => 2 (0x2)) 0000 - 02 2a .* 140701791647040:error:14094412:SSL routines:ssl3_read_bytes: sslv3 alert bad certificate:../ssl/record/rec_layer_s3.c:1543: SSL alert number 42
Error 42 is bad_certificate
, with this description from RFC 5246 (7.2.2):
bad_certificate A certificate was corrupt, contained signatures that did not verify correctly, etc.
We're now quite certain it's our client certificate that is being
rejected. But we're no closer to the reason why. If we openssl-verify
client.crt
locally, it appears to be just fine.
Upgrading and inspecting the BMC firmware
This particular motherboard — X10SRD-F — already had the
latest Firmware according to the SuperMicro
BIOS IPMI downloads:
REDFISH_X10_388_20200221_unsigned.zip
As a last ditch effort, we checked if we could upgrade to a newer version. After all, in the changelog for version 3.90 (and similar for 3.89) it said:
7. Corrected issues with KVM console.
Ignoring the fact that version 3.89 was not listed for our hardware, we went ahead and upgraded to 3.89. That went smoothly, but the problem persisted.
Upgrade to 3.90 then? Or maybe there is something else we're overlooking. Let's see if we can disect the firmware:
$ unzip ../REDFISH_X10_390_20200717_unsigned.zip Archive: ../REDFISH_X10_390_20200717_unsigned.zip inflating: REDFISH_X10_390_20200717_unsigned.bin ...
$ binwalk REDFISH_X10_390_20200717_unsigned.bin DECIMAL HEXADECIMAL DESCRIPTION -------------------------------------------------------------------------------- 103360 0x193C0 CRC32 polynomial table, little endian 4194304 0x400000 CramFS filesystem, little endian, size: 15253504, version 2, sorted_dirs, CRC 0x4148A5CC, edition 0, 8631 blocks, 1100 files 20971520 0x1400000 uImage header, header size: 64 bytes, header CRC: 0xC3B2AF42, created: 2020-07-17 09:02:52, image size: 1537698 bytes, Data Address: 0x40008000, Entry Point: 0x40008000, data CRC: 0x4ACB7660, OS: Linux, CPU: ARM, image type: OS Kernel Image, compression type: gzip, image name: "21400000" 20971584 0x1400040 gzip compressed data, maximum compression, has original file name: "linux.bin", from Unix, last modified: 2020-07-17 08:56:49 24117248 0x1700000 CramFS filesystem, little endian, size: 7446528, version 2, sorted_dirs, CRC 0x1D3A953F, edition 0, 3089 blocks, 456 files
4194304 and 24117248 are both multiples of 4096 (0x1000) (obvious from the
zeroes in the hexadecimal column), this speeds up this dd
step a bit:
$ dd if=REDFISH_X10_390_20200717_unsigned.bin \ bs=$(( 0x1000 )) skip=$(( 0x400000 / 0x1000 )) \ count=$(( (0x1400000 - 0x400000) / 0x1000 )) \ of=cramfs1.bin
$ dd if=REDFISH_X10_390_20200717_unsigned.bin \ bs=$(( 0x1000 )) skip=$(( 0x1700000 / 0x1000 )) \ of=cramfs2.bin
$ du -b cramfs* 16777216 cramfs1.bin 9437184 cramfs2.bin
We can mount these and inspect their contents:
$ sudo mkdir /mnt/cramfs{1,2} $ sudo mount -t cramfs ./cramfs1.bin /mnt/cramfs1 $ sudo mount -t cramfs ./cramfs2.bin /mnt/cramfs2
cramfs1.bin
contains a Linux filesystem with an stunnel configuration:
$ ls /mnt/cramfs1/ bin linuxrc proc SMASH var dev lost+found run sys web etc mnt sbin tmp lib nv share usr
$ ls /mnt/cramfs1/etc/stunnel/ client.crt server.key server.crt stunnel.conf
This also looks sane. The server.key
matches
the server.crt
we already had. And the
client.crt
also matches what we had. And any and all
validation on these just succeeds.
cramfs2.bin
then?
$ ls /mnt/cramfs2/ cgi cgi-bin css extract.in iKVM__V1.69.38.0x0.jar.pack.gz ...
This looks like the webserver contents on https://10.1.2.3
.
(Interestingly the iKVM__V1.69.38.0x0.jar.pack.gz
file differs
between 3.88 and 3.89 and 3.90, but that turned out to be of no significance.)
Peering into the jar yielded no additional clues unfortunately:
$ unpack200 /mnt/cramfs2/iKVM__V1.69.38.0x0.jar.pack.gz iKVM__V1.69.38.0x0.jar
$ unzip iKVM__V1.69.38.0x0.jar Archive: iKVM__V1.69.38.0x0.jar PACK200 inflating: META-INF/MANIFEST.MF ...
$ ls res/*.crt res/*.key res/client.crt res/client.key res/server.crt
Same certificates. Everything matches.
Date and Time
At this point I'm giving up. I had tried the Syslog option in the BMC, which gave me 0 output thusfar. I had tried replacing the webserver certificate. Upgrading the BMC...
Out of ideas I'm mindlessly clicking around in the web interface. This landed me at Configuration -> Date and Time. Apparently the local date was set to somewhere in the year 2015.
We might as well fix that and try one last time.
Yes! After fixing the date, connecting suddenly worked.
Immediately all pieces fit together:
$ openssl x509 -in client.crt -noout -text | grep Validity -A3 Validity Not Before: May 19 09:46:36 2016 GMT Not After : May 17 09:46:36 2026 GMT Subject: C = US, ST = California, L = San Jose, O = Super Micro Computer, OU = Software, CN = IPMI, emailAddress = Email
Crap. The server had been looking at a “not yet valid” certificate the whole time. The certificate would be valid between 2016 and 2026, but the server was still living in the year 2015.
I wonder why stunnel/openssl did not send error 45 (certificate_expired). After all, the RFC says “a certificate has expired or is not currently valid” (my emphasis). That would've pointed us to the cause immediately.
This problem was one giant time sink. But we did learn a few things about structure of the BMC Firmware. And, also important: after the 17th of May in the year 2026, the iKVM connections will stop working unless we upgrade the firmware or fiddle with the time.
Maybe set a reminder for that event, in case these servers are still around by then...
2021-05-24 - ancient acer h340 nas / quirks and fixes
A while ago, I got a leftover NAS from someone. It's ancient, and what's worse, it's headless — that is, there is no video adapter on it. So installing an OS on it, or debugging failures is not entirely trivial.
Here are some tips and tricks to get it moving along.
First, according to dmidecode it's a:
Manufacturer: Acer Product Name: Aspire easyStore H340 Product Name: WG945GCM
The previous owner had already placed the JP3 jumper on the motherboard. This likely made it boot from USB. So, I debootstrapped a USB stick with Debian/Buster, a user and an autostarting sshd daemon so I could look around. I use the following on-boot script to get some network:
#!/bin/sh -x # # Called from /etc/systemd/system/networking.service.d/override.conf: # ExecStartPre=/usr/local/bin/auto-ifaces # /bin/rm /etc/network/interfaces for iface in $( /sbin/ip link | /bin/sed -e '/^[0-9]/!d;s/^[0-9]*: //;s/:.*//'); do if test $iface = lo; then printf 'auto lo\niface lo inet loopback\n' \ >>/etc/network/interfaces elif test $iface = enp9s0 -o $iface = enp9s0f0; then # physically broken interface # [ 19.191416] sky2 0000:09:00.0: enp9s0f0: phy I/O error # [ 19.191442] sky2 0000:09:00.0: enp9s0f0: phy I/O error /sbin/ip link set down $iface else # ethtool not needed with dongle anymore. #/sbin/ethtool -s $iface speed 100 duplex full printf 'auto %s\niface %s inet dhcp\n' $iface $iface \ >>/etc/network/interfaces fi done
(At first, enp9s0
wasn't broken yet. Later I let a USB
dongle took over its role.)
Note that a S.M.A.R.T. (disk) error would land it in "Press F1 to continue"-land. Attaching a keyboard and pressing F1 — or, alternatively, ejecting all disks before booting — bypassed that. Watch for the blue blinking i symbol. If there isn't any, it's not booting.
Next, it appeared that it refused to shutdown on
poweroff
. After completing the shutdown sequence, it would
reboot.
Various reports on the internet claim that disabling USB 3.0 in the BIOS works. But, I don't have access to the BIOS (no video, remember) and I'm using the USB for a network dongle.
Adding XHCI_SPURIOUS_WAKEUP (262144) | XHCI_SPURIOUS_REBOOT
(8182)
to the linux command line as
xhci_hcd.quirks=270336
appeared to work, but this was fake
news, at least on Linux kernel 4.19.0-16-amd64.
The acpi=off
command line suggestion seen elsewhere is a
bit much. It will not reboot after shutdown, but it will
require a manual button press. And additionally, only one CPU (thread)
will be detected, and dmesg gets flooded with PCI Express errors.
The solution that did work, was the following (at shutdown) in
/etc/rc0.d/K02rmmod-against-reboot
:
#!/bin/sh /usr/sbin/modprobe -r ehci_pci ehci_hcd /bin/true
(Removing ehci_pci
when it's still running breaks the USB
networking, so be careful when you test.)
Lastly, you may want to install mediasmartserverd
— the Linux daemon that controls the status LEDs of Acer Aspire
EasyStore H340 [...]. It will get you some nice colored leds next to your
hard disks. (Do use --brightness=6
, as brightness level
one is too weak.)
2021-05-20 - partially removed pve node / proxmox cluster
The case of the stale (removed but not removed) PVE node in our Proxmox cluster.
On one of our virtual machine clusters, a node — pve3
— had been removed on
purpose, yet is was still visible in the GUI with a big red cross
(because it was unavailable). This was not only ugly, but also caused
problems for the node enumeration done by proxmove.
The node had been properly removed, according to the removing a cluster node documentation. Yet it was apparently still there.
# pvecm nodes Membership information ---------------------- Nodeid Votes Name 1 1 pve1 (local) 2 1 pve2 3 1 pve4 5 1 pve5
This listing looked fine: pve3
(nodeid 4) was absent.
And all remaining nodes showed the same info.
But, a quick grep through /etc
did turn up some references to pve3
:
# grep pve3 /etc/* -rl /etc/corosync/corosync.conf /etc/pve/.version /etc/pve/.members /etc/pve/corosync.conf
Those two corosync.conf
config files are in sync. Both
between themselves and equal to those files on the other three
nodes. But they did contain a reference to the removed node:
nodelist { ... node { name: pve3 nodeid: 4 quorum_votes: 1 ring0_addr: 10.x.x.x }
The .version
and .members
json files were
different, albeit similar on all nodes. They all included the 5 nodes (one too many):
# cat /etc/pve/.members { "nodename": "pve1", "version": 77, "cluster": { "name": "my-clustername", "version": 6, "nodes": 5, "quorate": 1 }, "nodelist": { "pve1": { "id": 1, "online": 1, "ip": "10.x.x.x"}, "pve2": { "id": 2, "online": 1, "ip": "10.x.x.x"}, "pve3": { "id": 4, "online": 0, "ip": "10.x.x.x"}, "pve4": { "id": 3, "online": 1, "ip": "10.x.x.x"}, "pve5": { "id": 5, "online": 1, "ip": "10.x.x.x"} } }
The document versions were all a bit different, but the cluster
versions were the same between the nodes. Except for one node, on which
the cluster version was 5
instead of 6
.
Restarting corosync on that node fixed that problem: the
cluster versions were now 6
everywhere.
With that problem tackled, it was a matter of:
# pvecm expected 4
# pvecm delnode pve3 Killing node 4
All right! Even though it did not list nodeid 4 in the pvecm
nodes output, delnode did find the right one. And this
properly removed all traces of pve3
from the remaining
files, making the cluster happy again.
2021-05-11 - enable noisy build / opensips
How do you enable the noisy build when building OpenSIPS? The one where the actual gcc invocations are not hidden.
In various projects the compilation and linking steps called by make are cleaned up, so you only see things like:
Compiling db/db_query.c Compiling db/db_id.c ...
This looks cleaner. But sometimes you want to see (or temporarily change) the compilation/linking call:
gcc -g -O9 -funroll-loops -Wcast-align -Wall [...] -c db/db_query.c -o db/db_query.o gcc -g -O9 -funroll-loops -Wcast-align -Wall [...] -c db/db_id.c -o db/db_id.o ...
(I elided about 800 characters per line in this example. Noisy indeed.)
The setting to disable this “clean” output and favor a “noisy” one generally exists, but there is no consensus on a standardized name.
For projects built with CMake, enabling verbose mode
probably does the trick (VERBOSE=1
). For other projects,
the name varies. (I'm partial to the NOISY_BUILD=yes
that Asterisk PBX uses.)
For OpenSIPS you can achieve this
effect by setting the Q
variable to empty:
$ make Q= opensips modules
Other OpenSIPS make variables
Okay. And what about parallel jobs?
Use the FASTER=1
variable, along with the -j
flag:
$ make -j4 FASTER=1 opensips modules
And building only a specific module?
Generally you'll want to preset which modules to include or exclude
in Makefile.conf
. There you have exclude_modules?=
and include_modules?=
which are used unless you set them
earlier (on the command line).
(Beware: touching Makefile.conf
will force a
rebuild of the entire project.)
For a single run, you can specify either one of them or
modules
on the command line, where modules takes
space separated modules with a modules/
prefix:
$ make modules exclude_modules=presence (builds all modules except presence)
$ make modules include_modules=presence (builds the selected modules from Makefile.conf and the presence module)
$ make modules modules=modules/presence (builds only the presence module)
This might save you some time.
2021-05-05 - missing serial / scsi / disk by-id
When you have a lot of storage devices, it's best practice to assign them to raid arrays or ZFS pools by something identifiable. And preferably something that's also readable when outside a computer. Commonly: the disk manufacturer and the serial number.
Usually, both the disk manufacturer and the disk serial number are printed on a small label on the disk. So, if you're in the data center replacing a disk, one glance is sufficient to know you got the correct disk.
For this reason, our ZFS storage pool configurations look like this:
NAME STATE tank ONLINE raidz2-0 ONLINE scsi-SSEAGATE_ST10000NM0226_6351 ONLINE scsi-SSEAGATE_ST10000NM0226_0226 ONLINE scsi-SSEAGATE_ST10000NM0226_8412 ONLINE scsi-SSEAGATE_ST10000NM0226_... ONLINE
Instead of this:
NAME STATE tank ONLINE raidz2-0 ONLINE sda ONLINE sdb ONLINE sdc ONLINE sd... ONLINE
If you're replacing a faulty disk, you can match it to the serial number and confirm that you haven't done anything stupid.
Referencing these disks is as easy as using the symlink in
/dev/disk/by-id
.
No model names and serial numbers in udev?
But I don't have any serial numbers in
/dev/disk/by-id
, I only have these wwn-
numbers.
If your /dev/disk/by-id
looks like this:
# ls -1 /dev/disk/by-id/ scsi-35000c5001111138e scsi-35000c50011111401 ... wwn-0x5000c5001111138e wwn-0x5000c5001111140f ...
And it has no manufacturer/serial symlinks, then udev is letting you down.
Looking at udevadm info /dev/sda
may reveal that you're missing some udev rules.
On this particular machine I did have ID_SCSI_SERIAL
,
but not SCSI_VENDOR
, SCSI_MODEL
or
SCSI_IDENT_SERIAL
.
On Ubuntu/Focal, the fix was to install
sg3-utils-udev
which provides udev rules in
55-scsi-sg3_id.rules
and
58-scsi-sg3_symlink.rules
:
# apt-get install sg3-utils-udev
# udevadm trigger --action=change
# ls -1 /dev/disk/by-id/ scsi-35000c5001111138e scsi-35000c50011111401 ... scsi-SSEAGATE_ST10000NM0226_8327 scsi-SSEAGATE_ST10000NM0226_916D ... wwn-0x5000c5001111138e wwn-0x5000c5001111140f ...
Awesome. Devices with serial numbers. I'm off to create a nice zpool.
2021-04-29 - smtp_domain / gitlab configuration
What is the smtp_domain
in the GitLab configuration?
There is also a smtp_address
and
smtp_user_name
; so what would you put in the
“domain” field?
Contrary to what the examples on GitLab
Omnibus SMTP lead you to believe: smtp_domain
is
the HELO/EHLO
domain; i.e. your hostname.
RFC 5321 has this to say about the HELO/EHLO parameter:
o The domain name given in the EHLO command MUST be either a primary host name (a domain name that resolves to an address RR) or, if the host has no name, an address literal, as described in Section 4.1.3 and discussed further in the EHLO discussion of Section 4.1.4.
So, the term [smtp_]helo_hostname
would've been a lot more appropriate.
2021-04-26 - yubico otp / pam / openvpn
Quick notes on setting up pam_yubico.so
with OpenVPN.
Add to OpenVPN server config:
plugin /usr/lib/x86_64-linux-gnu/openvpn/plugins/openvpn-plugin-auth-pam.so openvpn # Use a generated token instead of user/password for up # to 16 hours, so you'll need to re-enter your otp daily. auth-gen-token 57600
Sign up at https://upgrade.yubico.com/getapikey/.
It's really quick.
Store client_id
and secret
(or id
and key respectively). You'll need them in the config below.
Get PAM module:
# apt-get install --no-install-recommends libpam-yubico
Create /etc/pam.d/openvpn:
# This file is called /etc/pam.d/openvpn; and it is used by openvpn through: # plugin /usr/lib/x86_64-linux-gnu/openvpn/plugins/openvpn-plugin-auth-pam.so openvpn # Settings for pam_yubico.so # -------------------------- # debug # yes, we want debugging (DISABLE when done) # debug_file=stderr # stdout/stderr/somefile all go to journald; # but stdout will get truncated because it's not flush()ed. # mode=client # client for OTP validation # authfile=/etc/openvpn/server/authorized_yubikeys # the file with "USERNAME:YUBI1[:YUBI2:...]" lines # #alwaysok # this is the dry-run (allow all) # #use_first_pass/try_first_pass # do NOT use these for openvpn/openssh; the password is fetched # through PAM_CONV: # > pam_yubico.c:935 (pam_sm_authenticate): get password returned: (null) # #verbose_otp # do NOT use this for openvpn/openssh; it will break password input # without any meaningful debug info: # > pam_yubico.c:1096 (pam_sm_authenticate): conv returned 1 bytes # > pam_yubico.c:1111 (pam_sm_authenticate): Skipping first 0 bytes. [...] # > pam_yubico.c:1118 (pam_sm_authenticate): OTP: username ID: username # First, the username+password is checked: auth required pam_yubico.so debug debug_file=stderr mode=client authfile=/etc/openvpn/server/authorized_yubikeys id=<client_id> key=<secret> # Second, an account is needed: pam_sm_acct_mgmt returning PAM_SUCCESS # (It checks the value of 'yubico_setcred_return' which was set by # pam_sm_authenticate.) This one needs no additional config: account required pam_yubico.so debug debug_file=stderr
As you can see in the comments above, some of that config had me puzzled for a while.
The above should be sufficient to get a second factor (2FA) for OpenVPN logins, next to your valid certificate. But, as someone immediately cleverly pointed out: if you use it like this, you have 2 x 1FA. Not 2FA.
That means that the usefulness of this is limited...
2021-04-08 - proxmox / virtio-blk / disk by-id
Why does the virtio-blk /dev/vda
block
device not show up in /dev/disk/by-id
?
Yesterday, I wrote about how Proxmox VE attaches
scsi0
and virtio0
block devices differently.
That is the starting point for todays question: how come do I get
/dev/sda
in /dev/disk/by-id
while
/dev/vda
is nowhere to be found?
This question is relevant if you're used to referencing disks through
/dev/disk/by-id
(for example when setting
up ZFS, using the device identifiers). The named devices can be a
lot more convenient to keep track of.
If you're on a QEMU VM using virtio-scsi
,
the block devices do show up:
# ls -log /dev/disk/by-id/ total 0 lrwxrwxrwx 1 9 apr 8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0 lrwxrwxrwx 1 9 apr 8 14:50 scsi-0QEMU_QEMU_HARDDISK_drive-scsi0 -> ../../sda lrwxrwxrwx 1 10 apr 8 14:50 scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-part1 -> ../../sda1
But if you're using virtio-blk
, they do not:
# ls -log /dev/disk/by-id/ total 0 lrwxrwxrwx 1 9 apr 8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0
There: no symlinks to /dev/vda
, while it does exist
and it does show up in /dev/disk/by-path
:
# ls -l /dev/vda{,1} brw-rw---- 1 root disk 254, 0 apr 8 14:50 /dev/vda brw-rw---- 1 root disk 254, 1 apr 8 14:50 /dev/vda1
# ls -log /dev/disk/by-path/ total 0 lrwxrwxrwx 1 9 apr 8 14:50 pci-0000:00:01.1-ata-2 -> ../../sr0 lrwxrwxrwx 1 9 apr 8 14:50 pci-0000:00:0a.0 -> ../../vda lrwxrwxrwx 1 10 apr 8 14:50 pci-0000:00:0a.0-part1 -> ../../vda1 lrwxrwxrwx 1 9 apr 8 14:50 virtio-pci-0000:00:0a.0 -> ../../vda lrwxrwxrwx 1 10 apr 8 14:50 virtio-pci-0000:00:0a.0-part1 -> ../../vda1
udev rules
Who creates these? It's udev.
If you look at the udev rules in 60-persistent-storage.rules
, you'll see a bunch of these:
# grep -E '"(sd|vd)' /lib/udev/rules.d/60-persistent-storage.rules KERNEL=="vd*[!0-9]", ATTRS{serial}=="?*", ENV{ID_SERIAL}="$attr{serial}", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}" KERNEL=="vd*[0-9]", ATTRS{serial}=="?*", ENV{ID_SERIAL}="$attr{serial}", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}-part%n" ... KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", SUBSYSTEMS=="scsi", ATTRS{vendor}=="ATA", IMPORT{program}="ata_id --export $devnode" KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", SUBSYSTEMS=="scsi", ATTRS{type}=="5", ATTRS{scsi_level}=="[6-9]*", IMPORT{program}="ata_id --export $devnode" ... KERNEL=="sd*|sr*|cciss*", ENV{DEVTYPE}=="disk", ENV{ID_SERIAL}=="?*", SYMLINK+="disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}" KERNEL=="sd*|cciss*", ENV{DEVTYPE}=="partition", ENV{ID_SERIAL}=="?*", SYMLINK+="disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}-part%n" ...
So, udev is in the loop, and would create symlinks, if it matched the appropriate rules.
Comparing output from udevadm:
# udevadm info /dev/sda P: /devices/pci0000:00/0000:00:05.0/virtio1/host2/target2:0:0/2:0:0:0/block/sda N: sda ... E: DEVNAME=/dev/sda E: DEVTYPE=disk ... E: ID_SERIAL=0QEMU_QEMU_HARDDISK_drive-scsi0 E: ID_SERIAL_SHORT=drive-scsi0 E: ID_BUS=scsi E: ID_PATH=pci-0000:00:05.0-scsi-0:0:0:0 ...
and:
# udevadm info /dev/vda P: /devices/pci0000:00/0000:00:0a.0/virtio1/block/vda N: vda ... E: DEVNAME=/dev/vda E: DEVTYPE=disk ... E: ID_PATH=pci-0000:00:0a.0 ...
The output from /dev/vda
is a lot shorter. And there is no
ID_BUS
nor ID_SERIAL
. And the lack of a
serial is what causes this rule to be skipped:
KERNEL=="vd*[!0-9]", ATTRS{serial}=="?*", ENV{ID_SERIAL}="$attr{serial}", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}"
We could hack the udev rules, adding a default serial when it's unavailable:
KERNEL=="vd*[!0-9]", ATTRS{serial}!="?*", ENV{ID_SERIAL}="MY_SERIAL", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}"
# udevadm control --reload
# udevadm trigger --action=change # ls -log /dev/disk/by-id/ lrwxrwxrwx 1 9 apr 8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0 lrwxrwxrwx 1 9 apr 8 14:50 virtio-MY_SERIAL -> ../../vda
But that's awkward. And it breaks things if we ever add a second disk.
Adding a serial through Proxmox
Instead, we can hand-hack the Proxmox VE QEMU configuration
file and add a (custom 20 bytes) ,serial=MY_SERIAL
parameter to the disk configuration. We'll use disk0
as
serial for now:
--- /etc/pve/qemu-server/NNN.conf +++ /etc/pve/qemu-server/NNN.conf @@ -10,5 +10,5 @@ ostype: l26 scsihw: virtio-scsi-pci smbios1: uuid=d41e78ad-4ff6-4000-8882-c343e3233945 sockets: 1 -virtio0: somedisk:vm-NNN-disk-0,size=32G +virtio0: somedisk:vm-NNN-disk-0,serial=disk0,size=32G vmgenid: 2ffdfa16-769a-421f-91f3-71397562c6b9
Stop the VM, start it again, and voilĂ , the disk is matched:
# ls -log /dev/disk/by-id/ total 0 lrwxrwxrwx 1 9 apr 8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0 lrwxrwxrwx 1 9 apr 8 14:50 virtio-disk0 -> ../../vda lrwxrwxrwx 1 10 apr 8 14:50 virtio-disk0-part1 -> ../../vda1
As long as you don't create duplicate serials in the same VM, this should be fine.
2021-04-07 - proxmox / alter default create vm parameters
The Proxmox Virtual Environment has defaults when creating a new VM, but it has no option to change those defaults. Here's a quick example of hacking in some defaults.
Why? (Changing SCSI controller does not change existing disks)
In the next post I wanted to talk about /dev/disk/by-id
and why disks that use the VirtIO SCSI controller do not show up
there. A confusing matter in this situation was that creating a
VM disk using a different SCSI controller and then
switching does not change the storage driver for the existing
disks completely!
If you're on Proxmox VE 6.x (observed with 6.1 and 6.3) and
you create a VM with the VirtIO SCSI controller, your
virtual machine parameters may look like this, and you get a /dev/vda
device inside your QEMU VM:
/usr/bin/kvm \ ... -device virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa \ -drive file=/dev/zvol/somedisk/vm-NNN-disk-0,if=none,id=drive-virtio0,format=raw
But if you create it with the (default) LSI 53C895A SCSI
controller first, and then switch to VirtIO SCSI,
you still keep the (ATA) /dev/sda
block device name.
The VM is started with these command line arguments:
/usr/bin/kvm \ ... -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 \ -drive file=/dev/zvol/somedisk/vm-NNN-disk-0,if=none,id=drive-scsi0,format=raw \ -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0
If you look at the configuration in
/etc/pve/qemu-server/NNN.conf
, both would have:
scsihw: virtio-scsi-pci
But the disk configuration type/name is different:
virtio0: somedisk:vm-NNN-disk-0,size=32G
vs.
scsi0: somedisk:vm-NNN-disk-0,size=32G
This virtio-scsi-pci
+ scsi0
is turned into
-device virtio-scsi-pci
flags.
While virtio-scsi-pci
+ virtio0
translates to
-device virtio-blk-pci
.
It's not a bad thing though, that it does not change from
scsi0
to virtio0
. After all,
if the device did change from /dev/sda
to
/dev/vda
, your boot procedure and mounts might be impacted.
But it does mean that you want the VirtIO SCSI
option selected before you create any disks.
How? (Hacking defaults into pvemanagerlib.js)
In the pve-manager package, there's a
/usr/share/pve-manager/js/pvemanagerlib.js
that controls
much of the user interface. Altering the default appears to be a matter
of:
--- /usr/share/pve-manager/js/pvemanagerlib.js +++ /usr/share/pve-manager/js/pvemanagerlib.js @@ -21771,7 +21771,7 @@ Ext.define('PVE.qemu.OSDefaults', { scsi: 2, virtio: 1 }, - scsihw: '' + scsihw: 'virtio-scsi-pci' }; // virtio-net is in kernel since 2.6.25
For bonus points, we can disable the firewall default, which we manage elsewhere anyway:
--- /usr/share/pve-manager/js/pvemanagerlib.js +++ /usr/share/pve-manager/js/pvemanagerlib.js @@ -22434,7 +22434,7 @@ Ext.define('PVE.qemu.NetworkInputPanel', xtype: 'proxmoxcheckbox', fieldLabel: gettext('Firewall'), name: 'firewall', - checked: (me.insideWizard || me.isCreate) + checked: false } ]; @@ -27909,7 +27909,7 @@ Ext.define('PVE.lxc.NetworkInputPanel', cdata.name = 'eth0'; me.dataCache = {}; } - cdata.firewall = (me.insideWizard || me.isCreate); + cdata.firewall = false; if (!me.dataCache) { throw "no dataCache specified";
Of course these changes will get wiped whenever you update Proxmox VE. Keeping your hacks active will be an exercise for the reader.
2021-01-21 - openvpn / hardened fox-it openvpn-nl
Today, we will be evaluating OpenVPN-NL — “[a] hardened version of OpenVPN that includes as many of the security measures required to operate in a classified environment as possible — and whether we can use it as a drop-in replacement for regular OpenVPN.
While OpenVPN allows many insecure configurations, such as turning off encryption, or the use of outdated cryptographic functions in security critical places, the goal of OpenVPN-NL — a fork created and maintained by Fox-IT — is to strip insecure configuration and verify that the distributed version is uncompromised.
We'll be answering the question of whether it's compatible and whether we want to use it.
For Ubuntu Bionic and Xenial, repositories exist. But the Bionic version works just fine on Ubuntu Focal.
OpenVPN | OpenVPN-NL | |
---|---|---|
repo | ubuntu default | fox-it repository |
package | openvpn | openvpn-nl |
version | 2.4.7-1ubuntu2 | 2.4.7-bionicnl1 |
dependencies | lzo2-2, lz4-1, pkcs11, ssl, systemd0, iproute2 | lzo2-2, net-tools, (embedded) Mbed TLS 2.16.2 |
size | 1160 KiB | 1627 KiB |
binary | /usr/sbin/openvpn | /usr/sbin/openvpn-nl |
systemd notify | YES | - |
As you can already see in the above list:
- the versions are similar (2.4.7);
- OpenVPN is linked to OpenSSL while
OpenVPN-NL embeds Mbed TLS. This means that:
- it is not affected by OpenSSL specific security issues,
- but it will be affected by Mbed TLS issues and we'll have to rely on updates from Fox-IT, should such issues arise.
- OpenVPN-NL can be installed alongside OpenVPN, which makes switching between the two convenient;
- it depends on older networking tools (net-tools);
- it does not support
sd_notify
— you'll have to disableType=notify
in your SystemD service files.
On to the hardening bits
The hardening done by Fox-IT appears to consist of the following changes:
- Mbed TLS is used instead of OpenSSL:
- if you assume that OpenSSL is riddled with flaws, then this is a good thing;
- if you assume that any security product, including Mbed TLS will have its flaws, then a drawback is that you get fewer features (no TLS 1.3) and that you have to rely on timely patches from Fox-IT.
- OpenVPN-NL drastically limits the allowed cryptography algorithms — both on the weak and on the strong side of the spectrum — leaving you with really no option but SHA256, RSA and AES-256;
- it enforces a few options that you should have enabled, like
certificate validation, and specifically
remote-cert-tls
to prevent client-to-client man in the middle attacks; - it removes a few options that you should not have
enabled, like
no-iv
,client-cert-not-required
or optionalverify-client-cert
; - certificates must be signed with a SHA256 hash, or the certificates will be rejected;
- it delays startup until there is sufficient entropy on the system
(it does so by reading and discarding
min-platform-entropy
bytes from/dev/random
, which strikes me as an odd way to accomplish that) — during testing you can setmin-platform-entropy 0
.
Note that we're only using Linux, so we did not check any Windows build scripts/fixes that may also be done. The included PKCS#11 code — for certificates on hardware tokens — was not checked either at this point.
The available algorithms:
OpenVPN | OpenVPN-NL | |
---|---|---|
--show-digests | .. lots and lots .. | SHA256 |
--show-tls | .. anything that OpenSSL supports, for TLS 1.3 and below .. | TLS 1.2 (only) with ciphers: TLS-ECDHE-RSA-WITH-AES-256-GCM-SHA384 TLS-DHE-RSA-WITH-AES-256-GCM-SHA384 TLS-DHE-RSA-WITH-AES-256-CBC-SHA256 |
--show-ciphers | .. lots and lots .. | AES-256-CBC AES-256-GCM |
Notable in the above list is that SHA512 is not allowed, nor are ECDSA ciphers: so no new fancy ed25519 or secp521r1 elliptic curve (EC) ciphers, but only plain old RSA large primes. (The diff between openvpn-2.4.7-1ubuntu2 and Fox-IT bionicnl1 even explicitly states that EC is disabled, except for during the Diffie-Hellman key exchange. No motivation is given.)
So, compatibility with vanilla OpenVPN is available, if you stick to this configuration, somewhat.
Server settings:
mode server tls-server
Client settings:
client # equals: pull + tls-client
Server and client settings:
local SERVER_IP # [server: remote SERVER_IP] proto udp port 1194 nobind # [server: server VPN_NET 255.255.255.0] dev vpn-DOMAIN # named network devices are nice dev-type tun # HMAC auth, first line of defence against brute force auth SHA256 tls-auth DOMAIN/ta.key 1 # [server: tls-auth DOMAIN/ta.key 0] key-direction 1 # int as above, allows inline <tls-auth> # TLS openvpn-nl compatibility config tls-version-min 1.2 #[not necessary]#tls-version-max 1.2 # MbedTLS has no 1.3 # DH/TLS setup # - no ECDSA for openvpn-nl # - no TLS 1.3 for openvpn-nl tls-cipher TLS-ECDHE-RSA-WITH-AES-256-GCM-SHA384 tls-ciphersuites TLS_AES_256_GCM_SHA384 # only for TLS 1.3 ecdh-curve secp384r1 #[only server]#dh none # (EC)DHE, thus no permanent parameters # TLS certificates # Note that the certificates must be: # - SHA-256 signed # - using RSA 2048 or higher (choose at least 4096), and not Elliptic Curve # - including "X509v3 Extended Key Usage" (EKU) for Server vs. Client remote-cert-tls server # [server: remote-cert-tls client] (EKU) ca DOMAIN/ca.crt # CA to validate the peer certificate against cert DOMAIN/client-or-server.crt key DOMAIN/client-or-server.key #[only server]#crl-verify DOMAIN/crl.pem # check for revoked certs # Data channel cipher AES-256-GCM # or AES-256-CBC ncp-disable # and no cipher negotiation # Drop privileges; keep tunnel across restarts; keepalives # useradd -md /var/spool/openvpn -k /dev/null -r -s /usr/sbin/nologin openvpn user openvpn group nogroup persist-key persist-tun keepalive 15 55 # ping every 15, disconnect after 55 #[only server]#opt-verify # force compatible options
The lack of SystemD notify
support is a minor annoyance.
When editing the SystemD service file, set Type
to simple
and remove --daemon
from the
options. Otherwise you may end up with unmounted
PrivateTmp
mounts and multiple openvpn-nl
daemons (which of course hold on to the listening socket your
new daemon needs, causing strange client-connect
errors):
# /etc/systemd/system/openvpn@server.service.d/override.conf [Service] ExecStart= # Take the original ExecStart, replace "openvpn" with "openvpn-nl" # and remove "--daemon ...": ExecStart=/usr/sbin/openvpn-nl --status /run/openvpn/%i.status 10 --cd /etc/openvpn --script-security 2 --config /etc/openvpn/%i.conf --writepid /run/openvpn/%i.pid Type=simple
If you're okay with sticking to SHA256 and RSA for now, then OpenVPN-NL is compatible with vanilla OpenVPN. Do note that hardware acceleration in Mbed TLS is explicitly marked as disabled on the OpenVPN-NL lifecycle page. I'm not sure if this is a security decision, but it may prove to be less performant.
In conclusion: there is no immediate need to use OpenVPN-NL, but it is wise to take their changes to heart. Make sure:
- you validate and trust packages from your software repository;
- all your certificates are SHA256-signed;
remote-cert-tls
is enabled (and your certificates are marked with the correct key usage, e.g. by using a recent easy-rsa to sign your keys);- ciphers are fixed or non-negotiable using
ncp-disable
; auth
,cipher
andtls-cipher
are set to something modern.
But if you stick to the above configuration, then using OpenVPN-NL is fine too..
.. although still I cannot put my finger on how discarding bytes from
/dev/random
would make things more secure.
Notes about RNG, min-platform-entropy and hardware support
About “how discarding bytes from /dev/random
makes
things more secure.”
I think the theory is that throwing away some bytes makes
things more secure, because the initially seeded bytes after reboot
might be guessable. And instead of working against the added code
— by lowering min-platform-entropy
— we can
instead attempt to get more/better entropy.
If the rdrand
processor flag is available then this
might be a piece of cake:
$ grep -q '^flags.*\brdrand\b' /proc/cpuinfo && echo has rdrand has rdrand
If it isn't, and this is a virtual machine, you'll need to
(a) confirm that it's available on the VM host and (b) enable
the host processor in the VM guest (-cpu
host
). (If you wanted AES CPU acceleration, you would have
enabled host CPU support already.)
When the processor flag is available, you can start benefitting from host-provided entropy.
$ cat /proc/sys/kernel/random/entropy_avail 701
This old entropy depletes faster than a Coca-Cola bottle with
a Mentos in it, once you start reading from /dev/random
directly.
But, if you install rng-tools, you get a nice
/usr/sbin/rngd
that checks entropy levels and reads from
/dev/hwrng
, replenishing the entropy as needed.
17:32:00.784031 poll([{fd=4, events=POLLOUT}], 1, -1) = 1 ([{fd=4, revents=POLLOUT}]) 17:32:00.784162 ioctl(4, RNDADDENTROPY, {entropy_count=512, buf_size=64, buf="\262\310"...}) = 0
$ cat /proc/sys/kernel/random/entropy_avail 3138
Instant replenish! Now you can consider enabling
use-prediction-resistance
if you're using MbedTLS
(through OpenVPN-NL).
Footnotes
See also blog.g3rt.nl openvpn security tips and how to harden OpenVPN in 2020.
2021-01-15 - postgresql inside kubernetes / no space left on device
Running PostgreSQL inside Kubernetes? Getting occasional "No space left on device" errors? Know that 64MB is not enough for everyone.
With the advent of more services running inside Kubernetes, we're now running into new issues and complexities specific to the containerization. For instance, to solve the problem of regular file backups of distributed filesystems, we've resorted to using rsync wrapped inside a pod (or sidecar). And now for containerized PostgreSQL, we're running into an artificial memory limit that needs fixing.
Manifestation
The issue manifests itself like this:
ERROR: could not resize shared memory segment "/PostgreSQL.491173048" to 4194304 bytes: No space left on device
This shared memory that PostgreSQL speaks of, is the
shared memory made available to it through
/dev/shm
.
On your development machine, it may look like this:
$ mount | grep shm tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
$ df -h | sed -ne '1p;/shm/p' Filesystem Size Used Avail Use% Mounted on tmpfs 16G 948M 15G 6% /dev/shm
That's fine. 16GiB is plenty of space. But in Kubernetes we get a Kubernetes default of a measly 64MiB and no means to change the shm-size. So, inside the pod with the PostgreSQL daemon, things look like this:
$ mount | grep shm shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
$ df -h | sed -ne '1p;/shm/p' Filesystem Size Used Avail Use% Mounted on shm 64M 0 64M 0% /dev/shm
For a bunch of database operations, that is definitely too little. Any PostgreSQL database doing any serious work will quickly use up that much temporary space. (And run into this error.)
According to Thomas Munro on the postgrespro.com mailing list:
PostgreSQL creates segments in
/dev/shm
for parallel queries (viashm_open()
), not for shared buffers. The amount used is controlled bywork_mem
. Queries can use up towork_mem
for each node you see in theEXPLAIN
plan, and for each process, so it can be quite a lot if you have lots of parallel worker processes and/or lots of tables/partitions being sorted or hashed in your query.
Basically what they're saying is: you need sufficient space in /dev/shm
, period!
On the docker-library
postgres page it is documented that you may want to increase
the --shm-size
(ShmSize).
That is quite doable for direct Docker or docker-compose instantiations.
But for PostgreSQL daemon pods in Kubernetes resizing
shm does not seem to be possible.
Any other fixes then?
Well, I'm glad you asked! /dev/shm
is just one of the ways that the
PostgreSQL daemon can be configured to allocate shared memory through:
- dynamic_shared_memory_type (enum)
- Specifies the dynamic shared memory implementation that the server should use. Possible values are posix (for POSIX shared memory allocated using shm_open), sysv (for System V shared memory allocated via shmget), windows (for Windows shared memory), and mmap (to simulate shared memory using memory-mapped files stored in the data directory). [...]
(from PostgresSQL runtime config)
When using the posix shm_open()
, we're directly
opening files in /dev/shm
. If we however opt to use the
(old fashioned) sysv shmget()
, the memory
allocation is not pinned to this filesystem and it is not limited
(unless someone has been touching /proc/sys/kernel/shm*
).
Technical details of using System V shared memory
Using System V shared memory is a bit more convoluted than
using POSIX shm. For POSIX shared memory
calling shm_open()
is basically the same as opening a
(mmap
-able) file in /dev/shm
. For System
V however, you're looking at an incantation like this
shmdemo.c
example:
#include <stdio.h> #include <string.h> #include <sys/ipc.h> #include <sys/shm.h> #define SHM_SIZE (size_t)(512 * 1024 * 1024UL) /* 512MiB */ int main(int argc, char *argv[]) { key_t key; int shmid; char *data; if (argc > 2) { fprintf(stderr, "usage: shmdemo [data_to_write]\n"); return 1; } /* The file here is used as a "pointer to memory". The key is * calculated based on the inode number and non-zero 8 bits: */ if ((key = ftok("./pointer-to-memory.txt", 1 /* project_id */)) == -1) { fprintf(stderr, "please create './pointer-to-memory.txt'\n"); return 2; } if ((shmid = shmget(key, SHM_SIZE, 0644 | IPC_CREAT)) == -1) return 3; if ((data = shmat(shmid, NULL, 0)) == (char *)(-1)) /* attach */ return 4; /* read or modify the segment, based on the command line: */ if (argc == 2) { printf("writing to segment %#x: \"%s\"\n", key, argv[1]); strncpy(data, argv[1], SHM_SIZE); } else { printf("segment %#x contained: \"%s\"\n", key, data); shmctl(shmid, IPC_RMID, NULL); /* free the memory */ } if (shmdt(data) == -1) /* detach */ return 5; return 0; }
(Luckily the PostgreSQL programmers concerned themselves with these awkward semantics, so we won't have to.)
If you want to confirm that you have access to sufficient System V shared memory inside your pod, you could try the above code sample to test. Invoking it looks like:
$ ./shmdemo please create './pointer-to-memory.txt'
$ touch ./pointer-to-memory.txt
$ ./shmdemo segment 0x1010dd5 contained: ""
$ ./shmdemo 'please store this in shm' writing to segment 0x1010dd5: "please store this in shm"
$ ./shmdemo segment 0x1010dd5 contained: "please store this in shm"
$ ./shmdemo segment 0x1010dd5 contained: ""
And if you skipped/forget the IPC_RMID
, you can see the
leftovers using ipcs
:
$ ipcs | awk '{if(int($6)==0)print}' ------ Message Queues -------- key msqid owner perms used-bytes messages ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x52010e16 688235 walter 644 536870912 0 0x52010e19 688238 walter 644 536870912 0 ------ Semaphore Arrays -------- key semid owner perms nsems
And remove them with ipcrm
:
$ ipcrm -M 0x52010e16
$ ipcrm -M 0x52010e19
But, you probably did not come here for lessons in ancient IPC. Quickly moving on to the next paragraph...
Configuring sysv dynamic_shared_memory_type in stolon
For stolon — the Kubernetes PostgreSQL
manager that we're using — you can configure different
parameters through the pgParameters
setting. It keeps
the configuration in a configMap
:
$ kubectl -n NS get cm stolon-cluster-mycluster -o yaml apiVersion: v1 kind: ConfigMap metadata: annotations: control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":...}' stolon-clusterdata: '{"formatVersion":1,...}' ...
Where the stolon-clusterdata
holds both the
configuration and current state:
{ "formatVersion": 1, "changeTime": "2021-01-15T10:17:54.297700008Z", "cluster": { ... "spec": { ... "pgParameters": { "datestyle": "iso, mdy", "default_text_search_config": "pg_catalog.english", "dynamic_shared_memory_type": "posix", ...
You should not be editing this directly, but it can be educational to look at.
To edit the pgParameters
you'll be using
stolonctl
from inside a stolon-proxy as specified
in the cluster
specification patching docs:
$ stolonctl --cluster-name=mycluster --store-backend=kubernetes \ --kube-resource-kind=configmap update --patch \ '{"pgParameters": {"dynamic_shared_memory_type": "sysv"}}'
$ stolonctl --cluster-name=mycluster --store-backend=kubernetes \ --kube-resource-kind=configmap update --patch \ '{"pgParameters": {"shared_buffers": "6144MB"}}'
And a restart:
$ kubectl -n NS rollout restart sts stolon-keeper
And that, my friends, should get rid of that pesky 64MiB limit.
2021-01-05 - chromium snap / wrong fonts
So, since a couple of weeks my snap-installed Chromium browser on
Ubuntu Focal started acting up: suddenly it chooses the wrong fonts on
some web pages. The chosen fonts are from the
~/.local/share/fonts/
directory.
Look! That's not the correct font. And it's even more apparent that the font is off when seeing the source view.
Bah. That's not even a monospaced font.
A fix that appeared to work — but unfortunately only temporarily — involves temporarily moving the custom local fonts out of the way and then flushing the font cache:
$ mkdir ~/.local/share/DISABLED-fonts $ mv ~/.local/share/fonts/* ~/.local/share/DISABLED-fonts/ $ fc-cache -rv && sudo fc-cache -rv
Restarting chromium-browser by using the about:restart
took quite a while. Some patience had to be exercised.
When it finally did start, all font issues were solved.
Can we now restore our custom local fonts again?
$ mv ~/.local/share/DISABLED-fonts/* ~/.local/share/fonts/ $ fc-cache -rv && sudo fc-cache -rv
And another about:restart
— which was fast as
normal again — and everything was still fine. So yes, apparently, we can.
However, after half a day of work, the bug reappeared.
A semi-permanent fix is refraining from using the the local fonts directory. But that's not really good enough.
Appently there's a bug report showing that not only Chromium is affected. And while I'm not sure how to fix things yet, at least the following seems suspect:
$ grep include.*/snap/ \ ~/snap/chromium/current/.config/fontconfig/fonts.conf <include ignore_missing="yes">/snap/chromium/1424/gnome-platform/etc/fonts/fonts.conf</include>
This would make sense, if current/
pointed to
1424
, but current/
now points to
1444
.
Here's a not
yet merged pull request that look's promising.
And here, there's someone who grew tired of hotfixing the
fonts.conf
and symlinked
all global font conf files into ~/.local/share/fonts/. That might
also be worth a try...
A more permanent solution?
$ mkdir -p ~/snap/chromium/common/.config/fontconfig $ cat >>~/snap/chromium/common/.config/fontconfig/fonts.conf <<EOF <fontconfig> <include>/etc/fonts/conf.d</include> </fontconfig> EOF
I settled for a combination of the linked suggestions. The above snippet looks like it works. Crosses fingers...
Three weeks later...
Or at least, for a while. It looks like a new snap-installed version of Chromium broke things again. When logging in after the weekend, I was presented with the wrong fonts again.
This time, I:
- fixed the symlinks,
- removed the older/unused 1444 snap revision,
- reran the
fc-cache
flush, and - restarted Chromium.
Permanent? No!
TL;DR
(Months later by now.. still a problem.)
It feels as if I'm the only one suffering from this. At least now the following sequence appears to work reliably:
- new Chromium snap has been silently installed;
- fonts are suddenly broken in currently running version;
sudo rm /var/snap/chromium/common/fontconfig/*
;- shut down / kill Chromium (make sure you get them all);
- start Chromium and reopen work with ctrl-shift-T.
(It's perhaps also worth looking into whether the default Chromium fonts are missing after snapd has been updated ticket has been resolved.)
2021-01-02 - stale apparmor config / mysql refuses to start
So, recently we had an issue with a MariaDB server that refused to start. Or, actually, it would start, but before long, SystemD would kill it. But why?
# systemctl start mariadb.service Job for mariadb.service failed because a timeout was exceeded. See "systemctl status mariadb.service" and "journalctl -xe" for details.
After 90 seconds, it would be killed. systemctl status
mariadb.service
shows the immediate cause:
# systemctl status mariadb.service ... systemd[1]: mariadb.service: Start operation timed out. Terminating. systemd[1]: mariadb.service: Main process exited, code=killed, status=15/TERM systemd[1]: mariadb.service: Failed with result 'timeout'.
Ok, a start operation timeout. That is caused by the
notify
type: apparently the mysqld
doesn't
get a chance to tell SystemD that it has succesfully
completed startup.
First, a quickfix, so we can start at all:
# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf [Service] Type=simple EOF
That fixes so we can start — because now SystemD won't require for any "started" notification anymore — but it doesn't explain what is wrong.
Second, an attempt at debugging the cause:
# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf [Service] NotifyAccess=all ExecStart= ExecStart=/usr/bin/strace -fesendmsg,sendto,connect,socket -s8192 \ /usr/sbin/mysqld $MYSQLD_OPTS EOF
Okay, that one showed EACCESS
errors on the
sendmsg()
call on the /run/systemd/notify
unix socket:
strace[55081]: [pid 55084] socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 46 strace[55081]: [pid 55084] sendmsg(46, {msg_name={sa_family=AF_UNIX, sun_path="/run/systemd/notify"}, msg_namelen=22, msg_iov=[{iov_base="READY=1\nSTATUS=Taking your SQL requests now...\n", iov_len=47}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = -1 EACCES (Permission denied)
Permission denied? But why?
# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf [Service] NotifyAccess=all ExecStart= ExecStart=/usr/bin/strace -fesendmsg,sendto,connect,socket -s8192 \ /bin/sh -c 'printf "READY=1\nSTATUS=Taking your SQL requests now...\n" | \ socat - UNIX-SENDTO:/run/systemd/notify; sleep 3600 EOF
This worked:
strace[54926]: [pid 54931] socket(AF_UNIX, SOCK_DGRAM, 0) = 5 strace[54926]: [pid 54931] sendto(5, "READY=1\nSTATUS=Taking your SQL requests now...\n", 47, 0, {sa_family=AF_UNIX, sun_path="/run/systemd/notify"}, 21) = 47
(Unless someone is really trying to mess with you,
you can regard sendto()
and sendmsg()
as
equivalent here. socat
simply uses the other one.)
That means that there is nothing wrong with SystemD or
/run/systemd/notify
. So the problem
must be related to /usr/sbin/mysqld
.
After looking at journalctl -u mariadb.service
for the nth time,
I decided to peek at all of journalctl without any filters. And there it was
after all: audit logs.
# journalctl -t audit audit[1428513]: AVC apparmor="DENIED" operation="sendmsg" info="Failed name lookup - disconnected path" error=-13 profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=1428513 comm="mysqld" requested_mask="w" denied_mask="w" fsuid=104 ouid=0
(Observe the -t
in the journalctl invocation above which
looks for the SYSLOG_IDENTIFIER=audit
key-value pair.)
Okay. And fixing it?
# aa-remove-unknown Skipping profile in /etc/apparmor.d/disable: usr.sbin.mysqld Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd Removing '/usr/sbin/mysqld'
A-ha! Stale cruft in /var/cache/apparmor
.
# /etc/init.d/apparmor restart Restarting apparmor (via systemctl): apparmor.service.
Finally we could undo the override.conf
and everything
started working as expected.
2021-01-01 - zfs / zvol / partition does not show up
On our Proxmox virtual machine I had to go into a volume to quickly fix an IP address. The volume exists on the VM host, so surely mounting is easy. Right?
I checked in /dev/zvol/pve2-pool/
where I found the disk:
# ls /dev/zvol/pve2-pool/vm-125-virtio0* total 0 lrwxrwxrwx 1 root root 10 Dec 29 15:55 vm-125-virtio0 -> ../../zd48
Good, there's a disk:
# fdisk -l /dev/zd48 Disk /dev/zd48: 50 GiB, 53687091200 bytes, 104857600 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 8192 bytes I/O size (minimum/optimal): 8192 bytes / 8192 bytes Disklabel type: dos Disk identifier: 0x000aec27 Device Boot Start End Sectors Size Id Type /dev/zd48p1 * 2048 97656831 97654784 46.6G 83 Linux /dev/zd48p2 97656832 104855551 7198720 3.4G 82 Linux swap / Solaris
And it has partitions. Now if I could only find them, so I can mount them...
Apparently, there's a volmode
on the ZFS volume that specifies how volumes should be exposed to the OS.
Setting it to
full
exposes volumes as fully fledged block devices, providing maximal functionality. [...] Setting it todev
hides its partitions. Volumes with property set tonone
are not exposed outside ZFS, but can be snapshoted, cloned, replicated, etc, that can be suitable for backup purposes.
So:
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0 NAME PROPERTY VALUE SOURCE zl-pve2-ssd1/vm-125-virtio0 volmode default default
# zfs set volmode=full zl-pve2-ssd1/vm-125-virtio0
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0 NAME PROPERTY VALUE SOURCE zl-pve2-ssd1/vm-125-virtio0 volmode full local
# ls -1 /dev/zl-pve2-ssd1/ vm-122-virtio0 vm-123-virtio0 vm-124-virtio0 vm-125-virtio0 vm-125-virtio0-part1 vm-125-virtio0-part2
Yes! Partitions for vm-125-virtio0
.
If that partition does not show up as expected, a
call to partx -a /dev/zl-pve2-ssd1/vm-125-virtio0
might do
the trick.
Quick, do some mount /dev/zl-pve2-ssd1/vm-125-virtio0-part1
/mnt/root
; edit some files.
But, try to refrain from editing the volume while the VM is running. That may cause filesystem corruption.
Lastly umount
and unset the volmode again:
# zfs inherit volmode zl-pve2-ssd1/vm-125-virtio0
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0 NAME PROPERTY VALUE SOURCE zl-pve2-ssd1/vm-125-virtio0 volmode default default
And optionally updating kernel bookkeeping, with: partx -d
-n 1:2 /dev/zl-pve2-ssd1/vm-125-disk-0