Notes to self, 2021

2021-04-08 - proxmox / virtio-blk / disk by-id

Why does the virtio-blk /dev/vda block device not show up in /dev/disk/by-id?

Yesterday, I wrote about how Proxmox VE attaches scsi0 and virtio0 block devices differently. That is the starting point for todays question: how come do I get /dev/sda in /dev/disk/by-id while /dev/vda is nowhere to be found?

This question is relevant if you're used to referencing disks through /dev/disk/by-id (for example when setting up ZFS, using the device identifiers). The named devices can be a lot more convenient to keep track of.

If you're on a QEMU VM using virtio-scsi, the block devices do show up:

# ls -log /dev/disk/by-id/
total 0
lrwxrwxrwx 1  9 apr  8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0
lrwxrwxrwx 1  9 apr  8 14:50 scsi-0QEMU_QEMU_HARDDISK_drive-scsi0 -> ../../sda
lrwxrwxrwx 1 10 apr  8 14:50 scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-part1 -> ../../sda1

But if you're using virtio-blk, they do not:

# ls -log /dev/disk/by-id/
total 0
lrwxrwxrwx 1 9 apr  8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0

There: no symlinks to /dev/vda, while it does exist and it does show up in /dev/disk/by-path:

# ls -l /dev/vda{,1}
brw-rw---- 1 root disk 254, 0 apr  8 14:50 /dev/vda
brw-rw---- 1 root disk 254, 1 apr  8 14:50 /dev/vda1
# ls -log /dev/disk/by-path/
total 0
lrwxrwxrwx 1  9 apr  8 14:50 pci-0000:00:01.1-ata-2 -> ../../sr0
lrwxrwxrwx 1  9 apr  8 14:50 pci-0000:00:0a.0 -> ../../vda
lrwxrwxrwx 1 10 apr  8 14:50 pci-0000:00:0a.0-part1 -> ../../vda1
lrwxrwxrwx 1  9 apr  8 14:50 virtio-pci-0000:00:0a.0 -> ../../vda
lrwxrwxrwx 1 10 apr  8 14:50 virtio-pci-0000:00:0a.0-part1 -> ../../vda1

udev rules

Who creates these? It's udev.

If you look at the udev rules in 60-persistent-storage.rules, you'll see a bunch of these:

# grep -E '"(sd|vd)' /lib/udev/rules.d/60-persistent-storage.rules
KERNEL=="vd*[!0-9]", ATTRS{serial}=="?*", ENV{ID_SERIAL}="$attr{serial}", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}"
KERNEL=="vd*[0-9]", ATTRS{serial}=="?*", ENV{ID_SERIAL}="$attr{serial}", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}-part%n"
...
KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", SUBSYSTEMS=="scsi", ATTRS{vendor}=="ATA", IMPORT{program}="ata_id --export $devnode"
KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", SUBSYSTEMS=="scsi", ATTRS{type}=="5", ATTRS{scsi_level}=="[6-9]*", IMPORT{program}="ata_id --export $devnode"
...
KERNEL=="sd*|sr*|cciss*", ENV{DEVTYPE}=="disk", ENV{ID_SERIAL}=="?*", SYMLINK+="disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}"
KERNEL=="sd*|cciss*", ENV{DEVTYPE}=="partition", ENV{ID_SERIAL}=="?*", SYMLINK+="disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}-part%n"
...

So, udev is in the loop, and would create symlinks, if it matched the appropriate rules.

Comparing output from udevadm:

# udevadm info /dev/sda
P: /devices/pci0000:00/0000:00:05.0/virtio1/host2/target2:0:0/2:0:0:0/block/sda
N: sda
...
E: DEVNAME=/dev/sda
E: DEVTYPE=disk
...
E: ID_SERIAL=0QEMU_QEMU_HARDDISK_drive-scsi0
E: ID_SERIAL_SHORT=drive-scsi0
E: ID_BUS=scsi
E: ID_PATH=pci-0000:00:05.0-scsi-0:0:0:0
...

and:

# udevadm info /dev/vda
P: /devices/pci0000:00/0000:00:0a.0/virtio1/block/vda
N: vda
...
E: DEVNAME=/dev/vda
E: DEVTYPE=disk
...
E: ID_PATH=pci-0000:00:0a.0
...

The output from /dev/vda is a lot shorter. And there is no ID_BUS nor ID_SERIAL. And the lack of a serial is what causes this rule to be skipped:

KERNEL=="vd*[!0-9]", ATTRS{serial}=="?*", ENV{ID_SERIAL}="$attr{serial}", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}"

We could hack the udev rules, adding a default serial when it's unavailable:

KERNEL=="vd*[!0-9]", ATTRS{serial}!="?*", ENV{ID_SERIAL}="MY_SERIAL", SYMLINK+="disk/by-id/virtio-$env{ID_SERIAL}"
# udevadm control --reload
# udevadm trigger --action=change
# ls -log /dev/disk/by-id/
lrwxrwxrwx 1 9 apr  8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0
lrwxrwxrwx 1 9 apr  8 14:50 virtio-MY_SERIAL -> ../../vda

But that's awkward. And it breaks things if we ever add a second disk.

Adding a serial through Proxmox

Instead, we can hand-hack the Proxmox VE QEMU configuration file and add a (custom 20 bytes) ,serial=MY_SERIAL parameter to the disk configuration. We'll use disk0 as serial for now:

--- /etc/pve/qemu-server/NNN.conf
+++ /etc/pve/qemu-server/NNN.conf
@@ -10,5 +10,5 @@ ostype: l26
 scsihw: virtio-scsi-pci
 smbios1: uuid=d41e78ad-4ff6-4000-8882-c343e3233945
 sockets: 1
-virtio0: somedisk:vm-NNN-disk-0,size=32G
+virtio0: somedisk:vm-NNN-disk-0,serial=disk0,size=32G
 vmgenid: 2ffdfa16-769a-421f-91f3-71397562c6b9

Stop the VM, start it again, and voilà, the disk is matched:

# ls -log /dev/disk/by-id/
total 0
lrwxrwxrwx 1  9 apr  8 14:50 ata-QEMU_DVD-ROM_QM00003 -> ../../sr0
lrwxrwxrwx 1  9 apr  8 14:50 virtio-disk0 -> ../../vda
lrwxrwxrwx 1 10 apr  8 14:50 virtio-disk0-part1 -> ../../vda1

As long as you don't create duplicate serials in the same VM, this should be fine.

2021-04-07 - proxmox / alter default create vm parameters

The Proxmox Virtual Environment has defaults when creating a new VM, but it has no option to change those defaults. Here's a quick example of hacking in some defaults.

Why? (Changing SCSI controller does not change existing disks)

In the next post I wanted to talk about /dev/disk/by-id and why disks that use the VirtIO SCSI controller do not show up there. A confusing matter in this situation was that creating a VM disk using a different SCSI controller and then switching does not change the storage driver for the existing disks completely!

If you're on Proxmox VE 6.x (observed with 6.1 and 6.3) and you create a VM with the VirtIO SCSI controller, your virtual machine parameters may look like this, and you get a /dev/vda device inside your QEMU VM:

/usr/bin/kvm \
  ...
  -device virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa \
  -drive file=/dev/zvol/somedisk/vm-NNN-disk-0,if=none,id=drive-virtio0,format=raw

But if you create it with the (default) LSI 53C895A SCSI controller first, and then switch to VirtIO SCSI, you still keep the (ATA) /dev/sda block device name. The VM is started with these command line arguments:

/usr/bin/kvm \
  ...
  -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 \
  -drive file=/dev/zvol/somedisk/vm-NNN-disk-0,if=none,id=drive-scsi0,format=raw \
  -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0

If you look at the configuration in /etc/pve/qemu-server/NNN.conf, both would have:

scsihw: virtio-scsi-pci

But the disk configuration type/name is different:

virtio0: somedisk:vm-NNN-disk-0,size=32G

vs.

scsi0: somedisk:vm-NNN-disk-0,size=32G

This virtio-scsi-pci + scsi0 is turned into -device virtio-scsi-pci flags.
While virtio-scsi-pci + virtio0 translates to -device virtio-blk-pci.

It's not a bad thing though, that it does not change from scsi0 to virtio0. After all, if the device did change from /dev/sda to /dev/vda, your boot procedure and mounts might be impacted. But it does mean that you want the VirtIO SCSI option selected before you create any disks.

How? (Hacking defaults into pvemanagerlib.js)

In the pve-manager package, there's a /usr/share/pve-manager/js/pvemanagerlib.js that controls much of the user interface. Altering the default appears to be a matter of:

--- /usr/share/pve-manager/js/pvemanagerlib.js
+++ /usr/share/pve-manager/js/pvemanagerlib.js
@@ -21771,7 +21771,7 @@ Ext.define('PVE.qemu.OSDefaults', {
         scsi: 2,
         virtio: 1
       },
-      scsihw: ''
+      scsihw: 'virtio-scsi-pci'
     };
 
     // virtio-net is in kernel since 2.6.25

For bonus points, we can disable the firewall default, which we manage elsewhere anyway:

--- /usr/share/pve-manager/js/pvemanagerlib.js
+++ /usr/share/pve-manager/js/pvemanagerlib.js
@@ -22434,7 +22434,7 @@ Ext.define('PVE.qemu.NetworkInputPanel',
         xtype: 'proxmoxcheckbox',
         fieldLabel: gettext('Firewall'),
         name: 'firewall',
-        checked: (me.insideWizard || me.isCreate)
+        checked: false
       }
     ];
 
@@ -27909,7 +27909,7 @@ Ext.define('PVE.lxc.NetworkInputPanel',
       cdata.name = 'eth0';
       me.dataCache = {};
     }
-    cdata.firewall = (me.insideWizard || me.isCreate);
+    cdata.firewall = false;
 
     if (!me.dataCache) {
       throw "no dataCache specified";

Of course these changes will get wiped whenever you update Proxmox VE. Keeping your hacks active will be an exercise for the reader.

2021-01-21 - openvpn / hardened fox-it openvpn-nl

Today, we will be evaluating OpenVPN-NL — “[a] hardened version of OpenVPN that includes as many of the security measures required to operate in a classified environment as possible — and whether we can use it as a drop-in replacement for regular OpenVPN.

While OpenVPN allows many insecure configurations, such as turning off encryption, or the use of outdated cryptographic functions in security critical places, the goal of OpenVPN-NL — a fork created and maintained by Fox-IT — is to strip insecure configuration and verify that the distributed version is uncompromised.

We'll be answering the question of whether it's compatible and whether we want to use it.

For Ubuntu Bionic and Xenial, repositories exist. But the Bionic version works just fine on Ubuntu Focal.

OpenVPN OpenVPN-NL
repoubuntu defaultfox-it repository
packageopenvpnopenvpn-nl
version2.4.7-1ubuntu22.4.7-bionicnl1
dependencieslzo2-2, lz4-1, pkcs11, ssl, systemd0, iproute2lzo2-2, net-tools, (embedded) Mbed TLS 2.16.2
size1160 KiB1627 KiB
binary/usr/sbin/openvpn/usr/sbin/openvpn-nl
systemd notifyYES-

As you can already see in the above list:

  • the versions are similar (2.4.7);
  • OpenVPN is linked to OpenSSL while OpenVPN-NL embeds Mbed TLS. This means that:
    • it is not affected by OpenSSL specific security issues,
    • but it will be affected by Mbed TLS issues and we'll have to rely on updates from Fox-IT, should such issues arise.
  • OpenVPN-NL can be installed alongside OpenVPN, which makes switching between the two convenient;
  • it depends on older networking tools (net-tools);
  • it does not support sd_notify — you'll have to disable Type=notify in your SystemD service files.

On to the hardening bits

The hardening done by Fox-IT appears to consist of the following changes:

  • Mbed TLS is used instead of OpenSSL:
    • if you assume that OpenSSL is riddled with flaws, then this is a good thing;
    • if you assume that any security product, including Mbed TLS will have its flaws, then a drawback is that you get fewer features (no TLS 1.3) and that you have to rely on timely patches from Fox-IT.
  • OpenVPN-NL drastically limits the allowed cryptography algorithms — both on the weak and on the strong side of the spectrum — leaving you with really no option but SHA256, RSA and AES-256;
  • it enforces a few options that you should have enabled, like certificate validation, and specifically remote-cert-tls to prevent client-to-client man in the middle attacks;
  • it removes a few options that you should not have enabled, like no-iv, client-cert-not-required or optional verify-client-cert;
  • certificates must be signed with a SHA256 hash, or the certificates will be rejected;
  • it delays startup until there is sufficient entropy on the system (it does so by reading and discarding min-platform-entropy bytes from /dev/random, which strikes me as an odd way to accomplish that) — during testing you can set min-platform-entropy 0.

Note that we're only using Linux, so we did not check any Windows build scripts/fixes that may also be done. The included PKCS#11 code — for certificates on hardware tokens — was not checked either at this point.

The available algorithms:

OpenVPN OpenVPN-NL
--show-digests.. lots and lots ..SHA256
--show-tls .. anything that OpenSSL supports, for TLS 1.3 and below .. TLS 1.2 (only) with ciphers:
TLS-ECDHE-RSA-WITH-AES-256-GCM-SHA384
TLS-DHE-RSA-WITH-AES-256-GCM-SHA384
TLS-DHE-RSA-WITH-AES-256-CBC-SHA256
--show-ciphers.. lots and lots ..AES-256-CBC
AES-256-GCM

Notable in the above list is that SHA512 is not allowed, nor are ECDSA ciphers: so no new fancy ed25519 or secp521r1 elliptic curve (EC) ciphers, but only plain old RSA large primes. (The diff between openvpn-2.4.7-1ubuntu2 and Fox-IT bionicnl1 even explicitly states that EC is disabled, except for during the Diffie-Hellman key exchange. No motivation is given.)

So, compatibility with vanilla OpenVPN is available, if you stick to this configuration, somewhat.

Server settings:

mode server
tls-server

Client settings:

client  # equals: pull + tls-client

Server and client settings:

local SERVER_IP # [server: remote SERVER_IP]
proto udp
port 1194
nobind          # [server: server VPN_NET 255.255.255.0]
dev vpn-DOMAIN  # named network devices are nice
dev-type tun

# HMAC auth, first line of defence against brute force
auth SHA256
tls-auth DOMAIN/ta.key 1  # [server: tls-auth DOMAIN/ta.key 0]
key-direction 1           # int as above, allows inline <tls-auth>

# TLS openvpn-nl compatibility config
tls-version-min 1.2
#[not necessary]#tls-version-max 1.2    # MbedTLS has no 1.3

# DH/TLS setup
# - no ECDSA for openvpn-nl
# - no TLS 1.3 for openvpn-nl
tls-cipher TLS-ECDHE-RSA-WITH-AES-256-GCM-SHA384
tls-ciphersuites TLS_AES_256_GCM_SHA384 # only for TLS 1.3
ecdh-curve secp384r1
#[only server]#dh none  # (EC)DHE, thus no permanent parameters

# TLS certificates
# Note that the certificates must be:
# - SHA-256 signed
# - using RSA 2048 or higher (choose at least 4096), and not Elliptic Curve
# - including "X509v3 Extended Key Usage" (EKU) for Server vs. Client
remote-cert-tls server  # [server: remote-cert-tls client] (EKU)
ca DOMAIN/ca.crt        # CA to validate the peer certificate against
cert DOMAIN/client-or-server.crt
key DOMAIN/client-or-server.key
#[only server]#crl-verify DOMAIN/crl.pem  # check for revoked certs

# Data channel
cipher AES-256-GCM      # or AES-256-CBC
ncp-disable             # and no cipher negotiation

# Drop privileges; keep tunnel across restarts; keepalives
user openvpn
group nogroup
persist-key
persist-tun
keepalive 15 55             # ping every 15, disconnect after 55
#[only server]#opt-verify   # force compatible options

The lack of SystemD notify support is a minor annoyance. When editing the SystemD service file, set Type to simple and remove --daemon from the options. Otherwise you may end up with unmounted PrivateTmp mounts and multiple openvpn-nl daemons (which of course hold on to the listening socket your new daemon needs, causing strange client-connect errors):

# /etc/systemd/system/openvpn@server.service.d/override.conf
[Service]
ExecStart=
# Take the original ExecStart, replace "openvpn" with "openvpn-nl"
# and remove "--daemon ...":
ExecStart=/usr/sbin/openvpn-nl --status /run/openvpn/%i.status 10
    --cd /etc/openvpn --script-security 2 --config /etc/openvpn/%i.conf
    --writepid /run/openvpn/%i.pid
Type=simple

If you're okay with sticking to SHA256 and RSA for now, then OpenVPN-NL is compatible with vanilla OpenVPN. Do note that hardware acceleration in Mbed TLS is explicitly marked as disabled on the OpenVPN-NL lifecycle page. I'm not sure if this is a security decision, but it may prove to be less performant.

In conclusion: there is no immediate need to use OpenVPN-NL, but it is wise to take their changes to heart. Make sure:

  • you validate and trust packages from your software repository;
  • all your certificates are SHA256-signed;
  • remote-cert-tls is enabled (and your certificates are marked with the correct key usage, e.g. by using a recent easy-rsa to sign your keys);
  • ciphers are fixed or non-negotiable using ncp-disable;
  • auth, cipher and tls-cipher are set to something modern.

But if you stick to the above configuration, then using OpenVPN-NL is fine too..

.. although still I cannot put my finger on how discarding bytes from /dev/random would make things more secure.

Notes about RNG, min-platform-entropy and hardware support

About “how discarding bytes from /dev/random makes things more secure.”

I think the theory is that throwing away some bytes makes things more secure, because the initially seeded bytes after reboot might be guessable. And instead of working against the added code — by lowering min-platform-entropy — we can instead attempt to get more/better entropy.

If the rdrand processor flag is available then this might be a piece of cake:

$ grep -q '^flags.*\brdrand\b' /proc/cpuinfo && echo has rdrand
has rdrand

If it isn't, and this is a virtual machine, you'll need to (a) confirm that it's available on the VM host and (b) enable the host processor in the VM guest (-cpu host). (If you wanted AES CPU acceleration, you would have enabled host CPU support already.)

When the processor flag is available, you can start benefitting from host-provided entropy.

$ cat /proc/sys/kernel/random/entropy_avail
701

This old entropy depletes faster than a Coca-Cola bottle with a Mentos in it, once you start reading from /dev/random directly.

But, if you install rng-tools, you get a nice /usr/sbin/rngd that checks entropy levels and reads from /dev/hwrng, replenishing the entropy as needed.

17:32:00.784031 poll([{fd=4, events=POLLOUT}], 1, -1) = 1 ([{fd=4, revents=POLLOUT}])
17:32:00.784162 ioctl(4, RNDADDENTROPY, {entropy_count=512, buf_size=64, buf="\262\310"...}) = 0
$ cat /proc/sys/kernel/random/entropy_avail
3138

Instant replenish! Now you can consider enabling use-prediction-resistance if you're using MbedTLS (through OpenVPN-NL).

Footnotes

See also blog.g3rt.nl openvpn security tips and how to harden OpenVPN in 2020.

2021-01-15 - postgresql inside kubernetes / no space left on device

Running PostgreSQL inside Kubernetes? Getting occasional "No space left on device" errors? Know that 64MB is not enough for everyone.

With the advent of more services running inside Kubernetes, we're now running into new issues and complexities specific to the containerization. For instance, to solve the problem of regular file backups of distributed filesystems, we've resorted to using rsync wrapped inside a pod (or sidecar). And now for containerized PostgreSQL, we're running into an artificial memory limit that needs fixing.

Manifestation

The issue manifests itself like this:

ERROR: could not resize shared memory segment "/PostgreSQL.491173048"
  to 4194304 bytes: No space left on device

This shared memory that PostgreSQL speaks of, is the shared memory made available to it through /dev/shm.

On your development machine, it may look like this:

$ mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
$ df -h | sed -ne '1p;/shm/p'
Filesystem  Size  Used Avail Use% Mounted on
tmpfs        16G  948M   15G   6% /dev/shm

That's fine. 16GiB is plenty of space. But in Kubernetes we get a Kubernetes default of a measly 64MiB and no means to change the shm-size. So, inside the pod with the PostgreSQL daemon, things look like this:

$ mount | grep shm
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
$ df -h | sed -ne '1p;/shm/p'
Filesystem  Size  Used Avail Use% Mounted on
shm          64M     0   64M   0% /dev/shm

For a bunch of database operations, that is definitely too little. Any PostgreSQL database doing any serious work will quickly use up that much temporary space. (And run into this error.)

According to Thomas Munro on the postgrespro.com mailing list:

PostgreSQL creates segments in /dev/shm for parallel queries (via shm_open()), not for shared buffers. The amount used is controlled by work_mem. Queries can use up to work_mem for each node you see in the EXPLAIN plan, and for each process, so it can be quite a lot if you have lots of parallel worker processes and/or lots of tables/partitions being sorted or hashed in your query.

Basically what they're saying is: you need sufficient space in /dev/shm, period!

On the docker-library postgres page it is documented that you may want to increase the --shm-size (ShmSize). That is quite doable for direct Docker or docker-compose instantiations. But for PostgreSQL daemon pods in Kubernetes resizing shm does not seem to be possible.

Any other fixes then?

Well, I'm glad you asked! /dev/shm is just one of the ways that the PostgreSQL daemon can be configured to allocate shared memory through:

dynamic_shared_memory_type (enum)
Specifies the dynamic shared memory implementation that the server should use. Possible values are posix (for POSIX shared memory allocated using shm_open), sysv (for System V shared memory allocated via shmget), windows (for Windows shared memory), and mmap (to simulate shared memory using memory-mapped files stored in the data directory). [...]

(from PostgresSQL runtime config)

When using the posix shm_open(), we're directly opening files in /dev/shm. If we however opt to use the (old fashioned) sysv shmget(), the memory allocation is not pinned to this filesystem and it is not limited (unless someone has been touching /proc/sys/kernel/shm*).

Technical details of using System V shared memory

Using System V shared memory is a bit more convoluted than using POSIX shm. For POSIX shared memory calling shm_open() is basically the same as opening a (mmap-able) file in /dev/shm. For System V however, you're looking at an incantation like this shmdemo.c example:

#include <stdio.h>
#include <string.h>
#include <sys/ipc.h>
#include <sys/shm.h>

#define SHM_SIZE (size_t)(512 * 1024 * 1024UL) /* 512MiB */

int main(int argc, char *argv[])
{
    key_t key;
    int shmid;
    char *data;

    if (argc > 2) {
        fprintf(stderr, "usage: shmdemo [data_to_write]\n");
        return 1;
    }
    /* The file here is used as a "pointer to memory". The key is
     * calculated based on the inode number and non-zero 8 bits: */
    if ((key = ftok("./pointer-to-memory.txt", 1 /* project_id */)) == -1) {
        fprintf(stderr, "please create './pointer-to-memory.txt'\n");
        return 2;
    }
    if ((shmid = shmget(key, SHM_SIZE, 0644 | IPC_CREAT)) == -1)
        return 3;
    if ((data = shmat(shmid, NULL, 0)) == (char *)(-1)) /* attach */
        return 4;

    /* read or modify the segment, based on the command line: */
    if (argc == 2) {
        printf("writing to segment %#x: \"%s\"\n", key, argv[1]);
        strncpy(data, argv[1], SHM_SIZE);
    } else {
        printf("segment %#x contained: \"%s\"\n", key, data);
        shmctl(shmid, IPC_RMID, NULL); /* free the memory */
    }

    if (shmdt(data) == -1) /* detach */
        return 5;
    return 0;
}

(Luckily the PostgreSQL programmers concerned themselves with these awkward semantics, so we won't have to.)

If you want to confirm that you have access to sufficient System V shared memory inside your pod, you could try the above code sample to test. Invoking it looks like:

$ ./shmdemo
please create './pointer-to-memory.txt'
$ touch ./pointer-to-memory.txt
$ ./shmdemo
segment 0x1010dd5 contained: ""
$ ./shmdemo 'please store this in shm'
writing to segment 0x1010dd5: "please store this in shm"
$ ./shmdemo
segment 0x1010dd5 contained: "please store this in shm"
$ ./shmdemo
segment 0x1010dd5 contained: ""

And if you skipped/forget the IPC_RMID, you can see the leftovers using ipcs:

$ ipcs | awk '{if(int($6)==0)print}'

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x52010e16 688235     walter     644        536870912  0                       
0x52010e19 688238     walter     644        536870912  0                       

------ Semaphore Arrays --------
key        semid      owner      perms      nsems     

And remove them with ipcrm:

$ ipcrm -M 0x52010e16
$ ipcrm -M 0x52010e19

But, you probably did not come here for lessons in ancient IPC. Quickly moving on to the next paragraph...

Configuring sysv dynamic_shared_memory_type in stolon

For stolon — the Kubernetes PostgreSQL manager that we're using — you can configure different parameters through the pgParameters setting. It keeps the configuration in a configMap:

$ kubectl -n NS get cm stolon-cluster-mycluster -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":...}'
    stolon-clusterdata: '{"formatVersion":1,...}'
...

Where the stolon-clusterdata holds both the configuration and current state:

{
  "formatVersion": 1,
  "changeTime": "2021-01-15T10:17:54.297700008Z",
  "cluster": {
...
    "spec": {
...
      "pgParameters": {
        "datestyle": "iso, mdy",
        "default_text_search_config": "pg_catalog.english",
        "dynamic_shared_memory_type": "posix",
...

You should not be editing this directly, but it can be educational to look at.

To edit the pgParameters you'll be using stolonctl from inside a stolon-proxy as specified in the cluster specification patching docs:

$ stolonctl --cluster-name=mycluster --store-backend=kubernetes \
    --kube-resource-kind=configmap update --patch \
    '{"pgParameters": {"dynamic_shared_memory_type": "sysv"}}'
$ stolonctl --cluster-name=mycluster --store-backend=kubernetes \
    --kube-resource-kind=configmap update --patch \
    '{"pgParameters": {"shared_buffers": "6144MB"}}'

And a restart:

$ kubectl -n NS rollout restart sts stolon-keeper

And that, my friends, should get rid of that pesky 64MiB limit.

2021-01-05 - chromium snap / wrong fonts

So, since a couple of weeks my snap-installed Chromium browser on Ubuntu Focal started acting up: suddenly it chooses the wrong fonts on some web pages. The chosen fonts are from the ~/.local/share/fonts/ directory.

[wjd.nu pages with incorrect looking font]

Look! That's not the correct font. And it's even more apparent that the font is off when seeing the source view.

[browser html source view with incorrect looking font]

Bah. That's not even a monospaced font.

A fix that appeared to work — but unfortunately only temporarily — involves temporarily moving the custom local fonts out of the way and then flushing the font cache:

$ mkdir ~/.local/share/DISABLED-fonts
$ mv ~/.local/share/fonts/* ~/.local/share/DISABLED-fonts/
$ fc-cache -rv && sudo fc-cache -rv

Restarting chromium-browser by using the about:restart took quite a while. Some patience had to be exercised.

When it finally did start, all font issues were solved.

Can we now restore our custom local fonts again?

$ mv ~/.local/share/DISABLED-fonts/* ~/.local/share/fonts/
$ fc-cache -rv && sudo fc-cache -rv

And another about:restart — which was fast as normal again — and everything was still fine. So yes, apparently, we can.

However, after half a day of work, the bug reappeared.

A semi-permanent fix is refraining from using the the local fonts directory. But that's not really good enough.

Appently there's a bug report showing that not only Chromium is affected. And while I'm not sure how to fix things yet, at least the following seems suspect:

$ grep include.*/snap/ \
    ~/snap/chromium/current/.config/fontconfig/fonts.conf 
  <include ignore_missing="yes">/snap/chromium/1424/gnome-platform/etc/fonts/fonts.conf</include>

This would make sense, if current/ pointed to 1424, but current/ now points to 1444.

Here's a not yet merged pull request that look's promising. And here, there's someone who grew tired of hotfixing the fonts.conf and symlinked all global font conf files into ~/.local/share/fonts/. That might also be worth a try...

A more permanent solution?

$ mkdir -p ~/snap/chromium/common/.config/fontconfig
$ cat >>~/snap/chromium/common/.config/fontconfig/fonts.conf <<EOF
<fontconfig>
  <include>/etc/fonts/conf.d</include>
</fontconfig>
EOF

I settled for a combination of the linked suggestions. The above snippet looks like it works. Crosses fingers...

Three weeks later...

Or at least, for a while. It looks like a new snap-installed version of Chromium broke things again. When logging in after the weekend, I was presented with the wrong fonts again.

This time, I:

  • fixed the symlinks,
  • removed the older/unused 1444 snap revision,
  • reran the fc-cache flush, and
  • restarted Chromium.

Permanent? No!

TL;DR

(Months later by now.. still a problem.)

It feels as if I'm the only one suffering from this. At least now the following sequence appears to work reliably:

  • new Chromium snap has been silently installed;
  • fonts are suddenly broken in currently running version;
  • sudo rm /var/snap/chromium/common/fontconfig/* ;
  • shut down / kill Chromium (make sure you get them all);
  • start Chromium and reopen work with ctrl-shift-T.

(It's perhaps also worth looking into whether the default Chromium fonts are missing after snapd has been updated ticket has been resolved.)

2021-01-02 - stale apparmor config / mysql refuses to start

So, recently we had an issue with a MariaDB server that refused to start. Or, actually, it would start, but before long, SystemD would kill it. But why?

# systemctl start mariadb.service
Job for mariadb.service failed because a timeout was exceeded.
See "systemctl status mariadb.service" and "journalctl -xe" for details.

After 90 seconds, it would be killed. systemctl status mariadb.service shows the immediate cause:

# systemctl status mariadb.service
...
systemd[1]: mariadb.service: Start operation timed out. Terminating.
systemd[1]: mariadb.service: Main process exited, code=killed, status=15/TERM
systemd[1]: mariadb.service: Failed with result 'timeout'.

Ok, a start operation timeout. That is caused by the notify type: apparently the mysqld doesn't get a chance to tell SystemD that it has succesfully completed startup.

First, a quickfix, so we can start at all:

# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf
[Service]
Type=simple
EOF

That fixes so we can start — because now SystemD won't require for any "started" notification anymore — but it doesn't explain what is wrong.

Second, an attempt at debugging the cause:

# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf
[Service]
NotifyAccess=all
ExecStart=
ExecStart=/usr/bin/strace -fesendmsg,sendto,connect,socket -s8192 \
  /usr/sbin/mysqld $MYSQLD_OPTS
EOF

Okay, that one showed EACCESS errors on the sendmsg() call on the /run/systemd/notify unix socket:

strace[55081]: [pid 55084] socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 46
strace[55081]: [pid 55084] sendmsg(46, {msg_name={sa_family=AF_UNIX,
  sun_path="/run/systemd/notify"}, msg_namelen=22,
  msg_iov=[{iov_base="READY=1\nSTATUS=Taking your SQL requests now...\n", iov_len=47}],
  msg_iovlen=1, msg_controllen=0, msg_flags=0},
  MSG_NOSIGNAL) = -1 EACCES (Permission denied)

Permission denied? But why?

# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf
[Service]
NotifyAccess=all
ExecStart=
ExecStart=/usr/bin/strace -fesendmsg,sendto,connect,socket -s8192 \
  /bin/sh -c 'printf "READY=1\nSTATUS=Taking your SQL requests now...\n" | \
    socat - UNIX-SENDTO:/run/systemd/notify; sleep 3600
EOF

This worked:

strace[54926]: [pid 54931] socket(AF_UNIX, SOCK_DGRAM, 0) = 5
strace[54926]: [pid 54931] sendto(5,
  "READY=1\nSTATUS=Taking your SQL requests now...\n", 47, 0,
  {sa_family=AF_UNIX, sun_path="/run/systemd/notify"}, 21) = 47

(Unless someone is really trying to mess with you, you can regard sendto() and sendmsg() as equivalent here. socat simply uses the other one.)

That means that there is nothing wrong with SystemD or /run/systemd/notify. So the problem must be related to /usr/sbin/mysqld.

After looking at journalctl -u mariadb.service for the nth time, I decided to peek at all of journalctl without any filters. And there it was after all: audit logs.

# journalctl -t audit
audit[1428513]: AVC apparmor="DENIED" operation="sendmsg"
  info="Failed name lookup - disconnected path" error=-13
  profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=1428513
  comm="mysqld" requested_mask="w" denied_mask="w" fsuid=104 ouid=0

(Observe the -t in the journalctl invocation above which looks for the SYSLOG_IDENTIFIER=audit key-value pair.)

Okay. And fixing it?

# aa-remove-unknown
Skipping profile in /etc/apparmor.d/disable: usr.sbin.mysqld
Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
Removing '/usr/sbin/mysqld'

A-ha! Stale cruft in /var/cache/apparmor.

# /etc/init.d/apparmor restart
Restarting apparmor (via systemctl): apparmor.service.

Finally we could undo the override.conf and everything started working as expected.

2021-01-01 - zfs / zvol / partition does not show up

On our Proxmox virtual machine I had to go into a volume to quickly fix an IP address. The volume exists on the VM host, so surely mounting is easy. Right?

I checked in /dev/zvol/pve2-pool/ where I found the disk:

# ls /dev/zvol/pve2-pool/vm-125-virtio0*
total 0
lrwxrwxrwx 1 root root 10 Dec 29 15:55 vm-125-virtio0 -> ../../zd48

Good, there's a disk:

# fdisk -l /dev/zd48
Disk /dev/zd48: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disklabel type: dos
Disk identifier: 0x000aec27

Device      Boot    Start       End  Sectors  Size Id Type
/dev/zd48p1 *        2048  97656831 97654784 46.6G 83 Linux
/dev/zd48p2      97656832 104855551  7198720  3.4G 82 Linux swap / Solaris

And it has partitions. Now if I could only find them, so I can mount them...

Apparently, there's a volmode on the ZFS volume that specifies how volumes should be exposed to the OS.

Setting it to full exposes volumes as fully fledged block devices, providing maximal functionality. [...] Setting it to dev hides its partitions. Volumes with property set to none are not exposed outside ZFS, but can be snapshoted, cloned, replicated, etc, that can be suitable for backup purposes.

So:

# zfs get volmode zl-pve2-ssd1/vm-125-virtio0
NAME                         PROPERTY  VALUE    SOURCE
zl-pve2-ssd1/vm-125-virtio0  volmode   default  default
# zfs set volmode=full zl-pve2-ssd1/vm-125-virtio0
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0
NAME                         PROPERTY  VALUE    SOURCE
zl-pve2-ssd1/vm-125-virtio0  volmode   full     local
# ls -1 /dev/zl-pve2-ssd1/
vm-122-virtio0
vm-123-virtio0
vm-124-virtio0
vm-125-virtio0
vm-125-virtio0-part1
vm-125-virtio0-part2

Yes! Partitions for vm-125-virtio0.

If that partition does not show up as expected, a call to partx -a /dev/zl-pve2-ssd1/vm-125-virtio0 might do the trick.

Quick, do some mount /dev/zl-pve2-ssd1/vm-125-virtio0-part1 /mnt/root; edit some files.

But, try to refrain from editing the volume while the VM is running. That may cause filesystem corruption.

Lastly umount and unset the volmode again:

# zfs inherit volmode zl-pve2-ssd1/vm-125-virtio0
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0
NAME                         PROPERTY  VALUE    SOURCE
zl-pve2-ssd1/vm-125-virtio0  volmode   default  default

And optionally updating kernel bookkeeping, with: partx -d -n 1:2 /dev/zl-pve2-ssd1/vm-125-disk-0