Notes to self, 2021

2021-01-15 - postgresql inside kubernetes / no space left on device

Running PostgreSQL inside Kubernetes? Getting occasional "No space left on device" errors? Know that 64MB is not enough for everyone.

With the advent of more services running inside Kubernetes, we're now running into new issues and complexities specific to the containerization. For instance, to solve the problem of regular file backups of distributed filesystems, we've resorted to using rsync wrapped inside a pod (or sidecar). And now for containerized PostgreSQL, we're running into an artificial memory limit that needs fixing.

Manifestation

The issue manifests itself like this:

ERROR: could not resize shared memory segment "/PostgreSQL.491173048"
  to 4194304 bytes: No space left on device

This shared memory that PostgreSQL speaks of, is the shared memory made available to it through /dev/shm.

On your development machine, it may look like this:

$ mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
$ df -h | sed -ne '1p;/shm/p'
Filesystem  Size  Used Avail Use% Mounted on
tmpfs        16G  948M   15G   6% /dev/shm

That's fine. 16GiB is plenty of space. But in Kubernetes we get a Kubernetes default of a measly 64MiB and no means to change the shm-size. So, inside the pod with the PostgreSQL daemon, things look like this:

$ mount | grep shm
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
$ df -h | sed -ne '1p;/shm/p'
Filesystem  Size  Used Avail Use% Mounted on
shm          64M     0   64M   0% /dev/shm

For a bunch of database operations, that is definitely too little. Any PostgreSQL database doing any serious work will quickly use up that much temporary space. (And run into this error.)

According to Thomas Munro on the postgrespro.com mailing list:

PostgreSQL creates segments in /dev/shm for parallel queries (via shm_open()), not for shared buffers. The amount used is controlled by work_mem. Queries can use up to work_mem for each node you see in the EXPLAIN plan, and for each process, so it can be quite a lot if you have lots of parallel worker processes and/or lots of tables/partitions being sorted or hashed in your query.

Basically what they're saying is: you need sufficient space in /dev/shm, period!

On the docker-library postgres page it is documented that you may want to increase the --shm-size (ShmSize). That is quite doable for direct Docker or docker-compose instantiations. But for PostgreSQL daemon pods in Kubernetes resizing shm does not seem to be possible.

Any other fixes then?

Well, I'm glad you asked! /dev/shm is just one of the ways that the PostgreSQL daemon can be configured to allocate shared memory through:

dynamic_shared_memory_type (enum)
Specifies the dynamic shared memory implementation that the server should use. Possible values are posix (for POSIX shared memory allocated using shm_open), sysv (for System V shared memory allocated via shmget), windows (for Windows shared memory), and mmap (to simulate shared memory using memory-mapped files stored in the data directory). [...]

(from PostgresSQL runtime config)

When using the posix shm_open(), we're directly opening files in /dev/shm. If we however opt to use the (old fashioned) sysv shmget(), the memory allocation is not pinned to this filesystem and it is not limited (unless someone has been touching /proc/sys/kernel/shm*).

Technical details of using System V shared memory

Using System V shared memory is a bit more convoluted than using POSIX shm. For POSIX shared memory calling shm_open() is basically the same as opening a (mmap-able) file in /dev/shm. For System V however, you're looking at an incantation like this shmdemo.c example:

#include <stdio.h>
#include <string.h>
#include <sys/ipc.h>
#include <sys/shm.h>

#define SHM_SIZE (size_t)(512 * 1024 * 1024UL) /* 512MiB */

int main(int argc, char *argv[])
{
    key_t key;
    int shmid;
    char *data;

    if (argc > 2) {
        fprintf(stderr, "usage: shmdemo [data_to_write]\n");
        return 1;
    }
    /* The file here is used as a "pointer to memory". The key is
     * calculated based on the inode number and non-zero 8 bits: */
    if ((key = ftok("./pointer-to-memory.txt", 1 /* project_id */)) == -1) {
        fprintf(stderr, "please create './pointer-to-memory.txt'\n");
        return 2;
    }
    if ((shmid = shmget(key, SHM_SIZE, 0644 | IPC_CREAT)) == -1)
        return 3;
    if ((data = shmat(shmid, NULL, 0)) == (char *)(-1)) /* attach */
        return 4;

    /* read or modify the segment, based on the command line: */
    if (argc == 2) {
        printf("writing to segment %#x: \"%s\"\n", key, argv[1]);
        strncpy(data, argv[1], SHM_SIZE);
    } else {
        printf("segment %#x contained: \"%s\"\n", key, data);
        shmctl(shmid, IPC_RMID, NULL); /* free the memory */
    }

    if (shmdt(data) == -1) /* detach */
        return 5;
    return 0;
}

(Luckily the PostgreSQL programmers concerned themselves with these awkward semantics, so we won't have to.)

If you want to confirm that you have access to sufficient System V shared memory inside your pod, you could try the above code sample to test. Invoking it looks like:

$ ./shmdemo
please create './pointer-to-memory.txt'
$ touch ./pointer-to-memory.txt
$ ./shmdemo
segment 0x1010dd5 contained: ""
$ ./shmdemo 'please store this in shm'
writing to segment 0x1010dd5: "please store this in shm"
$ ./shmdemo
segment 0x1010dd5 contained: "please store this in shm"
$ ./shmdemo
segment 0x1010dd5 contained: ""

And if you skipped/forget the IPC_RMID, you can see the leftovers using ipcs:

$ ipcs | awk '{if(int($6)==0)print}'

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x52010e16 688235     walter     644        536870912  0                       
0x52010e19 688238     walter     644        536870912  0                       

------ Semaphore Arrays --------
key        semid      owner      perms      nsems     

And remove them with ipcrm:

$ ipcrm -M 0x52010e16
$ ipcrm -M 0x52010e19

But, you probably did not come here for lessons in ancient IPC. Quickly moving on to the next paragraph...

Configuring sysv dynamic_shared_memory_type in stolon

For stolon — the Kubernetes PostgreSQL manager that we're using — you can configure different parameters through the pgParameters setting. It keeps the configuration in a configMap:

$ kubectl -n NS get cm stolon-cluster-mycluster -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":...}'
    stolon-clusterdata: '{"formatVersion":1,...}'
...

Where the stolon-clusterdata holds both the configuration and current state:

{
  "formatVersion": 1,
  "changeTime": "2021-01-15T10:17:54.297700008Z",
  "cluster": {
...
    "spec": {
...
      "pgParameters": {
        "datestyle": "iso, mdy",
        "default_text_search_config": "pg_catalog.english",
        "dynamic_shared_memory_type": "posix",
...

You should not be editing this directly, but it can be educational to look at.

To edit the pgParameters you'll be using stolonctl from inside a stolon-proxy as specified in the cluster specification patching docs:

$ stolonctl --cluster-name=mycluster --store-backend=kubernetes \
    --kube-resource-kind=configmap update --patch \
    '{"pgParameters": {"dynamic_shared_memory_type": "sysv"}}'
$ stolonctl --cluster-name=mycluster --store-backend=kubernetes \
    --kube-resource-kind=configmap update --patch \
    '{"pgParameters": {"shared_buffers": "6144MB"}}'

And a restart:

$ kubectl -n NS rollout restart sts stolon-keeper

And that, my friends, should get rid of that pesky 64MiB limit.

2021-01-05 - chromium snap / wrong fonts

So, since a couple of weeks my snap-installed Chromium browser on Ubuntu focal started acting up: suddenly it chooses the wrong fonts on some web pages. The chosen fonts are from the ~/.local/share/fonts/ directory.

[wjd.nu pages with incorrect looking font]

Look! That's not the correct font. And it's even more apparent that the font is off when seeing the source view.

[browser html source view with incorrect looking font]

Bah. That's not even a monospaced font.

A fix that appeared to work — but unfortunately only temporarily — involves temporarily moving the custom local fonts out of the way and then flushing the font cache:

$ mkdir ~/.local/share/DISABLED-fonts
$ mv ~/.local/share/fonts/* ~/.local/share/DISABLED-fonts/
$ fc-cache -rv && sudo fc-cache -rv

Restarting chromium-browser by using the about:restart took quite a while. Some patience had to be exercised.

When it finally did start, all font issues were solved.

Can we now restore our custom local fonts again?

$ mv ~/.local/share/DISABLED-fonts/* ~/.local/share/fonts/
$ fc-cache -rv && sudo fc-cache -rv

And another about:restart — which was fast as normal again — and everything was still fine. So yes, apparently, we can.

However, after half a day of work, the bug reappeared.

A semi-permanent fix is refraining from using the the local fonts directory. But that's not really good enough.

Appently there's a bug report showing that not only Chromium is affected. And while I'm not sure how to fix things yet, at least the following seems suspect:

$ grep include.*/snap/ \
    ~/snap/chromium/current/.config/fontconfig/fonts.conf 
  <include ignore_missing="yes">/snap/chromium/1424/gnome-platform/etc/fonts/fonts.conf</include>

This would make sense, if current/ pointed to 1424, but current/ now points to 1444.

Here's a not yet merged pull request that look's promising. And here, there's someone who grew tired of hotfixing the fonts.conf and symlinked all global font conf files into ~/.local/share/fonts/. That might also be worth a try...

A more permanent solution?

$ mkdir -p ~/snap/chromium/common/.config/fontconfig
$ cat >>~/snap/chromium/common/.config/fontconfig/fonts.conf <<EOF
<fontconfig>
  <include>/etc/fonts/conf.d</include>
</fontconfig>
EOF

I settled for a combination of the linked suggestions. The above snippet looks like it works. Crosses fingers...

2021-01-02 - stale apparmor config / mysql refuses to start

So, recently we had an issue with a MariaDB server that refused to start. Or, actually, it would start, but before long, SystemD would kill it. But why?

# systemctl start mariadb.service
Job for mariadb.service failed because a timeout was exceeded.
See "systemctl status mariadb.service" and "journalctl -xe" for details.

After 90 seconds, it would be killed. systemctl status mariadb.service shows the immediate cause:

# systemctl status mariadb.service
...
systemd[1]: mariadb.service: Start operation timed out. Terminating.
systemd[1]: mariadb.service: Main process exited, code=killed, status=15/TERM
systemd[1]: mariadb.service: Failed with result 'timeout'.

Ok, a start operation timeout. That is caused by the notify type: apparently the mysqld doesn't get a chance to tell SystemD that it has succesfully completed startup.

First, a quickfix, so we can start at all:

# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf
[Service]
Type=simple
EOF

That fixes so we can start — because now SystemD won't require for any "started" notification anymore — but it doesn't explain what is wrong.

Second, an attempt at debugging the cause:

# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf
[Service]
NotifyAccess=all
ExecStart=
ExecStart=/usr/bin/strace -fesendmsg,sendto,connect,socket -s8192 \
  /usr/sbin/mysqld $MYSQLD_OPTS
EOF

Okay, that one showed EACCESS errors on the sendmsg() call on the /run/systemd/notify unix socket:

strace[55081]: [pid 55084] socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 46
strace[55081]: [pid 55084] sendmsg(46, {msg_name={sa_family=AF_UNIX,
  sun_path="/run/systemd/notify"}, msg_namelen=22,
  msg_iov=[{iov_base="READY=1\nSTATUS=Taking your SQL requests now...\n", iov_len=47}],
  msg_iovlen=1, msg_controllen=0, msg_flags=0},
  MSG_NOSIGNAL) = -1 EACCES (Permission denied)

Permission denied? But why?

# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf
[Service]
NotifyAccess=all
ExecStart=
ExecStart=/usr/bin/strace -fesendmsg,sendto,connect,socket -s8192 \
  /bin/sh -c 'printf "READY=1\nSTATUS=Taking your SQL requests now...\n" | \
    socat - UNIX-SENDTO:/run/systemd/notify; sleep 3600
EOF

This worked:

strace[54926]: [pid 54931] socket(AF_UNIX, SOCK_DGRAM, 0) = 5
strace[54926]: [pid 54931] sendto(5,
  "READY=1\nSTATUS=Taking your SQL requests now...\n", 47, 0,
  {sa_family=AF_UNIX, sun_path="/run/systemd/notify"}, 21) = 47

(Unless someone is really trying to mess with you, you can regard sendto() and sendmsg() as equivalent here. socat simply uses the other one.)

That means that there is nothing wrong with SystemD or /run/systemd/notify. So the problem must be related to /usr/sbin/mysqld.

After looking at journalctl -u mariadb.service for the nth time, I decided to peek at all of journalctl without any filters. And there it was after all: audit logs.

# journalctl -t audit
audit[1428513]: AVC apparmor="DENIED" operation="sendmsg"
  info="Failed name lookup - disconnected path" error=-13
  profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=1428513
  comm="mysqld" requested_mask="w" denied_mask="w" fsuid=104 ouid=0

(Observe the -t in the journalctl invocation above which looks for the SYSLOG_IDENTIFIER=audit key-value pair.)

Okay. And fixing it?

# aa-remove-unknown
Skipping profile in /etc/apparmor.d/disable: usr.sbin.mysqld
Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
Removing '/usr/sbin/mysqld'

A-ha! Stale cruft in /var/cache/apparmor.

# /etc/init.d/apparmor restart
Restarting apparmor (via systemctl): apparmor.service.

Finally we could undo the override.conf and everything started working as expected.

2021-01-01 - zfs / zvol / partition does not show up

On our Proxmox virtual machine I had to go into a volume to quickly fix an IP address. The volume exists on the VM host, so surely mounting is easy. Right?

I checked in /dev/zvol/pve2-pool/ where I found the disk:

# ls /dev/zvol/pve2-pool/vm-125-virtio0*
total 0
lrwxrwxrwx 1 root root 10 Dec 29 15:55 vm-125-virtio0 -> ../../zd48

Good, there's a disk:

# fdisk -l /dev/zd48
Disk /dev/zd48: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disklabel type: dos
Disk identifier: 0x000aec27

Device      Boot    Start       End  Sectors  Size Id Type
/dev/zd48p1 *        2048  97656831 97654784 46.6G 83 Linux
/dev/zd48p2      97656832 104855551  7198720  3.4G 82 Linux swap / Solaris

And it has partitions. Now if I could only find them, so I can mount them...

Apparently, there's a volmode on the ZFS volume that specifies how volumes should be exposed to the OS.

Setting it to full exposes volumes as fully fledged block devices, providing maximal functionality. [...] Setting it to dev hides its partitions. Volumes with property set to none are not exposed outside ZFS, but can be snapshoted, cloned, replicated, etc, that can be suitable for backup purposes.

So:

# zfs get volmode zl-pve2-ssd1/vm-125-virtio0
NAME                         PROPERTY  VALUE    SOURCE
zl-pve2-ssd1/vm-125-virtio0  volmode   default  default
# zfs set volmode=full zl-pve2-ssd1/vm-125-virtio0
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0
NAME                         PROPERTY  VALUE    SOURCE
zl-pve2-ssd1/vm-125-virtio0  volmode   full     local
# ls -1 /dev/zl-pve2-ssd1/
vm-122-virtio0
vm-123-virtio0
vm-124-virtio0
vm-125-virtio0
vm-125-virtio0-part1
vm-125-virtio0-part2

Yes! Partitions for vm-125-virtio0.

If that partition does not show up as expected, a call to partx -a /dev/zl-pve2-ssd1/vm-125-virtio0 might do the trick.

Quick, do some mount /dev/zl-pve2-ssd1/vm-125-virtio0-part1 /mnt/root; edit some files.

But, try to refrain from editing the volume while the VM is running. That may cause filesystem corruption.

Lastly umount and unset the volmode again:

# zfs inherit volmode zl-pve2-ssd1/vm-125-virtio0
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0
NAME                         PROPERTY  VALUE    SOURCE
zl-pve2-ssd1/vm-125-virtio0  volmode   default  default

And optionally updating kernel bookkeeping, with: partx -d -n 1:2 /dev/zl-pve2-ssd1/vm-125-disk-0