Notes to self, 2021
2021-01-15 - postgresql inside kubernetes / no space left on device
Running PostgreSQL inside Kubernetes? Getting occasional "No space left on device" errors? Know that 64MB is not enough for everyone.
With the advent of more services running inside Kubernetes, we're now running into new issues and complexities specific to the containerization. For instance, to solve the problem of regular file backups of distributed filesystems, we've resorted to using rsync wrapped inside a pod (or sidecar). And now for containerized PostgreSQL, we're running into an artificial memory limit that needs fixing.
Manifestation
The issue manifests itself like this:
ERROR: could not resize shared memory segment "/PostgreSQL.491173048" to 4194304 bytes: No space left on device
This shared memory that PostgreSQL speaks of, is the
shared memory made available to it through
/dev/shm
.
On your development machine, it may look like this:
$ mount | grep shm tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
$ df -h | sed -ne '1p;/shm/p' Filesystem Size Used Avail Use% Mounted on tmpfs 16G 948M 15G 6% /dev/shm
That's fine. 16GiB is plenty of space. But in Kubernetes we get a Kubernetes default of a measly 64MiB and no means to change the shm-size. So, inside the pod with the PostgreSQL daemon, things look like this:
$ mount | grep shm shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
$ df -h | sed -ne '1p;/shm/p' Filesystem Size Used Avail Use% Mounted on shm 64M 0 64M 0% /dev/shm
For a bunch of database operations, that is definitely too little. Any PostgreSQL database doing any serious work will quickly use up that much temporary space. (And run into this error.)
According to Thomas Munro on the postgrespro.com mailing list:
PostgreSQL creates segments in
/dev/shm
for parallel queries (viashm_open()
), not for shared buffers. The amount used is controlled bywork_mem
. Queries can use up towork_mem
for each node you see in theEXPLAIN
plan, and for each process, so it can be quite a lot if you have lots of parallel worker processes and/or lots of tables/partitions being sorted or hashed in your query.
Basically what they're saying is: you need sufficient space in /dev/shm
, period!
On the docker-library
postgres page it is documented that you may want to increase
the --shm-size
(ShmSize).
That is quite doable for direct Docker or docker-compose instantiations.
But for PostgreSQL daemon pods in Kubernetes resizing
shm does not seem to be possible.
Any other fixes then?
Well, I'm glad you asked! /dev/shm
is just one of the ways that the
PostgreSQL daemon can be configured to allocate shared memory through:
- dynamic_shared_memory_type (enum)
- Specifies the dynamic shared memory implementation that the server should use. Possible values are posix (for POSIX shared memory allocated using shm_open), sysv (for System V shared memory allocated via shmget), windows (for Windows shared memory), and mmap (to simulate shared memory using memory-mapped files stored in the data directory). [...]
(from PostgresSQL runtime config)
When using the posix shm_open()
, we're directly
opening files in /dev/shm
. If we however opt to use the
(old fashioned) sysv shmget()
, the memory
allocation is not pinned to this filesystem and it is not limited
(unless someone has been touching /proc/sys/kernel/shm*
).
Technical details of using System V shared memory
Using System V shared memory is a bit more convoluted than
using POSIX shm. For POSIX shared memory
calling shm_open()
is basically the same as opening a
(mmap
-able) file in /dev/shm
. For System
V however, you're looking at an incantation like this
shmdemo.c
example:
#include <stdio.h> #include <string.h> #include <sys/ipc.h> #include <sys/shm.h> #define SHM_SIZE (size_t)(512 * 1024 * 1024UL) /* 512MiB */ int main(int argc, char *argv[]) { key_t key; int shmid; char *data; if (argc > 2) { fprintf(stderr, "usage: shmdemo [data_to_write]\n"); return 1; } /* The file here is used as a "pointer to memory". The key is * calculated based on the inode number and non-zero 8 bits: */ if ((key = ftok("./pointer-to-memory.txt", 1 /* project_id */)) == -1) { fprintf(stderr, "please create './pointer-to-memory.txt'\n"); return 2; } if ((shmid = shmget(key, SHM_SIZE, 0644 | IPC_CREAT)) == -1) return 3; if ((data = shmat(shmid, NULL, 0)) == (char *)(-1)) /* attach */ return 4; /* read or modify the segment, based on the command line: */ if (argc == 2) { printf("writing to segment %#x: \"%s\"\n", key, argv[1]); strncpy(data, argv[1], SHM_SIZE); } else { printf("segment %#x contained: \"%s\"\n", key, data); shmctl(shmid, IPC_RMID, NULL); /* free the memory */ } if (shmdt(data) == -1) /* detach */ return 5; return 0; }
(Luckily the PostgreSQL programmers concerned themselves with these awkward semantics, so we won't have to.)
If you want to confirm that you have access to sufficient System V shared memory inside your pod, you could try the above code sample to test. Invoking it looks like:
$ ./shmdemo please create './pointer-to-memory.txt'
$ touch ./pointer-to-memory.txt
$ ./shmdemo segment 0x1010dd5 contained: ""
$ ./shmdemo 'please store this in shm' writing to segment 0x1010dd5: "please store this in shm"
$ ./shmdemo segment 0x1010dd5 contained: "please store this in shm"
$ ./shmdemo segment 0x1010dd5 contained: ""
And if you skipped/forget the IPC_RMID
, you can see the
leftovers using ipcs
:
$ ipcs | awk '{if(int($6)==0)print}' ------ Message Queues -------- key msqid owner perms used-bytes messages ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x52010e16 688235 walter 644 536870912 0 0x52010e19 688238 walter 644 536870912 0 ------ Semaphore Arrays -------- key semid owner perms nsems
And remove them with ipcrm
:
$ ipcrm -M 0x52010e16
$ ipcrm -M 0x52010e19
But, you probably did not come here for lessons in ancient IPC. Quickly moving on to the next paragraph...
Configuring sysv dynamic_shared_memory_type in stolon
For stolon — the Kubernetes PostgreSQL
manager that we're using — you can configure different
parameters through the pgParameters
setting. It keeps
the configuration in a configMap
:
$ kubectl -n NS get cm stolon-cluster-mycluster -o yaml apiVersion: v1 kind: ConfigMap metadata: annotations: control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":...}' stolon-clusterdata: '{"formatVersion":1,...}' ...
Where the stolon-clusterdata
holds both the
configuration and current state:
{ "formatVersion": 1, "changeTime": "2021-01-15T10:17:54.297700008Z", "cluster": { ... "spec": { ... "pgParameters": { "datestyle": "iso, mdy", "default_text_search_config": "pg_catalog.english", "dynamic_shared_memory_type": "posix", ...
You should not be editing this directly, but it can be educational to look at.
To edit the pgParameters
you'll be using
stolonctl
from inside a stolon-proxy as specified
in the cluster
specification patching docs:
$ stolonctl --cluster-name=mycluster --store-backend=kubernetes \ --kube-resource-kind=configmap update --patch \ '{"pgParameters": {"dynamic_shared_memory_type": "sysv"}}'
$ stolonctl --cluster-name=mycluster --store-backend=kubernetes \ --kube-resource-kind=configmap update --patch \ '{"pgParameters": {"shared_buffers": "6144MB"}}'
And a restart:
$ kubectl -n NS rollout restart sts stolon-keeper
And that, my friends, should get rid of that pesky 64MiB limit.
2021-01-05 - chromium snap / wrong fonts
So, since a couple of weeks my snap-installed Chromium browser on
Ubuntu focal started acting up: suddenly it chooses the wrong fonts on
some web pages. The chosen fonts are from the
~/.local/share/fonts/
directory.
![[wjd.nu pages with incorrect looking font]](/files/2021/01/badfont-wjd.png)
Look! That's not the correct font. And it's even more apparent that the font is off when seeing the source view.
![[browser html source view with incorrect looking font]](/files/2021/01/badfont-wjdsource.png)
Bah. That's not even a monospaced font.
A fix that appeared to work — but unfortunately only temporarily — involves temporarily moving the custom local fonts out of the way and then flushing the font cache:
$ mkdir ~/.local/share/DISABLED-fonts $ mv ~/.local/share/fonts/* ~/.local/share/DISABLED-fonts/ $ fc-cache -rv && sudo fc-cache -rv
Restarting chromium-browser by using the about:restart
took quite a while. Some patience had to be exercised.
When it finally did start, all font issues were solved.
Can we now restore our custom local fonts again?
$ mv ~/.local/share/DISABLED-fonts/* ~/.local/share/fonts/ $ fc-cache -rv && sudo fc-cache -rv
And another about:restart
— which was fast as
normal again — and everything was still fine. So yes, apparently, we can.
However, after half a day of work, the bug reappeared.
A semi-permanent fix is refraining from using the the local fonts directory. But that's not really good enough.
Appently there's a bug report showing that not only Chromium is affected. And while I'm not sure how to fix things yet, at least the following seems suspect:
$ grep include.*/snap/ \ ~/snap/chromium/current/.config/fontconfig/fonts.conf <include ignore_missing="yes">/snap/chromium/1424/gnome-platform/etc/fonts/fonts.conf</include>
This would make sense, if current/
pointed to
1424
, but current/
now points to
1444
.
Here's a not
yet merged pull request that look's promising.
And here, there's someone who grew tired of hotfixing the
fonts.conf
and symlinked
all global font conf files into ~/.local/share/fonts/. That might
also be worth a try...
A more permanent solution?
$ mkdir -p ~/snap/chromium/common/.config/fontconfig $ cat >>~/snap/chromium/common/.config/fontconfig/fonts.conf <<EOF <fontconfig> <include>/etc/fonts/conf.d</include> </fontconfig> EOF
I settled for a combination of the linked suggestions. The above snippet looks like it works. Crosses fingers...
2021-01-02 - stale apparmor config / mysql refuses to start
So, recently we had an issue with a MariaDB server that refused to start. Or, actually, it would start, but before long, SystemD would kill it. But why?
# systemctl start mariadb.service Job for mariadb.service failed because a timeout was exceeded. See "systemctl status mariadb.service" and "journalctl -xe" for details.
After 90 seconds, it would be killed. systemctl status
mariadb.service
shows the immediate cause:
# systemctl status mariadb.service ... systemd[1]: mariadb.service: Start operation timed out. Terminating. systemd[1]: mariadb.service: Main process exited, code=killed, status=15/TERM systemd[1]: mariadb.service: Failed with result 'timeout'.
Ok, a start operation timeout. That is caused by the
notify
type: apparently the mysqld
doesn't
get a chance to tell SystemD that it has succesfully
completed startup.
First, a quickfix, so we can start at all:
# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf [Service] Type=simple EOF
That fixes so we can start — because now SystemD won't require for any "started" notification anymore — but it doesn't explain what is wrong.
Second, an attempt at debugging the cause:
# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf [Service] NotifyAccess=all ExecStart= ExecStart=/usr/bin/strace -fesendmsg,sendto,connect,socket -s8192 \ /usr/sbin/mysqld $MYSQLD_OPTS EOF
Okay, that one showed EACCESS
errors on the
sendmsg()
call on the /run/systemd/notify
unix socket:
strace[55081]: [pid 55084] socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 46 strace[55081]: [pid 55084] sendmsg(46, {msg_name={sa_family=AF_UNIX, sun_path="/run/systemd/notify"}, msg_namelen=22, msg_iov=[{iov_base="READY=1\nSTATUS=Taking your SQL requests now...\n", iov_len=47}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = -1 EACCES (Permission denied)
Permission denied? But why?
# cat <<EOF >/etc/systemd/system/mariadb.service.d/override.conf [Service] NotifyAccess=all ExecStart= ExecStart=/usr/bin/strace -fesendmsg,sendto,connect,socket -s8192 \ /bin/sh -c 'printf "READY=1\nSTATUS=Taking your SQL requests now...\n" | \ socat - UNIX-SENDTO:/run/systemd/notify; sleep 3600 EOF
This worked:
strace[54926]: [pid 54931] socket(AF_UNIX, SOCK_DGRAM, 0) = 5 strace[54926]: [pid 54931] sendto(5, "READY=1\nSTATUS=Taking your SQL requests now...\n", 47, 0, {sa_family=AF_UNIX, sun_path="/run/systemd/notify"}, 21) = 47
(Unless someone is really trying to mess with you,
you can regard sendto()
and sendmsg()
as
equivalent here. socat
simply uses the other one.)
That means that there is nothing wrong with SystemD or
/run/systemd/notify
. So the problem
must be related to /usr/sbin/mysqld
.
After looking at journalctl -u mariadb.service
for the nth time,
I decided to peek at all of journalctl without any filters. And there it was
after all: audit logs.
# journalctl -t audit audit[1428513]: AVC apparmor="DENIED" operation="sendmsg" info="Failed name lookup - disconnected path" error=-13 profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=1428513 comm="mysqld" requested_mask="w" denied_mask="w" fsuid=104 ouid=0
(Observe the -t
in the journalctl invocation above which
looks for the SYSLOG_IDENTIFIER=audit
key-value pair.)
Okay. And fixing it?
# aa-remove-unknown Skipping profile in /etc/apparmor.d/disable: usr.sbin.mysqld Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd Removing '/usr/sbin/mysqld'
A-ha! Stale cruft in /var/cache/apparmor
.
# /etc/init.d/apparmor restart Restarting apparmor (via systemctl): apparmor.service.
Finally we could undo the override.conf
and everything
started working as expected.
2021-01-01 - zfs / zvol / partition does not show up
On our Proxmox virtual machine I had to go into a volume to quickly fix an IP address. The volume exists on the VM host, so surely mounting is easy. Right?
I checked in /dev/zvol/pve2-pool/
where I found the disk:
# ls /dev/zvol/pve2-pool/vm-125-virtio0* total 0 lrwxrwxrwx 1 root root 10 Dec 29 15:55 vm-125-virtio0 -> ../../zd48
Good, there's a disk:
# fdisk -l /dev/zd48 Disk /dev/zd48: 50 GiB, 53687091200 bytes, 104857600 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 8192 bytes I/O size (minimum/optimal): 8192 bytes / 8192 bytes Disklabel type: dos Disk identifier: 0x000aec27 Device Boot Start End Sectors Size Id Type /dev/zd48p1 * 2048 97656831 97654784 46.6G 83 Linux /dev/zd48p2 97656832 104855551 7198720 3.4G 82 Linux swap / Solaris
And it has partitions. Now if I could only find them, so I can mount them...
Apparently, there's a volmode
on the ZFS volume that specifies how volumes should be exposed to the OS.
Setting it to
full
exposes volumes as fully fledged block devices, providing maximal functionality. [...] Setting it todev
hides its partitions. Volumes with property set tonone
are not exposed outside ZFS, but can be snapshoted, cloned, replicated, etc, that can be suitable for backup purposes.
So:
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0 NAME PROPERTY VALUE SOURCE zl-pve2-ssd1/vm-125-virtio0 volmode default default
# zfs set volmode=full zl-pve2-ssd1/vm-125-virtio0
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0 NAME PROPERTY VALUE SOURCE zl-pve2-ssd1/vm-125-virtio0 volmode full local
# ls -1 /dev/zl-pve2-ssd1/ vm-122-virtio0 vm-123-virtio0 vm-124-virtio0 vm-125-virtio0 vm-125-virtio0-part1 vm-125-virtio0-part2
Yes! Partitions for vm-125-virtio0
.
If that partition does not show up as expected, a
call to partx -a /dev/zl-pve2-ssd1/vm-125-virtio0
might do
the trick.
Quick, do some mount /dev/zl-pve2-ssd1/vm-125-virtio0-part1
/mnt/root
; edit some files.
But, try to refrain from editing the volume while the VM is running. That may cause filesystem corruption.
Lastly umount
and unset the volmode again:
# zfs inherit volmode zl-pve2-ssd1/vm-125-virtio0
# zfs get volmode zl-pve2-ssd1/vm-125-virtio0 NAME PROPERTY VALUE SOURCE zl-pve2-ssd1/vm-125-virtio0 volmode default default
And optionally updating kernel bookkeeping, with: partx -d
-n 1:2 /dev/zl-pve2-ssd1/vm-125-disk-0