Notes to self, 2023

2023-10-15 - mplayer / screen saver / wayland

I don't usually watch movies from my laptop, but when I do, I don't want the screen saver to kick in.

Occasionally, I notice that my old time movie player mplayer does not inhibit the screen saver or screen blanking. That means that 5 minutes into a movie, the screen turns black.

I don't think this used to be the case. Maybe it's because I'm running the Wayland display server in GNOME now.

In any case, inhibiting the screen blanking can be done using the gnome-session-inhibit program:

gnome-session-inhibit --inhibit idle PROGRAM

For example:

gnome-session-inhibit --inhibit idle mplayer /path/to/wholesome-documentary.mkv

2023-09-20 - bash / postfix health check / dev tcp

Using /dev/tcp in bash for a health check? Here's an example.

I had a script that used netcat to connect to a Postfix email daemon to check its health status. To avoid pipelining errors I had it sleep between each write. The core looked somewhat like this:

messages=$(for x in \
        'EHLO localhost' \
        'MAIL FROM:<healthz@localhost>' \
        'RCPT TO:<postmaster@example.com>' \
        RSET \
        QUIT
    do sleep 0.1; echo "$x"; done |
        nc -v $HOST 25 2>&1 |
        tr -d '\r')

This works, but the sleeps make it slower than necessary, and more brittle. If the daemon is temporarily slow, we can trigger a Postfix SMTP command pipelining error anyway.

Ideally, we want to read the responses, and act on them immediately instead.

Here's a script that uses bash instead of POSIX sh because bash has /dev/tcp support, which makes doing networking I/O easier.

Starting bash might be slightly costlier than starting a smaller POSIX sh like dash. But we avoid calling netcat and some other tools, so we win out not only in speed but also in resource usage.

#!/bin/bash
# Using bash (not POSIX sh) for /dev/tcp I/O.
: ${BASH_VERSION:?Use bash, not POSIX-sh}
set -u

DEBUG=0
READ_TIMEOUT=10
CHAR_CR=$'\r'

#HOST=mail.example.com
#HOST=1.2.3.4
HOST=127.0.0.1
PROXY_PROTOCOL=

if [ "$HOST" = 127.0.0.1 ]; then
    # This should succeed in mere milliseconds. But sometimes postfix
    # decides to take just over a second for the RCPT TO check.
    READ_TIMEOUT=3
    PROXY_PROTOCOL='PROXY TCP4 127.0.0.1 127.0.0.1 12345 25'  # (or empty)
fi

getresp() {
    local line status
    while :; do
        read -t $READ_TIMEOUT -r line
        [ $DEBUG -ne 0 ] && printf '%s\n' "<<< $line" >&2
        printf '%s\n' "${line%$CHAR_CR}"
        test -z "$line" && exit 65
        status=${line:0:3}
        if [ $status -lt 200 -o $status -ge 300 ]; then
            exit 66
        elif [ "${line:3:1}" = ' ' ]; then
            break   # "250 blah"
        elif [ "${line:3:1}" = '-' ]; then
            true    # "250-blah" (continue)
        else
            exit 67
        fi
    done
}

if ! exec 3<>/dev/tcp/$HOST/25; then  # open fd
    # Takes a looooot of time. 2m10 in the test case. You will want to wrap
    # this script in a timeout(1) call.
    # $0: connect: Connection timed out
    # $0: line 40: /dev/tcp/1.2.3.4/25: Connection timed out
    exit 1
fi

messages=$(for x in \
        "$PROXY_PROTOCOL" \
        'EHLO localhost' \
        'MAIL FROM:<healthz@localhost>' \
        'RCPT TO:<postmaster@example.com>' \
        RSET \
        QUIT
    do \
        [ -n "$x" -a $DEBUG -ne 0 ] && printf '>>> %s\n' "$x" >&2
        [ -n "$x" ] && printf '%s\r\n' "$x" >&3
        getresp <&3 || exit $?
    done)
ret=$?

exec 3>&- # close fd

ok=$(echo "$messages" | grep -xF '250 2.0.0 Ok')
fail=$(echo "$messages" | sed -e1d | grep -v ^2 | grep '')
if [ $ret -ne 0 -o -z "$ok" ]; then
    echo "Missing OK line (ret $ret). Got:" $messages
    false
elif [ -n "$fail" ]; then
    echo "$fail"
    false
else
    true
fi

One can invoke this from something like a Haproxy check script — this is also the part where you add a timeout call.

#!/bin/sh
CHECK_SCRIPT=/usr/local/bin/postfix-is-healthy

if test -f /srv/in_maintenance; then
    echo 'drain stopped 0%'
elif error=$(timeout -k3 -s1 2 $CHECK_SCRIPT 2>&1); then
    echo 'ready up 100%'
else
    echo 'fail #' $error
fi

And that script could be invoked from something like xinetd:

service postfix-haproxy
{
        flags           = REUSE
        socket_type     = stream
        type            = unlisted
        port            = 1025
        wait            = no
        user            = nobody
        server          = /usr/local/bin/postfix-haproxy-agent-check
        log_on_failure  += USERID
        disable         = no
        only_from       = 0.0.0.0/0 ::1
        per_source      = UNLIMITED
}

2023-09-18 - gpgv / can't allocate lock for

gpgv prints out a warning that it cannot allocate a lock. This looks like something should be fixable, but it isn't.

Observed with gpg version 2.2.27-3ubuntu2.1:

$ gpgv </dev/null
gpgv: can't allocate lock for '/home/walter/.gnupg/trustedkeys.gpg'
gpgv: verify signatures failed: Unknown system error

The issue at hand here is the “gpgv: can't allocate lock for '/home/walter/.gnupg/trustedkeys.gpg'”. Can we fix something to suppress that?

TL;DR: no

Investigation

Fire up the debugger gdb and break at keybox_lock.constprop.0.isra.0.

(gdb) break keybox_lock.constprop.0.isra.0
Breakpoint 1 at 0x2f850: file ../kbx/../../kbx/keybox-init.c, line 271.
(gdb) run
Starting program: /usr/bin/gpgv
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, keybox_lock.constprop.0.isra.0 (yes=yes@entry=1, timeout=0, hd=<optimized out>, hd=<optimized out>) at ../kbx/../../kbx/keybox-init.c:271
271 keybox_lock (KEYBOX_HANDLE hd, int yes, long timeout)
(gdb) list
266  * Lock the keybox at handle HD, or unlock if YES is false.  TIMEOUT
267  * is the value used for dotlock_take.  In general -1 should be used
268  * when taking a lock; use 0 when releasing a lock.
269  */
270 gpg_error_t
271 keybox_lock (KEYBOX_HANDLE hd, int yes, long timeout)
272 {
273   gpg_error_t err = 0;
274   KB_NAME kb = hd->kb;
275
(gdb) list
276   if (!keybox_is_writable (kb))
277     return 0;
278
279   /* Make sure the lock handle has been created.  */
280   if (!kb->lockhd)
281     {
282       kb->lockhd = dotlock_create (kb->fname, 0);
283       if (!kb->lockhd)
284         {
285           err = gpg_error_from_syserror ();
(gdb) next
276   if (!keybox_is_writable (kb))
(gdb)
280   if (!kb->lockhd)
(gdb)
285           err = gpg_error_from_syserror ();

Okay, there was a dotlock_create that supposedly failed. But it isn't even called!

We can confirm this in the machine code:

$ objdump -d `which gpgv` --disassemble=keybox_lock.constprop.0.isra.0
...
000000000002f850 <keybox_lock.constprop.0.isra.0>:
   2f850: f3 0f 1e fa           endbr64
   2f854: 41 55                 push   %r13
   2f856: 41 54                 push   %r12
   2f858: 55                    push   %rbp
   2f859: 53                    push   %rbx
   2f85a: 48 83 ec 08           sub    $0x8,%rsp
   2f85e: 48 85 ff              test   %rdi,%rdi
   2f861: 74 3d                 je     2f8a0 <keybox_lock.constprop.0.isra.0+0x50>
   2f863: 4c 8d 6f 30           lea    0x30(%rdi),%r13
   2f867: 48 89 fb              mov    %rdi,%rbx
   2f86a: 89 f5                 mov    %esi,%ebp
   2f86c: be 02 00 00 00        mov    $0x2,%esi
   2f871: 4c 89 ef              mov    %r13,%rdi
   2f874: e8 97 6f fd ff        call   6810 <gpgrt_access@plt>

This call was the keybox_is_writable check.

   2f879: 41 89 c4              mov    %eax,%r12d
   2f87c: 85 c0                 test   %eax,%eax
   2f87e: 75 20                 jne    2f8a0 <keybox_lock.constprop.0.isra.0+0x50>
   2f880: 48 83 7b 20 00        cmpq   $0x0,0x20(%rbx)

Here is the first !kb->lockhd check.

   2f885: 74 41                 je     2f8c8 <keybox_lock.constprop.0.isra.0+0x78>

It was zero, so we jump to 0x2f8c8.

   2f887: 8b 43 28              mov    0x28(%rbx),%eax
   2f88a: 85 ed                 test   %ebp,%ebp
   2f88c: 74 2a                 je     2f8b8 <keybox_lock.constprop.0.isra.0+0x68>
   2f88e: 85 c0                 test   %eax,%eax
...

0x2f8c8 is here: immediate error handling. No attempt to call dotlock_create at all.

   2f8c8: e8 d3 6a fd ff        call   63a0 <gpg_err_code_from_syserror@plt>
   2f8cd: 4c 89 ee              mov    %r13,%rsi
   2f8d0: 48 8d 3d db c6 00 00  lea    0xc6db(%rip),%rdi        # 3bfb2 <__FUNCTION__.0.lto_priv.14+0x192>

Here the "can't allocate lock for '%s'\n" string is loaded.

   2f8d7: 41 89 c4              mov    %eax,%r12d
   2f8da: 0f b7 c0              movzwl %ax,%eax
   2f8dd: 0d 00 00 00 08        or     $0x8000000,%eax
   2f8e2: 45 85 e4              test   %r12d,%r12d
   2f8e5: 44 0f 45 e0           cmovne %eax,%r12d
   2f8e9: 31 c0                 xor    %eax,%eax
   2f8eb: e8 70 5b fe ff        call   15460 <log_info>
   2f8f0: eb b1                 jmp    2f8a3 <keybox_lock.constprop.0.isra.0+0x53>

So, we get a “can't allocate” warning always.

If we build gpgv ourselves, we can take a look at the gpgv.o object file, which has a no-op dotlock_create:

0000000000000400 <dotlock_create>:
 400:   f3 0f 1e fa             endbr64
 404:   31 c0                   xor    %eax,%eax
 406:   c3                      ret
 407:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
 40e:   00 00

This function does nothing and returns NULL. When linking, it is elided, so the call disappears, along with the double !kb->lockhd check.

(That nopw is an extended nop (nop-op). It is there for alignment purposes only.)

Conclusion

This leaves two questions:

  • Why is dotlock_create a no-op? Apparently gpgv.c contains this:
    // ...
    
    /* We do not do any locking, so use these stubs here */
    void
    dotlock_disable (void)
    {
    }
    
    dotlock_t
    dotlock_create (const char *file_to_lock, unsigned int flags)
    {
      (void)file_to_lock;
      (void)flags;
      return NULL;
    }
    
    // ...
    
  • Why are we forced to look at a warning message for something that will never succeed anyway? I guess no one saw this as an issue.

2023-09-15 - etckeeper / git / pack-objects died of signal 9

Is your etckeeper dying on the git gc?

On several machines I have now seen git run out of memory when trying to do repo optimization and garbage collection. Usually this happened in /etc where we like to have etckeeper. It might look like this:

...
warning: The last gc run reported the following. Please correct the root cause
and remove .git/gc.log
Automatic cleanup will not be performed until the file is removed.

error: pack-objects died of signal 9
fatal: failed to run repack

A manual git gc did not help:

# git gc --prune=now
Enumerating objects: 11391, done.
Counting objects: 100% (11391/11391), done.
Delta compression using up to 2 threads
error: pack-objects died of signal 9)
fatal: failed to run repack

But combining it with a memory limit did:

# git -c pack.windowMemory=1024m gc --prune=now
Enumerating objects: 11391, done.
Counting objects: 100% (11391/11391), done.
Delta compression using up to 2 threads
Compressing objects: 100% (5664/5664), done.
Writing objects: 100% (11391/11391), done.
Total 11391 (delta 7052), reused 8042 (delta 5041), pack-reused 0

That was quite slow. But now everything is okay again.

2023-09-14 - segfault in library / addr2line / objdump

Yesterday, we spotted some SEGFAULTs on an Ubuntu/Focal server. We did not have core dumps, but the kernel message in dmesg was sufficient to find a culprit.

The observed messages were these:

nginx[854]: segfault at 6d702e746379 ip 00007ff40dc2f5a3 sp 00007fffd51c8420 error 4 in libperl.so.5.30.0[7ff40dbc7000+166000]
Code: 48 89 43 10 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 40 00 0f b6 7f 30 48 c1 e8 03 48 29 f8 48 89 c3 74 89 48 8b 02 <4c> 8b 68 10 4d 85 ed 0f 84 28 01 00 00 0f b6 40 30 49 c1 ed 03 49
nginx[951947]: segfault at 10 ip 00007fba4a1645a3 sp 00007ffe57b0f8a0 error 4 in libperl.so.5.30.0 (deleted)[7fba4a0fc000+166000]
Code: 48 89 43 10 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 40 00 0f b6 7f 30 48 c1 e8 03 48 29 f8 48 89 c3 74 89 48 8b 02 <4c> 8b 68 10 4d 85 ed 0f 84 28 01 00 00 0f b6 40 30 49 c1 ed 03 49

And after upgrading libperl5.30 from 5.30.0-9ubuntu0.3 to 5.30.0-9ubuntu0.4, we got these similar ones:

traps: nginx[955774] general protection fault ip:7f6af33345a3 sp:7ffe74310100 error:0 in libperl.so.5.30.0[7f6af32cc000+166000]
nginx[1049280]: segfault at 205bd ip 00007f5e60d265d9 sp 00007ffe7b5f08c0 error 4 in libperl.so.5.30.0[7f5e60cbe000+166000]
Code: 00 0f b6 40 30 49 c1 ed 03 49 29 c5 0f 84 17 01 00 00 48 8b 76 10 48 8b 52 10 4c 8d 3c fe 4c 8d 0c c2 84 c9 0f 84 c7 02 00 00 <49> 83 39 00 0f 85 ad 03 00 00 49 83 c1 08 49 83 ed 01 49 8d 74 1d

Apparently they were triggered by an nginx reload.

If we had a proper core dump, we could extract lots of useful info from it: where the crash occurred, which registers and variables were set, and the call chain (backtrace). With the info from above, we can at most get where the crash happened, and maybe which register had a bad value. But it is definitely better than nothing.

Feeding calculated offset to addr2line

For the most basic attempt, I found a box which still had libperl version 5.30.0-9ubuntu0.3. I installed the perl-debug apt package — perl-debug_5.30.0-9ubuntu0.3_amd64.deb from https://launchpadlibrarian.net/ — there. From the kernel message “nginx[854]: segfault at 6d702e746379 ip 00007ff40dc2f5a3 sp 00007fffd51c8420 error 4 in libperl.so.5.30.0[7ff40dbc7000+166000]” we take the instruction pointer 00007ff40dc2f5a3 and subtract the library starting position 7ff40dbc7000:

0x7ff40dc2f5a3 - 0x7ff40dbc7000 = 0x685a3

Feed that to addr2line and get the location of the crash... right?

$ addr2line -Cfe /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 685a3
Perl_vload_module
op.c:7750

At first glance that appears okay. But when we check what happens in the machine instructions there, it is not:

$ objdump -d /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 --disassemble=Perl_vload_module
...
00000000000685a3 <Perl_vload_module@@Base+0xd3>:
...
   6859b:       83 c0 08                add    $0x8,%eax
   6859e:       49 03 57 10             add    0x10(%r15),%rdx
   685a2:       41 89 07                mov    %eax,(%r15)
   685a5:       48 8b 0a                mov    (%rdx),%rcx
   685a8:       45 31 e4                xor    %r12d,%r12d
   685ab:       48 85 c9                test   %rcx,%rcx
...

There is no instruction start at 0x685a3!

Searching for machine code inside a binary

What if we simply look for the instructions as shown in the Code: message?

To this end, I hacked together a script that does the following:

  • Spawn a copy of objdump to disassemble the binary;
  • look for the instructions as passed on the command line;
  • display where the instructions are found.

The objdump-find-instructions.py script is collapsed here (see “details”):

details of objdump-find-instructions.py
#!/usr/bin/env python3
import re
import subprocess
import sys

# Look for these:
# >   19640c: 48 89 44 24 28        mov    %rax,0x28(%rsp)
# >   196411: 31 c0                 xor    %eax,%eax
# >   196413: 48 85 db              test   %rbx,%rbx
code_re = re.compile(
    br'^\s+(?P<addr>[0-9a-f]+):(?P<code>(\s[0-9a-f]{2})+)\s+'
    br'(?P<decoded>.*)')
code_without_decoded_re = re.compile(
    br'^\s+(?P<addr>[0-9a-f]+):(?P<code>(\s[0-9a-f]{2})+)\s*$')

# Look for these:
# > 000000000004ea40 <Perl_ck_concat@@Base>:
func_re = re.compile(br'^(?P<addr>[0-9a-f]+) <(?P<name>[^<>]*)>:')

# Look for blanks:
blank_re = re.compile(br'^\s*$')

# Lines to ignore:
ignore_re = re.compile(
    br'^/.*:\s+file format |^Disassembly of section ')

def to_bin(binstr_array):
    return bytes([int(i, 16) for i in binstr_array])

def to_hex(binarray):
    return ' '.join('{:02x}'.format(i) for i in binarray)

# Get executable/binary from argv
executable = sys.argv[1]  # /usr/lib/x86_64-linux-gnu/libperl.so.5.30

# Get needle from argv
needle = [i.encode() for i in sys.argv[2:]]  # ['48', '89', '44', '24', '28']
needle_len = len(needle)
assert needle_len >= 2, 'must specify XX XX XX bytes to search for'
needle_bin = to_bin(needle)
MAX_BUF = needle_len + 30

class Matcher:
    def search(self, haystack, regex):
        self.match = regex.search(haystack)
        if self.match:
            self.dict = self.match.groupdict()
        return self.match

    def get(self, key):
        return self.dict[key]

# Execute
proc = subprocess.Popen(
    ['/usr/bin/objdump', '-d', executable], stdout=subprocess.PIPE)

# Parse
code_bin = bytearray()
last_func = None
last_addr = None
matcher = Matcher()
for line in proc.stdout:
    line = line.rstrip()
    if matcher.search(line, blank_re):
        last_func = None
        last_addr = None
    elif matcher.search(line, func_re):
        last_func = matcher.get('name')
        last_addr = matcher.get('addr')
    elif (matcher.search(line, code_re) or
            matcher.search(line, code_without_decoded_re)):
        new_code_bin = to_bin(matcher.get('code').lstrip().split())
        code_bin.extend(new_code_bin)
        code_bin = code_bin[-MAX_BUF:]  # truncate early

        # This contains search on binary is pretty fast, compared to doing
        # sub-array comparisons.
        # real  0m9.873s  --> 0m4.000s
        # user  0m12.637s --> 0m6.624s
        if needle_bin in code_bin:
            print(
                last_addr.decode(), last_func.decode(),
                matcher.get('addr').decode(),
                matcher.get('decoded').decode(),
                to_hex(new_code_bin))
            # print('//', to_hex(code_bin))
            assert needle_len > 1
            code_bin = code_bin[(-needle_len + 1):]  # skip same results
    elif matcher.search(line, ignore_re):
        pass
    else:
        print('discarding', line)
        exit(2)

The script is invoked like this:

$ python3 objdump-find-instructions.py PATH_TO_BINARY INSTRUCTIONS...

We include all instructions up to and including the <4c> and invoke it like this:

$ python3 objdump-find-instructions.py \
    /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 \
    48 89 43 10 48 83 c4 18 5b 5d 41 5c 41 5d 41 \
    5e 41 5f c3 0f 1f 40 00 0f b6 7f 30 48 c1 e8 \
    03 48 29 f8 48 89 c3 74 89 48 8b 02 4c

It spews out this one line:

00000000000b0500 Perl__invlist_intersection_maybe_complement_2nd@@Base
  b05a3 mov    0x10(%rax),%r13 4c 8b 68 10

That contains the following info:

  • The function Perl__invlist_intersection_maybe_complement_2nd@@Base starts at 00000000000b0500.
  • At 0xb05a3 there is a mov 0x10(%rax),%r13 instruction.
  • That instruction is 4c 8b 68 10 in machine code.

That instruction corresponds with the position in the Code: log line.

Code: [...  8b 02] <4c> 8b 68 10 [4d 85 ...]

This looks like a much better candidate than the Perl_vload_module we got from addr2line. The reading of 0x10(%rax) matches the second crash perfectly: if the %rax register is 0 — a common value — then this would produce a segfault at 10.

Getting the surrounding code from objdump:

$ objdump -d /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 --start-address=0xb0500
...
00000000000b0500 <Perl__invlist_intersection_maybe_complement_2nd@@Base>:
...
   b059e:       74 89                   je     b0529 <Perl__invlist_intersection_maybe_complement_2nd@@Base+0x29>
   b05a0:       48 8b 02                mov    (%rdx),%rax
   b05a3:       4c 8b 68 10             mov    0x10(%rax),%r13
   b05a7:       4d 85 ed                test   %r13,%r13
   b05aa:       0f 84 28 01 00 00       je     b06d8 <Perl__invlist_intersection_maybe_complement_2nd@@Base+0x1d8>
...

Offset 0x48000

I was confident that this is the right crash location. And because Perl did have a problem with the code in this vicinity, it was easy to file a lp2035339 bug report.

But I could not explain yet why the calculated offset of 0x685a3 is off. The difference between 0x685a3 and 0xb05a3 is 0x48000.

A bit of poking around the binary did turn up this:

$ objdump -p  /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0
...
Dynamic Section:
  NEEDED               libdl.so.2
  NEEDED               libm.so.6
  NEEDED               libpthread.so.0
  NEEDED               libc.so.6
  NEEDED               libcrypt.so.1
  SONAME               libperl.so.5.30
  INIT                 0x0000000000048000
  FINI                 0x00000000001ad6b4
...

The machine instructions reside between 0x48000 and 0x1ad6b4. That's where we got the extra 0x48000 we need.

So, next time we do an addr2line lookup of a library, we should check the INIT offset, and add that to calculated instruction pointer position.

Check with newer version

After upgrading both libperl and perl-debug on the test box, we could confirm that the latest crashes were caused by the same problem.

From “traps: nginx[955774] general protection fault ip:7f6af33345a3 sp:7ffe74310100 error:0 in libperl.so.5.30.0[7f6af32cc000+166000]” and the INIT offset of 0x48000 we get 0xb05a3 and from “nginx[1049280]: segfault at 205bd ip 00007f5e60d265d9 sp 00007ffe7b5f08c0 error 4 in libperl.so.5.30.0[7f5e60cbe000+166000]” we get 0xb05d9.

addr2line gives us:

$ addr2line -Cfe /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 b05a3 b05d9
Perl__invlist_intersection_maybe_complement_2nd
invlist_inline.h:51
Perl__invlist_intersection_maybe_complement_2nd
regcomp.c:9841

Both in Perl__invlist_intersection_maybe_complement_2nd. Same problem.

general protection fault vs. segfault

Lastly, why did we get a “traps: ... general protection fault ... error:0” for one crash and “segfault at ... ip ... error 4” for the others?

I'm not entirely sure. As far as I can gather, this could be the difference between the segmentation violation happening while running in kernel mode versus running in user mode. The error code of 0 vs. 4 does indicate as much. (See “details” for a snippet from arch/x86/include/asm/trap_pf.h.)

details of error_code
/*
 * Page fault error code bits:
 *
 *   bit 0 ==    0: no page found       1: protection fault
 *   bit 1 ==    0: read access         1: write access
 *   bit 2 ==    0: kernel-mode access  1: user-mode access
 *   bit 3 ==                           1: use of reserved bit detected
 *   bit 4 ==                           1: fault was an instruction fetch
 *   bit 5 ==                           1: protection keys block access
 *   bit 15 ==                          1: SGX MMU page-fault
 */
enum x86_pf_error_code {
        X86_PF_PROT     =               1 << 0,
        X86_PF_WRITE    =               1 << 1,
        X86_PF_USER     =               1 << 2,
        X86_PF_RSVD     =               1 << 3,
        X86_PF_INSTR    =               1 << 4,
        X86_PF_PK       =               1 << 5,
        X86_PF_SGX      =               1 << 15,
};

But maybe it has a different reason, like the specific memory location that was tried (we don't see it in this message). Let me know if you know!

2023-09-12 - qpress / qz1 extension

This is a quick note to my future self. As we're using qpress less in favor of lz4 that has been available on Ubuntu Focal and above, we're inclined to forget what the .qz1 extension means.

Wondering what files with the .qz1 extension do?

This concerns single stream qpress compressed files.

Historically qpress uses the .qp extension, but that concerns multifile archives. The qpress binary can write compressed streams to stdout, but it will not decompress them to stdout. When using streams of data, we often prefer not touching the disk in between. For that reason there are also qlzip1 and qlzcat1.

What does the 1 in .qz1 mean?

The qpress source also has other levels (2 and 3). They need to be set at compile time, so switching at will is not possible. More importantly, qlzip1 does not write a header that specifies which level we're dealing with. Trying to decompress a level 3 stream with a level 1 binary would fail.

Where are qpress, qlzip1 and qlzcat1?

Originally, the source code and a summary could be found at https://www.quicklz.com/. Today however, this page is suspended.

The applications one might use can be found at GitHub/ossobv/qpress-deb. See the Releases page for sources and precompiled binaries.

2023-09-01 - dirmngr / keyserver / disable-ipv6 / bionic

This morning a build pipeline failed. dirmngr called by apt-key tried to use IPv6, even though it was disabled.

The build logs had this to say:

21.40 + apt-key adv --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 0xF1656F24C74CD1D8
21.50 Warning: apt-key output should not be parsed (stdout is not a terminal)
21.57 Executing: /tmp/apt-key-gpghome.KzTTOZjgZP/gpg.1.sh --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 0xF1656F24C74CD1D8
21.65 gpg: keyserver receive failed: Cannot assign requested address

This was strange for a number of reasons:

  1. The same Docker build scripts succeeded for other Ubuntu releases. The problem only seemed to affect the bionic based build: on focal it was not broken.
  2. I could reproduce in a similar bionic Docker environment. But behaviour appeared to be random — maybe based on the order of the DNS responses:
    # apt-key adv --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 0xF1656F24C74CD1D8
    Executing: /tmp/apt-key-gpghome.2XtcT1xWZb/gpg.1.sh --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 0xF1656F24C74CD1D8
    gpg: keyserver receive failed: Cannot assign requested address
    
    # apt-key adv --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 0xF1656F24C74CD1D8
    Executing: /tmp/apt-key-gpghome.TyIbtADNXD/gpg.1.sh --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 0xF1656F24C74CD1D8
    gpg: key F1656F24C74CD1D8: "MariaDB Signing Key <signing-key@mariadb.org>" not changed
    gpg: Total number processed: 1
    gpg:              unchanged: 1
    
  3. Calling it with strace showed that it tried to connect to an IPv6 address:
    # strace -e 'signal=!all' -e trace=clone,connect,execve -s 32 -f \
        apt-key adv --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 \
          0xF1656F24C74CD1D8 2>&1 |
        grep -vE 'attached$|exited with|resumed>'
    ...
    [pid  9749] execve("/usr/bin/dirmngr", ["dirmngr", "--daemon", "--homedir", "/tmp/apt-key-gpghome.0zrNQAlNLG"], 0x7fff73d35cb8 /* 9 vars */ <unfinished ...>
    ...
    [pid  9749] clone(child_stack=NULL, flags=..., child_tidptr=...) = 9750
    ...
    [pid  9750] clone(child_stack=..., flags=..., parent_tidptr=..., tls=..., child_tidptr=...) = 9751
    ...
    [pid  9751] connect(6, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("1.1.1.1")}, 16) = 0
    [pid  9751] connect(6, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("1.1.1.1")}, 16) = 0
    [pid  9751] connect(7, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "2620:2d:4000:1007::70c", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EADDRNOTAVAIL (Cannot assign requested address)
    gpg: keyserver receive failed: Cannot assign requested address
    ...
    
    That log reads as: apt-key spawns (something that spawns) dirmngr, which forks twice and then does two DNS lookups (via 1.1.1.1) and finally tries to connect to 2620:2d:4000:1007::70c.

    That is unexpected because...
  4. In this Docker environment there wasn't even any IPv6 at all:
    # ip -br a
    lo               UNKNOWN        127.0.0.1/8
    eth0@if153       UP             172.17.0.2/16
    

    No assigned IP address.

    # sysctl -a 2>/dev/null | grep disable_ipv6
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.eth0.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1
    

    And in fact IPv6 is disabled on all interfaces through sysctl.

  5. If getaddrinfo() with AF_UNSPEC is used for the DNS lookup, it shouldn't even try to get an IPv6 address, because dns4only is enabled by default in this environment. With dns4only enabled requesting an unspecified address family (IPv4 or IPv6), only results in IPv4 responses.

    Without dns4only, you'd see this Python transcript:
    >>> from socket import *
    >>> set([i[4][0] for i in getaddrinfo('keyserver.ubuntu.com', 80)])
    {'2620:2d:4000:1007::d43', '185.125.188.27', '185.125.188.26', '2620:2d:4000:1007::70c'}
    
    But here, Python correctly yields only these:
    >>> from socket import *
    >>> set([i[4][0] for i in getaddrinfo('keyserver.ubuntu.com', 80)])
    {'185.125.188.27', '185.125.188.26'}
    
    This meant that gnupg dirmngr was doing something custom with DNS resolving, like manually looking up both IPv6 and IPv4.
  6. The offending dirmngr has version 2.2.4-1ubuntu1.6 and according to the changelog (1.3..1.5 diff) there was a fix for IPv6 in 2.2.4-1ubuntu1.4; but we were already using that.
  7. Interestingly, that bugfix from LP#1910432 does point to the means of checking connectivity. It checks whether socket(AF_INET6) succeeds instead of whether connect(ipv6_address) succeeds:
    $ cat debian/patches/dirmngr-handle-EAFNOSUPPORT-at-connect_server.patch
    ...
    @@ -2940,6 +2942,15 @@ connect_server (const char *server, unsi
               sock = my_sock_new_for_addr (ai->addr, ai->socktype, ai->protocol);
               if (sock == ASSUAN_INVALID_FD)
                 {
    +              if (errno == EAFNOSUPPORT)
    +                {
    +                  if (ai->family == AF_INET)
    +                    ignore_v4 = 1;
    +                  if (ai->family == AF_INET6)
    +                    ignore_v6 = 1;
    +                  continue;
    ...
    
    But the exposed Linux kernel interface (version 5.x) has no problem with creating an AF_INET6 socket:
    >>> from socket import *
    >>> s = socket(AF_INET6)  # no failure on this line
    >>> s.connect(('keyserver.ubuntu.com', 80))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    OSError: [Errno 99] Cannot assign requested address
    
    It's first the connect() that fails — with a EADDRNOTAVAIL: Cannot assign requested address.
  8. However, the dirmngr code in the Ubuntu focal version doesn't seem that different. That one also does two DNS lookups too, but that one correctly selects AF_INET when connecting.

    The lookups as seen with strace -e trace=sendto:
    >>> print(':'.join('%02x' % (i,) for i in (
    ...     b"\317/\1\0\0\1\0\0\0\0\0\0\tkeyserver\6ubuntu\3com\0\0\1\0\1")))
    cf:2f:01:00:00:01:00:00:00:00:00:00:09:6b:65:79:73:65:72:76:65:72:06:75:62:75:6e:74:75:03:63:6f:6d:00:00:01:00:01
    >>> print(':'.join('%02x' % (i,) for i in (
    ...     b"\307\275\1\0\0\1\0\0\0\0\0\0\tkeyserver\6ubuntu\3com\0\0\34\0\1")))
    c7:bd:01:00:00:01:00:00:00:00:00:00:09:6b:65:79:73:65:72:76:65:72:06:75:62:75:6e:74:75:03:63:6f:6d:00:00:1c:00:01
    
    (Here the lookup with 00:1c is the the AAAA query.)

Really unexpected. And another example of old software sinking development time. I hope this writeup saves someone else a little time.

Workarounds?

For a while, I thought the dirmngr code might respect /etc/gai.conf and tried to add/enable precedence ::ffff:0:0/96 100 there. But that just proved to be intermittent (random) success.

A shittier, but working, workaround is hacking the /etc/hosts file with this oneliner in the Dockerfile:

{ test "$UBUNTU_RELEASE" = "bionic" && \
  getent ahosts keyserver.ubuntu.com | \
  sed -e '/STREAM/!d;s/[[:blank:]].*/ keyserver.ubuntu.com/' | \
   shuf >> /etc/hosts || true; }

Not pretty, but it works. And it only affects the (soon to be obsolete) bionic build.

An alternative solution is provided at usbarmory-debian-base_image#9:

{ mkdir -p ~/.gnupg && \
  echo "disable-ipv6" >> ~/.gnupg/dirmngr.conf && \
  apt-key adv --homedir ~/.gnupg --keyserver ...; }

2023-08-31 - mariadb / gdb / debugging shutdown deadlock / part 2

We were looking at a core dump of MariaDB instance that was in a deadlocked state. It was killed with SIGABRT. We now like to get MySQL thread info from it. Python support in gdb is here to help.

To recap: we were dissecting the core dump with gdb (gdb /usr/sbin/mariadbd core.208161), found out that it was waiting on one of the threads to stop before stopping completely. We now want to know what exactly it was waiting for.

It's too late to get a SHOW PROCESSLIST now that the process is dead. But we might be able to fetch the data from the server_threads structure:

(gdb) print ((THD*)server_threads->threads->first)->proc_info
$7 = 0x559aca9152dc "Freeing items"
(gdb) print ((THD*)server_threads->threads->first->next)->proc_info
$8 = 0x559aca922c02 "committing"
(gdb) print ((THD*)server_threads->threads->first->next->next)->proc_info
$9 = 0x559acaac848a "wsrep applier committed"

Threads

There aren't terribly many threads:

(gdb) info threads
  Id   Target Id                           Frame
* 1    Thread 0x7f70061c3840 (LWP 208161)  0x00007f700639c3db in kill
  2    Thread 0x7f635435b700 (LWP 208827)  futex_wait_cancelable
  3    Thread 0x7f5109852700 (LWP 208868)  0x00007f700643623f in __GI___clock_nanosleep
  4    Thread 0x7f4fe83de700 (LWP 279027)  futex_abstimed_wait_cancelable
  5    Thread 0x7f6ffe06c700 (LWP 208425)  futex_abstimed_wait_cancelable
  6    Thread 0x7f63543f1700 (LWP 208799)  0x00007f700639d1a2 in __GI___sigtimedwait
  7    Thread 0x7f63553fd700 (LWP 208750)  futex_wait_cancelable
  8    Thread 0x7f6357fff700 (LWP 208749)  futex_abstimed_wait_cancelable
  9    Thread 0x7f6ffdc17700 (LWP 208426)  futex_wait_cancelable
  10   Thread 0x7f6ffc492700 (LWP 208430)  futex_wait_cancelable
  11   Thread 0x7f6ffc09f700 (LWP 208790)  futex_wait_cancelable
  12   Thread 0x7f63543a6700 (LWP 208826)  futex_wait_cancelable
  13   Thread 0x7f63540b8700 (LWP 208836)  futex_wait_cancelable
  14   Thread 0x7f6ffc054700 (LWP 208791)  futex_wait_cancelable
  15   Thread 0x7f510922b700 (LWP 208889)  futex_wait_cancelable
  16   Thread 0x7f6340436700 (LWP 208841)  futex_wait_cancelable
  17   Thread 0x7f4f7d31f700 (LWP 1633036) futex_abstimed_wait_cancelable
  18   Thread 0x7f6ffcc93700 (LWP 1642065) futex_abstimed_wait_cancelable

But these threads do not show what it's doing. A thread apply all bt will have info, but is very verbose and not necessarily helpful.

Let's first pursue the server_threads. But not by manually adding ->next for each thread. There must be a better way.

Enter Python

gdb comes with Python integration, and it's straight forward enough to call: python, then the code, then end with end or EOF (press ^D).

(gdb) python
>print(gdb.parse_and_eval('server_threads->threads->first'))
>end
0x7f5080046088

Use the following Python script:

thd_type = gdb.lookup_type('THD').pointer()
it = gdb.parse_and_eval('server_threads->threads->first')
last = gdb.parse_and_eval('server_threads->threads->last')
while it != last.address:
    thdp = it.cast(thd_type)
    print(it, thdp['proc_info'], thdp['query_string']['string']['str'])
    it = it['next']

Invoking that looks like this:

(gdb) python
>thd_type = gdb.lookup_type('THD').pointer()
>it = gdb.parse_and_eval('server_threads->threads->first')
>last = gdb.parse_and_eval('server_threads->threads->last')
>while it != last.address:
>    thdp = it.cast(thd_type)
>    print(it, thdp['m_thread'], thdp['proc_info'], thdp['query_string']['string']['str'])
>    it = it['next']
>end
0x7f5080046088 0x559aca9152dc "Freeing items" 0x7f5080055c10 "COMMIT"
0x7f4fc4000c58 0x559aca922c02 "committing" 0x0
0x7f5030000c58 0x559acaac848a "wsrep applier committed" 0x0
0x7f635c002228 0x559acaac848a "wsrep applier committed" 0x0
0x7f5098000c58 0x559aca922c02 "committing" 0x0
0x7f50b8000c58 0x559aca922c02 "committing" 0x0
0x7f50bc000c58 0x559aca922c02 "committing" 0x0
0x7f6360000c58 0x559acaacd584 "wsrep aborter idle" 0x0

No mapping to the info threads yet. But after some digging in various (sub)structures, I found thread in the lock_info struct:

(gdb) x/((THD*)server_threads->threads->first)->lock_info.thread
0x7f4fe83de700: -398596352

The improved code:

thd_type = gdb.lookup_type('THD').pointer()
it = gdb.parse_and_eval('server_threads->threads->first')
last = gdb.parse_and_eval('server_threads->threads->last')
while it != last.address:
    thdp = it.cast(thd_type)
    print(
        it, 'thread', hex(thdp['lock_info']['thread']), thdp['killed'],
        thdp['proc_info'], thdp['query_string']['string']['str'])
    it = it['next']
0x7f5080046088 thread 0x7f4fe83de700 0x559aca9152dc "Freeing items" 0x7f5080055c10 "COMMIT"
0x7f4fc4000c58 thread 0x7f510922b700 0x559aca922c02 "committing" 0x0
0x7f5030000c58 thread 0x7f5109852700 0x559acaac848a "wsrep applier committed" 0x0
0x7f635c002228 thread 0x7f6340436700 0x559acaac848a "wsrep applier committed" 0x0
0x7f5098000c58 thread 0x7f63540b8700 0x559aca922c02 "committing" 0x0
0x7f50b8000c58 thread 0x7f635435b700 0x559aca922c02 "committing" 0x0
0x7f50bc000c58 thread 0x7f63543a6700 0x559aca922c02 "committing" 0x0
0x7f6360000c58 thread 0x7f6ffc492700 0x559acaacd584 "wsrep aborter idle" 0x0

Yes. These thread_id can be mapped to the list above (manully annotated and sorted):

0x7f50b8000c58 thread [ 2] 0x559aca922c02 "committing" 0x0
0x7f5030000c58 thread [ 3] 0x559acaac848a "wsrep applier committed" 0x0
0x7f5080046088 thread [ 4] 0x559aca9152dc "Freeing items" 0x7f5080055c10 "COMMIT"
0x7f6360000c58 thread [10] 0x559acaacd584 "wsrep aborter idle" 0x0
0x7f50bc000c58 thread [12] 0x559aca922c02 "committing" 0x0
0x7f5098000c58 thread [13] 0x559aca922c02 "committing" 0x0
0x7f4fc4000c58 thread [15] 0x559aca922c02 "committing" 0x0
0x7f635c002228 thread [16] 0x559acaac848a "wsrep applier committed" 0x0

Deadlock?

Back to the task at hand. Now that we found something akin to SHOW PROCESSLIST, it looks like Thread 4 could use some closer inspection.

(gdb) thread 4
[Switching to thread 4 (Thread 0x7f4fe83de700 (LWP 279027))]
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f4fe83dcd80, clockid=<optimized out>, expected=0, futex_word=0x7f50080b05c8) at ../sysdeps/nptl/futex-internal.h:320
320 ../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f4fe83dcd80, clockid=<optimized out>, expected=0, futex_word=0x7f50080b05c8) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7f4fe83dcd80, clockid=<optimized out>, mutex=0x559acbbc4dd8, cond=0x7f50080b05a0) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=cond@entry=0x7f50080b05a0, mutex=mutex@entry=0x559acbbc4dd8, abstime=0x7f4fe83dcd80) at pthread_cond_wait.c:665
#3  0x00007f6ffde2e673 in gu_cond_timedwait_SYS (ts=0x7f4fe83dcd80, mutex=0x559acbbc4dd8, cond=<optimized out>) at ./galerautils/src/gu_threads.h:264
#4  gu::Lock::wait (this=<synthetic pointer>, date=<synthetic pointer>..., cond=...) at ./galerautils/src/gu_lock.hpp:64
#5  galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::wait (wait_until=<synthetic pointer>..., gtid=..., this=0x559acbbc4dd8) at ./galera/src/monitor.hpp:399
#6  galera::ReplicatorSMM::sync_wait (this=0x559acbbc3b30, upto=<optimized out>, tout=<optimized out>, gtid=0x0) at ./galera/src/replicator_smm.cpp:1605
#7  0x00007f6ffde0b514 in galera_sync_wait (wsrep=<optimized out>, upto=<optimized out>, tout=<optimized out>, gtid=<optimized out>) at ./galera/src/wsrep_provider.cpp:503
#8  0x0000559aca90406e in wsrep::wsrep_provider_v26::wait_for_gtid (this=<optimized out>, gtid=..., timeout=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:1056
#9  0x0000559aca6049e6 in Wsrep_server_service::set_position (this=<optimized out>, c=..., gtid=...) at ./wsrep-lib/include/wsrep/server_state.hpp:320
#10 0x0000559aca8f9753 in wsrep::transaction::release_commit_order (this=0x7f508004c5c8, lock=...) at ./wsrep-lib/include/wsrep/provider.hpp:142
#11 0x0000559aca901c6f in wsrep::transaction::after_statement (this=this@entry=0x7f508004c5c8) at ./wsrep-lib/src/transaction.cpp:895
#12 0x0000559aca8e7cb2 in wsrep::client_state::after_statement (this=this@entry=0x7f508004c560) at ./wsrep-lib/src/client_state.cpp:287
#13 0x0000559aca0d9a08 in wsrep_after_statement (thd=0x7f5080046088) at ./sql/sql_class.h:5454
#14 wsrep_after_statement (thd=0x7f5080046088) at ./sql/wsrep_trans_observer.h:443
#15 wsrep_mysql_parse (thd=0x7f5080046088, rawbuf=0x7f5080055c10 "COMMIT", length=6, parser_state=0x7f4fe83dd3e0) at ./sql/sql_parse.cc:7878
#16 0x0000559aca0e7fee in dispatch_command (command=COM_QUERY, thd=0x7f5080046088, packet=<optimized out>, packet_length=<optimized out>, blocking=<optimized out>) at ./sql/sql_class.h:1388
#17 0x0000559aca0e899e in do_command (thd=0x7f5080046088, blocking=blocking@entry=true) at ./sql/sql_parse.cc:1409
#18 0x0000559aca1fd617 in do_handle_one_connection (connect=<optimized out>, put_in_cache=true) at ./sql/sql_connect.cc:1416
...

Unfortunately Thread 4 is not waiting for a lock that we can trace somewhere:

(gdb) ptype pthread_mutex_t
type = union {
    struct __pthread_mutex_s __data;
    char __size[40];
    long __align;
}
(gdb) print *(pthread_mutex_t*)0x559acbbc4dd8
$10 = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust = No, Shared = No, Protocol = None}
(gdb) print *(__pthread_mutex_s*)0x559acbbc4dd8
$11 = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}

No owner. And the pthread_cont_t argument doesn't help us either. Apparently this thread is waiting for someone to notify cond=0x7f50080b05a0.

A thread apply all bt full does not reveal 0x7f50080b05a0 being in stack scope in the vicinity. It appears that it is waiting in this bit:

wsrep_status_t galera::ReplicatorSMM::sync_wait(wsrep_gtid_t* upto,
                                                int           tout,
                                                wsrep_gtid_t* gtid)
{
    gu::GTID wait_gtid;
// ...
        // Note: Since wsrep API 26 application may request release of
        // commit monitor before the commit actually happens (commit
        // may have been ordered/queued on application side for later
        // processing). Therefore we now rely on apply_monitor on sync
        // wait. This is sufficient since apply_monitor is always released
        // only after the whole transaction is over.
        apply_monitor_.wait(wait_gtid, wait_until); // <-- HERE
        void wait(gu::GTID& gtid, const gu::datetime::Date& wait_until)
        {
            gu::Lock lock(mutex_);
            if (gtid.uuid() != uuid_)
            {
                throw gu::NotFound();
            }
            while (last_left_ < gtid.seqno())
            {
                size_t idx(indexof(gtid.seqno()));
                lock.wait(process_[idx].wait_cond_, wait_until);
            }
        }

Debug info here is scarce:

(gdb) up 5
#5  galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::wait (wait_until=<synthetic pointer>..., gtid=..., this=0x559acbbc4dd8) at ./galera/src/monitor.hpp:399
(gdb) info args
wait_until = <synthetic pointer>: <optimized out>
gtid = @0x7f4fe83dcd90: {static SEQNO_UNDEFINED = -1, uuid_ = {<gu::UUID_base> = {uuid_ = {data = "s|\344\\\352\370\021\354\217G\a\177S0\226r", alignment = 17010650953019587699}}, <No data fields>}, seqno_ = 223595135}
this = 0x559acbbc4dd8
(gdb) info locals
idx = <optimized out>
cond = <optimized out>
lock = {_vptr.Lock = <optimized out>, mtx_ = @0x559acbbc4dd8}
lock = <optimized out>
idx = <optimized out>
cond = <optimized out>

idx(indexof(gtid.seqno())) can be rewritten as idx = (gtid.seqno_ & 65535), so the lock/wait is happening on the process_ with index 51839:

(gdb) print this->process_[51839]
$12 = {obj_ = 0x7f6354359f70, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (use count 2, weak count 0) = {get() = 0x7f50080b05a0}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_APPLYING}

Scanning over all 65536 process_ entries — most are empty, like $13 — yields:

(gdb) print this->process_[51836]
$13 = {obj_ = 0x0, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (empty) = {get() = 0x0}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_IDLE}
(gdb) print this->process_[51837]
$14 = {obj_ = 0x7f5109229f70, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (empty) = {get() = 0x0}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_APPLYING}
(gdb) print this->process_[51838]
$15 = {obj_ = 0x7f63540b6f70, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (empty) = {get() = 0x0}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_APPLYING}
(gdb) print this->process_[51839]
$16 = {obj_ = 0x7f6354359f70, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (use count 2, weak count 0) = {get() = 0x7f50080b05a0}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_APPLYING}
(gdb) print this->process_[51840]
$17 = {obj_ = 0x7f4fe83dce90, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (empty) = {get() = 0x0}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_APPLYING}
(gdb) print this->process_[51841]
$18 = {obj_ = 0x7f63543a4f70, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (empty) = {get() = 0x0}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_APPLYING}
(gdb) print this->process_[51842]
$19 = {obj_ = 0x0, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (empty) = {get() = 0x0}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_FINISHED}
(gdb) print this->process_[51843]
$20 = {obj_ = 0x0, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (empty) = {get() = 0x0}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_FINISHED}
(gdb) print this->process_[51844]
$21 = {obj_ = 0x0, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (use count 1, weak count 0) = {get() = 0x7f4fb008f520}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_FINISHED}
(gdb) print this->process_[51845]
$22 = {obj_ = 0x0, cond_ = 0x0, wait_cond_ = std::shared_ptr<gu::Cond> (empty) = {get() = 0x0}, state_ = galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::Process::S_IDLE}

That 0x7f50080b05a0 is indeed the pthread_cont_t that we're waiting for.

Conclusion

At this point I'm stuck. That last piece of info only confirms what we already know. No obvious reason stands out why it is waiting for something that is not happening.

Searching through the MariaDB JIRA bug tracker did not yield any candidate issues. Luckily this deadlock happened two months ago, and we haven't seen it since. But if we do see it again, we'll at least have this post as a starting point.

2023-08-30 - mariadb / gdb / debugging shutdown deadlock / part 1

I was asked to look into a MariaDB deadlock during shutdown. We had a core dump (luckily). Now it's time to dissect it.

Ensure you can get core dumps

For starters, you want a nice core dump whenever you hit an issue. We do. Here are some tips to ensure you do too.

You need to have a couple of parameters set correctly: the equivalent of the systemd LimitCORE=infinity and the assurance that the working directory is writable (maybe WorkingDirectory=/var/lib/mysql). And a non-empty kernel.core_pattern (e.g. core.%p).

If MariaDB crashes, you get a core dump immediately. If it deadlocks, you can kill it with SIGABRT to create one.

(Once you get a dump, it's nice to salvage it by moving it to less volatile storage. This is especially important in a containerized environment where the /var/lib/mysql can get flushed by a container restart.)

Getting the symbols

Apart from gdb, the debugger, you'll also need the debug symbols. For a Debian/Ubuntu install, you'll want to add these to your sources.list:

deb http://ddebs.ubuntu.com jammy main restricted universe multiverse
deb http://ddebs.ubuntu.com jammy-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com jammy-proposed main restricted universe multiverse

Or, if you're running packages from MariaDB directly:

deb [arch=amd64,arm64,ppc64el] http://archive.mariadb.org/mariadb-10.6.13/repo/ubuntu/ focal main
deb [arch=amd64,arm64,ppc64el] http://archive.mariadb.org/mariadb-10.6.13/repo/ubuntu/ focal main/debug

The main/debug component holds the dbgsym packages. This is also the part where you ensure that you have the same MariaDB version (and distro) as where the crash occurred. If you have different versions, the core dump will make no sense at all.

Fetch at least these:

mariadb-server-10.6-dbgsym
mariadb-server-core-10.6-dbgsym

And confirm that they belong to the same binary version. It's also nice to have libc6-dbg.

Firing up gdb

We have gdb. We have symbols. We have a core dump. Fire it up:

$ gdb /usr/sbin/mariadbd core.208161
...
Core was generated by `/usr/sbin/mysqld --defaults-file=/etc/mysql/my.cnf --wsrep_start_position=73'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f700639c3db in kill () at ../sysdeps/unix/syscall-template.S:78
78  ../sysdeps/unix/syscall-template.S: No such file or directory.
[Current thread is 1 (Thread 0x7f70061c3840 (LWP 208161))]
(gdb)

At this point, we can do the usual stuff: bt, bt full, info threads, thread apply all bt full, info regs and so on.

In this case, we were stuck in shutdown. This is confirmed by the first thread we look at:

(gdb) bt
#0  0x00007f700639c3db in kill () at ../sysdeps/unix/syscall-template.S:78
#1  0x0000559aca322bae in handle_fatal_signal (sig=11) at ./sql/signal_handler.cc:372
#2  <signal handler called>
#3  0x00007f700643623f in __GI___clock_nanosleep () at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#4  0x00007f700643bec7 in __GI___nanosleep () at nanosleep.c:27
#5  0x00007f700643bdfe in __sleep (seconds=0, seconds@entry=1) at ../sysdeps/posix/sleep.c:55
#6  0x0000559aca5ea13a in wsrep_close_client_connections (wait_to_end=<optimized out>, except_caller_thd=0x0) at ./sql/wsrep_mysqld.cc:3253
#7  0x0000559aca5ea83a in wsrep_shutdown_replication () at ./sql/wsrep_mysqld.cc:1150
#8  0x0000559aca00f7dd in mysqld_main (argc=<optimized out>, argv=<optimized out>) at ./sql/mysqld.cc:5899
#9  0x00007f700637d083 in __libc_start_main (main=0x559ac9fbecb0 <main(int, char**)>, argc=3, argv=0x7ffdd3be23c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdd3be23b8) at ../csu/libc-start.c:308
#10 0x0000559aca00300e in _start () at ./include/mysql/plugin.h:215

The code appeared to be busy in a sleep(0).

It's best to have the MariaDB 10.6.13 source code checked out somewhere. Git clone the entire repository, check out the correct tag, and get those git submodules. There you can now find ./sql/wsrep_mysqld.cc:3253, which lists:

void wsrep_close_client_connections(my_bool wait_to_end, THD* except_caller_thd)
{
// ...
  while (wait_to_end && server_threads.iterate(have_client_connections))
  {
    sleep(1);
    DBUG_PRINT("quit",("One thread died (count=%u)", THD_count::value()));
  }

Yes, it is in fact sleep(1) not 0 — and now that you mention it, it did say seconds@entry=1:

(gdb) up 6
#6  0x0000559aca5ea13a in wsrep_close_client_connections () at ./sql/wsrep_mysqld.cc:3253
3253  ./sql/wsrep_mysqld.cc: No such file or directory.
(gdb) disass
Dump of assembler code for function wsrep_close_client_connections(char, THD*):
Address range 0x559aca5e9dc0 to 0x559aca5ea3fb:
...
   0x0000559aca5ea12b <+875>: callq  0x559ac9f54150 <pthread_rwlock_unlock@plt>
   0x0000559aca5ea130 <+880>: mov    $0x1,%edi
   0x0000559aca5ea135 <+885>: callq  0x559ac9f54080 <sleep@plt>
=> 0x0000559aca5ea13a <+890>: jmpq   0x559aca5ea080 <wsrep_close_client_connections(char, THD*)+704>
   0x0000559aca5ea13f <+895>: nop
   0x0000559aca5ea140 <+896>: cmpq   $0x0,(%r14)
   0x0000559aca5ea144 <+900>: jne    0x559aca5ea240 <wsrep_close_client_connections(char, THD*)+1152>

(That mov $0x1,%edi puts 1 into the first argument — RDI, RSI, RDX, RCX, R8, R9 — and then sleep is called. See the System V amd64 calling convention.)

Inspecting server_threads.iterate

So, wsrep_close_client_connections is looping until server_threads.iterate(have_client_connections) returns zero.

static my_bool have_client_connections(THD *thd, void*)
{
  DBUG_PRINT("quit",("Informing thread %lld that it's time to die",
                     (longlong) thd->thread_id));
  if (is_client_connection(thd) &&
      (thd->killed == KILL_CONNECTION ||
       thd->killed == KILL_CONNECTION_HARD))
  {
    (void)abort_replicated(thd);
    return 1;
  }
  return 0;
}

Assuming that server_threads.iterate iterates over the server threads, we can conclude that one or more threads are (a) a client connection and (b) in some kind of "KILL" state and (c) receiving an abort_replicated(thd) call.

(gdb) ptype server_threads
type = class THD_list : public THD_list_iterator {
  public:
    void init(void);
    void destroy(void);
    void insert(THD *);
    void erase(THD *);
}

server_threads is a global. It inherits from the THD_list_iterator. If we look there, we see this:

/** THD registry */
class THD_list_iterator
{
// ...
  template <typename T> int iterate(my_bool (*action)(THD *thd, T *arg), T *arg= 0)
  {
    int res= 0;
    mysql_rwlock_rdlock(&lock);
    I_List_iterator<THD> it(threads);
    while (auto tmp= it++)
      if ((res= action(tmp, arg)))
        break;
    mysql_rwlock_unlock(&lock);
    return res;
  }

So, server_threads.iterate iterates over all threads and stops as soon as it finds one with the previously mentioned characteristics, returning the (boolean) value of the callback.

Figuring out which threads were doing what

We want to know what those unstopped threads were doing. If we could get a SHOW PROCESSLIST from the dead process, that'd be perfect. Maybe we can iterate over the server_threads and get the info.

(gdb) print server_threads->threads
$1 = {
  <base_ilist> = {
    first = 0x7f5080046088,
    last = {_vptr.ilink = 0x559acaf1b670 <vtable for ilink+16>, prev = 0x7f6360000c68, next = 0x0}},
  <No data fields>
}
(gdb) print server_threads.threads.first
$2 = (ilink *) 0x7f5080046088
(gdb) print *server_threads.threads.first
$3 = {_vptr.ilink = 0x559acaf21550 <vtable for THD+16>, prev = 0x559acb121b00 <server_threads>, next = 0x7f4fc4000c58}

This is a linked list structure, wrapped in a bit of C++ sauce.

We can in fact get the elements out of it fairly easily:

(gdb) print *(THD*)server_threads.threads.first
$5 = {
  <THD_count> = {
    static count = {
      m_counter = {
...
(gdb) print ((THD*)server_threads.threads.first)->proc_info
$5 = 0x559aca9152dc "Freeing items"

And we can continue with the next items:

(gdb) print ((THD*)server_threads.threads.first->next)->proc_info
$6 = 0x559aca922c02 "committing"

That's nice. But tedious. We don't want to do this for every thread — there could be many.

Luckily gdb has builtin Python support. And that's what we'll be looking at in the next post...

2023-08-24 - mch2022 badge / arduino ide / part 3 / leds and more

The colored LCD display on my living room HUD has reached its end-of-life. Time to replace it with an MCH2022 badge?

In the first two parts, we looked at interfacing with the serial output, reading the input button state and using the TFT display on the MCH2022 badge. In this post, we'll look at interfacing with the RP2040 where other components are hooked up to.

Interfacing with the RP2040

As seen with the buttons already, not everything on the badge is controlled by the ESP32 directly. The display is connected to an SPI bus. A Raspberry PI2040 (RP2040) is connected to an I2C bus. On that RP2040 various functions are implemented.

We we look at esp32-component-mch2022-rp2040/include/rp2040.h (or the implementation at mch2022-firmware-rp2040/i2c_peripheral.h) we can see that there are lots of other registers we can read from and write to. For instance, there is RP2040_REG_LCD_BACKLIGHT (the 0x04 element in the enum).

Step one is configuring the I2C bus:

#include <Wire.h>

#define GPIO_I2C_SDA 22
#define GPIO_I2C_SCL 21

Wire.begin(GPIO_I2C_SDA, GPIO_I2C_SCL);

And then, we can...

Set LCD backlight brightness

We can set the LCD backlight brightness:

#define RP2040_REG_LCD_BACKLIGHT 0x04

/**
 * Backlight brightness can be set from 0 (off) to 255.
 */
void set_backlight(uint8_t brightness) {
    Wire.beginTransmission(RP2040_ADDR);
    Wire.write(RP2040_REG_LCD_BACKLIGHT);
    Wire.write(brightness);
    Wire.endTransmission();
}

Read button presses

Check the button state — we did this already:

#define RP2040_REG_INPUT1 0x06

uint32_t read_inputs() {
  Wire.beginTransmission(RP2040_ADDR);
  Wire.write(RP2040_REG_INPUT1);
  Wire.endTransmission();
  Wire.requestFrom(RP2040_ADDR, 4);
  uint8_t values[4];
  values[0] = Wire.read(); // RP2040_REG_INPUT1 / I2C_REGISTER_INPUT1
  values[1] = Wire.read(); // RP2040_REG_INPUT2 / I2C_REGISTER_INPUT2
  values[2] = Wire.read(); // RP2040_REG_INTERRUPT1 / I2C_REGISTER_INTERRUPT1
  values[3] = Wire.read(); // RP2040_REG_INTERRUPT2 / I2C_REGISTER_INTERRUPT2
  // ...
}

Send infrared (transmitter on the back)

On the back of the badge there is an IR-transmitter. We can have the RP2040 send remote control signals and use the badge as a remote for the TV.

#define RP2040_REG_IR_ADDRESS_LO 0x60

/**
 * Supports extended NEC (NECext/NECX) remote control codes. See:
 * https://en.wikipedia.org/wiki/Consumer_IR
 * https://github.com/flipperdevices/flipperzero-firmware/blob/
 *   52b59662627254ae5ccde6201b5c4921573fc769/documentation/file_formats/InfraredFileFormats.md
 */
void send_ir_nec(uint16_t address, uint16_t command) {
  Wire.beginTransmission(RP2040_ADDR);
  Wire.write(RP2040_REG_IR_ADDRESS_LO);
  Wire.write(address & 0xFF);
  Wire.write(address >> 8);
  Wire.write(command);
  Wire.write(0x01); /* trigger */
  Wire.endTransmission();
}

We can map the joystick keys to useful functions:

void loop() {
  if (!digitalRead(GPIO_INT_RP2040)) {
    uint32_t input = read_inputs();
    if (BUTTON_RELEASED(input, BTN_HOME)) {
      return_to_launcher();
    /**
     * See these LG codes:
     * https://github.com/Lucaslhm/Flipper-IRDB/blob/48f26b13715547d579a4f17a3fc0983a11d36776/TVs/LG/LG.ir
     * > name: Power
     * > type: parsed
     * > protocol: NEC
     * > address: 04 00 00 00
     * > command: 08 00 00 00
     */
    } else if (BUTTON_PRESSED(input, BTN_START)) {
      send_ir_nec(0x04, 0x08); // LG and others: power (on/off)
    } else if (BUTTON_PRESSED(input, BTN_LEFT)) {
      send_ir_nec(0x04, 0x01); // LG and others: chan-down
    } else if (BUTTON_PRESSED(input, BTN_RIGHT)) {
      send_ir_nec(0x04, 0x00); // LG and others: chan-up
    } else if (BUTTON_PRESSED(input, BTN_UP)) {
      send_ir_nec(0x04, 0x02); // LG and others: vol-up
    } else if (BUTTON_PRESSED(input, BTN_DOWN)) {
      send_ir_nec(0x04, 0x03); // LG and others: vol-down
    } else if (BUTTON_PRESSED(input, BTN_JOY)) {
      send_ir_nec(0x04, 0x09); // LG and others: mute (on/off)
    } else {
      Serial.printf("input: 0x%08x // 0x<button><state>\r\n", input);
    }
  }
  // ...
}

Unfortunately it is currently limited to sending (extended) NEC codes. If we want to send different protocols, like the Philips RC-5 protocol, we'll need to add such functionality to the RP2040 firmware.

Get battery state

#define RP2040_REG_ADC_VALUE_VBAT_LO 0x0d
#define RP2040_REG_CHARGING_STATE 0x15

/**
 * Get charging state.
 */
void get_battery_state(bool& charging, float& voltage) {
  const float conversion_factor = 3.3f / (1 << 12); /* 12-bit ADC with 3.3v vref */
  uint16_t raw_voltage;
  Wire.beginTransmission(RP2040_ADDR);
  Wire.write(RP2040_REG_ADC_VALUE_VBAT_LO);
  Wire.endTransmission();
  Wire.requestFrom(RP2040_ADDR, 2);
  raw_voltage = (Wire.read() | (Wire.read() << 8));
  voltage = raw_voltage * conversion_factor * 2; /* Connected through 100k/100k divider */

  Wire.beginTransmission(RP2040_ADDR);
  Wire.write(RP2040_REG_CHARGING_STATE);
  Wire.endTransmission();
  Wire.requestFrom(RP2040_ADDR, 2);
  charging = (bool)Wire.read();
  raw_voltage = (Wire.read() | (Wire.read() << 8));
}

Showing this on the screen:

void update_display() {
  // ...
  canvas->setCursor(0, 0);
  canvas->setTextColor(ILI9341_WHITE);
  canvas->setTextSize(1);
  canvas->print(voltage);
  canvas->print(" (V)");
  if (charging) {
    canvas->print(" charging");
  }
  // ...
}

Control the kite LEDs

There are 5 LEDs in a kite layout on the top left of the badge. The Badge.team documentation luckily already mentioned that it has to be activated after initializing the display — hence init_leds_after_display(). But we'll need to actually get them to update their state. This is done by calling FastLED.show().

Here are the parts you need:

#include <FastLED.h> /* library to interface with SK9822 (and other) leds */

#define GPIO_LED_DATA 5
#define GPIO_SD_PWR 19
#define GPIO_LED_ENABLE GPIO_SD_PWR /* enables power to both SD and LEDs */
#define NUM_LEDS 5

/* LED configuration of the kite on the top left:
 *   _2_                       _0_
 *  / | \                     / | \
 * 0--1--3  == reordered =>  1--2--3
 *  \ | /                     \ | /
 *   \|/                       \|/
 *    4                         4 */
CHSV led_color[NUM_LEDS];
CRGB leds[NUM_LEDS];
const int led_reorder[NUM_LEDS] = {2, 0, 1, 3, 4}; /* left-to-right */
void setup() {
  // ...
  init_leds_after_display();

  /* Initialize the 5 LED colors at a slight offset, so we get a wave. */
  for (int i = 0; i < NUM_LEDS; ++i) {
    led_color[led_reorder[NUM_LEDS - i - 1]].hue = ((i * 8) % 256);
    led_color[i].saturation = 255;
    led_color[i].value = 255;
  }
}

void loop() {
  // ...
  change_leds();
}
void init_leds_after_display() {
  FastLED.addLeds<SK6812, GPIO_LED_DATA, GRB>(leds, NUM_LEDS).setCorrection(TypicalLEDStrip);
  FastLED.setBrightness(96);

  /* This has to be placed after SPI (LCD) has been initialized. The
   * Arduino library wants to use this pin as SPI MISO... */
  pinMode(GPIO_LED_ENABLE, OUTPUT);
  digitalWrite(GPIO_LED_ENABLE, HIGH);
}
/**
 * Update the kite LEDs. Loop over the colors of the rainbow.
 */
void change_leds() {
  /* This would set the first LED to white. */
  //leds[0] = CRGB::White;

  /* This merely sets all of the LEDs local storage (leds[]) to the new value. */
  for (int i = 0; i < NUM_LEDS; ++i) {
    led_color[led_reorder[i]].hue = ((led_color[led_reorder[i]].hue + 1) % 256);
    hsv2rgb_rainbow(led_color[i], leds[i]);
  }

  /* This uses the leds[] values and notifies the SK6812 of the new colors. */
  FastLED.show();
}

Compiling with FastLED.h will show a warning:

No hardware SPI pins defined.  All SPI access will default to bitbanged output

That is a fact, and it is no problem. It just means that changing LED colors will take more than a few microseconds (but not milliseconds). Luckily there are only a few of them.

To the right of the display, there is actually also a sixth LED. It's a 5050 RGB. It's not controlled by the ESP32 or the RP2040, but it's connected to the ICE40 FPGA through separate R, G and B pins.

I don't have the know-how to explain what an FPGA even is, nor do I know how to interface with it yet. Getting that sixth LED to work will be for a future post, if ever.

Putting it all together

Here's a downloadable version of the hello-display-plus.ino sketch that contains all of the above code. If you've read parts 1 and 2 of this blog series, you should now know how to get it to work.

2023-08-23 - mch2022 badge / arduino ide / part 2 / display

The colored LCD display on my living room HUD has reached its end-of-life. Time to replace it with an MCH2022 badge?

In the previous blog post, we explored how to upload and start a simple "hello world" sketch programmed in the Arduino IDE to the MCH2022 badge. In this one, we look at writing to the 320x240 display.

As mentioned in the previous post, the documentation for building MCH2022 badge apps with the Arduino IDE is incomplete. Notably the mentioned LCD_FREQ is undefined, and more importantly the steps to select video output through GPIO_LCD_MODE is missing.

Here are the changes in relation to the hello-serial sketch from the previous post. I also added a screen buffer (canvas), which you'll likely want to employ.

--- hello-serial/hello-serial.ino
+++ hello-display/hello-display.ino
@@ -6,6 +6,9 @@
 /* Needed for *_inputs() */
 #include <Wire.h>
 
+/* Needed for *_display() */
+#include <Adafruit_ILI9341.h> /* Adafruit ILI9341 library */
+
 /* Workaround to skip esp_ota_get_running_partition() */
 extern "C" bool verifyRollbackLater() {
   return true;
@@ -18,9 +21,14 @@ extern "C" bool verifyRollbackLater() {
 #define GPIO_I2C_SCL 21
 #define GPIO_INT_RP2040 34 /* active LOW */
 
+#define GPIO_SPI_CS_LCD 32
+#define GPIO_SPI_DC_LCD 33
+#define GPIO_LCD_RESET 25
+#define GPIO_LCD_MODE 26
+
 /* I2C addresses */
 #define RP2040_ADDR 0x17 /* RP2040 co-processor */
 
 /* See RP register addresses here:
  *   https://github.com/badgeteam/esp32-component-mch2022-rp2040/
  *     blob/45fce00c790a2918bf506dd8af0c0b68db05d031/include/rp2040.h
@@ -53,13 +61,19 @@ extern "C" bool verifyRollbackLater() {
 
 void init_inputs();
 uint32_t read_inputs();
+void init_display();
+void update_display();
 void return_to_launcher();
 
+Adafruit_ILI9341 tft = Adafruit_ILI9341(GPIO_SPI_CS_LCD, GPIO_SPI_DC_LCD, GPIO_LCD_RESET);
+GFXcanvas16* canvas; /* uses malloc(); cannot be initialized statically */
+
 void setup() {
   Serial.begin(115200);
   Serial.println("hello MCH2022");
 
   init_inputs();
+  init_display();
 }
 
 void loop() {
@@ -71,6 +85,10 @@ void loop() {
       Serial.printf("input: 0x%08x // 0x<button><state>\r\n", input);
     }
   }
+
+  //unsigned long t0 = millis();
+  update_display(); // takes about 60ms (= 17Hz)
+  //Serial.printf("updating display took %lu ms\r\n", millis() - t0);
 }
 
 void init_inputs() {

Aside from that diff, the definitions of init_display and update_display:

void init_display() {
  pinMode(GPIO_LCD_MODE, OUTPUT);
  digitalWrite(GPIO_LCD_MODE, LOW); /* set "ESP32" output instead of "FPGA" */

  tft.begin();
  tft.fillScreen(ILI9341_BLACK);
  tft.setRotation(1);

  /* Using a buffer is slightly slower than writing directly. But it's
   * significantly easier to avoid flicker. */
  canvas = new GFXcanvas16(ILI9341_TFTHEIGHT, ILI9341_TFTWIDTH); /* 16-bit, 320x240 */
  canvas->setRotation(0);
}

void update_display() {
  static int x = 0;
  static int y = 0;

  canvas->fillScreen(ILI9341_PURPLE);
  canvas->setCursor(x, y);
  canvas->setTextColor(ILI9341_YELLOW);
  canvas->setTextSize(3);
  canvas->print("MCH2022");

  /* Move the text for the next round */
  x += 16;
  if (x >= ILI9341_TFTHEIGHT) {
    y += 24;
    x = 0;
    if (y >= ILI9341_TFTWIDTH) {
      y = 0;
    }
  }

  tft.drawRGBBitmap(0, 0, canvas->getBuffer(), canvas->width(), canvas->height());
}

For a long while, I only got a white-ish blank screen. Getting that pinMode(GPIO_LCD_MODE, OUTPUT); and digitalWrite(GPIO_LCD_MODE, LOW); in there was critical to get the display to behave.

[Compile success. Upload success. Purple background and yellow MCH2022 on the badge]

A nice Makefile helps too, for quick building/uploading:

PROJECT := $(notdir $(CURDIR))
FQBN := esp32.esp32.esp32wrover
BINARY := build/$(FQBN)/$(PROJECT).ino.bin

.PHONY: upload
upload: build
        ~/Arduino/projmch2022/mch2022-tools/app_push.py \
          --run $(BINARY) 'my_cool_app' main.bin 1

.PHONY: build
build: $(BINARY)

$(BINARY): $(PROJECT).ino
        arduino-cli compile -eb esp32:esp32:esp32wrover $(CURDIR)

(Don't forget to convert 8 spaces to tabs in any Makefile.)

2023-08-22 - mch2022 badge / arduino ide / part 1 / serial

The colored LCD display on my living room HUD has reached its end-of-life. Time to replace it with an MCH2022 badge?

The HUD I have in my living room is a breadboard with an ESP8266 development board, some sensors and a Grove 16x2 LCD I2C display. The background color of the display nicely visualizes current power production/usage, but the bottom row of text has now become defunct.

I could attempt to find a new LCD component at the closest electronics shop. But I also happen to have a bunch of MCH2022 badges laying around, after the company paid trip there last year. Maybe I could use one of those instead. It has a 320x240 full color display and various free pins. And an on-board ESP32. The existing ESP8266 code should run more than fine on that.

The Badge development documentation lists Arduino as a possible route. This would allow me to reuse most of my existing HUD code. Unfortunately the page says: “PLEASE BE AWARE THAT THE ARDUINO SDK IS NOT FULLY SUPPORTED!! YOU MAY RUN INTO SOME ISSUES”

They were not wrong...

Steps

The steps listed are:

  • Install the Arduino IDE if you haven’t already — do use version 2.x and not 1.x (because it's old, unless you want to do a lot of serial monitoring, in which case you probably want sermon anyway (see below).
  • Install ESP32 support using these instructions — from "esp32" select the "ESP32 Wrover Module" board.
  • Download mch2022-tools — you'll be needing app_push.py and its prerequisites: the files next to it, and python3-usb.

You can write an Arduino sketch, and then instead of Upload you instead do an Export Compiled Binary (ctrl+alt+S). You upload the .bin to the MCH2022 appfs like this:

$ .../app_push.py --run <binfile> <name> <title> <version>

Hello world

We start off with this hello-serial project:

void setup() {
  Serial.begin(115200);
  Serial.println("hello MCH2022");
}

void loop() {
  Serial.println(millis());
  delay(1000);
}

Building and exporting with ctrl+alt+S gives us this output:

...
Compiling sketch...
~/.arduino15/packages/esp32/tools/xtensa-esp32-elf-gcc/esp-2021r2-patch5-8.4.0/bin/xtensa-esp32-elf-g++ ... -o /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE/sketch/hello-serial.ino.cpp.o
Compiling libraries...
Compiling core...
bash -c "echo -DARDUINO_CORE_BUILD >; /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE/file_opts"
Using precompiled core: /tmp/arduino/cores/cb14c27b7a099bfbee683e2ff88af527/core.a
bash -c ": > /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE/file_opts"
Linking everything together...
~/.arduino15/packages/esp32/tools/xtensa-esp32-elf-gcc/esp-2021r2-patch5-8.4.0/bin/xtensa-esp32-elf-g++ -Wl,--Map=/tmp/arduino/sketches/...
python3 ~/.arduino15/packages/esp32/tools/esptool_py/4.5.1/esptool.py --chip esp32 elf2image --flash_mode dio --flash_freq 80m --flash_size 4MB --elf-sha256-offset 0xb0 -o /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE/hello-serial.ino.bin /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE/hello-serial.ino.elf
esptool.py v4.5.1
Creating esp32 image...
Merged 2 ELF sections
Successfully created esp32 image.
python3 ~/.arduino15/packages/esp32/hardware/esp32/2.0.11/tools/gen_esp32part.py -q /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE/partitions.csv /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE/hello-serial.ino.partitions.bin
bash -c "[ ! -d \"/tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE\"/libraries/Insights ] || python3 \"~/.arduino15/packages/esp32/hardware/esp32/2.0.11\"/tools/gen_insights_package.py /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE hello-serial.ino \"~/Arduino/projmch2022/hello-serial\""
bash -c "[ 0 -eq 0 ] || cp -f \"~/.arduino15/packages/esp32/tools/openocd-esp32/v0.11.0-esp32-20221026/share/openocd/scripts/\"board/esp32-wrover-kit-3.3v.cfg \"~/Arduino/projmch2022/hello-serial\"/debug.cfg"
bash -c "[ 0 -eq 0 ] || cp -f \"~/.arduino15/packages/esp32/hardware/esp32/2.0.11\"/tools/ide-debug/esp32.json \"~/Arduino/projmch2022/hello-serial\"/debug_custom.json"
bash -c "[ 0 -eq 0 ] || cp -f \"~/.arduino15/packages/esp32/hardware/esp32/2.0.11\"/tools/ide-debug/svd/esp32.svd \"~/Arduino/projmch2022/hello-serial\"/debug.svd"

~/.arduino15/packages/esp32/tools/xtensa-esp32-elf-gcc/esp-2021r2-patch5-8.4.0/bin/xtensa-esp32-elf-size -A /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE/hello-serial.ino.elf
Sketch uses 281621 bytes (21%) of program storage space. Maximum is 1310720 bytes.
Global variables use 21480 bytes (6%) of dynamic memory, leaving 306200 bytes for local variables. Maximum is 327680 bytes.

We now have these files:

$ ls -og hello-serial/build/esp32.esp32.esp32wrover/
total 4589
-rw-rw-r-- 1  281984 aug 21 23:44 hello-serial.ino.bin
-rw-rw-r-- 1   18992 aug 21 23:44 hello-serial.ino.bootloader.bin
-rwxrwxr-x 1 6050184 aug 21 23:44 hello-serial.ino.elf
-rw-rw-r-- 1 9089963 aug 21 23:44 hello-serial.ino.map
-rw-rw-r-- 1    3072 aug 21 23:44 hello-serial.ino.partitions.bin

Uploading:

$ .../app_push.py --run \
    hello-serial/build/esp32.esp32.esp32wrover/hello-serial.ino.bin \
    my_cool_app main.bin 1

The results:

[Failed to start app. Check console for more details. App: my_cool_app]
Failed to start app.
Check console for more details.
App: my_cool_app

So, the app does get uploaded. But it fails to start for some reason or the other.

A crashing app

Luckily the badge is robust: it handles that the app fails, reports the failure and gets back into a working start by pressing A. Our "my_cool_app" app can be found under "Apps" where we can retry starting it, but not before hooking up a serial console (see sermon below):

$ sermon -t /dev/ttyACM0 115200
sermon [-t] DEV BAUD; now listening on /dev/ttyACM0 with 115200 baud
00:08:08.085638: E (127) esp_core_dump_flash: No core dump partition found!
00:08:08.219348: E (128) esp_core_dump_flash: No core dump partition found!
00:08:08.224508:
00:08:08.224691: abort() was called at PC 0x400d6309 on core 0
00:08:08.228778:
00:08:08.228933:
00:08:08.229113: Backtrace: 0x40083495:0x3ffbc160 0x400893d5:0x3ffbc180 0x4008e1b1:0x3ffbc1a0 0x400d6309:0x3ffbc220 0x400d2088:0x3ffbc240 0x400d2d1e:0x3ffbc270 0x400f2fd4:0x3ffbc2a0
00:08:08.243526:
00:08:08.243700:
00:08:08.243879:
00:08:08.244054:
00:08:08.244211: ELF file SHA256: e104f0bd5a9e7c1a
00:08:08.247249:
00:08:08.247409: E (150) esp_core_dump_flash: Core dump flash config is corrupted! CRC=0x7bd5c66f instead of 0x0
00:08:08.255852: Rebooting...
00:08:08.731141: I (473) psram: This chip is ESP32-D0WD
...

Starting the app again yielded the above messages in the serial monitor. It looks like abort() is called because of something, after which the device is reboooted. The I line marks the first log of the Launcher (Home) application.

That backtrace will give us sufficient clues to go on, but we'll need some means to figure out where those addresses are. If we had installed the Expressif tools, we would have a bunch of binaries in ~/.espressif/tools/xtensa-esp32-elf/.../bin/, like xtensa-esp32-elf-addr2line, xtensa-esp32-elf-gdb and xtensa-esp32-elf-objdump. Now I could only find objdump as ~/.arduino15/packages/esp32/tools/.../bin/objdump.

objdump is not addr2line, but we can use its output to find function locations with a helper script objdump-syms-addr2line:

#!/usr/bin/perl
# Replacement for addr2line that uses "objdump --syms" output
#
# Why: because you might have objdump but not addr2line
#
# Example:
#   $ objdump --syms bin.elf | objdump-syms-addr2line 0x400d2d1e 0x400f2fd4
#   0x400d2d1e app_main [0x400d2d0c+50]
#   0x400f2fd4 main_task [0x400f2f60+128]
#
# Alternative (if you have it):
#   $ addr2line -Cfe bin.elf 0x400d2d1e 0x400f2fd4 | sed -e '0~2s/^/  /'
#   app_main
#     ~/.arduino15/packages/esp32/hardware/esp32/2.0.11/cores/esp32/main.cpp:70
#   main_task
#     /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/freertos...
#
use warnings;
my %l = ();
my %f = ();
for (<STDIN>) {
    # 400d63f4 l     F .flash.text    000000c2 esp_ota_current_ota_is_workable
    if (/^([0-9a-f]+)\s+(\S+\s+){3}([0-9a-f]+)\s+(\S+)/ and hex($1)) {
        $l{hex($1)} = hex($3);
        $f{hex($1)} = $4;
    }
}
foreach (@ARGV) {
    my ($i, $x) = (hex, undef);
    foreach my $s (keys %l) {
        if ($s <= $i and $i < ($s + $l{$s})) {
            $x = $s;
            last;
        }
    }
    if ($x) {
        printf("0x%08x %s [0x%x+%d]\n", $i, $f{$x}, $x, $l{$x});
    } else {
        print "0x".hex($i)."  <unknown>\n";
    }
}

(Next time I should probably check all of find ~/.arduino15/packages/esp32/tools/ -perm /0700 -type f -path '*/bin/*' for more tools.)

Using this script on the addresses from the backtrace, we get this:

$ ~/.arduino15/packages/esp32/tools/xtensa-esp32-elf-gcc/esp-2021r2-patch5-8.4.0/xtensa-esp32-elf/bin/objdump \
    --syms hello-serial/build/esp32.esp32.esp32wrover/hello-serial.ino.elf | \
  objdump-syms-addr2line \
    0x40083495 0x400893d5 0x4008e1b1 0x400d6309 0x400d2088 0x400d2d1e 0x400f2fd4
0x40083495 panic_abort [0x40083480+29]
0x400893d5 esp_system_abort [0x400893d0+8]
0x4008e1b1 abort [0x4008e128+140]
0x400d6309 esp_ota_get_running_partition [0x400d62a0+108]
0x400d2088 initArduino [0x400d2080+348]
0x400d2d1e app_main [0x400d2d0c+50]
0x400f2fd4 main_task [0x400f2f60+128]

Good. Finally some useful clues.

esp_ota_get_running_partition

I cannot find the esp_ota_get_running_partition definition in the .arduino15 dir, but I can find initArduino where that function is invoked — in esp32-hal-misc.c:

void initArduino()
{
#ifdef CONFIG_APP_ROLLBACK_ENABLE
    if(!verifyRollbackLater()){
        const esp_partition_t *running = esp_ota_get_running_partition();
        esp_ota_img_states_t ota_state;
        if (esp_ota_get_state_partition(running, &ota_state) == ESP_OK) {
            if (ota_state == ESP_OTA_IMG_PENDING_VERIFY) {
                if (verifyOta()) {
                    esp_ota_mark_app_valid_cancel_rollback();
                } else {
                    log_e("OTA verification failed! Start rollback to the previous version ...");
                    esp_ota_mark_app_invalid_rollback_and_reboot();
                }
            }
        }
    }
#endif
    //init proper ref tick value for PLL (uncomment if REF_TICK is different than 1MHz)
    //ESP_REG(APB_CTRL_PLL_TICK_CONF_REG) = APB_CLK_FREQ / REF_CLK_FREQ - 1;
#ifdef F_CPU
    setCpuFrequencyMhz(F_CPU/1000000);
...

So, it looks like it bails out right after starting: initArduino() is invoked, verifyRollbackLater() returns false and then esp_ota_get_running_partition() calls abort because of an unexpected situation.

For normal ESP32 applications, this panic is not triggered. But the badge does something special when chainloading the second app — the Launcher (Home) app being the first one.

I figure we don't need any rollback. The badge handles bad apps just fine — as we noticed already — so we'll need to figure out a way to skip this code.

Fixing through sdkconfig.h

Avoiding the call to esp_ota_get_running_partition() can be done by unsetting/removing the CONFIG_APP_ROLLBACK_ENABLE define.

In ~/.arduino15/packages/esp32/.. ../tools/sdk/esp32/sdkconfig the config is seen, but this is not the active config:

CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE=y
CONFIG_APP_ROLLBACK_ENABLE=y

We cannot undefine it from our compilation unit. We have to alter sdkconfig.h. There are a bunch of those. We're using ~/.arduino15/packages/esp32/.. ../tools/sdk/esp32/qio_qspi/include/sdkconfig.h at the moment (see Tools > Flash Mode menu option):

// ...
#define CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE 1
// ...
#define CONFIG_APP_ROLLBACK_ENABLE CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE
// ...

By removing/commenting both we can build a new binary without the code that calls esp_ota_get_running_partition():

// ...
//#define CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE 1
// ...
//#define CONFIG_APP_ROLLBACK_ENABLE CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE
// ...

Build, Upload, Run. Don't worry about the screen not updating (showing "Connected to PC"). Check with sermon:

$ sermon -t /dev/ttyACM0 115200
sermon [-t] DEV BAUD; now listening on /dev/ttyACM0 with 115200 baud
[   455][I][esp32-hal-psram.c:96] psramInit(): PSRAM enabled
10:03:41.117540: hello MCH2022
10:03:41.118636: 480
10:03:42.117090: 1479
10:03:43.117141: 2479
...

Yes, victory! The app starts and appears to run fine. (Don't worry about any garbage showing up in the serial output during upload.)

Fixing from the project

Editing that sdkconfig.h is cumbersome. There must be a better way. And luckily there is. (Start by undoing your edits to sdkconfig.h if you haven't already.)

We could try to override the esp_ota_get_running_partition function itself, and have it return NULL instead of aborting:

extern "C" const esp_partition_t* esp_ota_get_running_partition() { return NULL; }

Alas, the linking step fails:

Linking everything together...
~/.arduino15/packages/esp32/tools/xtensa-esp32-elf-gcc/esp-2021r2-patch5-8.4.0/bin/xtensa-esp32-elf-g++ ... -o /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE/hello-serial.ino.elf
~/.arduino15/packages/esp32/tools/xtensa-esp32-elf-gcc/esp-2021r2-patch5-8.4.0/bin/../lib/gcc/xtensa-esp32-elf/8.4.0/../../../../xtensa-esp32-elf/bin/ld: ~/.arduino15/packages/esp32/hardware/esp32/2.0.11/tools/sdk/esp32/lib/libapp_update.a(esp_ota_ops.c.obj): in function `esp_ota_get_running_partition':
/Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/app_update/esp_ota_ops.c:533: multiple definition of `esp_ota_get_running_partition'; /tmp/arduino/sketches/2EA640295D81B35604C751A774B1A1FE/sketch/hello-serial.ino.cpp.o:~/Arduino/projmch2022/hello-serial/hello-serial.ino:3: first defined here
collect2: error: ld returned 1 exit status

That was to be expected. We're not using shared libraries, so we cannot have multiple functions with the same name.

But the verifyRollbackLater() function we saw earlier can be overridden. In fact, it's specifically tailored to be, through __attribute__((weak)). If we create one and have it return true, we'll also skip the call to esp_ota_get_running_partition() and achieve our goal:

extern "C" bool verifyRollbackLater() { return true; }

Add the above line to our sketch. This builds, and uploads. All is well!

Ending the application

Now we need a way to exit the application and return to the Launcher.

The Badge.team guys already helpfully added the relevant code. Reproduced here with some extra functionality for future reference:

/* Needed for return_to_launcher() */
#include <esp_system.h>
#include <soc/rtc.h>
#include <soc/rtc_cntl_reg.h>

/* Needed for *_inputs() */
#include <Wire.h>

/* Workaround to skip esp_ota_get_running_partition() */
extern "C" bool verifyRollbackLater() {
  return true;
}

/* See ESP PIN numbers here:
 * https://github.com/badgeteam/esp32-component-mch2022-bsp/
 *   blob/2da8dbd50679512058820b40bfd6d7a5c529b28c/include/mch2022_badge.h */
#define GPIO_I2C_SDA 22
#define GPIO_I2C_SCL 21
#define GPIO_INT_RP2040 34 /* active LOW */

/* I2C addresses */
#define RP2040_ADDR 0x17 /* RP2040 co-processor */

/* See RP register addresses here:
 *   https://github.com/badgeteam/esp32-component-mch2022-rp2040/
 *     blob/45fce00c790a2918bf506dd8af0c0b68db05d031/include/rp2040.h
 * Feed through:
 *   sed -e '1,/RP2040_REG_FW_VER/d;/}/,$d;s/,//' | awk '{printf "#define %s 0x%02x\n",$1,NR}'
 */
#define RP2040_REG_INPUT1 0x06

/* Buttons, accessible through the RP2040 through wire I2C */
#define BTN_HOME 0x0001   /* home */
#define BTN_MENU 0x0002   /* menu */
#define BTN_START 0x0004  /* start */
#define BTN_A 0x0008      /* accept */
#define BTN_B 0x0010      /* back */
#define BTN_RES0 0x0020   /* (RP2040_INPUT_FPGA_CDONE) */
#define BTN_RES1 0x0040   /* (RP2040_INPUT_BATTERY_CHARGING) */
#define BTN_SELECT 0x0080 /* select */
#define BTN_LEFT 0x0100   /* joy left */
#define BTN_JOY 0x0200    /* joy press */
#define BTN_DOWN 0x0400   /* joy down */
#define BTN_UP 0x0800     /* joy up */
#define BTN_RIGHT 0x1000  /* joy right */

#define BTN2STATE(button) (button & 0xffff)
#define BTN2BTN(button) ((button >> 16) & 0xffff)
#define BUTTON_PRESSED(button, compare) \
  ((BTN2BTN(button) & compare) && (BTN2STATE(button) & compare) == compare)
#define BUTTON_RELEASED(button, compare) \
  ((BTN2BTN(button) & compare) && (BTN2STATE(button) & compare) == 0)

void init_inputs();
uint32_t read_inputs();
void return_to_launcher();

void setup() {
  Serial.begin(115200);
  Serial.println("hello MCH2022");

  init_inputs();
}

void loop() {
  if (!digitalRead(GPIO_INT_RP2040)) {
    uint32_t input = read_inputs();
    if (BUTTON_RELEASED(input, BTN_HOME)) {
      return_to_launcher();
    } else {
      Serial.printf("input: 0x%08x // 0x<button><state>\r\n", input);
    }
  }
}

void init_inputs() {
  Wire.begin(GPIO_I2C_SDA, GPIO_I2C_SCL);
  pinMode(GPIO_INT_RP2040, INPUT);
  read_inputs();
}

uint32_t read_inputs() {
  Wire.beginTransmission(RP2040_ADDR);
  Wire.write(RP2040_REG_INPUT1);
  Wire.endTransmission();
  Wire.requestFrom(RP2040_ADDR, 4);
  /* state  = 0x20 + sum of all pressed buttons */
  uint16_t state = Wire.read() | (Wire.read() << 8);
  /* which_button = 1, 2, 4, 8, ... */
  uint16_t which_button = Wire.read() | (Wire.read() << 8);
  /* NOTE: which_button may hold more than one if the presses/releases
   * are registered "simultaneously" */
  return (which_button << 16 | state);
}

void return_to_launcher() {
  REG_WRITE(RTC_CNTL_STORE0_REG, 0);
  esp_restart();
}

You can now see buttons get pressed/depressed:

10:40:24.257248: input: 0x00080028 // 0x<button><state>
10:40:24.497229: input: 0x00080020 // 0x<button><state>

Note that this sketch still lacks the code to interface with the screen. That will be for another day...

sermon (serial-monitor)

As promised above, this sermon snippet is better than the crippled built-in Serial Monitor in Arduino IDE 2:

#!/bin/sh
# sermon (serial-monitor) -- more convenient for serial logs than screen(1)
set -eu
test "${1:-}" = '-t' && { with_ts=true; shift; } || with_ts=false
dev=${1:-/dev/ttyACM0}
baud=${2:-115200}
stty -F "$dev" "$baud" raw -clocal -echo || {
    echo "Usage: ${0##*/} [-t] DEV BAUD" >&2; exit 1; }
echo "${0##*/} [-t] DEV BAUD; now listening on $dev with $baud baud" >&2
if $with_ts; then
    exec perl -e 'use strict;use warnings;
      use Time::HiRes qw(gettimeofday);use POSIX qw(strftime);
      my ($nl,$t,$ut)=1;while(sysread STDIN,$b,1){
      if($nl){($t,$ut)=gettimeofday;syswrite STDOUT,
      (strftime "%H:%M:%S",localtime $t).(sprintf ".%06d: ",$ut);$nl=0}
      $nl=1 if $b eq "\n";syswrite STDOUT,$b}' <"$dev"
else
    exec cat "$dev"
fi

Usage example: sermon -t /dev/ttyACM0 115200

2023-07-08 - laptop battery discharge / logging

I recently got a new Framework Laptop. It is generally nice. Apart from the reflective screen, and the excessive battery consumption.

In suspend mode, it draws too much battery: more than a Watt. This is a known issue. The worst offenders are the expansion cards. For instance the USB-A cards consume about 350mW each, just by being plugged in.

To do some testing, I whipped up a script allowing easy access to battery usage logs: discharge-log

Place it in /usr/lib/systemd/system-sleep with execute permissions.

It will log usage before going into sleep and after waking up. This allows for easy testing of various configurations of expansion cards. The values look like this:

# journalctl -o cat -t discharge-log -b0 --grep dT | tail -n1
(sleep) dT = 362 (s), dQ = -9 (mAh) = 138 (mWh), rate = 0 (pmlpt), rate/h = 0 (pmlpt/h), dE = -542 (J), P = -1498 (mW)

One can run it manually too:

$ /usr/lib/systemd/system-sleep/discharge-log begin
$ /usr/lib/systemd/system-sleep/discharge-log show '(testing charge)'
(testing charge) dT = 185 (s), dQ = 32 (mAh) = 561 (mWh), rate = 10 (pmlpt), rate/h = 194 (pmlpt/h), dE = 2019 (J), P = 10916 (mW)
$ /usr/lib/systemd/system-sleep/discharge-log begin
$ /usr/lib/systemd/system-sleep/discharge-log show '(testing discharge)'
(testing discharge) dT = 314 (s), dQ = -50 (mAh) = -854 (mWh), rate = -15 (pmlpt), rate/h = -171 (pmlpt/h), dE = -3074 (J), P = -9791 (mW)

(The pmlpt values are promille-points.)

Ejecting all (non USB-C) cards leaves us with a battery usage of about 400mW. Still steep, but a lot better than a Watt. In my tests — with 2023-07 firmware and a Linux 6.1 kernel (linux-oem-22.04c) — deep versus s2idle sleep did matter a lot: I could leave the USB-A cards connected (with one mouse dongle) and get about 520mW discharge rate with s2idle; a clear winner.

2023-07-03 - viewing unencrypted traffic / ltrace / bpftrace

Can we view TLS-encrypted traffic on the originating or terminating host, without having to decode the data from the wire?

This is a question that comes up every now and then when trying to debug a service by looking at how it communicates.

For the most insight, we should capture the encrypted traffic and use the (logged!) pre-master secret keys. See example details in make-master-secret-log discussing how to have HAProxy log them. Oftentimes this involves a lot of work: capturing the traffic, logging the keys, collecting both and running them through Wireshark on a local machine.

If we have access to the encryption libraries, there can be easier ways.

Using ltrace?

We should be able to use ltrace which traces the invocation of library calls. ltrace is the library counterpart of strace which traces system calls. Looking at the network system calls (recvfrom or sendto) would be useless: we'd see the encrypted traffic. Encryption is done in userland, so we need to be one step closer to the application.

By using ltrace, we can trace function calls before data gets encrypted and after data gets decrypted. In this case we're interested in SSL_read and SSL_write from the OpenSSL library — assuming that that is the TLS library our application is using.

Let's try it on cURL. This cURL build is compiled against OpenSSL, as is common:

$ curl --version | grep -i ssl

curl 7.81.0 (x86_64-pc-linux-gnu) libcurl/7.81.0 OpenSSL/3.0.2 ...

Invoke it from ltrace:

$ ltrace curl --http1.1 -I https://www.google.com -sSfLo/dev/null

+++ exited (status 0) +++

Nothing! No library calls whatsoever?

It turns out that this version of ltrace (0.7.3-6.1ubuntu6 on Ubuntu/Jammy) is too old and suffers from ltrace produces no debug output at all.

The problem is fixed by ec56370 (Add Intel CET support). After fetching and building that, we can try the newer ltrace:

$ ltrace-0.8.x -e SSL_read -e SSL_write \
    curl --http1.1 -I https://www.google.com -sSfLo/dev/null

libcurl.so.4->SSL_write(0x55ad69380010, 0x55ad693a0f30, 79, 1) = 79
libcurl.so.4->SSL_read(0x55ad69380010, 0x55ad691eaa80, 0x19000, 1) = 1215
libcurl.so.4->SSL_read(0x55ad69380010, 0x7ffd2590e4e0, 32, 0x55ad691e5828) = 0xffffffff
+++ exited (status 0) +++

Good! We can see the calls to SSL_write and SSL_read.

(Ignore the mention of libcurl.so.4 here. We are in fact seeing the libssl calls.)

Unfortunately ltrace doesn't know that the second argument to both functions is a character buffer that often contains readable strings. We see the memory locations of the buffers but we do not get to see their contents. So, using plain ltrace does not help us.

Using latrace?

There exists a similar application that we can use: latrace. We can convince latrace to print arguments (see -A and -a). But there we'll get the arguments before the call only, not after: we can see the write data going into SSL_write but we will only see uninitialized data going into SSL_read, not the read data coming out of it.

Using bpftrace?

Maybe eBPF can come to the rescue?

First we need to know which library our application is using exactly:

$ ldd `which curl` | grep -E 'ssl|tls'
  libssl.so.3 => /lib/x86_64-linux-gnu/libssl.so.3 (0x00007feb84d20000)
  libgnutls.so.30 => /lib/x86_64-linux-gnu/libgnutls.so.30 (0x00007feb843a6000)

We can use bpftrace and attach BPF userland probes on SSL_read and SSL_write in libssl.so.3:

$ sudo bpftrace -e '
    uprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_read,
    uprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_write {
        @ctx[pid] = arg0; @buf[pid] = arg1; @len[pid] = arg2;
    }
    uretprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_read,
    uretprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_write {
        printf("[%d/%s] %s(%p, %p, %d) = %d", pid, comm, probe, @ctx[pid], @buf[pid], @len[pid], retval);
        if ((int32)retval > 0) {
            @slen = retval;
            if (@slen >= 64) {
                printf(" [[\n%s\n]] (truncated)", str(@buf[pid], @slen));
            } else {
                printf(" [[\n%s\n]]", str(@buf[pid], @slen));
            }
        }
        printf("\n");
        delete(@ctx[pid]); delete(@buf[pid]); delete(@len[pid]);
    }
'

(Our cURL uses OpenSSL. Adapting this for GnuTLS encrypted reads/writes is left as an exercise for the reader.)

eBPF does not support spawning the process itself. We'll have to ensure the sniffed program is started elsewhere:

$ curl --http1.1 -I https://www.google.com -sSfLo/dev/null

The bpftrace probe displays this:

Attaching 4 probes...

[2556865/curl] uretprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_write(0x55ad2054d010, 0x55ad2056df30, 79) = 79 [[
HEAD / HTTP/1.1
Host: www.google.com
User-Agent: curl/7.81.0
]] (truncated)

[2556865/curl] uretprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_read(0x55ad2054d010, 0x55ad203b7a80, 102400) = 1214 [[
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
C
]] (truncated)

[2556865/curl] uretprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_read(0x55ad2054d010, 0x7fff2f584060, 32) = -1

^C

Looking better! Some contents. Not very conveniently formatted, but it's a start.

The biggest problem here is that we're limited by BPFTRACE_STRLEN this time. We can get at most 64 characters, or maybe 200 if you push it.

Alternatives?

Are there alternatives?

We could attach a full fledged debugger (gdb), but that might feel a bit heavy. We could LD_PRELOAD a library that wraps the SSL_read and SSL_write calls, but this only works if we start the application with this wrapper. We could use ptrace(2) and create a minimal debugger that does what ltrace does, but then ensure that the buffers are printed in full.

Options, options.

And then we still haven't tackled the issue of statically compiled programs: many services in today's ecosystem are written in golang. They use no (or very few) dynamic libraries. Trying to trap OpenSSL or GnuTLS function calls there is meaningless. I'm afraid we'll have to dig into golang debugging at some point...

2023-06-30 - removing auditd / disabling logging

After installing auditd for testing purposes and removing it again, my kernel logs got flooded with messages. How do I disable them?

If you happened to have installed auditd, it is likely that the kernel audit subsystem was enabled. Even when there are no rules left (auditctl -l) you can still get more messages in your kernel logs than before.

For instance, after uninstalling auditd, I still get the following ones:

$ sudo journalctl -o cat --dmesg --grep audit -n2

audit: type=1106 audit(1688134323.226:190): pid=2476506 uid=1000 auid=1000 ses=3 subj=unconfined
  msg='op=PAM:session_close grantors=pam_limits,.. acct="root"
  exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/15 res=success'

audit: type=1104 audit(1688134323.226:191): pid=2476506 uid=1000 auid=1000 ses=3 subj=unconfined
  msg='op=PAM:setcred grantors=pam_permit acct="root"
  exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/15 res=success'

The audit rules might be gone, but userland can still log messages, and they are now forwarded through the kernel ring buffer. The pam subsystem does this, for instance:

int _pam_auditlog(pam_handle_t *pamh, int action, int retval, int flags, struct handler *h)
/* ... */
  case PAM_CLOSE_SESSION:
    message = "session_close";
/* ... */
  if (_pam_audit_writelog(pamh, audit_fd, type, message,
      grantors ? grantors : "?", retval) < 0)
/* ... */

Not expecting those logs after removing auditd?

Here's a quick way to disable audit logging again: compile the following C snippet, and run it to disable.

$ gcc -Wall -o audit_enable_disable audit_enable_disable.c
$ sudo ./audit_enable_disable 0

audit_enable_disable.c

/* GPL2.1+ */
#include <asm/types.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/audit.h>
#include <linux/netlink.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <unistd.h>

#define MAX_AUDIT_MESSAGE_LENGTH 8970 // PATH_MAX*2+CTX_SIZ*2+11+256+1

struct audit_message {
    struct nlmsghdr nlh;
    char data[MAX_AUDIT_MESSAGE_LENGTH];
};
struct audit_reply {
    int type;
    int len;
    struct nlmsghdr *nlh;
    struct audit_message msg;
    struct nlmsgerr *error;
};

int audit_open(void) {
    int fd = socket(PF_NETLINK, SOCK_RAW, NETLINK_AUDIT);
    if (fd < 0) {
        perror("socket");
        return -1;
    }
    if (fcntl(fd, F_SETFD, FD_CLOEXEC) == -1) {
        perror("fcntl");
        close(fd);
        return -1;
    }
    return fd;
}

int audit_recv_error(int fd) {
    struct audit_message resp = {0};
    struct sockaddr_nl nladdr;
    int ret;

    socklen_t nladdrlen = sizeof(nladdr);
    do {
        ret = recvfrom(
             fd, &resp, sizeof(resp), 0,
             (struct sockaddr*)&nladdr, &nladdrlen);
    } while (ret < 0 && errno == EINTR);

    if (ret < 0 || nladdrlen != sizeof(nladdr) || nladdr.nl_pid) {
        perror("recvfrom");
        return -1;
    }

    if (resp.nlh.nlmsg_type != NLMSG_ERROR) {
        fprintf(
            stderr, "audit_recv_error: got unexpected %d\n",
            resp.nlh.nlmsg_type);
        return -1;
    }

    return ((struct nlmsgerr*)NLMSG_DATA(&resp.nlh))->error == 0 ? 0 : -1;
}

int audit_send(int fd, int type, const void *data, unsigned int size) {
    struct audit_message req = {0};
    struct sockaddr_nl addr = {.nl_family = AF_NETLINK};
    int ret;

    req.nlh.nlmsg_len = NLMSG_SPACE(size);
    req.nlh.nlmsg_type = type;
    req.nlh.nlmsg_flags = NLM_F_REQUEST|NLM_F_ACK;
    req.nlh.nlmsg_seq = 1;
    if (size && data) {
        memcpy(NLMSG_DATA(&req.nlh), data, size);
    }

    do {
        ret = sendto(
            fd, &req, req.nlh.nlmsg_len, 0,
            (struct sockaddr*)&addr, sizeof(addr));
    } while (ret < 0 && errno == EINTR);

    if (ret < 0 || ret != (int)req.nlh.nlmsg_len) {
        perror("sendto");
        return -1;
    }
    return audit_recv_error(fd);  /* because NLM_F_ACK */
}

int audit_set_enabled(int fd, uint32_t enabled) {
    struct audit_status s = {
        .mask = AUDIT_STATUS_ENABLED, .enabled = enabled};
    int ret = audit_send(fd, AUDIT_SET, &s, sizeof(s));
    if (ret < 0) {
        fprintf(stderr, "audit_set_enabled: failed, are you root?\n");
        return -1;
    }
    return ret;
}

int main(int argc, const char *argv[]) {
    int ret = 0;
    int fd = audit_open();

    if (fd < 0) {
        ret = -1;
    } else if (argc == 2) {
        uint32_t enable = atoi(argv[1]);
        /* allow 2 here to permanently enable until reboot */
        if (enable != 1) {
            enable = 0;
        }
        if (audit_set_enabled(fd, enable) < 0) {
            ret = 2;
        }
    } else {
        printf("Usage: %s 0|1\n", argv[0]);
        ret = 1;
    }
    if (fd >= 0) {
        close(fd);
    }
    return ret;
}

Or, if you still have auditctl, just do: auditctl -e 0

2023-06-15 - netplan / docker0 / bind on 172.17.0.1

If you want to bind your host-service to a the docker IP, exposing it to docker instances, means that that IP needs to exist first. If it doesn't, your log might look like this:

LOG:  listening on IPv4 address "127.0.0.1", port 5432
LOG:  could not bind IPv4 address "172.17.0.1": Cannot assign requested address
WARNING:  could not create listen socket for "172.17.0.1"
LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"

As you probaby know, you cannot bind to an IP that is not configured on an interface anywhere — barring the net.ipv4.ip_nonlocal_bind sysctl setting.

So, if you want to expose, say, your postgresql to a docker instance, you could try to set:
listen_addresses = '127.0.0.1,172.17.0.1'
postgresql will refuse to start unless that 172.17.0.1 IP exists on a local interface.

You could reorder your dependencies to have postgresql depend on docker, but that makes no sense from a logical perspective. A better fix is to make sure that the docker IP exists.

Add this in /etc/netplan/60-docker.yaml:

network:
  version: 2
  renderer: networkd
  bridges:
    docker0:
      addresses:
        - 172.17.0.1/16
      parameters:
        forward-delay: 0ms
        stp: false

postgresql will depend on the network being up, so now it should start on boot again without failing.

Trying to get the IPv6 addresses as ansible fact, but getting unwanted link_local scope addresses?

Maybe you're landing here because you did not see a solution at “all_ipv6_addresses includes link local addresses”. (I find it most odd that they locked the GitHub ticket, eliminating the possibility for anyone to reply with a fix.)

The ansible_all_ipv6_addresses | reject("ansible.utils.in_network", "fe80::/10") construct works for me. For example:

local-address=127.0.0.1, ::1, {{ (
  ansible_all_ipv4_addresses + (
    ansible_all_ipv6_addresses|reject("ansible.utils.in_network","fe80::/10"))
  )|join(", ") }}

2023-03-31 - zabbix server / jammy upgrade / missing font

The other day, we upgraded the host OS for our Zabbix Server from Ubuntu/Focal to Ubuntu/Jammy. This caused all text to go missing from the rendered graphs.

The php (uwsgi) logs had the following to say:

PHP Warning:  imagettfbbox():
  Could not find/open font in
  /usr/share/zabbix/include/graphs.inc.php
  on line 600

At least that's a pretty clear message.

Through a quick php live hack we learned that it tried to open /usr/share/zabbix/assets/fonts/graphfont.ttf. This file was a symlink to /etc/alternatives/zabbix-frontend-font and that was a symlink to /usr/share/fonts/truetype/ttf-dejavu/DejaVuSans.ttf. But that symlink target did not exist.

# update-alternatives --config zabbix-frontend-font
update-alternatives: warning:
  /etc/alternatives/zabbix-frontend-font is dangling;
  it will be updated with best choice
There is only one alternative in link group
  zabbix-frontend-font (providing
  /usr/share/zabbix/assets/fonts/graphfont.ttf):
  /usr/share/fonts/truetype/ttf-dejavu/DejaVuSans.ttf
Nothing to configure.

That didn't fix anything.

Apparently the fonts-dejavu-core package has moved the fonts from /usr/share/fonts/truetype/ttf-dejavu to /usr/share/fonts/truetype/dejavu. There we could find our trusted old DejaVuSans.ttf.

The fix:

# ln -sf /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf \
    /etc/alternatives/zabbix-frontend-font

2023-03-02 - postfix / no system resources / proxy protocol

Connecting to Postfix and getting a "421 4.3.2 No system resources"? Maybe you forgot you're using the (HAProxy) Proxy Protocol...

If you're trying to connect to your Postfix mail daemon, and it looks like this:

$ nc localhost 25
... wait for 5 seconds ...
421 4.3.2 No system resources

Then I bet you're using HAProxy as reverse proxy to your mailserver and you have the following configured:

$ postconf | grep ^postscreen_upstream
postscreen_upstream_proxy_protocol = haproxy
postscreen_upstream_proxy_timeout = 5s

To test a direct connection, you'll need to prefix your traffic with the proxy protocol v1 handshake. That can be as simple as pasting PROXY TCP4 127.1.1.1 127.0.0.1 12345 25 as first line:

$ nc localhost 25
PROXY TCP4 127.0.123.45 127.0.0.1 12345 25
220 the-mail-server ESMTP Postfix

After that, you can resume typing SMTP commands as you're used to.

An alternative solution is to use LD_PRELOADed code that does this for you. For example libproxyproto by Michael Santos:

$ git clone https://github.com/msantos/libproxyproto.git
...

$ cd libproxyproto

$ make
...

That creates a libproxyproto.so and a libproxyproto_connect.so shared library. And now you can do this:

$ LD_PRELOAD=./libproxyproto_connect.so LIBPROXYPROTO_VERSION=1 nc localhost 25
220 the-mail-server ESMTP Postfix

Or — if you need this a lot — you can move them to /usr/local/lib/ and add a small /usr/local/bin/proxyproto_v1 shell wrapper:

#!/bin/sh
# Invoke as: proxyproto_v1 nc [args...]
LD_PRELOAD=/usr/local/lib/libproxyproto_connect.so \
LIBPROXYPROTO_VERSION=1 \
exec "$@"

2023-02-17 - oneliner / finding fixed kernel bugs

Recently we were bitten by an old kernel bug on Ubuntu that only rarely triggers. Finding out where the problem was is easier if you know where to look.

We had no kernel logs to go on. Only a hanging machine with no output. And a hunch that the running kernel version linux-image-5.4.0-122 was the likely culprit. Was there another way than meticulously reading all changelogs and changes to find out which bug we're dealing with?

For issues that are already fixed in a newer kernel, the following recipe seems to work. (Note that we would like to use commit ids instead of string matching, but this won't work because Ubuntu cherry-picks the commits, altering the commit ids.)

Step 1: get the Ubuntu kernel repo:

$ git clone git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal \
    linux-ubuntu-focal

Step 2: check which versions have been released:

$ git -c tag.sort=-v:refname tag | grep Ubuntu-5.4.0
Ubuntu-5.4.0-144.161
Ubuntu-5.4.0-143.160
...
Ubuntu-5.4.0-123.139
Ubuntu-5.4.0-122.138
Ubuntu-5.4.0-121.137
Ubuntu-5.4.0-120.136
...

Step 3: pick which versions we want to check. In this case, we're assuming that the bug was introduced in Ubuntu-5.4.0-122.138, so we'll plug in Ubuntu-5.4.0-121.137..Ubuntu-5.4.0-122.138. And we'll plug in Ubuntu-5.4.0-122.138..Ubuntu-5.4.0-144.161 so we get all versions to the end.

$ git log Ubuntu-5.4.0-121.137..Ubuntu-5.4.0-122.138 --format=%s |
    grep -v ^UBUNTU: |
    while read -r l; do res=$(
      git log Ubuntu-5.4.0-122.138..Ubuntu-5.4.0-144.161 \
        --grep "Fixes:.*$(printf '%s' "$l" | sed -e 's/[^A-Za-z0-9 ]/./g')" \
        --oneline)
    test -n "$res" && echo "$res (fixes: $l)"
  done

97842ea930e0 tcp: make sure treq->af_specific is initialized (fixes: tcp: md5: incorrect tcp_header_len for incoming connections)
$ git show 97842ea930e0
commit 97842ea930e0eb94bfdb87beaf87d56224c1e8ad
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Apr 24 13:35:09 2022 -0700

    tcp: make sure treq->af_specific is initialized
...
    Fixes: 5b0b9e4c2c89 ("tcp: md5: incorrect tcp_header_len for incoming connections")

Nice. One result. Seems to be the one we're looking for.

$ git -c tag.sort=-v:refname tag --contains 97842ea930e0 | tail -n1
Ubuntu-5.4.0-123.139

And it was included in the next release already.

2023-01-27 - windows openvpn / unexpected default route

The other day, I was looking into a VPN client issue. The user could connect, they would get their routes pushed, but they would then proceed to use the VPN for all traffic instead of just the routes we provided them.

We did not push a default route, because this VPN server exposed a small internal network only. Any regular internet surfing should be done directly. So, when I looked at a tcpdump I was baffled when I saw that DNS lookups were attempted through the OpenVPN tunnel:

12:50:45.992684 IP 10.8.8.11.51953 > 8.8.8.8.53:
  51928+ A? kv601.prod.do.dsp.mp.microsoft.com. (52)

The server in question runs OpenVPN 2.4.

The client that exhibited this behaviour was OpenVPN Connect v3 for Windows, with the following peer info, according to the server logs:

peer info: IV_VER=3.git::d3f8b18b
peer info: IV_PLAT=win
peer info: IV_NCP=2
peer info: IV_TCPNL=1
peer info: IV_PROTO=30
peer info: IV_CIPHERS=AES-256-GCM:AES-128-GCM:CHACHA20-POLY1305
peer info: IV_AUTO_SESS=1
peer info: IV_GUI_VER=OCWindows_3.3.6-2752
peer info: IV_SSO=webauth,openurl,crtext

The users with Linux OpenVPN clients had no issues with this VPN. Was there an extra setting in the Windows OpenVPN Connect that we could change? A “Send all traffic over this tunnel” option to uncheck, perhaps? There seemed to be very few settings.

One thing we had changed recently was the DNS. We had begun pushing 8.8.8.8 as DNS to the users (to solve a different issue) using the following rules:

push "dhcp-option DNS 8.8.8.8"
push "dhcp-option DNS 8.8.4.4"
push "dhcp-option DOMAIN-ROUTE one-specific-domain.tld"

This rule was supposed to force lookups for *.one-specific-domain.tld to go through the aforementioned Google DNS servers. Maybe the VPN client secretly added a route for this under the assumption that if you want a specific DNS server for VPN, it should be routed through the VPN as well.

This was easy enough to test. I allowed traffic to 8.8.8.8 and 8.8.4.4 to go through the VPN.

Did that fix the problem? Well, no. DNS resolving worked for the user, and now actual (non-DNS) traffic would be attempted through the VPN as well:

13:02:14.618777 IP 10.8.8.11.52040 > 23.2.214.66.443:
  Flags [S], seq 932856193, win 64240,
  options [mss 1289,nop,wscale 8,nop,nop,sackOK], length 0

What is up with this? A route print on the Windows side showed nothing out of the ordinary:

============================================================
Active Routes:
Network Destination      Netmask      Gateway      Interface
        0.0.0.0          0.0.0.0  192.168.1.1  192.168.1.100  <- default
    10.55.55.55  255.255.255.255     10.8.8.1      10.8.8.11  <- vpn
       10.8.8.0    255.255.255.0     On-link       10.8.8.11  <- vpn
      10.8.8.11  255.255.255.255     On-link       10.8.8.11  <- vpn
     10.8.8.255  255.255.255.255     On-link       10.8.8.11  <- vpn
      127.0.0.0        255.0.0.0     On-link       127.0.0.1
      127.0.0.1  255.255.255.255     On-link       127.0.0.1
127.255.255.255  255.255.255.255     On-link       127.0.0.1
    192.168.1.0    255.255.255.0     On-link   192.168.1.100
  192.168.1.100  255.255.255.255     On-link   192.168.1.100
  192.168.1.255  255.255.255.255     On-link   192.168.1.100
123.123.123.123  255.255.255.255  192.168.1.1  192.168.1.100  <- vpn-server
      224.0.0.0        240.0.0.0     On-link       127.0.0.1
      224.0.0.0        240.0.0.0     On-link   192.168.1.100
      224.0.0.0        240.0.0.0     On-link       10.8.8.11  <- vpn
255.255.255.255  255.255.255.255     On-link       127.0.0.1
255.255.255.255  255.255.255.255     On-link   192.168.1.100
255.255.255.255  255.255.255.255     On-link       10.8.8.11  <- vpn
============================================================

Ignoring broadcast and multicast addresses, only 10.55.55.55 and 10.8.8.* should go through the VPN interface. The default route 0.0.0.0 is clearly marked to go through the regular internet via the 192.168.1.1 gateway. This does not explain at all why traffic to 8.8.8.8 or 23.2.214.66 goes to the VPN.

In a last ditch attempt to fix things, I tried what happens if we did push 8.8.8.8 and 8.8.4.4 as routes that should go through the VPN:

push "route 8.8.8.8 255.255.255.255 vpn_gateway"
push "route 8.8.4.4 255.255.255.255 vpn_gateway"

Lo and behold! Things started working properly. Traffic to 10.55.55.55 (and to the nameservers) now goes through the tunnel, but traffic to the rest of the internet properly takes the default route.

I cannot explain why OpenVPN Connect on Windows would not use the routes it prints. Maybe there is a “Use default gateway on remote network” setting somewhere that got enabled when it received a DNS server IP that was not pushed over the same tunnel. One would think that this would be visible on the routing table though. If anyone reading this can explain this phenomenon, please drop us a line.

2023-01-17 - django 1.8 / python 3.10

After upgrading a machine to Ubuntu/Jammy there was an old Django 1.8 project that refused to run with the newer Python 3.10.

...
  File "django/db/models/sql/query.py", line 11, in <module>
    from collections import Iterator, Mapping, OrderedDict
ImportError: cannot import name 'Iterator' from 'collections' (/usr/lib/python3.10/collections/__init__.py)

This was relatively straight forward to fix, by using the following patch. Some parts were stolen from a stackoverflow response by Elias Prado.

--- a/django/core/paginator.py  2023-01-11 14:09:04.915505171 +0100
+++ b/django/core/paginator.py  2023-01-11 14:09:29.407130151 +0100
@@ -1,4 +1,4 @@
-import collections
+from collections.abc import Sequence
 from math import ceil
 
 from django.utils import six
@@ -103,7 +103,7 @@ class Paginator(object):
 QuerySetPaginator = Paginator   # For backwards-compatibility.
 
 
-class Page(collections.Sequence):
+class Page(Sequence):
 
     def __init__(self, object_list, number, paginator):
         self.object_list = object_list
--- a/django/db/migrations/writer.py  2023-01-11 14:13:07.507799080 +0100
+++ b/django/db/migrations/writer.py  2023-01-11 14:14:36.978436145 +0100
@@ -1,6 +1,6 @@
 from __future__ import unicode_literals
 
-import collections
+from collections.abc import Iterable
 import datetime
 import decimal
 import math
@@ -434,7 +434,7 @@ class MigrationWriter(object):
                     % (value.__name__, module_name, get_docs_version()))
             return "%s.%s" % (module_name, value.__name__), {"import %s" % module_name}
         # Other iterables
-        elif isinstance(value, collections.Iterable):
+        elif isinstance(value, Iterable):
             imports = set()
             strings = []
             for item in value:
--- a/django/db/models/base.py  2023-01-11 14:17:13.471982572 +0100
+++ b/django/db/models/base.py  2023-01-11 14:19:38.337720520 +0100
@@ -80,7 +80,12 @@ class ModelBase(type):
 
         # Create the class.
         module = attrs.pop('__module__')
-        new_class = super_new(cls, name, bases, {'__module__': module})
+        new_attrs = {'__module__': module}
+        classcell = attrs.pop('__classcell__', None)
+        if classcell is not None:
+            new_attrs['__classcell__'] = classcell
+        new_class = super_new(cls, name, bases, new_attrs)
+
         attr_meta = attrs.pop('Meta', None)
         abstract = getattr(attr_meta, 'abstract', False)
         if not attr_meta:
--- a/django/db/models/fields/__init__.py 2023-01-11 14:12:50.780054102 +0100
+++ b/django/db/models/fields/__init__.py 2023-01-11 14:14:02.290964344 +0100
@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 from __future__ import unicode_literals
 
-import collections
+from collections.abc import Iterable, Iterator
 import copy
 import datetime
 import decimal
@@ -417,7 +417,7 @@ class Field(RegisterLookupMixin):
         for name, default in possibles.items():
             value = getattr(self, attr_overrides.get(name, name))
             # Unroll anything iterable for choices into a concrete list
-            if name == "choices" and isinstance(value, collections.Iterable):
+            if name == "choices" and isinstance(value, Iterable):
                 value = list(value)
             # Do correct kind of comparison
             if name in equals_comparison:
@@ -852,7 +852,7 @@ class Field(RegisterLookupMixin):
         return smart_text(self._get_val_from_obj(obj))
 
     def _get_choices(self):
-        if isinstance(self._choices, collections.Iterator):
+        if isinstance(self._choices, Iterator):
             choices, self._choices = tee(self._choices)
             return choices
         else:
--- a/django/db/models/sql/query.py 2023-01-11 14:07:45.900716653 +0100
+++ b/django/db/models/sql/query.py 2023-01-11 14:08:08.724366450 +0100
@@ -8,7 +8,8 @@ all about the internals of models in ord
 """
 import copy
 import warnings
-from collections import Iterator, Mapping, OrderedDict
+from collections import OrderedDict
+from collections.abc import Iterator, Mapping
 from itertools import chain, count, product
 from string import ascii_uppercase
 
--- a/django/db/models/sql/where.py 2023-01-11 14:13:01.595889201 +0100
+++ b/django/db/models/sql/where.py 2023-01-11 14:14:25.322613605 +0100
@@ -2,7 +2,7 @@
 Code to manage the creation and SQL rendering of 'where' constraints.
 """
 
-import collections
+from collections.abc import Iterator
 import datetime
 import warnings
 from itertools import repeat
@@ -59,7 +59,7 @@ class WhereNode(tree.Node):
         if not isinstance(data, (list, tuple)):
             return data
         obj, lookup_type, value = data
-        if isinstance(value, collections.Iterator):
+        if isinstance(value, Iterator):
             # Consume any generators immediately, so that we can determine
             # emptiness and transform any non-empty values correctly.
             value = list(value)

And to avoid the following warnings, the Django included six can be patched.

<frozen importlib._bootstrap>:914:
  ImportWarning: _SixMetaPathImporter.find_spec() not found;
  falling back to find_module()
<frozen importlib._bootstrap>:671:
  ImportWarning: _SixMetaPathImporter.exec_module() not found;
  falling back to load_module()

These changes were taken from six 1.16:

--- a/django/utils/six.py  2023-01-17 11:08:00.267645405 +0100
+++ b/django/utils/six.py  2023-01-17 11:12:13.993813451 +0100
@@ -71,6 +71,7 @@ else:
             MAXSIZE = int((1 << 63) - 1)
         del X
 
+from importlib.util import spec_from_loader
 
 def _add_doc(func, doc):
     """Add documentation to a function."""
@@ -186,6 +187,11 @@ class _SixMetaPathImporter(object):
             return self
         return None
 
+    def find_spec(self, fullname, path, target=None):
+        if fullname in self.known_modules:
+            return spec_from_loader(fullname, self)
+        return None
+
     def __get_module(self, fullname):
         try:
             return self.known_modules[fullname]
@@ -223,6 +229,12 @@ class _SixMetaPathImporter(object):
         return None
     get_source = get_code  # same as get_code
 
+    def create_module(self, spec):
+        return self.load_module(spec.name)
+
+    def exec_module(self, module):
+        pass
+
 _importer = _SixMetaPathImporter(__name__)
 
 
@@ -679,11 +691,15 @@ if PY3:
     exec_ = getattr(moves.builtins, "exec")
 
     def reraise(tp, value, tb=None):
-        if value is None:
-            value = tp()
-        if value.__traceback__ is not tb:
-            raise value.with_traceback(tb)
-        raise value
+        try:
+            if value is None:
+                value = tp()
+            if value.__traceback__ is not tb:
+                raise value.with_traceback(tb)
+            raise value
+        finally:
+            value = None
+            tb = None
 
 else:
     def exec_(_code_, _globs_=None, _locs_=None):
@@ -699,19 +715,19 @@ else:
         exec("""exec _code_ in _globs_, _locs_""")
 
     exec_("""def reraise(tp, value, tb=None):
-    raise tp, value, tb
+    try:
+        raise tp, value, tb
+    finally:
+        tb = None
 """)
 
 
-if sys.version_info[:2] == (3, 2):
-    exec_("""def raise_from(value, from_value):
-    if from_value is None:
-        raise value
-    raise value from from_value
-""")
-elif sys.version_info[:2] > (3, 2):
+if sys.version_info[:2] > (3,):
     exec_("""def raise_from(value, from_value):
-    raise value from from_value
+    try:
+        raise value from from_value
+    finally:
+        value = None
 """)
 else:
     def raise_from(value, from_value):
@@ -788,11 +804,10 @@ _add_doc(reraise, """Reraise an exceptio
 if sys.version_info[0:2] < (3, 4):
     def wraps(wrapped, assigned=functools.WRAPPER_ASSIGNMENTS,
               updated=functools.WRAPPER_UPDATES):
-        def wrapper(f):
-            f = functools.wraps(wrapped, assigned, updated)(f)
-            f.__wrapped__ = wrapped
-            return f
-        return wrapper
+        return functools.partial(_update_wrapper, wrapped=wrapped,
+                                 assigned=assigned, updated=updated)
+    wraps.__doc__ = functools.wraps.__doc__
+
 else:
     wraps = functools.wraps
 
@@ -802,10 +817,22 @@ def with_metaclass(meta, *bases):
     # This requires a bit of explanation: the basic idea is to make a dummy
     # metaclass for one level of class instantiation that replaces itself with
     # the actual metaclass.
-    class metaclass(meta):
+    class metaclass(type):
 
         def __new__(cls, name, this_bases, d):
-            return meta(name, bases, d)
+            if sys.version_info[:2] >= (3, 7):
+                # This version introduced PEP 560 that requires a bit
+                # of extra care (we mimic what is done by __build_class__).
+                resolved_bases = types.resolve_bases(bases)
+                if resolved_bases is not bases:
+                    d['__orig_bases__'] = bases
+            else:
+                resolved_bases = bases
+            return meta(name, resolved_bases, d)
+
+        @classmethod
+        def __prepare__(cls, name, this_bases):
+            return meta.__prepare__(name, bases)
     return type.__new__(metaclass, 'temporary_class', (), {})
 
 
@@ -821,6 +848,8 @@ def add_metaclass(metaclass):
                 orig_vars.pop(slots_var)
         orig_vars.pop('__dict__', None)
         orig_vars.pop('__weakref__', None)
+        if hasattr(cls, '__qualname__'):
+            orig_vars['__qualname__'] = cls.__qualname__
         return metaclass(cls.__name__, cls.__bases__, orig_vars)
     return wrapper
 

Patching is done using patch -p1 — you should know how.

2023-01-10 - sysctl / modules / load order / nf_conntrack

Recently we ran into an issue where connections were unexpectedly aborted. Connections from a NAT-ed client (a K8S pod) to a server would suddently get an old packet (according to the sequence number) in the middle of the data. This triggered the Linux NAT-box to issue a reset packet (RST). Setting the kernel flag to mitigate this behaviour required some knowledge of module load order during boot.

Spurious retransmits causing connection teardown

To start off: we observed that traffic from a pod to a server got disconnected. We enabled debug logging on the Kubernetes host where the pod resides. After enabling modprobe nf_log_ipv4 and net.netfilter.nf_conntrack_log_invalid=255, we saw this:

kernel: nf_ct_proto_6: SEQ is under the lower bound (already ACKed data retransmitted)
  IN= OUT= SRC=10.x.x.x DST=10.x.x.x LEN=1480 TOS=0x00 PREC=0x00 TTL=61 ID=53534 DF
  PROTO=TCP SPT=6379 DPT=26110 SEQ=4213094653 ACK=3402842193 WINDOW=509 RES=0x00
  ACK PSH URGP=0 OPT (0101080A084C76F30D12DCAA)

In the middle of a sequence of several packets of data from the server, an apparently unrelated packet — it had data, but not intended for this stream — but with the same source/destination tuples and yet a sequence number that was more than 80K too low. (Wireshark flags this packet as invalid with a TCP Spurious Retransmission message.)

This triggered a reset (RST) by the Linux connection tracking module. And that in turned caused (unexpected) RSTs from the server.

POD <-> NAT <-> SRV
--------------------
            <-- TCP seq 2000000 ack 5555 len 1400
    <-- TCP seq 2000000 ack 5555 len 1400

            <-- TCP seq 1200000 ack 5555 len 1234 (seq is _way_ off)
            --> TCP RST seq 5555 len 0

            <-- TCP seq 2001400 ack 5555 len 1000
    <-- TCP seq 2001400 ack 5555 len 1000

(Made up numbers in the above table, but they illustrate the problem.)

At this point, the non-rejected traffic still got forwarded back to the pod. Its ACKs back to the server were now however rejected by the server with an RST of its own — that end of the connection thinks it was tore down already after all.

kernel: nf_ct_proto_6: invalid rst
  IN= OUT= SRC=10.x.x.x DST=10.x.x.x LEN=40 TOS=0x00 PREC=0x00
  TTL=61 ID=0 DF PROTO=TCP SPT=6379 DPT=26110 SEQ=4213164625
  ACK=0 WINDOW=0 RES=0x00 RST URGP=0

The next packet (sequence 2001400 in the above example), was fine though. So if we could convince the Linux kernel to ignore the packet with the unexpected sequence number, our connections might survive.

Luckily there is such a flag: net.netfilter.nf_conntrack_tcp_be_liberal=1

While this does not explain the root cause, setting said flag mitigates the problem. It makes the the kernel ignore all spurious retransmits.

So, we placed net.netfilter.nf_conntrack_tcp_be_liberal = 1 in /etc/sysctl.conf and assumed the symptoms would be gone.

... or so we thought. Because after a reboot, the flag was unset again.

sysctl.conf not picked up?

That's odd. Setting kernel parameters during boot should be done in sysctl.conf (or sysctl.d). Why did it not get picked up?

The cause turned out to be this: this particular setting is not built-in. It belongs to a module; the nf_conntrack module. And that module is not necessarily loaded before sysctl settings are applied.

nf_conntrack was loaded on demand, and not in a particular well-defined order. Luckily, loading modules through /etc/modules-load.d is well defined, as you can see:

# systemd-analyze critical-chain systemd-sysctl.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

systemd-sysctl.service +44ms
└─systemd-modules-load.service @314ms +90ms
  └─systemd-journald.socket @242ms
    └─system.slice @220ms
      └─-.slice @220ms

Indeed, it sysctl settings are applied after systemd-modules-load.service:

# systemctl show systemd-sysctl.service | grep ^After
After=systemd-journald.socket system.slice systemd-modules-load.service

So, we can use the systemd-modules-load.service to ensure that the conntrack module is loaded before we attempt to set its parameters:

# cat /etc/modules-load.d/modules.conf
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
nf_conntrack

And that works. Now all settings are properly set during boot.

As for the spurious retransmissions: the last word has not yet been said on that...