Notes to self, 2025
2025-03-17 - sed / regular expressions / optimizing
Last week, we were looking at using sed(1) to remove line feeds from CSV files. The regular expressions used could use some further inspection.
We start off by creating two sample files:
$ python3 -c 'for x in range(100000): print("A"*1500)' >many-alpha1500.dat $ python3 -c 'for x in range(100000): print("\""*1500)' >many-dquotes1500.dat
We want to compare the speed of a match versus a non-match, and take printing a line or not out of the equation. For sed we add !d or d to the case that matches:
$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-alpha1500.dat | md5sum d41d8cd98f00b204e9800998ecf8427e - real 0m7.693s (all user time)
The regular expression from the previous post, which was quickly cobbled together, looked like this:
/^[^"]*("[^"]*"[^"]*)*$/
That is, zero-or-more non-double quotes, then a combination of two double quotes with non-double quotes around it, and that as many times as possible. Or, in plain English: an even amount of double quotes.
Reducing backtracking
Regular expression backtracking wisdom says you need to avoid
optional matches (?, *, +).
Rewriting this slightly to /^([^"]|"[^"]*")*$/
makes sense,
both for clarity and fewer optionals.
For speed it doesn't seem to matter though.
$ time sed -Ee '/^[^"]*("[^"]*"[^"]*)*$/d' many-alpha1500.dat | md5sum d41d8cd98f00b204e9800998ecf8427e - real 0m7.682s (all user time)
$ time sed -Ee '/^([^"]|"[^"]*")*$/d' many-alpha1500.dat | md5sum d41d8cd98f00b204e9800998ecf8427e - real 0m7.665s (all user time)
$ time sed -Ee '/^([^"]|"[^"]*")*$/d' many-dquotes1500.dat | md5sum d41d8cd98f00b204e9800998ecf8427e - real 0m6.647s (all user time)
The regular expression did not seem to matter, but the dataset did.
Here it's hard to tell whether it's slow because of backtracking,
or because of something else in the regex. It's not the address code itself,
because replacing the regex with /./
or /(.*)/
makes it fast. And when feeding it double quotes (the "hard" case),
it was in fact faster (0m6.647s); likely because of the larger match between parentheses.
What we do observe though, is if we make it not-matching (look for an uneven count of double quotes), it's suddenly fast for the A case:
$ time sed -Ee '/^([^"]|"[^"]*")*"[^"]*$/!d' many-alpha1500.dat | md5sum d41d8cd98f00b204e9800998ecf8427e - real 0m0.479s (all user time)
My instinct says those two matches should take the same amount of time on a long string of A characters. But apparently I'm wrong.
The lots-of-double-quotes case is now (expectedly) slow, even though it doesn't match either.
$ time sed -Ee '/^([^"]|"[^"]*")*"[^"]*$/!d' many-dquotes1500.dat | md5sum d41d8cd98f00b204e9800998ecf8427e - real 0m7.530s (all user time)
grep and perlre
We take a quick detour using grep.
grep differentiates between our old and our new regular expression:
$ time grep -E '^[^"]*("[^"]*"[^"]*)*$' many-alpha1500.dat | md5sum fe2ba5adbb46ade7582528b4cd025f96 - real 0m20.917s (all user time)
$ time grep -E '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum fe2ba5adbb46ade7582528b4cd025f96 - real 0m31.418s (all user time)
.. but in a bad way. The better expression performs worse.
grep can also do perl regular expressions
(perlre, with -P
). Let's see how that compares:
$ time grep -E '^.*$' many-alpha1500.dat | md5sum fe2ba5adbb46ade7582528b4cd025f96 - real 0m0.545s (all user time)
$ time grep -P '^.*$' many-alpha1500.dat | md5sum fe2ba5adbb46ade7582528b4cd025f96 - real 0m0.243s (all user time)
A slight speedup using -P
.
$ time grep -E '^[^"]*$' many-alpha1500.dat | md5sum fe2ba5adbb46ade7582528b4cd025f96 - real 0m6.816s (all user time)
$ time grep -P '^[^"]*$' many-alpha1500.dat | md5sum fe2ba5adbb46ade7582528b4cd025f96 - real 0m0.244s (all user time)
Using grep tells us that using even the simplest regular expressions can be slow using the old POSIX engine. Perl or other modern implementations will be slightly faster in trivial cases, and insanely much faster in slightly more complicated cases:
$ time grep -E '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum fe2ba5adbb46ade7582528b4cd025f96 - real 0m31.385s (all user time)
$ time grep -P '^([^"]|"[^"]*")*$' many-alpha1500.dat | md5sum fe2ba5adbb46ade7582528b4cd025f96 - real 0m0.755s (all user time)
And no, before you ask, memory usage is only slightly higher with the perl engine.
Conclusions
The conclusions we draw from the above are:
- you should test your regular expressions on exaggerated (larger) datasets instead of assuming, and
- before you try to optimize your sed regular expressions you may want to consider a different regular expression engine instead.
2025-03-15 - sed / remove line feeds / csv
Can you use sed(1) to remove line feeds?
Yes you can. Even though sed, the stream editor, is line based, you can use it to remove line feeds from output.
In this post we'll be exploring how that is done and look into how a slight refactoring step can provide a big speed gain. The task at hand is removing line feeds from CSV file with multiline strings. Why? Because I want to use grep and awk on it, and multiline strings would complicate that a lot. Why sed? Because I wanted to know if it's as easy as I think it should be.
Let's dig in.
sample.csv
We'll start off with a small sample.csv
that we can play around with, to get us going:
idx,name,year,summary,url 1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at: Eastnor Castle Deer Park Eastnor Herefordshire (UK)",https://www.emfcamp.org/schedule/2018 2,"Legoland trip",2019,"Went to Billund for team building.",https://www.osso.nl/blog/2019/recap2019/ 3,"MCH2022 ""May Contain Hackers""",2022,"Nonprofit outdoors hacker camp. Location: Zeewolde (NL)",https://wiki.mch2022.org/Village:OSSO 4,"WHY2025 ""What Hackers Yearn""",2025,"Successor to MCH2022.",-
As you can see, this CSV file has one header row and four CSV data rows. But because some rows are split over multiple lines, doing a simple grep, cut or awk will not work as intended.
Here's a feedble attempt to get all the URLs from the fifth column, by asking awk to split by comma:
$ awk -F, '{print $5}' sample.csv url https://www.osso.nl/blog/2019/recap2019/ -
As expected, awk will not work, as it works on the lines, and has no idea where a record/row begins.
Quick sidenote: using
cut -d, -f5 sample.csv
would produce different, and even more wrong, output. If there is no fifth field, you'll get an earlier field instead. See this example:$ echo how | cut -d, -f2,3,4 how $ echo how,are,you | cut -d, -f2,3,4 are,youThat's unexpected behaviour that awk avoids.
The goal
Selecting the URLs from the CSV sample would have worked fine, if it looked like this:
idx,name,year,summary,url 1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at: Eastnor Castle Deer Park Eastnor Herefordshire (UK)",https://www.emfcamp.org/schedule/2018 2,"Legoland trip",2019,"Went to Billund for team building.",https://www.osso.nl/blog/2019/recap2019/ 3,"MCH2022 ""May Contain Hackers""",2022,"Nonprofit outdoors hacker camp. Location: Zeewolde (NL)",https://wiki.mch2022.org/Village:OSSO 4,"WHY2025 ""What Hackers Yearn""",2025,"Successor to MCH2022.",-
Here we have one header and four rows, and they add up to five lines. Selecting the fifth column is easy now:
$ awk -F, '{print $5}' goal.csv url https://www.emfcamp.org/schedule/2018 https://www.osso.nl/blog/2019/recap2019/ https://wiki.mch2022.org/Village:OSSO -
As you may have noticed, a comma in CSV column/string would trip us up. Ignore that for now. You can argue that using regular unix utilities to work with CSV files is the wrong approach, but I think that often times the command line is king. If you're working with new/varied data, using the command line to quickly explore and prototype is unbeatable. You can choose a real programming language later when you're done prototyping.
So, we have the input (sample.csv
), and we have the output (goal.csv
md5sum d7a5eba9e0a80a1d8c1158a1430927a3
).
Can we coax sed into producing goal.csv
? We can.
(I used the md5sum to confirm that all the versions of goal.csv
are equal.)
Attempt #1: making it work
Here's attempt one, which I have to admit took me longer to produce than I wanted. And I have to say ChatGPT was of no help at all. I can tell you its sed skills are abysmal.
#!/bin/sed -Ef # Add line to hold space, clear pattern space, put hold space in pattern space. H z x # If pattern space contains an uneven number of double quotes, place # everything back into the hold space and don't display. /^[^"]*("[^"]*"[^"]*)*"[^"]*$/ { h d } # If pattern space contains an even number of double quotes, we're done # collecting. The first hold added a linefeed to void, remove it. # Replace the others with a single space. /^[^"]*("[^"]*"[^"]*)*$/ { s/^\n// s/\n/ /g }
That's my first working attempt:
$ ./attempt-1.sed | md5sum d7a5eba9e0a80a1d8c1158a1430927a3 -
It works! But it has a couple of odd quirks.
Taking you through it one by one:
- sed has two buffers: the pattern buffer and the hold buffer. You're always working on the pattern buffer. The first H appends the input (the current pattern) to the hold buffer.
- The z zeroes the pattern buffer, the x swaps them, so now the pattern buffer has all the collected input so far.
- The
/^[^"]*("[^"]*"[^"]*)*"[^"]*$/
regex checks that there is an uneven number of double quotes: if it is uneven, the pattern buffer is replaced into the hold buffer, the line is deleted (not printed) and the cycle restarts. - If we get to the next regex
(
/^[^"]*("[^"]*"[^"]*)*$/
) we've confirmed that the quotes are even. We remove the first linefeed (which got added after the first H) and replace the rest with a single space. - If we get here, we're not replacing the pattern buffer into the hold buffer (using h), so the next cycle starts clean.
So. That works. And after looking more into the sed info pages, it turns out that the d restarts the cycle. So that second regex match isn't even needed. It could instead just be:
# If we get here, the pattern space contains an even number of double quotes. # The first hold added a linefeed to void, remove it. # Replace the others with a single space. s/^\n// s/\n/ /g
Attempt #2: use only the pattern buffer
When working with real data, it turned out that this was rather slow. For a simple 1000 row, 1 MiB file, having the script take more than a second seemed inappropriate.
The thing that immediately stands out is the swapping between pattern and hold space. Step one is avoiding that.
#!/bin/sed -Ef :x # If pattern space contains an uneven number of double quotes, add more # to the line and continue at x. /^[^"]*("[^"]*"[^"]*)*"[^"]*$/ { N bx } # Apparently we have an even number of double quotes. Remove LFs and # print (by ending the cycle). s/\n/ /g
This looks better already. We don't need the hold buffer at all.
- The N reads a new line and appends it to the pattern buffer.
- bx jumps back to the :x label.
- This repeats for as long as there is an uneven amount of double quotes in the pattern buffer.
- Once the double quotes are even, we end up on the last line.
- There we replace the line feeds with spaces and let the cycle restart with the automatic printing of the pattern buffer.
Is it faster?
Well.. according to a couple of sample runs with a 1.3 MiB dataset, maybe we go from 690ms to 670ms. Definitely not earth shattering speed gains. What's next?
Attempt #3: stop doing double work
On to fix the real problem, the repetitive regular expression. The data set I'm working with at the moment looks like this:
$ wc dataset.csv 23679 156039 1389021 dataset.csv
$ ./attempt-2.sed dataset.csv | wc 1001 156039 1389021
A 1.3 MiB file, 1001 rows, 23679 lines. On average, that's 27 lines per row and 1388 bytes per row.
That means that on average, it has to rerun the regular expression 27 times before it doesn't match anymore (because the double quotes are even). And it has to do so on a line that grows to up to 1388 characters:
1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at:
Uneven amount of quotes? Yes, 3. Restart.
1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at: Eastnor Castle Deer Park
Still uneven amount of quotes? Yes, 3. Restart.
1,"Electromagnetic Field 2018",2018,"EMF2018 geek village in the summer at: Eastnor Castle Deer Park Eastnor
Still uneven amount of quotes? Yes, 3. Restart.
You can see how that's inefficient. It would be better if it just counted the double quotes on the latest input line.
Attempt #3 of the script fixes that:
#!/bin/sed -nEf /^[^"]*("[^"]*"[^"]*)*"[^"]*$/ { h n b inquote } b done :inquote /^[^"]*("[^"]*"[^"]*)*"[^"]*$/ { H x b done } H n b inquote :done s/\n/ /g # sed -n and manual printing is needed, # otherwise the 'n' will print the pattern # space. p
Admittedly, this looks a lot less like a sed oneliner, and more like an obscure programming language. But, oh boy, it is so much faster. With the same dataset, we go down to mere milliseconds.
It works like this:
- Find an uneven amount of double quotes. If not found, jump to the done label.
- At the done label, we replace the line feeds, and print the output. Restart the cycle.
- If we did find an uneven amount of double quotes, store the pattern in the hold space, get the next pattern, and go to the inquote label.
- At inquote we also look for an uneven amount of quotes. If not found it means that we're not done (uneven + even = uneven). We append the pattern to the hold buffer using H and load up the next pattern with n. Go back to inqoute.
- If we did find an uneven amount of double quotes, we're
done (uneven + uneven = even). Append this line as well with H, and then
exchange the hold and the pattern buffer with x so we
can work on the entire line. (Using
H;x
is the same asx;G
in this case.) Go to done (see above).
Timings
For completeness sake, here are some representative timing outputs:
$ time ./attempt-1.sed dataset.csv >/dev/null real 0m0,696s (all user time) $ time ./attempt-2.sed dataset.csv >/dev/null real 0m0,664s (all user time) $ time ./attempt-3.sed dataset.csv >/dev/null real 0m0,033s (all user time)
As you can see, that's much better.
In the next post, we'll go and see if we need to improve the regular expressions.