March 10, 2025
MarkBernstein.org
 

Refactoring

Last weekend, Detlef Beyer hosted a terrific Tinderbox meetup to demonstrate his integration of Tinderbox with a variety of large language models.

The previous day, Detlef had run into a particular JSON reply that crashed Tinderbox when Tinderbox tried to parse it. The problem arise from quoted strings that contained new emoji from Unicode’s Supplementary Multilingual Plane. This is a group of 65,536 potential code points (not all are currently assigned) that represent characters you seldom encounter: cuneiform, Linear A, Mayan numerals, and many recent emoji.

Tinderbox 1 didn’t support Unicode, because Unicode was not then in widespread use. We started to get serious about Unicode in Tinderbox 4, and Tinderbox 6 was already pretty good at Unicode. Unfortunately, that Supplementary Multilingual Plane can cause headaches.

One of the core classes in Tinderbox is Parser, which provides services for parsing actions, parsing export templates, parsing JSON (from Web services) and RIS (from reference managers), and lots of other little chores. My recent(ish) fix moved the internals of Parser to be Unicode-aware. Each parser needs explicitly to adopt the new way of doing things: the parts of the system that predate Unicode also predate test-driven development!

I think it’s time to make sure that everything has in fact adopted the new way. It’s a big refactoring, because it tends to ramify in surprising ways, and also because the New Way changes the architecture significantly. The core problem is that Parser, in the nature of things, reads character by character — and thinks that a character is a unichar, a 16-bit code point.

unichar Parser::Get()

Our supplementary code plane contains unusual characters that don’t fit in a unichar! So we need two unichars, or four utf-8 bytes, to hold them. So, where we used to return a plain old unichar, now we return a bundle that might contain a unichar or might contain the longer character code:

TbxCharacter Parser::Get()

The upside of this is that, once we are working with explicit TbxCharacters, we can make them do more work. For example, we could have methods that ask the character whether it's a backslash, or a quotation mark.

I usually wait to write about refactoring, if I write about them at all, only after they’ve succeeded. I thought it might be interesting to try this, and then see how things turn out.


Update: Two very long days later, the refactoring is mostly done. It was a bear; Monday evening, I left the office with a broken build. I almost never do that — I’d estimate perhaps once every two or three years — but dinner was urgent and something was breaking 132 tests. I managed to restore sanity on Tuesday morning by backing off some recent changes that I’d made with too much confidence; as usual, straying from conservative test-driven work had led me astray,

Nice email from a user to report that they, too, had experienced this crash and that they appreciated the imminent fix.