Ad hoc structured data

For some time, I’ve been looking for a way to build tiny data structures to pass around. As a minimum, it must support ints and strings, as well as nested lists. This is trivial in scripting languages, but less so in C++ code. And it must be efficient, i.e. low code and CPU overhead, since I’ll use this in µCs.

Example in C++#

With C++ 17 and its templates + initializer_list features, the notation can be made surprisingly convenient. Here’s a call to Jade’s emit which takes two arguments, an int and a “structured” value:

emit(0, {1, 2, "abc", {3, 4}, "def", 5, 6});

On the receiving end, i.e. inside emit, you get a Val object with a number of methods and casts:

void emit (int id, Val arg) {
    if (arg.isSeq())
        for (auto e : (Seq) arg)
            switch (e.tag()) {
                case e.NIL: ...
                case e.INT: ... (int) e ...
                case e.STR: ... (const char*) e ...
                case e.SEQ: ... (Seq) e ...
            }
    else
        ...
}

This example assumes that a sequence is passed in, but emit could also be written to accept scalars, i.e. integers and string pointers. For many simple scenarios this might be enough, and would avoid the (slight) overhead of constructing a sequence in the caller.

Playing tricks inside#

Note that Val is a hybrid type: it can hold different types which can then be extracted via static typecasts. I tend to use old-fashioned (type) notation for this, but that’s just personal preference.

Under the hood, Val uses some very nasty tricks to squeeze any type into a 32-bit word (or more accurately: whichever size is needed to represent a pointer). This leads to some tricky constraints:

  • integers are signed, but “boxed”: they have only 31-bit precision on 32-bit systems
  • string pointers are manipulated slightly, which requires that they have a few bits which are always the same (these get re-used)
  • sequences are represented by pointers into a special memory “arena”, and these are always word-aligned (which allows playing tricks with the lower 2 bits)

Fortunately, this all works fine on 32-bit ARM Cortex and on 64-bit MacOS & Linux. And the benefit of it all is that passing Val instances around is just as efficient as passing ints or string pointers.

Sequences#

One additional trick is needed to manage sequences: these obviously require more than one word to represent all its elements (each is again a Val). The approach used is to place sequences in a special fixed area of memory. This “arena” is treated as a stack: new sequences are stacked on top of existing ones, and when the Sequence goes out of scope, its memory can be re-used. With the limited testing I’ve done so far, it all works well and needs only a limited amount of space. But there are still two major concerns with this:

  1. Sequences are short-lived. In the above code, any sequence built up in the caller of emit will remain intact during the call to emit, but it will disappear the moment emit returns. As a consequence, you can’t hold onto a Val inside emit, i.e. store it into a variable which outlives the call.

  2. I’m not yet convinced that cleanup always happens at the right time. There is a very subtle interaction between Val and Seq types: a Val can be cast to a Seq (when ifSeq() is true), but doing so may lead to cleanup sooner than expected. There is logic to avoid this for properly nested parent-child instances, but again … maybe the current code is not robust enough to deal with all use cases. If this becomes a problem, some explicit locking/unlocking may have to be added on top of all this.

This has been implemented in the Jade project I’m working on. It has been very simple to use in C++. Performance is fine: 125 instructions to pass an int to emit on a Cortex M4 (including call overhead).

Message buffers#

Since sequences are so short-lived, their use is somewhat limited. They are fine for passing information from a caller to a callee, but that’s about it. You can’t save a Val (or a Seq) for later, nor send it across a byte stream, such as a serial port. But no worries … there’s also a solution for that.

Two functions have been implemented which convert any Val into a set of bytes and back:

size_t val2msg (Val val, uint8_t* buf, size_t len);
Val msg2val (const uint8_t* ptr);

With val2msg, you pass it a Val and a buffer to store the byte representation. The return value is the number of bytes needed for the result. If the buffer is too small, the call will have to be repeated with a larger buffer. To determine how many bytes it would take, simply pass in a null pointer + zero length.

The reverse process is simpler: pass a pointer to msg2val, and you get a Val back with everything decoded, including nested sequences. Again, the lifetime of the returned Val is limited, but it will be around long enough to pass into a call to emit, for example. Note that msg2val does not need a byte count, it knows from the data where the message ends.

Message buffers use a compact encoding for integers and sequences. Zero-terminated strings are inserted as is. Internally, this encoding is based on “VarInts”, using 1 to 5 bytes to encode any signed 32-bit integer. Here is some structured data with the matching 16-byte message:

{1,"x y z",2,{3,4},abc,5}   <=>   B2849B782079207A0088928C9001B694

Note that these messages can contain arbitrary 8-bit data, including null bytes.

Text representation#

For debugging, there are also two functions to see the contents of a Val as text, and to convert a text representation into a Seq:

void dump (Val val);
Seq parseSeq (std::string_view const& sv);

The dump function sends the text result to printf. In Jade’s test code there is also a dumps() function which returns a std::string.

The parseSeq function takes a std::string_view and parses it. The input string is parsed as a sequence of values, even if there is only one, which is why the result is a Seq instead of a Val. By using a string_view, it can parse text which is not null-byte terminated, but parseSeq will happily take a plain const char* as input and let std::string_view automatically determine its length.

Status#

This design + implementation is still very new and largely untested. It’s all part of the Jade project and can be used as single-header file. For now, the code lives in the git repository on SourceHut, but that will probably change later, as things evolve further. The “Val + arena + messages” described above do not depend on the rest of Jade (it’s in fact the opposite: the Jade dataflow engine relies on this code).