Threads vs Async I/O

Threads vs Async I/O

March 3, 2024
Musings

It was a mistake in JeeH to call something a Task when it really is a Thread, so I’ve decided to rename them everywhere in the code and in the documentation. Threads are separate execution contexts, each with their own stack. And that’s what’s in JeeH right now.

Threads #

Threads are a bit of a double-edged sword: yes, you can nicely modularise different parts of an application this way, especially in the context of a µC where lots of external events and interrupts are going on. But on a single-CPU system (i.e. all but the most extensive ARM Cortex µCs) they still need to run one at a time. There is no real parallelism going on. The inconvenience of threads is that each one needs to have its own stack area, each sized to handle peak memory needs, even though most of it will not be used at the same time. It can be quite wasteful, especially on a RAM-constrained µC.

Another drawback of threads is that switching contexts between them leads to storing and re-loading the entire CPU state each time. With floating point, that ends up needing dozens of 32-bit words of that scarce stack space.

Alternatives #

One alternative is to use Finite State Machines (FSMs) for each independent activity: when an event or IRQ (or in JeeH: a message) comes in, transition the FSM from one state to the next, adjusting variables (i.e. saved state) to track the work. This usually requires a complete rewrite (and re-think) of the code. The goal here is to get rid of all stack state, so that each transition to a different activity is in fact a return to the caller. The consquence: when there is no state on the stack, there is no reason to save and switch stacks.

An FSM can be implemented as an object with one method, called for each new event / IRQ. That method can then do anything except waitng or blocking. You can start things. but you can’t wait for its outcome: you have to return and act on the next incoming state transition.

Event-driven Programming is more or less the same thing: the software state is not kept on the stack between events, so again it’s just lots of calls with subsequent returns. You can sometimes nest the event loop to try and trick things, but the result is a huge mix of stack frames which can’t easily be unrolled.

A few years ago. I experimented with another approach in the Monty project and called it Fibers: instead of returning, move all stack state to a temporrary area and return. On re-entry: move the temporary copy back onto the stack, and resume. Based on C’s setjmp and longjmp functions, it all worked nicely up to a point. But moving data (especially C++ objects) around without telling them was a hairy business, with lots of potential surprises and pitfalls.

Yet another approach is with coroutines: these get called more or less like normal functions, but instead of returning they yield and will resume from there later on (when called again). Coroutines still need their own stack space to carry their state from a yield to the next call, but the interaction can be much more efficient than a full context switch between threads.

In Python there are iterators and generators which are coroutine-like, but do not keep any state on the stack. You get the performance of coroutines without needing separate stacks for them. These mechanisms are essentially the same as FSMs with very limited state which is then kept off-stack.

And then there is the async/await model, which seems to get all the attention these days. In Python this is implemented more or less the same as coroutines, i.e. with an allocated stack kept around while suspended. But in languages such as Go, Clojure, and Rust 1, my understanding is that it is all implemented quite differently: under the hood, these systems convert the code into an equivalent FSM, and then keep the state in a dedicated area, off the stack. Again: no stacks to allocate, but under the hood you still need to keep some state around somewhere.

Introducing: Tasks #

I’m not willing to switch to a different programming language. C++ serves me well and allows me to stay in full control. There is already a new coroutine mechanism in C++20 2, but it’s too complex and immature for my tastes, and I’m not sure it’ll be that convenient and concise to use in the end.

I do want to have a more lightweight alternative for threads, between (what is now called) a Thread and a Driver in JeeH. Both interoperate using messages, but drivers are designed to wrap interrupt handlers and will switch to a special atomic “handler” mode while running.

There are essentially two main styles of writing threaded code in JeeH: either you go “all-syncy” by using Sys::call as a blocking interface to issue requests and wait for their replies, or you go “event-loopy”, with one top-level loop which starts with a blocking Sys::recv call and then processes each incoming message before looping around to repeat the cycle.

That second approach is one step away from becoming an FSM and requiring no stack: keep all state in some C++ object, with one method to “process” each message. By doing this, the top-level loop can then be taken out.

I’ve started to implement this, with a new type which I’m calling a Task. Here is a minimal example to blink an LED, and of which there can be many “running” at the same time:

struct Toggler : Task {
    Pin pin;
    int ms;

    Toggler (char const* p, int t) : pin (p), ms (t) {
        pin.mode("P");
    }

    int process (Message&) override {
        pin.toggle();
        return ms;
    }
};

The basic idea is working. There are some new conventions involved - here’s how to use it:

auto p = new Toggler ("A5", 250);
p->init();

The init call sends a special first message (mTag: 'I') to the task. A return value > 0 can be used to start a timer. After that, this code keeps going with a 4 Hz toggle rate, i.e. 2 blinks/sec.

Net is now a task #

The next step is to see if the net thread can be turned into a task - spoiler: yep, it works!

Though not obvious at first, turning net’s code into an FSM-like loop turned out to be simple. But there is some fairly intricate logic under the hood, in particular due to these issues:

  • sending a message to a task can be handled by giving each task its own destination id
    • these id’s are distinct from thread and device id’s
  • the send system call runs in handler mode, the wrong context for “calling” a task method
    • solution: split the send logic into part-thread / part-handler code
  • tasks may send messages to anything: threads, devices, and other tasks, but a device can’t reply to a task, as the task is not “running” in the same way as threads: it’s borrowing its caller’s stack (and only while it runs)
    • solution: send the reply to the task owner’s thread, i.e. the one which called init()
  • replies do not have a place to store the destination task or thread id: the mDst field is reset to the reply sender by the time the message comes back
    • solution: give each task its own queue, and send the task as reply instead
  • picking up replies from a task queue needs atomic access (tasks don’t use handler mode)
    • solution: add a new internal system call to provide an “atomic pull”
  • when a thread receives such replies, it can’t be in handler mode (same as with sends)
    • solution: also split the recv system call into part-thread / part-handler code
  • lastly, there is some nasty complexity due to call’s blocking implementation

But the result is a working system. All network protocol processing now runs as a task and will use its caller’s (i.e. msg-sender’s) stack. In short: net now implements an async I/O design!

The price to pay is that net is never allowed to wait or block on anything: no Sys::recv, Sys::call, or Sys::wait, nor things like Sys::fork or Sys::quit. Unfortunately, that also rules out calling printf when it uses the underlying DMA-based UART, which does a call. Debugging output can still be done using JeeH’s logf, using ARM’s polled ITM channel.

All in all, tasks add quite some (internal) complexity to JeeH. The benefits I hope to see are:

  • no dedicated stacks, i.e. better use and sharing of memory
  • far less overhead: no context switches, no system call stack saves
  • no thread <=> handler mode switches (except for driver requests and replies)
  • threads are still available when needed, tasks just offer an extra option

There are still (some) bugs and (many) loose ends, but hey … it’s a start!