Threads vs Async I/O
March 3, 2024
It was a mistake in JeeH to call something a Task
when it really is a
Thread
, so I’ve decided to rename them everywhere in the code and in the
documentation. Threads are separate execution contexts, each with their own
stack. And that’s what’s in JeeH right now.
Threads #
Threads are a bit of a double-edged sword: yes, you can nicely modularise different parts of an application this way, especially in the context of a µC where lots of external events and interrupts are going on. But on a single-CPU system (i.e. all but the most extensive ARM Cortex µCs) they still need to run one at a time. There is no real parallelism going on. The inconvenience of threads is that each one needs to have its own stack area, each sized to handle peak memory needs, even though most of it will not be used at the same time. It can be quite wasteful, especially on a RAM-constrained µC.
Another drawback of threads is that switching contexts between them leads to storing and re-loading the entire CPU state each time. With floating point, that ends up needing dozens of 32-bit words of that scarce stack space.
Alternatives #
One alternative is to use Finite State Machines (FSMs) for each independent activity: when an event or IRQ (or in JeeH: a message) comes in, transition the FSM from one state to the next, adjusting variables (i.e. saved state) to track the work. This usually requires a complete rewrite (and re-think) of the code. The goal here is to get rid of all stack state, so that each transition to a different activity is in fact a return to the caller. The consquence: when there is no state on the stack, there is no reason to save and switch stacks.
An FSM can be implemented as an object with one method, called for each new event / IRQ. That method can then do anything except waitng or blocking. You can start things. but you can’t wait for its outcome: you have to return and act on the next incoming state transition.
Event-driven Programming is more or less the same thing: the software state is not kept on the stack between events, so again it’s just lots of calls with subsequent returns. You can sometimes nest the event loop to try and trick things, but the result is a huge mix of stack frames which can’t easily be unrolled.
A few years ago. I experimented with another approach in the Monty
project and
called it Fibers: instead of returning, move all stack state to a temporrary
area and return. On re-entry: move the temporary copy back onto the stack, and
resume. Based on C’s setjmp
and longjmp
functions, it all worked nicely up
to a point. But moving data (especially C++ objects) around without telling them
was a hairy business, with lots of potential surprises and pitfalls.
Yet another approach is with coroutines: these get called more or less like normal functions, but instead of returning they yield and will resume from there later on (when called again). Coroutines still need their own stack space to carry their state from a yield to the next call, but the interaction can be much more efficient than a full context switch between threads.
In Python there are iterators and generators which are coroutine-like, but do not keep any state on the stack. You get the performance of coroutines without needing separate stacks for them. These mechanisms are essentially the same as FSMs with very limited state which is then kept off-stack.
And then there is the async/await model, which seems to get all the attention these days. In Python this is implemented more or less the same as coroutines, i.e. with an allocated stack kept around while suspended. But in languages such as Go, Clojure, and Rust 1, my understanding is that it is all implemented quite differently: under the hood, these systems convert the code into an equivalent FSM, and then keep the state in a dedicated area, off the stack. Again: no stacks to allocate, but under the hood you still need to keep some state around somewhere.
Introducing: Tasks #
I’m not willing to switch to a different programming language. C++ serves me well and allows me to stay in full control. There is already a new coroutine mechanism in C++20 2, but it’s too complex and immature for my tastes, and I’m not sure it’ll be that convenient and concise to use in the end.
I do want to have a more lightweight alternative for threads, between (what is
now called) a Thread
and a Driver
in JeeH. Both interoperate using messages,
but drivers are designed to wrap interrupt handlers and will switch to a special
atomic “handler” mode while running.
There are essentially two main styles of writing threaded code in JeeH: either
you go “all-syncy” by using Sys::call
as a blocking interface to issue
requests and wait for their replies, or you go “event-loopy”, with one top-level
loop which starts with a blocking Sys::recv
call and then processes each
incoming message before looping around to repeat the cycle.
That second approach is one step away from becoming an FSM and requiring no stack: keep all state in some C++ object, with one method to “process” each message. By doing this, the top-level loop can then be taken out.
I’ve started to implement this, with a new type which I’m calling
a Task
. Here is a minimal example to blink an LED, and of which there can
be many “running” at the same time:
struct Toggler : Task {
Pin pin;
int ms;
Toggler (char const* p, int t) : pin (p), ms (t) {
pin.mode("P");
}
int process (Message&) override {
pin.toggle();
return ms;
}
};
The basic idea is working. There are some new conventions involved - here’s how to use it:
auto p = new Toggler ("A5", 250);
p->init();
The init call sends a special first message (mTag: 'I'
) to the task. A return
value > 0 can be used to start a timer. After that, this code keeps
going with a 4 Hz toggle rate, i.e. 2 blinks/sec.
Net
is now a task
#
The next step is to see if the net
thread can be turned into a task - spoiler:
yep, it works!
Though not obvious at first, turning net
’s code into an FSM-like loop turned
out to be simple. But there is some fairly intricate logic under the hood, in
particular due to these issues:
- sending a message to a task can be handled by giving each task its own
destination id
- these id’s are distinct from thread and device id’s
- the
send
system call runs in handler mode, the wrong context for “calling” a task method- solution: split the send logic into part-thread / part-handler code
- tasks may send messages to anything: threads, devices, and other tasks, but a
device can’t reply to a task, as the task is not “running” in the same way
as threads: it’s borrowing its caller’s stack (and only while it runs)
- solution: send the reply to the task owner’s thread, i.e. the one which
called
init()
- solution: send the reply to the task owner’s thread, i.e. the one which
called
- replies do not have a place to store the destination task or thread id: the
mDst
field is reset to the reply sender by the time the message comes back- solution: give each task its own queue, and send the
task
as reply instead
- solution: give each task its own queue, and send the
- picking up replies from a task queue needs atomic access (tasks don’t use
handler mode)
- solution: add a new internal system call to provide an “atomic pull”
- when a thread receives such replies, it can’t be in handler mode (same as with
sends)
- solution: also split the
recv
system call into part-thread / part-handler code
- solution: also split the
- lastly, there is some nasty complexity due to
call
’s blocking implementation
But the result is a working system. All network protocol processing now runs as
a task and will use its caller’s (i.e. msg-sender’s) stack. In short: net
now
implements an async I/O design!
The price to pay is that net
is never allowed to wait or block on anything: no
Sys::recv
, Sys::call
, or Sys::wait
, nor things like Sys::fork
or
Sys::quit
. Unfortunately, that also rules out calling printf
when it uses
the underlying DMA-based UART, which does a call
. Debugging output can still
be done using JeeH’s logf
, using ARM’s polled ITM channel.
All in all, tasks add quite some (internal) complexity to JeeH. The benefits I hope to see are:
- no dedicated stacks, i.e. better use and sharing of memory
- far less overhead: no context switches, no system call stack saves
- no thread <=> handler mode switches (except for driver requests and replies)
- threads are still available when needed, tasks just offer an extra option
There are still (some) bugs and (many) loose ends, but hey … it’s a start!
-
For a great overview of async/await in Rust, see: https://os.phil-opp.com/async-await/ ↩︎
-
See https://en.cppreference.com/w/cpp/language/coroutines. ↩︎