Linux CGroups: Subsystems as Modules: atomic all-nighters

exposition

In an environment like the Linux kernel, the many ways to deal with concurrency cause a bunch of restrictions on what sorts of things you can do at certain points in your code. The concurrency primitives are, briefly:

mutexes/semaphores: high-level blocking locks, will deschedule you if you need to wait to get it. Very commonly used in many subsystems, to protect data structures, and so on (popular example: tsk->mmap_sem).
spinlocks/RCU: low-level "atomic context" locks, will disable interrupts and spin-wait (in the SMP case). Shouldn't be held around "very expensive" operations (at the very least because another CPU may be spin-waiting on you!). These are needed for, among other reasons, accessing things from an interrupt handler, wherein rescheduling is already not permitted. The global tasklist_lock is a spin-lock, as are the locks on each task_struct (task_lock(tsk)).
There are calls such as yield() and schedule() (and sleep(), ewwww) which explicitly invoke the scheduler, and are sometimes used in open-coded synchronisation dances (to varying stylistic quality) (examples: A and B).
Also of note is the dynamic memory allocator, kmalloc(size_t size, gfp_t flags). The flags argument controls the behaviour of the allocator - the two most common choices:
- GFP_KERNEL: indicates that it's okay to wait and reschedule if necessary
- GFP_ATOMIC: "please satisfy this request immediately", and therefore more likely to fail - incidentally, this may also allocate from the kernel's "emergency memory pool", under the assumption that it should end up freed quickly.
There is also vmalloc(...), which allocates through the VM, and is therefore even more reliable especially for very large contiguous chunks of memory - but also always may sleep in order to complete the request.

You might by this point have figured out the rule, which is that you're not allowed to invoke the scheduler from within an atomic context (i.e. holding a spin-lock / interrupts are disabled / etc), or the world will explode. As a consequence, if you want to lock something while holding another spin-lock, it had better also be protected by a spin-lock (or by RCU); if you want to reliably dynamically allocate memory while juggling a bunch of locks, you'd better not do it while holding anything lower-level than a mutex or semaphore.

There were a bunch of places along the way of implementing the cgroups threadgroup interface patches where I had to tread carefully around my concurrency primitives to not break this rule. There's a macro in linux called "might_sleep()" which you can drop in your scheduler-invoking code as a debugging measure, and it will do a runtime check that everything is OK when it gets hit (there's even a "might_sleep_if(...)" version). But (a) the developer has to think to put it there, and developers are unreliable, and (b) it's a runtime check for something that seems like a static property of the code.

an idea

So it occurred to me that maybe there should be machinery as part of a language's tool-chain that could automatically verify whether code respects the contexts that functions are supposed to run in. It could even be entirely separate from the compiler itself (I was thinking that it might not be too hard to extend my 411 compiler to do it, though all the changes would be in my partner's code, not mine, so). The framework might look something like this:

Independent of the code, a definition of the various contexts that are important to track (and the rules governing their interactions). Here, an individual function could be "MIGHT_SLEEP" or "DOESNT_SLEEP", and call sites (or, regions of code in general) will have a property like "MAY_SLEEP" or "MUSTNT_SLEEP". If a MIGHT_SLEEP function is called from a MUSTNT_SLEEP line of code, print an error message; all other cases are okay. Functions may also change the context at their call sites, so transitions may be specified.
Explicitly annotate those functions which affect the code context:
- schedule(), as the central function that everything sleeping goes through, should be marked MIGHT_SLEEP.
- preempt_disable(), which spinlocks and RCU et al. invoke, should transition the call site from MAY_SLEEP to MUSTNT_SLEEP, and preempt_enable(), its dual, turns MUSTNT_SLEEP into MAY_SLEEP.
To a certain extent, the tool can then infer context rules for all other functions in the program. ("spin_lock calls preempt_disable, and task_lock calls spin_lock, and do_stuff calls task_lock on line 7, so after line 7 the code is MUSTNT_SLEEP; on line 8 do_stuff calls kmalloc, which ... MAY_SLEEP, so the code is wrong.") But also you don't want to infer over the whole codebase, so it would also be wise to annotate (where you can) function prototypes in your headers as MIGHT_SLEEP or DOESNT_SLEEP where appropriate.
Sometimes code likes to have callback interfaces, so structs full of function pointers may be built and passed around. A function pointer interface should be specified the same way as a call site (MAY_SLEEP or MUSTNT_SLEEP) and then checked at every point where a function's address is taken to serve as such a pointer.
Actually, some common programming idioms make it hard to say if a function just by name will block or not - like with kmalloc (as described above), which MIGHT_SLEEP if you say GFP_KERNEL, but DOESNT_SLEEP if you say GFP_ATOMIC. So some functions need to be described like MIGHT_SLEEP_IF(ARG(2) == GFP_KERNEL).
- Functions which take, say, a flags argument to pass through to a kmalloc call will also be MIGHT_SLEEP_IF; the condition would have to propagate.
- Usually the flag's value is known at compile time; if it's not known (and you're in a MUSTNT_SLEEP section) then the code is of dubious quality anyway and should be errored on.

This is remarkably like typestate, only the stateful object is the code itself. If you imagine every function as additionally passing around a single global object (per-CPU, actually) that represents the execution context, it is basically identical, except the matching conditions are more complicated than typestate's canonical examples.

but actually

There is one big problem, which is that preempt_disable() and preempt_enable() are actually inc_preempt_count() and dec_preempt_count(), and spinlocks can nest. So saying that every time you spin_unlock() you transition into a MAY_SLEEP section misses some of the point! Instead, the context should be a more complicated state-space (in this limited application, a counter will suffice, but it's easy to see how it could have as many fields of as many types as you want). Now the annotations attached to each function are turning into veritable functions on their own... Some accurate rules might look as follows:

The context of any range of code is an Int, with 0 indicating that sleeping is allowed.
The annotation for any function is, generally, an Int -> Bool. For this application, they would be simply either:
- "const True" (alias "DOESNT_SLEEP"): calling this is always safe
- "== 0" (alias "MIGHT_SLEEP"): only safe if sleeping is permitted As per point #5 above, the annotation may also refer to the function's arguments, so for kmalloc, it would be something like "(\x -> ARG(2) == GFP_ATOMIC || x == 0)".
An extra annotation indicating changing the context (Int -> Int) may be provided (or inferred). For inc_preempt_count(), it would be "+ 1", and for dec_preempt_count(), it would be "- 1".
Propagation and inference would be a bunch of boolean intersection and function composition. There would certainly be some room to coalesce conditions to avoid redundant condition blow-up, but at some point it would probably be acceptable for the analysis tool to say "this is too much for me to follow, please annotate explicitly".

For such a tool to be possibly useful, it would need to be easy to phase in to an existing codebase. You can't expect every developer who ever wrote code to come back and annotate every function (especially if they're uninteresting!), and you can't do it yourself, but you don't want to have to churn context inference for an hour over a 10000-node call graph every time you modify one function. So it should start with some reasonable defaults: assume that things that are unannotated are safe (all functions behave as DOESNT_SLEEP, all unknown code ranges MAY_SLEEP), so "some problems may go unnoticed, but if we say there's an error, then there is one for sure". Inference can probably be done little bits at a time on each module, and the tool could even spit out auto-generated annotations (subject to hand-review) to go in the corresponding header files to make easy bringing the codebase up to speed.

Well, then.

Linux CGroups: Subsystems as Modules

2011-02-18

atomic all-nighters

1 comment:

Followers

Blog Archive

Contributors