Linux CGroups: Subsystems as Modules
the musings of a curious kernel hacker
2012-07-02
Linux's leap-second deadlocks
In case anybody is still following this blog, but not my new research blog - I've made a post there about the leap-second-related deadlocks that showed up yesterday in Linux servers worldwide.
Linux's leap-second deadlocks"
Also, follow the new blog:
Landslide: systematic dynamic race detection in kernel-space
2011-04-12
double-double-toil-and-trouble-check locking
After nearly two years of on-and-off work, my cgroups threadgroup interface patches have attained a sufficiently polished state and been accepted into the mmotm tree. This is super exciting! It was hard, interesting, enlightening work from beginning to end, and was the most enjoyable experience I've ever had learning about real-world code:
What is the big deal? These patches solved an interesting and complicated problem in an interesting and complicated way. The challenge was to atomically migrate all tasks in a threadgroup (in non-linux speak, "threads in a multithreaded process") from one cgroup to another - no activity may cause a thread in that group to remain outside of the target cgroup at the end of the operation (think interleaving fork/exec/exit dances). Here is the landscape of thread management in linux:
The fork() problem is obvious: a race between a forking thread (using CLONE_THREAD) and cgroup_attach_proc() may cause the new thread to be left in the old cgroup if the latter happens between when fork() copies the parent thread's old cgroups pointer and when it adds the thread to the tasklist. I solved this by using an rwlock (called "rwsem" in linux-land) which forking threads take in read-mode. There is an issue here, though: fork() is such a hot-path that introducing any shared memory access is dubious, and taking an extra lock (even in read mode!) entails writing to shared memory. The performance regression appears when multiple processors contend for the memory: they have to synchronise their caches, and possibly shoot down the other's entry. There is not much to be done about this if you need to take a lock (almost true: in linux-2.4, there used to exist big-reader locks for solving this problem), so we simply seek to place our lock in memory that is already contended for. Fortunately, the signal_struct is shared among the group, has an appropriate lifespan, and has a reference counter that already bounces.
The exec() problem is somewhat more subtle. Recall from above that we need to drop all locks before allocating an array to snapshot the threadgroup into - when we originally found the task_struct to do our operation on, we knew that it was the leader, but what if when we drop locks before allocating, some other thread does exec() and steals leadership from us? Other threads in the group may exit, causing our links in the threadgroup list to become invalid, making iterating over the group unsafe. Furthermore, since the only thread we hold a reference on is the original thread, we can't even guarantee that following our updated tsk->group_leader pointer is safe.
Hence, the only safe way to snapshot the whole threadgroup after allocating an array big enough to do so is to not only take the RCU lock, but also to re-check whether we are still the group leader after doing so. If we are, then the group list is safe to iterate over, but if we are not, we must drop everything and start over - to look up by TGID whoever is the leader of the group (if it even still exists) again. So we have a check for the threadgroup leader when finding his task_struct to begin with, another check for the consistency of the list upon entering the RCU critical section for the second time, and a loop around the whole thing that retries for as long as it takes for no exec() race to occur. I call this algorithm "double-double-toil-and-trouble-check locking".
- what kinds of tricks, rules, and infrastructure must go into maintaining a million-line codebase with as many contributors
- how to write code that is both complicated and perfect: no symbol may be out of place in either aesthetic style or correct functionality
- how to write code that interacts with more parts of a codebase than whose details you can possibly understand; how to intuit what parts you will bump into meaningfully and know exactly where to look for details that will tell you how to negotiate the right interaction
- justifying the usefulness of your code before anybody will even consider pulling from you
What is the big deal? These patches solved an interesting and complicated problem in an interesting and complicated way. The challenge was to atomically migrate all tasks in a threadgroup (in non-linux speak, "threads in a multithreaded process") from one cgroup to another - no activity may cause a thread in that group to remain outside of the target cgroup at the end of the operation (think interleaving fork/exec/exit dances). Here is the landscape of thread management in linux:
- Each task_struct ("thread") is on a list for its group, called tsk->thread_group, and also has a pointer to the leader thread of the group, called tsk->group_leader (the leader's leader pointer points to himself). further, the leaders of all groups are on another system-wide list. The standard way to iterate over these lists is with macros called do_each_thread and while_each_thread. All these lists and pointers are protected by RCU and/or the global tasklist_lock, which are atomic-context synchronisation primitives.
- In fork(), the new thread is added to the list with tasklist_lock held, though cgroup modifications are done outside of that lock.
- In exit(), the tsk->PF_EXITING flag is set, and later, the thread falls off of the list (under RCU). It is unclear what happens to the other threads in a group when its leader exits, but it seems possible for them to stick around.
- In exec(), if the process is multithreaded, the calling thread will kill all other threads in the group (by way of zap_other_threads(), which delivers a SIGKILL). Further, if the calling thread is not the group leader, it will steal group leadership from the current leader. The old leader's leader pointer is updated, but other threads' are not, because they are going away - though they may not necessarily fall off of the group list until their signal is processed!
- For any task_struct, you can grab a reference on it to prevent it from being freed with get_task_struct(). This does not prevent the task from exiting, nor does it prevent it from falling off of the thread list.
The fork() problem is obvious: a race between a forking thread (using CLONE_THREAD) and cgroup_attach_proc() may cause the new thread to be left in the old cgroup if the latter happens between when fork() copies the parent thread's old cgroups pointer and when it adds the thread to the tasklist. I solved this by using an rwlock (called "rwsem" in linux-land) which forking threads take in read-mode. There is an issue here, though: fork() is such a hot-path that introducing any shared memory access is dubious, and taking an extra lock (even in read mode!) entails writing to shared memory. The performance regression appears when multiple processors contend for the memory: they have to synchronise their caches, and possibly shoot down the other's entry. There is not much to be done about this if you need to take a lock (almost true: in linux-2.4, there used to exist big-reader locks for solving this problem), so we simply seek to place our lock in memory that is already contended for. Fortunately, the signal_struct is shared among the group, has an appropriate lifespan, and has a reference counter that already bounces.
The exec() problem is somewhat more subtle. Recall from above that we need to drop all locks before allocating an array to snapshot the threadgroup into - when we originally found the task_struct to do our operation on, we knew that it was the leader, but what if when we drop locks before allocating, some other thread does exec() and steals leadership from us? Other threads in the group may exit, causing our links in the threadgroup list to become invalid, making iterating over the group unsafe. Furthermore, since the only thread we hold a reference on is the original thread, we can't even guarantee that following our updated tsk->group_leader pointer is safe.
Hence, the only safe way to snapshot the whole threadgroup after allocating an array big enough to do so is to not only take the RCU lock, but also to re-check whether we are still the group leader after doing so. If we are, then the group list is safe to iterate over, but if we are not, we must drop everything and start over - to look up by TGID whoever is the leader of the group (if it even still exists) again. So we have a check for the threadgroup leader when finding his task_struct to begin with, another check for the consistency of the list upon entering the RCU critical section for the second time, and a loop around the whole thing that retries for as long as it takes for no exec() race to occur. I call this algorithm "double-double-toil-and-trouble-check locking".
2011-02-18
atomic all-nighters
exposition
In an environment like the Linux kernel, the many ways to deal with concurrency cause a bunch of restrictions on what sorts of things you can do at certain points in your code. The concurrency primitives are, briefly:
There were a bunch of places along the way of implementing the cgroups threadgroup interface patches where I had to tread carefully around my concurrency primitives to not break this rule. There's a macro in linux called "might_sleep()" which you can drop in your scheduler-invoking code as a debugging measure, and it will do a runtime check that everything is OK when it gets hit (there's even a "might_sleep_if(...)" version). But (a) the developer has to think to put it there, and developers are unreliable, and (b) it's a runtime check for something that seems like a static property of the code.
an idea
So it occurred to me that maybe there should be machinery as part of a language's tool-chain that could automatically verify whether code respects the contexts that functions are supposed to run in. It could even be entirely separate from the compiler itself (I was thinking that it might not be too hard to extend my 411 compiler to do it, though all the changes would be in my partner's code, not mine, so). The framework might look something like this:
but actually
There is one big problem, which is that preempt_disable() and preempt_enable() are actually inc_preempt_count() and dec_preempt_count(), and spinlocks can nest. So saying that every time you spin_unlock() you transition into a MAY_SLEEP section misses some of the point! Instead, the context should be a more complicated state-space (in this limited application, a counter will suffice, but it's easy to see how it could have as many fields of as many types as you want). Now the annotations attached to each function are turning into veritable functions on their own... Some accurate rules might look as follows:
Well, then.
In an environment like the Linux kernel, the many ways to deal with concurrency cause a bunch of restrictions on what sorts of things you can do at certain points in your code. The concurrency primitives are, briefly:
- mutexes/semaphores: high-level blocking locks, will deschedule you if you need to wait to get it. Very commonly used in many subsystems, to protect data structures, and so on (popular example: tsk->mmap_sem).
- spinlocks/RCU: low-level "atomic context" locks, will disable interrupts and spin-wait (in the SMP case). Shouldn't be held around "very expensive" operations (at the very least because another CPU may be spin-waiting on you!). These are needed for, among other reasons, accessing things from an interrupt handler, wherein rescheduling is already not permitted. The global tasklist_lock is a spin-lock, as are the locks on each task_struct (task_lock(tsk)).
- There are calls such as yield() and schedule() (and sleep(), ewwww) which explicitly invoke the scheduler, and are sometimes used in open-coded synchronisation dances (to varying stylistic quality) (examples: A and B).
- Also of note is the dynamic memory allocator, kmalloc(size_t size, gfp_t flags). The flags argument controls the behaviour of the allocator - the two most common choices:
- GFP_KERNEL: indicates that it's okay to wait and reschedule if necessary
- GFP_ATOMIC: "please satisfy this request immediately", and therefore more likely to fail - incidentally, this may also allocate from the kernel's "emergency memory pool", under the assumption that it should end up freed quickly.
There were a bunch of places along the way of implementing the cgroups threadgroup interface patches where I had to tread carefully around my concurrency primitives to not break this rule. There's a macro in linux called "might_sleep()" which you can drop in your scheduler-invoking code as a debugging measure, and it will do a runtime check that everything is OK when it gets hit (there's even a "might_sleep_if(...)" version). But (a) the developer has to think to put it there, and developers are unreliable, and (b) it's a runtime check for something that seems like a static property of the code.
an idea
So it occurred to me that maybe there should be machinery as part of a language's tool-chain that could automatically verify whether code respects the contexts that functions are supposed to run in. It could even be entirely separate from the compiler itself (I was thinking that it might not be too hard to extend my 411 compiler to do it, though all the changes would be in my partner's code, not mine, so). The framework might look something like this:
- Independent of the code, a definition of the various contexts that are important to track (and the rules governing their interactions). Here, an individual function could be "MIGHT_SLEEP" or "DOESNT_SLEEP", and call sites (or, regions of code in general) will have a property like "MAY_SLEEP" or "MUSTNT_SLEEP". If a MIGHT_SLEEP function is called from a MUSTNT_SLEEP line of code, print an error message; all other cases are okay. Functions may also change the context at their call sites, so transitions may be specified.
- Explicitly annotate those functions which affect the code context:
- schedule(), as the central function that everything sleeping goes through, should be marked MIGHT_SLEEP.
- preempt_disable(), which spinlocks and RCU et al. invoke, should transition the call site from MAY_SLEEP to MUSTNT_SLEEP, and preempt_enable(), its dual, turns MUSTNT_SLEEP into MAY_SLEEP.
- To a certain extent, the tool can then infer context rules for all other functions in the program. ("spin_lock calls preempt_disable, and task_lock calls spin_lock, and do_stuff calls task_lock on line 7, so after line 7 the code is MUSTNT_SLEEP; on line 8 do_stuff calls kmalloc, which ... MAY_SLEEP, so the code is wrong.") But also you don't want to infer over the whole codebase, so it would also be wise to annotate (where you can) function prototypes in your headers as MIGHT_SLEEP or DOESNT_SLEEP where appropriate.
- Sometimes code likes to have callback interfaces, so structs full of function pointers may be built and passed around. A function pointer interface should be specified the same way as a call site (MAY_SLEEP or MUSTNT_SLEEP) and then checked at every point where a function's address is taken to serve as such a pointer.
- Actually, some common programming idioms make it hard to say if a function just by name will block or not - like with kmalloc (as described above), which MIGHT_SLEEP if you say GFP_KERNEL, but DOESNT_SLEEP if you say GFP_ATOMIC. So some functions need to be described like MIGHT_SLEEP_IF(ARG(2) == GFP_KERNEL).
- Functions which take, say, a flags argument to pass through to a kmalloc call will also be MIGHT_SLEEP_IF; the condition would have to propagate.
- Usually the flag's value is known at compile time; if it's not known (and you're in a MUSTNT_SLEEP section) then the code is of dubious quality anyway and should be errored on.
but actually
There is one big problem, which is that preempt_disable() and preempt_enable() are actually inc_preempt_count() and dec_preempt_count(), and spinlocks can nest. So saying that every time you spin_unlock() you transition into a MAY_SLEEP section misses some of the point! Instead, the context should be a more complicated state-space (in this limited application, a counter will suffice, but it's easy to see how it could have as many fields of as many types as you want). Now the annotations attached to each function are turning into veritable functions on their own... Some accurate rules might look as follows:
- The context of any range of code is an Int, with 0 indicating that sleeping is allowed.
- The annotation for any function is, generally, an Int -> Bool. For this application, they would be simply either:
- "const True" (alias "DOESNT_SLEEP"): calling this is always safe
- "== 0" (alias "MIGHT_SLEEP"): only safe if sleeping is permitted As per point #5 above, the annotation may also refer to the function's arguments, so for kmalloc, it would be something like "(\x -> ARG(2) == GFP_ATOMIC || x == 0)".
- An extra annotation indicating changing the context (Int -> Int) may be provided (or inferred). For inc_preempt_count(), it would be "+ 1", and for dec_preempt_count(), it would be "- 1".
- Propagation and inference would be a bunch of boolean intersection and function composition. There would certainly be some room to coalesce conditions to avoid redundant condition blow-up, but at some point it would probably be acceptable for the analysis tool to say "this is too much for me to follow, please annotate explicitly".
Well, then.
2010-12-26
lesson: never hack when it won't solve the entire problem.
I'd like to share some code I wrote for a particularly misguided solution to a yak-shaving quest. Fortunately, soon afterwards, I realized a much better way to dodge the bullet that I was here trying to take in the chest and stagger back to camp while screaming for help... so this code will never see the light of day.
That's good, because I also think it has a fatal flaw. I won't say it, but I'll give an exposition so you can figure it out. Here's the idea:
1456 /*
1457 * This particular per-task operation requires being able to sleep, so
1458 * it can't be done in the attach_task callback, where sleeping is
1459 * forbidden. See cgroup_attach_proc.
1460 */
1461 if (threadgroup) {
1462 struct task_struct *c = NULL;
1463 again:
1464 rcu_read_lock();
1465 /* ensure safe thread-group traversal */
1466 if (thread_group_leader(tsk)) {
1467 list_for_each_entry_rcu(c, &tsk->thread_group,
1468 thread_group) {
1469 get_task_struct(c);
1470 ret = cpuset_change_task_nodemask(c,
1471 &cpuset_attach_nodemask_to);
1472 /*
1473 * on a failed nodemask change, we need to exit
1474 * the atomic section and start over.
1475 */
1476 if (ret == -EAGAIN) {
1477 rcu_read_unlock();
1478 yield();
1479 goto again;
1480 }
1481 put_task_struct(c);
1482 }
1483 rcu_read_unlock();
1484 } else if (c != NULL) {
1485 /*
1486 * if racing with exec caused an abort, we need to
1487 * finish the rebind operation in progress on it.
1488 */
1489 rcu_read_unlock();
1490 do {
1491 ret = cpuset_change_task_nodemask(c,
1492 &cpuset_attach_nodemask_to);
1493 if (ret == -EAGAIN)
1494 yield();
1495 } while (ret != -EAGAIN)
1496 put_task_struct(c);
1497 }
1498 } else {
1499 do {
1500 ret = cpuset_change_task_nodemask(tsk,
1501 &cpuset_attach_nodemask_to);
1502 if (ret == -EAGAIN)
1503 yield();
1504 } while (ret != -EAGAIN);
1505 }
That's good, because I also think it has a fatal flaw. I won't say it, but I'll give an exposition so you can figure it out. Here's the idea:
- You are not allowed to block, sleep, or otherwise yield while in an "atomic" section of code (usually because it means interrupts are disabled). In order to safely iterate over a thread-group, you need to hold either rcu_read_lock or tasklist_lock, both of which are low-level locks and constitute "atomic" code while being held.
- In kernel/cpuset.c, cpuset_change_task_nodemask has two steps, and a check between those steps which does a yield to synchronize its internal state (as part of an open-coded replacement for a proper concurrency primitive... ugh. from commit c0ff7453bb5c7c98e0885fb94279f2571946f280 on mmotm). Here I have changed it to return -EAGAIN instead of yielding, so the caller can synchronize "more appropriately".
- get_task_struct and put_task_struct manage a thread's reference count. After you've done get_task_struct, it will always be safe to access that memory, until you release it. If you call put_task_struct and you're the last one with a reference count, it does a bunch of cleanup. (interesting read here, perhaps a hint.)
1456 /*
1457 * This particular per-task operation requires being able to sleep, so
1458 * it can't be done in the attach_task callback, where sleeping is
1459 * forbidden. See cgroup_attach_proc.
1460 */
1461 if (threadgroup) {
1462 struct task_struct *c = NULL;
1463 again:
1464 rcu_read_lock();
1465 /* ensure safe thread-group traversal */
1466 if (thread_group_leader(tsk)) {
1467 list_for_each_entry_rcu(c, &tsk->thread_group,
1468 thread_group) {
1469 get_task_struct(c);
1470 ret = cpuset_change_task_nodemask(c,
1471 &cpuset_attach_nodemask_to);
1472 /*
1473 * on a failed nodemask change, we need to exit
1474 * the atomic section and start over.
1475 */
1476 if (ret == -EAGAIN) {
1477 rcu_read_unlock();
1478 yield();
1479 goto again;
1480 }
1481 put_task_struct(c);
1482 }
1483 rcu_read_unlock();
1484 } else if (c != NULL) {
1485 /*
1486 * if racing with exec caused an abort, we need to
1487 * finish the rebind operation in progress on it.
1488 */
1489 rcu_read_unlock();
1490 do {
1491 ret = cpuset_change_task_nodemask(c,
1492 &cpuset_attach_nodemask_to);
1493 if (ret == -EAGAIN)
1494 yield();
1495 } while (ret != -EAGAIN)
1496 put_task_struct(c);
1497 }
1498 } else {
1499 do {
1500 ret = cpuset_change_task_nodemask(tsk,
1501 &cpuset_attach_nodemask_to);
1502 if (ret == -EAGAIN)
1503 yield();
1504 } while (ret != -EAGAIN);
1505 }
2010-12-25
as though holding a red pen and a lock stamp
Found two bugs in peripheral code around cgroup_attach_task while trying to finish up my older project, which is to add cgroup_attach_proc. More on the actual project later; for now:
- Earlier this year, cpuset was changed to use NODEMASK_ALLOC, to avoid stack-allocating a structure which can be huge on NUMA systems. However, the present usage is rather cavalier in functions that aren't allowed to fail!
- Last month, the locking order in memcontrol was changed to avoid a potential deadlock... introducing a bit of foot-stepping between it and cpuset.
2010-03-23
Mainline
2.6.34-rc2 came out three days ago, and supports modular cgroup subsystems.
(The net_cls patch wasn't merged with the rest and is presently still running the approval gauntlet, though even blkio-cg is already in.)
(The net_cls patch wasn't merged with the rest and is presently still running the approval gauntlet, though even blkio-cg is already in.)
2010-01-07
gdb is not so great at debugging within modules.
At the suggestion of LKML, I have spent today working on making blk-cgroup, also known as blkiocg, into a module. it has not gone entirely smoothly, as there is an already existing module (cfq-iosched) that depends on it, so the process has involved a good amount of figuring out how module dependencies (in terms of symbol-loading and the language of the kbuild system) work.
Now that I got it finally to build, boot, and load the modules (modprobe even figured out the dependency for me, it was great), I discovered that the kernel panics when trying to unload the subsystem (with what looks like a SIGILL in some memory management code). So, GDBing it up, I have discovered that it is difficult to deal with code that's loaded at runtime that isn't a standard dynamic library.
(gdb)
0x000000006005146a 3504 ss->destroy(ss, dummytop);
(gdb)
0x00000000628202ad in ?? ()
(gdb) break block/blk-cgroup.c:207
No source file named block/blk-cgroup.c.
Make breakpoint pending on future shared library load? (y or [n]) n
(gdb) disassemble 0x628202ad
No function contains specified address.
(gdb) disassemble 0x628202ad 0x62820363
Dump of assembler code from 0x628202ad to 0x62820363:
0x00000000628202ad: push %rbp
...
And you thought 213's bomblab would never be useful!
Now that I got it finally to build, boot, and load the modules (modprobe even figured out the dependency for me, it was great), I discovered that the kernel panics when trying to unload the subsystem (with what looks like a SIGILL in some memory management code). So, GDBing it up, I have discovered that it is difficult to deal with code that's loaded at runtime that isn't a standard dynamic library.
(gdb)
0x000000006005146a 3504 ss->destroy(ss, dummytop);
(gdb)
0x00000000628202ad in ?? ()
(gdb) break block/blk-cgroup.c:207
No source file named block/blk-cgroup.c.
Make breakpoint pending on future shared library load? (y or [n]) n
(gdb) disassemble 0x628202ad
No function contains specified address.
(gdb) disassemble 0x628202ad 0x62820363
Dump of assembler code from 0x628202ad to 0x62820363:
0x00000000628202ad: push %rbp
...
And you thought 213's bomblab would never be useful!
Subscribe to:
Posts (Atom)