double-double-toil-and-trouble-check locking

After nearly two years of on-and-off work, my cgroups threadgroup interface patches have attained a sufficiently polished state and been accepted into the mmotm tree. This is super exciting! It was hard, interesting, enlightening work from beginning to end, and was the most enjoyable experience I've ever had learning about real-world code:
  1. what kinds of tricks, rules, and infrastructure must go into maintaining a million-line codebase with as many contributors
  2. how to write code that is both complicated and perfect: no symbol may be out of place in either aesthetic style or correct functionality
  3. how to write code that interacts with more parts of a codebase than whose details you can possibly understand; how to intuit what parts you will bump into meaningfully and know exactly where to look for details that will tell you how to negotiate the right interaction
  4. justifying the usefulness of your code before anybody will even consider pulling from you
So, props to the core linux guys: they run a very tight ship, but living conditions are pleasant, and it sails to great places. Having my patches pulled is the ultimate validation that my work was not only interesting and challenging but also new and useful - a full-house of project qualities rarer even than talented programmers. It has been a blast.

What is the big deal? These patches solved an interesting and complicated problem in an interesting and complicated way. The challenge was to atomically migrate all tasks in a threadgroup (in non-linux speak, "threads in a multithreaded process") from one cgroup to another - no activity may cause a thread in that group to remain outside of the target cgroup at the end of the operation (think interleaving fork/exec/exit dances). Here is the landscape of thread management in linux:
  1. Each task_struct ("thread") is on a list for its group, called tsk->thread_group, and also has a pointer to the leader thread of the group, called tsk->group_leader (the leader's leader pointer points to himself). further, the leaders of all groups are on another system-wide list. The standard way to iterate over these lists is with macros called do_each_thread and while_each_thread. All these lists and pointers are protected by RCU and/or the global tasklist_lock, which are atomic-context synchronisation primitives.
  2. In fork(), the new thread is added to the list with tasklist_lock held, though cgroup modifications are done outside of that lock.
  3. In exit(), the tsk->PF_EXITING flag is set, and later, the thread falls off of the list (under RCU). It is unclear what happens to the other threads in a group when its leader exits, but it seems possible for them to stick around.
  4. In exec(), if the process is multithreaded, the calling thread will kill all other threads in the group (by way of zap_other_threads(), which delivers a SIGKILL). Further, if the calling thread is not the group leader, it will steal group leadership from the current leader. The old leader's leader pointer is updated, but other threads' are not, because they are going away - though they may not necessarily fall off of the group list until their signal is processed!
  5. For any task_struct, you can grab a reference on it to prevent it from being freed with get_task_struct(). This does not prevent the task from exiting, nor does it prevent it from falling off of the thread list.
I spent a long time scrambling around trying to do every per-thread operation in cgroup_attach_proc() (my operation) while in RCU read-side. Since you're not allowed to do any blocking operation while in an atomic context, this led me to much yak-shaving, tracking down things like open-coded calls to schedule() and NUMA memory migrations (erk!). I realised the best way was simply to kmalloc() an array as big as the threadgroup, snapshot refcounted pointers for each task_struct into it, and iterate over that instead. Still, since this requires dynamic memory allocation, we need to drop all locks which might protect the thread list between when we obtain a reference on the leader and when we go to iterate over his group list. (This will be important later.)

The fork() problem is obvious: a race between a forking thread (using CLONE_THREAD) and cgroup_attach_proc() may cause the new thread to be left in the old cgroup if the latter happens between when fork() copies the parent thread's old cgroups pointer and when it adds the thread to the tasklist. I solved this by using an rwlock (called "rwsem" in linux-land) which forking threads take in read-mode. There is an issue here, though: fork() is such a hot-path that introducing any shared memory access is dubious, and taking an extra lock (even in read mode!) entails writing to shared memory. The performance regression appears when multiple processors contend for the memory: they have to synchronise their caches, and possibly shoot down the other's entry. There is not much to be done about this if you need to take a lock (almost true: in linux-2.4, there used to exist big-reader locks for solving this problem), so we simply seek to place our lock in memory that is already contended for. Fortunately, the signal_struct is shared among the group, has an appropriate lifespan, and has a reference counter that already bounces.

The exec() problem is somewhat more subtle. Recall from above that we need to drop all locks before allocating an array to snapshot the threadgroup into - when we originally found the task_struct to do our operation on, we knew that it was the leader, but what if when we drop locks before allocating, some other thread does exec() and steals leadership from us? Other threads in the group may exit, causing our links in the threadgroup list to become invalid, making iterating over the group unsafe. Furthermore, since the only thread we hold a reference on is the original thread, we can't even guarantee that following our updated tsk->group_leader pointer is safe.

Hence, the only safe way to snapshot the whole threadgroup after allocating an array big enough to do so is to not only take the RCU lock, but also to re-check whether we are still the group leader after doing so. If we are, then the group list is safe to iterate over, but if we are not, we must drop everything and start over - to look up by TGID whoever is the leader of the group (if it even still exists) again. So we have a check for the threadgroup leader when finding his task_struct to begin with, another check for the consistency of the list upon entering the RCU critical section for the second time, and a loop around the whole thing that retries for as long as it takes for no exec() race to occur. I call this algorithm "double-double-toil-and-trouble-check locking".

No comments:

Post a Comment