Linux CGroups: Subsystems as Modules: lkml

Showing posts with label lkml. Show all posts

2011-04-12

double-double-toil-and-trouble-check locking

After nearly two years of on-and-off work, my cgroups threadgroup interface patches have attained a sufficiently polished state and been accepted into the mmotm tree. This is super exciting! It was hard, interesting, enlightening work from beginning to end, and was the most enjoyable experience I've ever had learning about real-world code:

what kinds of tricks, rules, and infrastructure must go into maintaining a million-line codebase with as many contributors
how to write code that is both complicated and perfect: no symbol may be out of place in either aesthetic style or correct functionality
how to write code that interacts with more parts of a codebase than whose details you can possibly understand; how to intuit what parts you will bump into meaningfully and know exactly where to look for details that will tell you how to negotiate the right interaction
justifying the usefulness of your code before anybody will even consider pulling from you

So, props to the core linux guys: they run a very tight ship, but living conditions are pleasant, and it sails to great places. Having my patches pulled is the ultimate validation that my work was not only interesting and challenging but also new and useful - a full-house of project qualities rarer even than talented programmers. It has been a blast.

What is the big deal? These patches solved an interesting and complicated problem in an interesting and complicated way. The challenge was to atomically migrate all tasks in a threadgroup (in non-linux speak, "threads in a multithreaded process") from one cgroup to another - no activity may cause a thread in that group to remain outside of the target cgroup at the end of the operation (think interleaving fork/exec/exit dances). Here is the landscape of thread management in linux:

Each task_struct ("thread") is on a list for its group, called tsk->thread_group, and also has a pointer to the leader thread of the group, called tsk->group_leader (the leader's leader pointer points to himself). further, the leaders of all groups are on another system-wide list. The standard way to iterate over these lists is with macros called do_each_thread and while_each_thread. All these lists and pointers are protected by RCU and/or the global tasklist_lock, which are atomic-context synchronisation primitives.
In fork(), the new thread is added to the list with tasklist_lock held, though cgroup modifications are done outside of that lock.
In exit(), the tsk->PF_EXITING flag is set, and later, the thread falls off of the list (under RCU). It is unclear what happens to the other threads in a group when its leader exits, but it seems possible for them to stick around.
In exec(), if the process is multithreaded, the calling thread will kill all other threads in the group (by way of zap_other_threads(), which delivers a SIGKILL). Further, if the calling thread is not the group leader, it will steal group leadership from the current leader. The old leader's leader pointer is updated, but other threads' are not, because they are going away - though they may not necessarily fall off of the group list until their signal is processed!
For any task_struct, you can grab a reference on it to prevent it from being freed with get_task_struct(). This does not prevent the task from exiting, nor does it prevent it from falling off of the thread list.

I spent a long time scrambling around trying to do every per-thread operation in cgroup_attach_proc() (my operation) while in RCU read-side. Since you're not allowed to do any blocking operation while in an atomic context, this led me to much yak-shaving, tracking down things like open-coded calls to schedule() and NUMA memory migrations (erk!). I realised the best way was simply to kmalloc() an array as big as the threadgroup, snapshot refcounted pointers for each task_struct into it, and iterate over that instead. Still, since this requires dynamic memory allocation, we need to drop all locks which might protect the thread list between when we obtain a reference on the leader and when we go to iterate over his group list. (This will be important later.)

The fork() problem is obvious: a race between a forking thread (using CLONE_THREAD) and cgroup_attach_proc() may cause the new thread to be left in the old cgroup if the latter happens between when fork() copies the parent thread's old cgroups pointer and when it adds the thread to the tasklist. I solved this by using an rwlock (called "rwsem" in linux-land) which forking threads take in read-mode. There is an issue here, though: fork() is such a hot-path that introducing any shared memory access is dubious, and taking an extra lock (even in read mode!) entails writing to shared memory. The performance regression appears when multiple processors contend for the memory: they have to synchronise their caches, and possibly shoot down the other's entry. There is not much to be done about this if you need to take a lock (almost true: in linux-2.4, there used to exist big-reader locks for solving this problem), so we simply seek to place our lock in memory that is already contended for. Fortunately, the signal_struct is shared among the group, has an appropriate lifespan, and has a reference counter that already bounces.

The exec() problem is somewhat more subtle. Recall from above that we need to drop all locks before allocating an array to snapshot the threadgroup into - when we originally found the task_struct to do our operation on, we knew that it was the leader, but what if when we drop locks before allocating, some other thread does exec() and steals leadership from us? Other threads in the group may exit, causing our links in the threadgroup list to become invalid, making iterating over the group unsafe. Furthermore, since the only thread we hold a reference on is the original thread, we can't even guarantee that following our updated tsk->group_leader pointer is safe.

Hence, the only safe way to snapshot the whole threadgroup after allocating an array big enough to do so is to not only take the RCU lock, but also to re-check whether we are still the group leader after doing so. If we are, then the group list is safe to iterate over, but if we are not, we must drop everything and start over - to look up by TGID whoever is the leader of the group (if it even still exists) again. So we have a check for the threadgroup leader when finding his task_struct to begin with, another check for the consistency of the list upon entering the RCU critical section for the second time, and a loop around the whole thing that retries for as long as it takes for no exec() race to occur. I call this algorithm "double-double-toil-and-trouble-check locking".

2010-12-25

as though holding a red pen and a lock stamp

Found two bugs in peripheral code around cgroup_attach_task while trying to finish up my older project, which is to add cgroup_attach_proc. More on the actual project later; for now:

Earlier this year, cpuset was changed to use NODEMASK_ALLOC, to avoid stack-allocating a structure which can be huge on NUMA systems. However, the present usage is rather cavalier in functions that aren't allowed to fail!
Last month, the locking order in memcontrol was changed to avoid a potential deadlock... introducing a bit of foot-stepping between it and cpuset.

Fortunately, none of these are design-level blockers to the patch series I'm trying to finalize. But the solutions could prove "interesting"...

2010-01-07

gdb is not so great at debugging within modules.

At the suggestion of LKML, I have spent today working on making blk-cgroup, also known as blkiocg, into a module. it has not gone entirely smoothly, as there is an already existing module (cfq-iosched) that depends on it, so the process has involved a good amount of figuring out how module dependencies (in terms of symbol-loading and the language of the kbuild system) work.

Now that I got it finally to build, boot, and load the modules (modprobe even figured out the dependency for me, it was great), I discovered that the kernel panics when trying to unload the subsystem (with what looks like a SIGILL in some memory management code). So, GDBing it up, I have discovered that it is difficult to deal with code that's loaded at runtime that isn't a standard dynamic library.

(gdb)
0x000000006005146a 3504 ss->destroy(ss, dummytop);
(gdb)
0x00000000628202ad in ?? ()
(gdb) break block/blk-cgroup.c:207
No source file named block/blk-cgroup.c.
Make breakpoint pending on future shared library load? (y or [n]) n
(gdb) disassemble 0x628202ad
No function contains specified address.
(gdb) disassemble 0x628202ad 0x62820363
Dump of assembler code from 0x628202ad to 0x62820363:
0x00000000628202ad: push %rbp
...

And you thought 213's bomblab would never be useful!

2009-12-31

lkml submission #4

http://lkml.org/lkml/2009/12/31/2

2009-12-21

lkml submission #3

http://lkml.org/lkml/2009/12/21/211

2009-12-09

I learned something new today.

http://lkml.org/lkml/2009/12/9/30

One possible solution - modify cgroup_get_sb to do as follows:

1) get subsys_mutex
2) parse_cgroupfs_options()
3) release subsys_mutex
4) call sget(), which gets s->s_umount
5) get subsys_mutex again
6) verify_cgroupfs_options - also known as the "deadlock avoidance dance"
7) proceed.

Now, note, for clarification:

1508 static struct file_system_type cgroup_fs_type = {
1509         .name = "cgroup",
1510         .get_sb = cgroup_get_sb,
1511         .kill_sb = cgroup_kill_sb,
1512 };

The issue is that s->s_umount is taken in get_sb while it already has the subsys_mutex, whereas kill_sb is called with s->s_umount already held, and deadlock comes from there. There's a good question lurking here, and it is "Who ever designed an interface of seemingly symmetrical functions, one of which has to take a lock inside it while the other has the lock taken before it's called?"

So yes, there's locking order violation, but if functions were locks (which is a reasonable comparison because isn't it intuitive to have a function that takes a lock at the start and drops it at the end?) then the blame would be on whoever wrote this interface to begin with.

2009-12-04

lkml submission #2

http://lkml.org/lkml/2009/12/4/53

2009-12-03

so what's this net_cls thing actually useful for?

Anand came to #cslounge today with an interesting question: is there a logical equivalent of nice for network bandwidth management that i can easily apply to a process? I don't like being unable to browse the web or irc when I scp large amounts of stuff.

I was under the impression that there didn't actually exist a subsystem for controlling that sort of thing, only for "classifying" it - the net_cls subsystem has one control file, which is "classid", and it lets you specify a network class for each cgroup, and it doesn't seem to do much because it doesn't have anything hooking into it because it's a module. Then I found this, which explains the actual secret: each classid can be associated with sockets held by tasks in those cgroups, and then from -userspace- have bandwidth throttling administered! Very slick, and keeps the kernel-side labour to a minimum, so much so that it can even be modularized.

Of course, Anand's particular situation had an easier solution, which is the -l option to scp which lets you specify a bandwidth limit explicitly.

2009-11-02

lkml submission #1

http://lkml.org/lkml/2009/11/2/442

2009-10-21

Midsemester plan (as seen in the 412 project volume!)

milestones! \o_

this week:
clean up my work for my other classes and get my brain back in shape! possibly even refactor some of the code that i have identified as needing refactoring. also stop being so lazy and mirror my work/source tree in the project volume.

next week:
polish up my current features - namely, the so-far-implemented subsys[] modifications, and module interface within cgroups, and the conversion of net_cls to be able to be modulificarized. possibly even submit first draft to LKML and folks, get reviews!

week after (nov 1-7):
think up how to do module unloading support; logistics of pinning the subsystems when loaded and letting them go when a hierarchy is unmounted. possibly begin implementing this thing. possibly consider any reviews gotten on LKML for first submission.

nov 8-14:
work should be moving along solidly on module unloading and/or fixing lkml reviews.

nov 15-21:
one or both of above should be finished. shoot for another submission to lkml around this time?

nov 22-28:
if not lkmled last week, module unloading should be first-draft done and thinged this week.

nov 29-dec 5:
rest of semester should be dedicated to finalizing everything and making the critics from lkml happy

grading criteria! _o/

C: idea rejected or otherwise falls apart somehow, implementation turns out to be very shaky, didn't get any shininess done on top of the rudimentary stuff.

B: implementation possibly a little shaky, the lkml dudes don't like it yet, a sizeable amount more work to be done before it can be called a real feature, not a lot of shininess. alternatively, a bare rudimentary implementation taken by lkml but with nothing shiny at all (i.e., pretty much what functionality i have now and nothing more)

A: implementation solid, most likely accepted to lkml by the end of semester, or if not, should be clearly on its way to that soon. at least a moderate amount of shininess, whether from module unloading or otherwise, should be present.

A++++ with a hug and a star-shaped sticker: shines more brilliantly than the sun, great features, accepted into kernel for sure by end of semester, works flawlessly and highly lauded by big-name developers as great development in computing. nobel peace prize possibly awarded.

Linux CGroups: Subsystems as Modules