Linux CGroups: Subsystems as Modules: milestones

Showing posts with label milestones. Show all posts

2011-04-12

double-double-toil-and-trouble-check locking

After nearly two years of on-and-off work, my cgroups threadgroup interface patches have attained a sufficiently polished state and been accepted into the mmotm tree. This is super exciting! It was hard, interesting, enlightening work from beginning to end, and was the most enjoyable experience I've ever had learning about real-world code:

what kinds of tricks, rules, and infrastructure must go into maintaining a million-line codebase with as many contributors
how to write code that is both complicated and perfect: no symbol may be out of place in either aesthetic style or correct functionality
how to write code that interacts with more parts of a codebase than whose details you can possibly understand; how to intuit what parts you will bump into meaningfully and know exactly where to look for details that will tell you how to negotiate the right interaction
justifying the usefulness of your code before anybody will even consider pulling from you

So, props to the core linux guys: they run a very tight ship, but living conditions are pleasant, and it sails to great places. Having my patches pulled is the ultimate validation that my work was not only interesting and challenging but also new and useful - a full-house of project qualities rarer even than talented programmers. It has been a blast.

What is the big deal? These patches solved an interesting and complicated problem in an interesting and complicated way. The challenge was to atomically migrate all tasks in a threadgroup (in non-linux speak, "threads in a multithreaded process") from one cgroup to another - no activity may cause a thread in that group to remain outside of the target cgroup at the end of the operation (think interleaving fork/exec/exit dances). Here is the landscape of thread management in linux:

Each task_struct ("thread") is on a list for its group, called tsk->thread_group, and also has a pointer to the leader thread of the group, called tsk->group_leader (the leader's leader pointer points to himself). further, the leaders of all groups are on another system-wide list. The standard way to iterate over these lists is with macros called do_each_thread and while_each_thread. All these lists and pointers are protected by RCU and/or the global tasklist_lock, which are atomic-context synchronisation primitives.
In fork(), the new thread is added to the list with tasklist_lock held, though cgroup modifications are done outside of that lock.
In exit(), the tsk->PF_EXITING flag is set, and later, the thread falls off of the list (under RCU). It is unclear what happens to the other threads in a group when its leader exits, but it seems possible for them to stick around.
In exec(), if the process is multithreaded, the calling thread will kill all other threads in the group (by way of zap_other_threads(), which delivers a SIGKILL). Further, if the calling thread is not the group leader, it will steal group leadership from the current leader. The old leader's leader pointer is updated, but other threads' are not, because they are going away - though they may not necessarily fall off of the group list until their signal is processed!
For any task_struct, you can grab a reference on it to prevent it from being freed with get_task_struct(). This does not prevent the task from exiting, nor does it prevent it from falling off of the thread list.

I spent a long time scrambling around trying to do every per-thread operation in cgroup_attach_proc() (my operation) while in RCU read-side. Since you're not allowed to do any blocking operation while in an atomic context, this led me to much yak-shaving, tracking down things like open-coded calls to schedule() and NUMA memory migrations (erk!). I realised the best way was simply to kmalloc() an array as big as the threadgroup, snapshot refcounted pointers for each task_struct into it, and iterate over that instead. Still, since this requires dynamic memory allocation, we need to drop all locks which might protect the thread list between when we obtain a reference on the leader and when we go to iterate over his group list. (This will be important later.)

The fork() problem is obvious: a race between a forking thread (using CLONE_THREAD) and cgroup_attach_proc() may cause the new thread to be left in the old cgroup if the latter happens between when fork() copies the parent thread's old cgroups pointer and when it adds the thread to the tasklist. I solved this by using an rwlock (called "rwsem" in linux-land) which forking threads take in read-mode. There is an issue here, though: fork() is such a hot-path that introducing any shared memory access is dubious, and taking an extra lock (even in read mode!) entails writing to shared memory. The performance regression appears when multiple processors contend for the memory: they have to synchronise their caches, and possibly shoot down the other's entry. There is not much to be done about this if you need to take a lock (almost true: in linux-2.4, there used to exist big-reader locks for solving this problem), so we simply seek to place our lock in memory that is already contended for. Fortunately, the signal_struct is shared among the group, has an appropriate lifespan, and has a reference counter that already bounces.

The exec() problem is somewhat more subtle. Recall from above that we need to drop all locks before allocating an array to snapshot the threadgroup into - when we originally found the task_struct to do our operation on, we knew that it was the leader, but what if when we drop locks before allocating, some other thread does exec() and steals leadership from us? Other threads in the group may exit, causing our links in the threadgroup list to become invalid, making iterating over the group unsafe. Furthermore, since the only thread we hold a reference on is the original thread, we can't even guarantee that following our updated tsk->group_leader pointer is safe.

Hence, the only safe way to snapshot the whole threadgroup after allocating an array big enough to do so is to not only take the RCU lock, but also to re-check whether we are still the group leader after doing so. If we are, then the group list is safe to iterate over, but if we are not, we must drop everything and start over - to look up by TGID whoever is the leader of the group (if it even still exists) again. So we have a check for the threadgroup leader when finding his task_struct to begin with, another check for the consistency of the list upon entering the RCU critical section for the second time, and a loop around the whole thing that retries for as long as it takes for no exec() race to occur. I call this algorithm "double-double-toil-and-trouble-check locking".

2010-03-23

Mainline

2.6.34-rc2 came out three days ago, and supports modular cgroup subsystems.

(The net_cls patch wasn't merged with the rest and is presently still running the approval gauntlet, though even blkio-cg is already in.)

2009-12-31

lkml submission #4

http://lkml.org/lkml/2009/12/31/2

2009-12-21

lkml submission #3

http://lkml.org/lkml/2009/12/21/211

2009-12-04

lkml submission #2

http://lkml.org/lkml/2009/12/4/53

2009-11-03

a module UNloading adventure...

I promised myself this morning that I would spend a "little" bit of time thinking about module unloading. An afternoon and half an evening later, I found myself with a newly written 175-line patch...

livecd dev # insmod /mnt/host/cgroup_test1.ko
livecd dev # insmod /mnt/host/cgroup_test2.ko
livecd dev # modprobe cls_cgroup
livecd dev # lsmod
Module Size Used by
cls_cgroup 5064 0
cgroup_test2 2800 0
cgroup_test1 2800 0
livecd dev # mount -t cgroup none -o net_cls,test2 cgroup/
livecd dev # lsmod
Module Size Used by
cls_cgroup 5064 1
cgroup_test2 2800 1
cgroup_test1 2800 0
livecd dev # rmmod cgroup_test1
livecd dev # rmmod cgroup_test2
ERROR: Module cgroup_test2 is in use
livecd dev # umount cgroup
livecd dev # lsmod
Module Size Used by
cls_cgroup 5064 0
cgroup_test2 2800 0
livecd dev # rmmod cgroup_test2
livecd dev #

It still has some FIXMEs, meaning I need to make sure there are no races where there might be races, but I am surprised at how easy that was.

2009-10-21

Midsemester plan (as seen in the 412 project volume!)

milestones! \o_

this week:
clean up my work for my other classes and get my brain back in shape! possibly even refactor some of the code that i have identified as needing refactoring. also stop being so lazy and mirror my work/source tree in the project volume.

next week:
polish up my current features - namely, the so-far-implemented subsys[] modifications, and module interface within cgroups, and the conversion of net_cls to be able to be modulificarized. possibly even submit first draft to LKML and folks, get reviews!

week after (nov 1-7):
think up how to do module unloading support; logistics of pinning the subsystems when loaded and letting them go when a hierarchy is unmounted. possibly begin implementing this thing. possibly consider any reviews gotten on LKML for first submission.

nov 8-14:
work should be moving along solidly on module unloading and/or fixing lkml reviews.

nov 15-21:
one or both of above should be finished. shoot for another submission to lkml around this time?

nov 22-28:
if not lkmled last week, module unloading should be first-draft done and thinged this week.

nov 29-dec 5:
rest of semester should be dedicated to finalizing everything and making the critics from lkml happy

grading criteria! _o/

C: idea rejected or otherwise falls apart somehow, implementation turns out to be very shaky, didn't get any shininess done on top of the rudimentary stuff.

B: implementation possibly a little shaky, the lkml dudes don't like it yet, a sizeable amount more work to be done before it can be called a real feature, not a lot of shininess. alternatively, a bare rudimentary implementation taken by lkml but with nothing shiny at all (i.e., pretty much what functionality i have now and nothing more)

A: implementation solid, most likely accepted to lkml by the end of semester, or if not, should be clearly on its way to that soon. at least a moderate amount of shininess, whether from module unloading or otherwise, should be present.

A++++ with a hug and a star-shaped sticker: shines more brilliantly than the sun, great features, accepted into kernel for sure by end of semester, works flawlessly and highly lauded by big-name developers as great development in computing. nobel peace prize possibly awarded.

2009-10-16

wrestling dragons

so, next goal after doing a dummy subsystem is to start on the existing builtin subsystems. at least one should end up being modularized, and for others that I examine, analysis should be given on why they can't or would be difficult to be made into modules. so a quick glance over the existing subsystems turned into surprise progress when I discovered that net_cls (which additionally goes by cls_cgroup, net_cls_cgroup, and in all likelihood additional permutations thereof) not only is restricted to one file (devices and ns for example have miscellaneous function calls to them with ifdefs, in contrast), but also already has module_init() et.al. declarations at the bottom!

After a bit of hacking around with cgroup_load_subsys() (my new function) and the subsys_id, and also changing the Kconfig option to tristate, the build system happily modulificatarized net_cls for me. Booting up UML, I realized it was probably time to figure out how to make modprobe do the right thing - as it turns out, UML's documentation is very helpful in this regard. A quick hostfs mount and depmod later, and the module loads and runs just fine. Victoly!

In other news, binding GDB to UML yields some frightening results:

0x00007f614c38b420 in nanosleep () from /lib64/libc.so.6
(gdb) break cgroup_load_subsys
Breakpoint 1 at 0x60051819: file kernel/cgroup.c, line 3619.
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
memcpy () at arch/um/sys-x86_64/../../x86/lib/memcpy_64.S:68
68 movq %r11, 0*8(%rdi)
Current language: auto; currently asm
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007f614c36911b in memset () from /lib64/libc.so.6
(gdb)
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007f614c36873e in memset () from /lib64/libc.so.6
(gdb)
Continuing.

Breakpoint 1, cgroup_load_subsys (ss=0x62828fe0) at kernel/cgroup.c:3619
3619 if (ss->fork || ss->exit)
Current language: auto; currently c
(gdb)

2009-10-11

a module loading adventure of the "great success" variety

okay! so, yesterday, and the day before that, and a little bit today, but mostly yesterday (in fact, for just about all of yesterday), I put together two patches in my stgit tree which I called cgroups-revamp-subsys-array.patch and cgroups-subsys-module-interface.patch, satisfying (in a very rudimentary way) #2 and #3 from the roadmap. (#1 turned out to be something I needed to keep but work around anyway... details not important now.) I made sure they compiled and went to bed.

Today, after looking at ns_cgroup.c and devices_cgroup.c and realizing that they couldn't really be modularized easily, I threw together a skeleton subsystem modeled after the other ones that adds a file "hax" that wraps a global variable whose value determines whether you can attach tasks to the cgroup or not. All right, now let's follow this guide that elly pointed me at to get it to build as a module...

WARNING: "cgroup_load_subsys" [/home/bblum/Documents/School/F09/412/hax/cgroup_test1.ko] undefined!
WARNING: "cgroup_add_file" [/home/bblum/Documents/School/F09/412/hax/cgroup_test1.ko] undefined!

Well, they're just warnings, so try loading the module anyway, right? (Note: I use insmod instead of modprobe because the latter wants infrastructure and dependencies, and the former can just take any random file from the filesystem.)

livecd / # mount -t hostfs none -o /home/bblum/412 /mnt/host/
livecd / # cd /mnt/host/hax/
livecd hax # insmod cgroup_test1.ko
cgroup_test1: Unknown symbol cgroup_load_subsys
cgroup_test1: Unknown symbol cgroup_add_file
insmod: error inserting 'cgroup_test1.ko': -1 Unknown symbol in module

It was worth a shot, though. Turns out I need to EXPORT_SYMBOL(...) everything I'll need for the module in kernel/cgroup.c. For now, I just do the functions my subsystem uses; later, I'll need to worry about functions that -any- subsystem might use. Next:

livecd hax # insmod cgroup_test1.ko
cgroup_test1: version magic '2.6.31-rc9-mm1-gf013913 mod_unload ' should be '2.6.31-rc9-mm1-ge40e265 mod_unload '
insmod: error inserting 'cgroup_test1.ko': -1 Invalid module format

It took too long to realize that the kernel I'd most recently booted somehow had something different enough to change the vermagic string from the most recent time I'd built it, which is what I'd built the module against. Okay. Rebooting UML, and going to get it right this time.

livecd hax # insmod cgroup_test1.ko
Kernel panic - not syncing: Kernel mode signal 4
Modules linked in: cgroup_test1(+)
Segmentation fault

I had deliberately left out the ".module = THIS_MODULE" line in test1_subsys when first building it, to see what would happen when cgroup_load_subsys tried to pin the module... and promptly forgotten about it. Putting the line in, finally, and:

livecd dev # lsmod
Module Size Used by
livecd dev # mount -t cgroup none -o test1 cgroup/
mount: special device none does not exist
livecd dev # insmod /mnt/host/hax/cgroup_test1.ko
livecd dev # lsmod
Module Size Used by
cgroup_test1 2512 1 [permanent]
livecd dev # mount -t cgroup none -o test1 cgroup/
livecd dev # ls cgroup/
cgroup.procs notify_on_release release_agent tasks test1.hax
livecd dev # mkdir cgroup/foo
livecd dev # echo $$ > cgroup/foo/tasks
bash: echo: write error: Operation not permitted
livecd dev # echo 42 > cgroup/foo/test1.hax
livecd dev # echo $$ > cgroup/foo/tasks
livecd dev #

:)

2009-10-08

roadmap

all right, so it seems like a good idea to map out the ideas and targets I've got in my head, for several reasons. Here's what I've determined I should be doing.

1) Fork/exit callbacks need to go. It's this functionality that cgroups has offered since (presumably) it first hit mainline in which a subsystem can set itself up to get a function called whenever a task forks or exits. Apparently, no subsystem has ever used it, and the presence of it here is going to interact funnily with module-loadable subsystems, so - at the suggestion and approval of Paul - I'm going to strip all callback code out of cgroups. This will be done as a pre-patch to the main patch series I plan on generating.

2) Changing how subsys[] is used.
a) At the bottom of the array will be the entries for builtin subsystems, which will be there at link-time, up until CGROUP_BUILTIN_SUBSYS_COUNT. CGROUP_SUBSYS_COUNT, which used to be that, is now defined as the size of the subsys_bits field in cgroupfs_root (i.e., 32 or 64), and is still the max size of the array. At link time, all entries between the builtin count and the total count will be NULL, and that's where module subsystems will put themselves. (This is done.) Also, the array will need to be surrounded in a rwlock, since when a subsystem registers itself it will need to take a subsys_id. (This is not done.)
b) All code throughout cgroups needs to be able to handle when a subsystem is gone. Each loop that iterates down the array will need to have a check for null pointers (this is done) and take the read-lock (this is not done). There may also be other things that certain loops need to do, situationally - this is as yet unclear.

3) cgroup_init_subsys() needs to be revised to be suitable as a module initcall. It needs to be able to handle failures correctly (the current version will kpanic on initialization fail, since it's assumed to call at boot time only). Of course, because some subsystems will be left as builtins, we'll still need a version suitable for calling at boottime - probably just a wrapper around the adapted module initcall. Also, we'll need to be concurrency-safe now - obviously around the subsys array, and possibly in the other various things that the function does. Among other things, when the module is loaded OR when the module is mounted on a cgroup hierarchy (see note at end of post) we'll need to pin it with try_module_get() to make sure it doesn't go away.

Once we hit this point, it can be said that cgroups has support for modular subsystems. Next, we do the whole "confirming" thing:

4) adapt one or more subsystems to become modules, or perhaps write a new skeleton one for testing, or both. in order to be a module (suppose your module is "foo" as CONFIG_FOO), you need to do the following things:
a) instead of having code interspersed in other code with stuff like #ifdef CONFIG_FOO, it has to be all in the same file (since each .o file is either going to be a builtin or a module). in the kconfig, you need to specify that it's buildable as a module, and in the makefile, you need to make sure that the config file corresponds to the right source file.
b) you need to register a bunch of stuff with the module_suchandsuch() macros - like name, version, author, and most importantly module_init() and module_exit(), which define what functions are called at module load and unload time. (the infrastructure behind this and these macros is a lot of hax.)

I am uncertain whether I'll end up supporting module unloading for cgroups - it seems like it would be useful, given that we have a limit on the number of subsystems loaded at a time. I think this would involve making sure that subsystems can't be unloaded while attached to any mounted hierarchy, but can when not. This likely will necessitate use of cgroup_lock. If we do this, we'll end up pinning the module when we mount a hierarchy - there will be a race here if somebody's trying to unload the module at the same time, so when mounting, pinning all subsystems will have to be done before committing to the mount.

The alternative approach - and the one that I'll go with to begin with, for sure - is to just say nope, never unload, and the module is pinned forever as soon as it's loaded.

Linux CGroups: Subsystems as Modules