Linux CGroups: Subsystems as Modules: mmotm

Showing posts with label mmotm. Show all posts

2011-04-12

double-double-toil-and-trouble-check locking

After nearly two years of on-and-off work, my cgroups threadgroup interface patches have attained a sufficiently polished state and been accepted into the mmotm tree. This is super exciting! It was hard, interesting, enlightening work from beginning to end, and was the most enjoyable experience I've ever had learning about real-world code:

what kinds of tricks, rules, and infrastructure must go into maintaining a million-line codebase with as many contributors
how to write code that is both complicated and perfect: no symbol may be out of place in either aesthetic style or correct functionality
how to write code that interacts with more parts of a codebase than whose details you can possibly understand; how to intuit what parts you will bump into meaningfully and know exactly where to look for details that will tell you how to negotiate the right interaction
justifying the usefulness of your code before anybody will even consider pulling from you

So, props to the core linux guys: they run a very tight ship, but living conditions are pleasant, and it sails to great places. Having my patches pulled is the ultimate validation that my work was not only interesting and challenging but also new and useful - a full-house of project qualities rarer even than talented programmers. It has been a blast.

What is the big deal? These patches solved an interesting and complicated problem in an interesting and complicated way. The challenge was to atomically migrate all tasks in a threadgroup (in non-linux speak, "threads in a multithreaded process") from one cgroup to another - no activity may cause a thread in that group to remain outside of the target cgroup at the end of the operation (think interleaving fork/exec/exit dances). Here is the landscape of thread management in linux:

Each task_struct ("thread") is on a list for its group, called tsk->thread_group, and also has a pointer to the leader thread of the group, called tsk->group_leader (the leader's leader pointer points to himself). further, the leaders of all groups are on another system-wide list. The standard way to iterate over these lists is with macros called do_each_thread and while_each_thread. All these lists and pointers are protected by RCU and/or the global tasklist_lock, which are atomic-context synchronisation primitives.
In fork(), the new thread is added to the list with tasklist_lock held, though cgroup modifications are done outside of that lock.
In exit(), the tsk->PF_EXITING flag is set, and later, the thread falls off of the list (under RCU). It is unclear what happens to the other threads in a group when its leader exits, but it seems possible for them to stick around.
In exec(), if the process is multithreaded, the calling thread will kill all other threads in the group (by way of zap_other_threads(), which delivers a SIGKILL). Further, if the calling thread is not the group leader, it will steal group leadership from the current leader. The old leader's leader pointer is updated, but other threads' are not, because they are going away - though they may not necessarily fall off of the group list until their signal is processed!
For any task_struct, you can grab a reference on it to prevent it from being freed with get_task_struct(). This does not prevent the task from exiting, nor does it prevent it from falling off of the thread list.

I spent a long time scrambling around trying to do every per-thread operation in cgroup_attach_proc() (my operation) while in RCU read-side. Since you're not allowed to do any blocking operation while in an atomic context, this led me to much yak-shaving, tracking down things like open-coded calls to schedule() and NUMA memory migrations (erk!). I realised the best way was simply to kmalloc() an array as big as the threadgroup, snapshot refcounted pointers for each task_struct into it, and iterate over that instead. Still, since this requires dynamic memory allocation, we need to drop all locks which might protect the thread list between when we obtain a reference on the leader and when we go to iterate over his group list. (This will be important later.)

The fork() problem is obvious: a race between a forking thread (using CLONE_THREAD) and cgroup_attach_proc() may cause the new thread to be left in the old cgroup if the latter happens between when fork() copies the parent thread's old cgroups pointer and when it adds the thread to the tasklist. I solved this by using an rwlock (called "rwsem" in linux-land) which forking threads take in read-mode. There is an issue here, though: fork() is such a hot-path that introducing any shared memory access is dubious, and taking an extra lock (even in read mode!) entails writing to shared memory. The performance regression appears when multiple processors contend for the memory: they have to synchronise their caches, and possibly shoot down the other's entry. There is not much to be done about this if you need to take a lock (almost true: in linux-2.4, there used to exist big-reader locks for solving this problem), so we simply seek to place our lock in memory that is already contended for. Fortunately, the signal_struct is shared among the group, has an appropriate lifespan, and has a reference counter that already bounces.

The exec() problem is somewhat more subtle. Recall from above that we need to drop all locks before allocating an array to snapshot the threadgroup into - when we originally found the task_struct to do our operation on, we knew that it was the leader, but what if when we drop locks before allocating, some other thread does exec() and steals leadership from us? Other threads in the group may exit, causing our links in the threadgroup list to become invalid, making iterating over the group unsafe. Furthermore, since the only thread we hold a reference on is the original thread, we can't even guarantee that following our updated tsk->group_leader pointer is safe.

Hence, the only safe way to snapshot the whole threadgroup after allocating an array big enough to do so is to not only take the RCU lock, but also to re-check whether we are still the group leader after doing so. If we are, then the group list is safe to iterate over, but if we are not, we must drop everything and start over - to look up by TGID whoever is the leader of the group (if it even still exists) again. So we have a check for the threadgroup leader when finding his task_struct to begin with, another check for the consistency of the list upon entering the RCU critical section for the second time, and a loop around the whole thing that retries for as long as it takes for no exec() race to occur. I call this algorithm "double-double-toil-and-trouble-check locking".

2010-12-26

lesson: never hack when it won't solve the entire problem.

I'd like to share some code I wrote for a particularly misguided solution to a yak-shaving quest. Fortunately, soon afterwards, I realized a much better way to dodge the bullet that I was here trying to take in the chest and stagger back to camp while screaming for help... so this code will never see the light of day.

That's good, because I also think it has a fatal flaw. I won't say it, but I'll give an exposition so you can figure it out. Here's the idea:

You are not allowed to block, sleep, or otherwise yield while in an "atomic" section of code (usually because it means interrupts are disabled). In order to safely iterate over a thread-group, you need to hold either rcu_read_lock or tasklist_lock, both of which are low-level locks and constitute "atomic" code while being held.
In kernel/cpuset.c, cpuset_change_task_nodemask has two steps, and a check between those steps which does a yield to synchronize its internal state (as part of an open-coded replacement for a proper concurrency primitive... ugh. from commit c0ff7453bb5c7c98e0885fb94279f2571946f280 on mmotm). Here I have changed it to return -EAGAIN instead of yielding, so the caller can synchronize "more appropriately".
get_task_struct and put_task_struct manage a thread's reference count. After you've done get_task_struct, it will always be safe to access that memory, until you release it. If you call put_task_struct and you're the last one with a reference count, it does a bunch of cleanup. (interesting read here, perhaps a hint.)

1456         /*
1457          * This particular per-task operation requires being able to sleep, so
1458          * it can't be done in the attach_task callback, where sleeping is
1459          * forbidden. See cgroup_attach_proc.
1460          */
1461         if (threadgroup) {
1462                 struct task_struct *c = NULL;
1463 again:
1464                 rcu_read_lock();
1465                 /* ensure safe thread-group traversal */
1466                 if (thread_group_leader(tsk)) {
1467                         list_for_each_entry_rcu(c, &tsk->thread_group,
1468                                                 thread_group) {
1469                                 get_task_struct(c);
1470                                 ret = cpuset_change_task_nodemask(c,
1471                                                 &cpuset_attach_nodemask_to);
1472                                 /*
1473                                  * on a failed nodemask change, we need to exit
1474                                  * the atomic section and start over.
1475                                  */
1476                                 if (ret == -EAGAIN) {
1477                                         rcu_read_unlock();
1478                                         yield();
1479                                         goto again;
1480                                 }
1481                                 put_task_struct(c);
1482                         }
1483                         rcu_read_unlock();
1484                 } else if (c != NULL) {
1485                         /*
1486                          * if racing with exec caused an abort, we need to
1487                          * finish the rebind operation in progress on it.
1488                          */
1489                         rcu_read_unlock();
1490                         do {
1491                                 ret = cpuset_change_task_nodemask(c,
1492                                                 &cpuset_attach_nodemask_to);
1493                                 if (ret == -EAGAIN)
1494                                         yield();
1495                         } while (ret != -EAGAIN)
1496                         put_task_struct(c);
1497                 }
1498         } else {
1499                 do {
1500                         ret = cpuset_change_task_nodemask(tsk,
1501                                         &cpuset_attach_nodemask_to);
1502                         if (ret == -EAGAIN)
1503                                 yield();
1504                 } while (ret != -EAGAIN);
1505         }

2009-12-21

I take two refcounts before I take two refcounts, and then I take two more

The best solution to the previously mentioned deadlock problem was determined to be having parse_cgroupfs_options take module reference counts (before sget takes the lock), and have rebind_subsystems drop them later. This division of work makes sure that the subsystems will stick around during sget so we can safely drop our own lock in the meantime.

So I was polishing up the deadlock-free version of the patch series today, and found a good bug. After a few routine changes and polishings, I decided to go through a bit more thorough testing than I'd done since implementing this change:

livecd dev # mount -t cgroup none -o test2,net_cls,devices,cpuacct cgroup/
livecd dev # ls cgroup/
cgroup.procs   cpuacct.usage_percpu  devices.list       release_agent
cpuacct.stat   devices.allow         net_cls.classid    tasks
cpuacct.usage  devices.deny          notify_on_release  test2.hax
livecd dev # lsmod
Module                  Size  Used by
cgroup_test2            2880  1
cls_cgroup              5080  1
livecd dev #

Simple enough. Now, cgroups has this functionality where you can't destroy a hierarchy that has children cgroups. (Children cgroups are represented as subdirectories in the filesystem tree, and are made just as you would expect.) It does, however, let you unmount it:

livecd dev # mkdir cgroup/foo
livecd dev # umount cgroup/
livecd dev #

In which case the hierarchy sticks around, invisible. You can make it reappear by mounting with the same list of subsystems; any attempt to mount a subsystem on that hierarchy with a mismatched set of subsystems will fail.

livecd dev # lsmod
Module                  Size  Used by
cgroup_test2            2880  1
cls_cgroup              5080  1
livecd dev #

Great - the modules' reference counts indicate that the hierarchy is still there. Let's bring it back again:

livecd dev # mount -t cgroup none -o cpuacct,test2,net_cls,devices cgroup/
livecd dev # lsmod
Module                  Size  Used by
cgroup_test2            2880  2
cls_cgroup              5080  2
livecd dev #

Oops! The mounting code didn't realize that we weren't changing anything with respect to the subsystems, and we leaked a reference! Fortunately, with the new changes that I made, there was a simple fix - in cgroup_get_sb, added lines 1518-1519:

1426         if (root == opts.new_root) {
1427 +--- 85 lines: We used the new root structure, so this is a new hierarchy ---
1512         } else {
1513                 /*
1514                  * We re-used an existing hierarchy - the new root (if
1515                  * any) is not needed
1516                  */
1517                 cgroup_drop_root(opts.new_root);
1518                 /* no subsys rebinding, so refcounts don't change */
1519                 drop_parsed_module_refcounts(opts.subsys_bits);
1520         }

For extra credit, tell how this bug is also a security vulnerability!

2009-11-16

the cgroup infrastructure: a quick tour

throughout my work on cgroups I have had many moments in which I look at a struct definition or variable declaration or even a function call and have nothing to think but "uhhhhh??" there is a nontrivial amount of infrastructure in the cgroup world, and here I am going to attempt to do a quick run-through of how everything is set up.

struct cgroupfs_root: this represents the root of a cgroup hierarchy. it knows things like what subsystems are attached to it, the root cgroup in the hierarchy, and other trivialities like its own name. variables of this type are almost always called "root".

struct cgroup: represents a single cgroup. remember that each directory, looking through the VFS layer, is a cgroup, so when you 'mkdir cgroup/foo', a new cgroup is created. variables of this type can be seen most commonly as "cgrp", but also sometimes "cg" or simply "c".

struct cgroup_subsys: the heart of the subsystem API - function pointers for the subsystem's operations, and other various options. referred to usually simply as "ss".

struct cgroup_subsys_state: per-cgroup, per-subsystem state storage - when you write to a subsystem control file for a given cgroup, this is what hears about it. always called "css".

struct css_set: a collection of pointers to cgroup_subsys_state objects. each css_set is referenced by all tasks who use the given combination of subsystem states. any particular subsystem state for a given task is only likely to change if the task gets moved into a different cgroup in the hierarchy to which the subsystem is attached. the presence of these guys is purely an optimization, since multiple tasks can have the same combination of subsystem states, and will therefore use the same css_set. additionally, they are stored in a hashtable for quick access. in comments, this data structure is referred to as a "cgroup group", and variables tend to go by the name "cg", but sometimes "css" as well.

struct cg_cgroup_link: as per the comment, this is a "link structure for associating css_set objects with cgroups". each of these has a pointer to one struct cgroup and to one struct css_set, and lives at the intersection of two lists: the first list, per cgroup, links all css_sets associated with that cgroup; the second, per css_set, links all cgroups associated with that css_set - forming, as is described in Documentation/cgroups/cgroups.txt, a "lattice". they are what you want to use if you want to iterate either all css_sets in a cgroup or all cgroups in a css_set, and respond mostly to "link" but sometimes "cgl".

2009-11-03

try_module_get vs delete_module

let me introduce you to my friend, whose name is try_module_get.

478 static inline int try_module_get(struct module *module)
479 {
480         int ret = 1;
481
482         if (module) {
483                 unsigned int cpu = get_cpu();
484                 if (likely(module_is_live(module))) {
485                         local_inc(__module_ref_addr(module, cpu));
486                         trace_module_get(module, _THIS_IP_,
487                                 local_read(__module_ref_addr(module, cpu)));
488                 }
489                 else
490                         ret = 0;
491                 put_cpu();
492         }
493         return ret;
494 }

he lives in include/linux/module.h, and he will get a reference count on the module for you, unless its state flag is set to MODULE_STATE_GOING. the get_cpu and put_cpu are SMP macros that disable/enable preemption so you can have a valid smp_processor_id.

great! now let's take a look over at a potential competitor, a system call in kernel/module.c by the name of delete_module. part of his code looks like this:

853         /* Stop the machine so refcounts can't move and disable module. */
854         ret = try_stop_module(mod, flags, &forced);
855         if (ret != 0)
856                 goto out;
857
858         /* Never wait if forced. */
859         if (!forced && module_refcount(mod) != 0)
860                 wait_for_zero_refcount(mod);

he can assist you in two different styles. the most common one is the "remove module immediately", which is what happens with rmmod usually. in this case, the O_NONBLOCK flag is specified. try_stop_module wants to set MODULE_STATE_GOING, and will behave differently depending on this flag.

if O_NONBLOCK is specified, try_stop_module will apply a very big hammer whose name is stop_machine. in this case, it will safely ensure that the reference count is zero (failing otherwise), and then set the MODULE_STATE_GOING flag. this is wonderful: because of the stop_machine hammer, there will be no problems racing with our first friend, try_module_get.

there is another way to invoke rmmod, which is with the --wait flag. if this is specified, try_stop_module will set MODULE_STATE_GOING without worrying about the refcount, and then delete_module will wait for the reference count to drop to zero. the keen-eyed systems hacker will at this point worry, what if we get the following scheduling pattern?

0) module state = MODULE_STATE_LIVE; refcount = 0
1) try_module_get checks module is alive, and succeeds (line 484)
2) delete_module sets MODULE_STATE_GOING flag (line 854)
3) delete_module waits until the refcount is zero, and finishes (line 860)
4) try_module_get increments the refcount (line 485).

not to worry, keen-eyed systems hacker! you will note that our clever friend try_module_get disables preemption on its CPU as it runs. this guarantees that he will not be descheduled during that bit of his code, and therefore, through the wonderful phenomenon of "very small critical section", the problematic execution order won't happen.

2009-10-11

a module loading adventure of the "great success" variety

okay! so, yesterday, and the day before that, and a little bit today, but mostly yesterday (in fact, for just about all of yesterday), I put together two patches in my stgit tree which I called cgroups-revamp-subsys-array.patch and cgroups-subsys-module-interface.patch, satisfying (in a very rudimentary way) #2 and #3 from the roadmap. (#1 turned out to be something I needed to keep but work around anyway... details not important now.) I made sure they compiled and went to bed.

Today, after looking at ns_cgroup.c and devices_cgroup.c and realizing that they couldn't really be modularized easily, I threw together a skeleton subsystem modeled after the other ones that adds a file "hax" that wraps a global variable whose value determines whether you can attach tasks to the cgroup or not. All right, now let's follow this guide that elly pointed me at to get it to build as a module...

WARNING: "cgroup_load_subsys" [/home/bblum/Documents/School/F09/412/hax/cgroup_test1.ko] undefined!
WARNING: "cgroup_add_file" [/home/bblum/Documents/School/F09/412/hax/cgroup_test1.ko] undefined!

Well, they're just warnings, so try loading the module anyway, right? (Note: I use insmod instead of modprobe because the latter wants infrastructure and dependencies, and the former can just take any random file from the filesystem.)

livecd / # mount -t hostfs none -o /home/bblum/412 /mnt/host/
livecd / # cd /mnt/host/hax/
livecd hax # insmod cgroup_test1.ko
cgroup_test1: Unknown symbol cgroup_load_subsys
cgroup_test1: Unknown symbol cgroup_add_file
insmod: error inserting 'cgroup_test1.ko': -1 Unknown symbol in module

It was worth a shot, though. Turns out I need to EXPORT_SYMBOL(...) everything I'll need for the module in kernel/cgroup.c. For now, I just do the functions my subsystem uses; later, I'll need to worry about functions that -any- subsystem might use. Next:

livecd hax # insmod cgroup_test1.ko
cgroup_test1: version magic '2.6.31-rc9-mm1-gf013913 mod_unload ' should be '2.6.31-rc9-mm1-ge40e265 mod_unload '
insmod: error inserting 'cgroup_test1.ko': -1 Invalid module format

It took too long to realize that the kernel I'd most recently booted somehow had something different enough to change the vermagic string from the most recent time I'd built it, which is what I'd built the module against. Okay. Rebooting UML, and going to get it right this time.

livecd hax # insmod cgroup_test1.ko
Kernel panic - not syncing: Kernel mode signal 4
Modules linked in: cgroup_test1(+)
Segmentation fault

I had deliberately left out the ".module = THIS_MODULE" line in test1_subsys when first building it, to see what would happen when cgroup_load_subsys tried to pin the module... and promptly forgotten about it. Putting the line in, finally, and:

livecd dev # lsmod
Module Size Used by
livecd dev # mount -t cgroup none -o test1 cgroup/
mount: special device none does not exist
livecd dev # insmod /mnt/host/hax/cgroup_test1.ko
livecd dev # lsmod
Module Size Used by
cgroup_test1 2512 1 [permanent]
livecd dev # mount -t cgroup none -o test1 cgroup/
livecd dev # ls cgroup/
cgroup.procs notify_on_release release_agent tasks test1.hax
livecd dev # mkdir cgroup/foo
livecd dev # echo $$ > cgroup/foo/tasks
bash: echo: write error: Operation not permitted
livecd dev # echo 42 > cgroup/foo/test1.hax
livecd dev # echo $$ > cgroup/foo/tasks
livecd dev #

:)

2009-10-06

understanding is powered by magic

I spent a good hour or two this afternoon poring over the kernel's module build infrastructure (not daring to look at any .c files, of course), looking at various other module code trying to take them as examples. Something clicked in my head then while looking at include/linux/init.h, which hadn't done before, presumably because I hadn't looked at module examples, and suddenly I understood how things wanted to be compiled as modules and have initcalls/exitcalls registered.

It seems (read: I've been advised) that each subsystem will want to have a pointer to its struct module, so it can keep it "pinned" while the subsystem is loaded. This raises an interesting question: where the hell does the module struct come from? include/linux/module.h has a macro called THIS_MODULE which references extern struct module __this_module. Looking at a few examples, some of them have foo.mod.c files, which all look to have the same struct declaration (with perhaps a few differences, namely in the "depends=" string at the end). However, modules that I found living in kernel/ don't tend to have that, though everything else (use of the initcall macros, etc) was the same. Where does this mystery struct come from? A grep through the standard directories revealed nothing; grepping the whole source tree discovered the file scripts/mod/modpost.c which... generates a header file for module code with the relevant struct information. As in, "buf_printf(b, "struct module __this_module\n");" with surrounding context. Ugh.

I need to learn to start trusting the macros (THIS_MODULE, in this case) that look like complete hax instead of trying to figure out what the hax are.

As a note to myself, the CONFIG_CGROUP option settings live in init/Kconfig, and to enable modularization on an option you need to change 'bool' to 'tristate'.

2009-09-13

UML

I spent a good portion of today (and small amounts of time previously) trying to get a good environment for this set up. This mostly involved getting the mm-of-the-moment (mmotm) tree and making UML (user-mode linux, the way i will test most of the stuff) actually go properly.

choosing an environment:
magrathea (my laptop): runs 64-bit. last time i tried gitting mmotm and building the kernel, it failed. :iiam:
maximegalon (my server): runs 32-bit. comparatively slow processor, so build times suboptimal.
unix.andrew servers: AFS volume only 1GB, mmotm tree with built objects in it is larger. I was given the suggestion to ask for a project volume, but :effort:

I eventually discovered that the build failure on magrathea was bash's fault. Version 4 strips environment variables with a "." in the name, and the kernel's makefiles depend on not-that-behaviour. During my googling that determined this, I found a mailing list discussing design implications and whether or not bash-4 should actually do that anyway or not. Magrathea now has bash-3 installed, and the kernel builds nicely.

The next thing I needed to do was obtain a filesystem to run UML on. I at first tried just laying it on top of my laptop's root filesystem (with ./linux root=/dev/root rootfstype=hostfs rootflags=/ rw), but my current gentoo installation's boot sequence is fancy enough that UML complained on a lot of the steps. Also, running as user forbids it from writing to anything that it needs to be able to write to, and furthermore found permissions conflicts with /bin/mount, which I sort of need a lot for cgroups devel.

Next I tried pulling a debian root filesystem from maximegalon, and running with that. That gave me the mystery "request_module: runaway loop modprobe binfmt-464c" (several times in a row, then hangs) immediately after mounting the root fs. I didn't examine it and instead looked for an alternative approach, finding rootstrap, a script designed to automatically build a UML root fs from scratch. It failed instantly complaining about mount's permissions (same as booting with my laptop's root). I next pulled down a prebuilt slackware root fs from UML's website - same binfmt-464c error. This time I googled it and discovered that that's what happens when you feed a 64-bit kernel (UML doesn't seem to have CONFIG_IA32_EMULATION for some reason?) a 32-bit filesystem.

With newfound resolve, I pulled down the amd64 minimal install CD ISO from Gentoo's website, loopback-mounted it, found the squashfs image therein, discovered my host kernel didn't have squashfs, built it as a module, then mounted the root filesystem and booted it. Same problem with /bin/mount as before. Okay, well, I'll change the permissions on this one since it's not on a filesystem I care about. At least, I would have done that, if the loopback-mounted squashfs image within the loopback-mounted ISO could at all be made not readonly. Instead I made myself a new filesystem (dd if=/dev/zero of=/tmp/uml_root bs=1024 count=1 seek=$((1024*1024-1)); /sbin/mke2fs -j /tmp/uml_root) and copied it over. That seemed to boot fine, until I got to the end, when after "Starting local ... [ ok ]" (the last message in the boot sequence before agetty runs, it printed a lot of "IRQF_DISABLED is not guaranteed on shared IRQs" messages and hung (and spawned a lot of xterms on the host kernel, too. wtf?). Google revealed no useful information, so I figured that as this was the homestretch a bit of hackery was allowed: I opened up /etc/conf.d/local.start and appended exec /bin/bash. Now when it booted up, it ended with a root shell.

Just to make sure everything was in working order - i.e., appropriate for doing devel with - I went to go mount cgroups (mkdir /dev/cgroups; mount -t cgroup none /dev/cgroup) and have a look. There I see the standard cgroups and subsystem control files, and also something I didn't expect to see: /dev/cgroup/cgroup.procs. That's the file that the patchseries I put together over the summer implemented, submitted to LKML a couple of times, and had somewhat lost track of as Paul (google mentor) had taken over its custody when I left. Looks like it got into the tree after all - I'm taking that as a sign of hope for this project :)

Linux CGroups: Subsystems as Modules