Linux CGroups: Subsystems as Modules: 2009

2009-12-31

2009-12-21

I take two refcounts before I take two refcounts, and then I take two more

The best solution to the previously mentioned deadlock problem was determined to be having parse_cgroupfs_options take module reference counts (before sget takes the lock), and have rebind_subsystems drop them later. This division of work makes sure that the subsystems will stick around during sget so we can safely drop our own lock in the meantime.

So I was polishing up the deadlock-free version of the patch series today, and found a good bug. After a few routine changes and polishings, I decided to go through a bit more thorough testing than I'd done since implementing this change:

livecd dev # mount -t cgroup none -o test2,net_cls,devices,cpuacct cgroup/
livecd dev # ls cgroup/
cgroup.procs   cpuacct.usage_percpu  devices.list       release_agent
cpuacct.stat   devices.allow         net_cls.classid    tasks
cpuacct.usage  devices.deny          notify_on_release  test2.hax
livecd dev # lsmod
Module                  Size  Used by
cgroup_test2            2880  1
cls_cgroup              5080  1
livecd dev #

Simple enough. Now, cgroups has this functionality where you can't destroy a hierarchy that has children cgroups. (Children cgroups are represented as subdirectories in the filesystem tree, and are made just as you would expect.) It does, however, let you unmount it:

livecd dev # mkdir cgroup/foo
livecd dev # umount cgroup/
livecd dev #

In which case the hierarchy sticks around, invisible. You can make it reappear by mounting with the same list of subsystems; any attempt to mount a subsystem on that hierarchy with a mismatched set of subsystems will fail.

livecd dev # lsmod
Module                  Size  Used by
cgroup_test2            2880  1
cls_cgroup              5080  1
livecd dev #

Great - the modules' reference counts indicate that the hierarchy is still there. Let's bring it back again:

livecd dev # mount -t cgroup none -o cpuacct,test2,net_cls,devices cgroup/
livecd dev # lsmod
Module                  Size  Used by
cgroup_test2            2880  2
cls_cgroup              5080  2
livecd dev #

Oops! The mounting code didn't realize that we weren't changing anything with respect to the subsystems, and we leaked a reference! Fortunately, with the new changes that I made, there was a simple fix - in cgroup_get_sb, added lines 1518-1519:

1426         if (root == opts.new_root) {
1427 +--- 85 lines: We used the new root structure, so this is a new hierarchy ---
1512         } else {
1513                 /*
1514                  * We re-used an existing hierarchy - the new root (if
1515                  * any) is not needed
1516                  */
1517                 cgroup_drop_root(opts.new_root);
1518                 /* no subsys rebinding, so refcounts don't change */
1519                 drop_parsed_module_refcounts(opts.subsys_bits);
1520         }

For extra credit, tell how this bug is also a security vulnerability!

2009-12-09

I learned something new today.

http://lkml.org/lkml/2009/12/9/30

One possible solution - modify cgroup_get_sb to do as follows:

1) get subsys_mutex
2) parse_cgroupfs_options()
3) release subsys_mutex
4) call sget(), which gets s->s_umount
5) get subsys_mutex again
6) verify_cgroupfs_options - also known as the "deadlock avoidance dance"
7) proceed.

Now, note, for clarification:

1508 static struct file_system_type cgroup_fs_type = {
1509         .name = "cgroup",
1510         .get_sb = cgroup_get_sb,
1511         .kill_sb = cgroup_kill_sb,
1512 };

The issue is that s->s_umount is taken in get_sb while it already has the subsys_mutex, whereas kill_sb is called with s->s_umount already held, and deadlock comes from there. There's a good question lurking here, and it is "Who ever designed an interface of seemingly symmetrical functions, one of which has to take a lock inside it while the other has the lock taken before it's called?"

So yes, there's locking order violation, but if functions were locks (which is a reasonable comparison because isn't it intuitive to have a function that takes a lock at the start and drops it at the end?) then the blame would be on whoever wrote this interface to begin with.

2009-12-04

lkml submission #2

http://lkml.org/lkml/2009/12/4/53

2009-12-03

so what's this net_cls thing actually useful for?

Anand came to #cslounge today with an interesting question: is there a logical equivalent of nice for network bandwidth management that i can easily apply to a process? I don't like being unable to browse the web or irc when I scp large amounts of stuff.

I was under the impression that there didn't actually exist a subsystem for controlling that sort of thing, only for "classifying" it - the net_cls subsystem has one control file, which is "classid", and it lets you specify a network class for each cgroup, and it doesn't seem to do much because it doesn't have anything hooking into it because it's a module. Then I found this, which explains the actual secret: each classid can be associated with sockets held by tasks in those cgroups, and then from -userspace- have bandwidth throttling administered! Very slick, and keeps the kernel-side labour to a minimum, so much so that it can even be modularized.

Of course, Anand's particular situation had an easier solution, which is the -l option to scp which lets you specify a bandwidth limit explicitly.

2009-11-16

the cgroup infrastructure: a quick tour

throughout my work on cgroups I have had many moments in which I look at a struct definition or variable declaration or even a function call and have nothing to think but "uhhhhh??" there is a nontrivial amount of infrastructure in the cgroup world, and here I am going to attempt to do a quick run-through of how everything is set up.

struct cgroupfs_root: this represents the root of a cgroup hierarchy. it knows things like what subsystems are attached to it, the root cgroup in the hierarchy, and other trivialities like its own name. variables of this type are almost always called "root".

struct cgroup: represents a single cgroup. remember that each directory, looking through the VFS layer, is a cgroup, so when you 'mkdir cgroup/foo', a new cgroup is created. variables of this type can be seen most commonly as "cgrp", but also sometimes "cg" or simply "c".

struct cgroup_subsys: the heart of the subsystem API - function pointers for the subsystem's operations, and other various options. referred to usually simply as "ss".

struct cgroup_subsys_state: per-cgroup, per-subsystem state storage - when you write to a subsystem control file for a given cgroup, this is what hears about it. always called "css".

struct css_set: a collection of pointers to cgroup_subsys_state objects. each css_set is referenced by all tasks who use the given combination of subsystem states. any particular subsystem state for a given task is only likely to change if the task gets moved into a different cgroup in the hierarchy to which the subsystem is attached. the presence of these guys is purely an optimization, since multiple tasks can have the same combination of subsystem states, and will therefore use the same css_set. additionally, they are stored in a hashtable for quick access. in comments, this data structure is referred to as a "cgroup group", and variables tend to go by the name "cg", but sometimes "css" as well.

struct cg_cgroup_link: as per the comment, this is a "link structure for associating css_set objects with cgroups". each of these has a pointer to one struct cgroup and to one struct css_set, and lives at the intersection of two lists: the first list, per cgroup, links all css_sets associated with that cgroup; the second, per css_set, links all cgroups associated with that css_set - forming, as is described in Documentation/cgroups/cgroups.txt, a "lattice". they are what you want to use if you want to iterate either all css_sets in a cgroup or all cgroups in a css_set, and respond mostly to "link" but sometimes "cgl".

2009-11-03

a module UNloading adventure...

I promised myself this morning that I would spend a "little" bit of time thinking about module unloading. An afternoon and half an evening later, I found myself with a newly written 175-line patch...

livecd dev # insmod /mnt/host/cgroup_test1.ko
livecd dev # insmod /mnt/host/cgroup_test2.ko
livecd dev # modprobe cls_cgroup
livecd dev # lsmod
Module Size Used by
cls_cgroup 5064 0
cgroup_test2 2800 0
cgroup_test1 2800 0
livecd dev # mount -t cgroup none -o net_cls,test2 cgroup/
livecd dev # lsmod
Module Size Used by
cls_cgroup 5064 1
cgroup_test2 2800 1
cgroup_test1 2800 0
livecd dev # rmmod cgroup_test1
livecd dev # rmmod cgroup_test2
ERROR: Module cgroup_test2 is in use
livecd dev # umount cgroup
livecd dev # lsmod
Module Size Used by
cls_cgroup 5064 0
cgroup_test2 2800 0
livecd dev # rmmod cgroup_test2
livecd dev #

It still has some FIXMEs, meaning I need to make sure there are no races where there might be races, but I am surprised at how easy that was.

try_module_get vs delete_module

let me introduce you to my friend, whose name is try_module_get.

478 static inline int try_module_get(struct module *module)
479 {
480         int ret = 1;
481
482         if (module) {
483                 unsigned int cpu = get_cpu();
484                 if (likely(module_is_live(module))) {
485                         local_inc(__module_ref_addr(module, cpu));
486                         trace_module_get(module, _THIS_IP_,
487                                 local_read(__module_ref_addr(module, cpu)));
488                 }
489                 else
490                         ret = 0;
491                 put_cpu();
492         }
493         return ret;
494 }

he lives in include/linux/module.h, and he will get a reference count on the module for you, unless its state flag is set to MODULE_STATE_GOING. the get_cpu and put_cpu are SMP macros that disable/enable preemption so you can have a valid smp_processor_id.

great! now let's take a look over at a potential competitor, a system call in kernel/module.c by the name of delete_module. part of his code looks like this:

853         /* Stop the machine so refcounts can't move and disable module. */
854         ret = try_stop_module(mod, flags, &forced);
855         if (ret != 0)
856                 goto out;
857
858         /* Never wait if forced. */
859         if (!forced && module_refcount(mod) != 0)
860                 wait_for_zero_refcount(mod);

he can assist you in two different styles. the most common one is the "remove module immediately", which is what happens with rmmod usually. in this case, the O_NONBLOCK flag is specified. try_stop_module wants to set MODULE_STATE_GOING, and will behave differently depending on this flag.

if O_NONBLOCK is specified, try_stop_module will apply a very big hammer whose name is stop_machine. in this case, it will safely ensure that the reference count is zero (failing otherwise), and then set the MODULE_STATE_GOING flag. this is wonderful: because of the stop_machine hammer, there will be no problems racing with our first friend, try_module_get.

there is another way to invoke rmmod, which is with the --wait flag. if this is specified, try_stop_module will set MODULE_STATE_GOING without worrying about the refcount, and then delete_module will wait for the reference count to drop to zero. the keen-eyed systems hacker will at this point worry, what if we get the following scheduling pattern?

0) module state = MODULE_STATE_LIVE; refcount = 0
1) try_module_get checks module is alive, and succeeds (line 484)
2) delete_module sets MODULE_STATE_GOING flag (line 854)
3) delete_module waits until the refcount is zero, and finishes (line 860)
4) try_module_get increments the refcount (line 485).

not to worry, keen-eyed systems hacker! you will note that our clever friend try_module_get disables preemption on its CPU as it runs. this guarantees that he will not be descheduled during that bit of his code, and therefore, through the wonderful phenomenon of "very small critical section", the problematic execution order won't happen.

2009-11-02

lkml submission #1

http://lkml.org/lkml/2009/11/2/442

"This is mainly for kernel developers and desperate users."

I ran into the CONFIG_MODULES_FORCE_UNLOAD option today, and out of curiosity tried setting it and seeing what would happen.

livecd dev # rmmod -f cls_cgroup
Disabling lock debugging due to kernel taint
livecd dev # ls cgroup/
cgroup.procs net_cls.classid notify_on_release release_agent tasks
livecd dev # cat cgroup/net_cls.classid
cat: cgroup/net_cls.classid: Invalid argument
livecd dev # echo 1 > cgroup/net_cls.classid
bash: echo: write error: Invalid argument
livecd dev # umount cgroup

Modules linked in: [last unloaded: cls_cgroup]
Kernel panic - not syncing: Kernel mode fault at addr 0x0, ip 0x6017b8cc

I was impressed that it didn't die immediately when I tried looking at the file.

livecd / # modprobe cls_cgroup
livecd / # rmmod cls_cgroup
ERROR: Module cls_cgroup is in use
livecd / # rmmod -f cls_cgroup
Disabling lock debugging due to kernel taint
livecd / # cd /dev
livecd dev # mkdir cgroup
livecd dev # mount -t cgroup none -o cls_cgroup cgroup/

Modules linked in: [last unloaded: cls_cgroup]
Kernel panic - not syncing: Kernel mode fault at addr 0x0, ip 0x600fe9aa

Note that this is a case that I'd eventually like to get working with module unloading (without the -f option, of course).

2009-10-26

what did the segmentation violation handler say to the general protection handler when the kernel panicked?

livecd / # modprobe cls_cgroup
[New LWP 21722]
linux-nat.c:1152: internal-error: linux_nat_resume: Assertion `lp != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n)

Currently merging gdb-7.0 and hoping that won't have the same bug. [edit: new version seems to work fine.]

2009-10-21

Midsemester plan (as seen in the 412 project volume!)

milestones! \o_

this week:
clean up my work for my other classes and get my brain back in shape! possibly even refactor some of the code that i have identified as needing refactoring. also stop being so lazy and mirror my work/source tree in the project volume.

next week:
polish up my current features - namely, the so-far-implemented subsys[] modifications, and module interface within cgroups, and the conversion of net_cls to be able to be modulificarized. possibly even submit first draft to LKML and folks, get reviews!

week after (nov 1-7):
think up how to do module unloading support; logistics of pinning the subsystems when loaded and letting them go when a hierarchy is unmounted. possibly begin implementing this thing. possibly consider any reviews gotten on LKML for first submission.

nov 8-14:
work should be moving along solidly on module unloading and/or fixing lkml reviews.

nov 15-21:
one or both of above should be finished. shoot for another submission to lkml around this time?

nov 22-28:
if not lkmled last week, module unloading should be first-draft done and thinged this week.

nov 29-dec 5:
rest of semester should be dedicated to finalizing everything and making the critics from lkml happy

grading criteria! _o/

C: idea rejected or otherwise falls apart somehow, implementation turns out to be very shaky, didn't get any shininess done on top of the rudimentary stuff.

B: implementation possibly a little shaky, the lkml dudes don't like it yet, a sizeable amount more work to be done before it can be called a real feature, not a lot of shininess. alternatively, a bare rudimentary implementation taken by lkml but with nothing shiny at all (i.e., pretty much what functionality i have now and nothing more)

A: implementation solid, most likely accepted to lkml by the end of semester, or if not, should be clearly on its way to that soon. at least a moderate amount of shininess, whether from module unloading or otherwise, should be present.

A++++ with a hug and a star-shaped sticker: shines more brilliantly than the sun, great features, accepted into kernel for sure by end of semester, works flawlessly and highly lauded by big-name developers as great development in computing. nobel peace prize possibly awarded.

2009-10-16

wrestling dragons

so, next goal after doing a dummy subsystem is to start on the existing builtin subsystems. at least one should end up being modularized, and for others that I examine, analysis should be given on why they can't or would be difficult to be made into modules. so a quick glance over the existing subsystems turned into surprise progress when I discovered that net_cls (which additionally goes by cls_cgroup, net_cls_cgroup, and in all likelihood additional permutations thereof) not only is restricted to one file (devices and ns for example have miscellaneous function calls to them with ifdefs, in contrast), but also already has module_init() et.al. declarations at the bottom!

After a bit of hacking around with cgroup_load_subsys() (my new function) and the subsys_id, and also changing the Kconfig option to tristate, the build system happily modulificatarized net_cls for me. Booting up UML, I realized it was probably time to figure out how to make modprobe do the right thing - as it turns out, UML's documentation is very helpful in this regard. A quick hostfs mount and depmod later, and the module loads and runs just fine. Victoly!

In other news, binding GDB to UML yields some frightening results:

0x00007f614c38b420 in nanosleep () from /lib64/libc.so.6
(gdb) break cgroup_load_subsys
Breakpoint 1 at 0x60051819: file kernel/cgroup.c, line 3619.
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
memcpy () at arch/um/sys-x86_64/../../x86/lib/memcpy_64.S:68
68 movq %r11, 0*8(%rdi)
Current language: auto; currently asm
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007f614c36911b in memset () from /lib64/libc.so.6
(gdb)
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007f614c36873e in memset () from /lib64/libc.so.6
(gdb)
Continuing.

Breakpoint 1, cgroup_load_subsys (ss=0x62828fe0) at kernel/cgroup.c:3619
3619 if (ss->fork || ss->exit)
Current language: auto; currently c
(gdb)

2009-10-11

a module loading adventure of the "great success" variety

okay! so, yesterday, and the day before that, and a little bit today, but mostly yesterday (in fact, for just about all of yesterday), I put together two patches in my stgit tree which I called cgroups-revamp-subsys-array.patch and cgroups-subsys-module-interface.patch, satisfying (in a very rudimentary way) #2 and #3 from the roadmap. (#1 turned out to be something I needed to keep but work around anyway... details not important now.) I made sure they compiled and went to bed.

Today, after looking at ns_cgroup.c and devices_cgroup.c and realizing that they couldn't really be modularized easily, I threw together a skeleton subsystem modeled after the other ones that adds a file "hax" that wraps a global variable whose value determines whether you can attach tasks to the cgroup or not. All right, now let's follow this guide that elly pointed me at to get it to build as a module...

WARNING: "cgroup_load_subsys" [/home/bblum/Documents/School/F09/412/hax/cgroup_test1.ko] undefined!
WARNING: "cgroup_add_file" [/home/bblum/Documents/School/F09/412/hax/cgroup_test1.ko] undefined!

Well, they're just warnings, so try loading the module anyway, right? (Note: I use insmod instead of modprobe because the latter wants infrastructure and dependencies, and the former can just take any random file from the filesystem.)

livecd / # mount -t hostfs none -o /home/bblum/412 /mnt/host/
livecd / # cd /mnt/host/hax/
livecd hax # insmod cgroup_test1.ko
cgroup_test1: Unknown symbol cgroup_load_subsys
cgroup_test1: Unknown symbol cgroup_add_file
insmod: error inserting 'cgroup_test1.ko': -1 Unknown symbol in module

It was worth a shot, though. Turns out I need to EXPORT_SYMBOL(...) everything I'll need for the module in kernel/cgroup.c. For now, I just do the functions my subsystem uses; later, I'll need to worry about functions that -any- subsystem might use. Next:

livecd hax # insmod cgroup_test1.ko
cgroup_test1: version magic '2.6.31-rc9-mm1-gf013913 mod_unload ' should be '2.6.31-rc9-mm1-ge40e265 mod_unload '
insmod: error inserting 'cgroup_test1.ko': -1 Invalid module format

It took too long to realize that the kernel I'd most recently booted somehow had something different enough to change the vermagic string from the most recent time I'd built it, which is what I'd built the module against. Okay. Rebooting UML, and going to get it right this time.

livecd hax # insmod cgroup_test1.ko
Kernel panic - not syncing: Kernel mode signal 4
Modules linked in: cgroup_test1(+)
Segmentation fault

I had deliberately left out the ".module = THIS_MODULE" line in test1_subsys when first building it, to see what would happen when cgroup_load_subsys tried to pin the module... and promptly forgotten about it. Putting the line in, finally, and:

livecd dev # lsmod
Module Size Used by
livecd dev # mount -t cgroup none -o test1 cgroup/
mount: special device none does not exist
livecd dev # insmod /mnt/host/hax/cgroup_test1.ko
livecd dev # lsmod
Module Size Used by
cgroup_test1 2512 1 [permanent]
livecd dev # mount -t cgroup none -o test1 cgroup/
livecd dev # ls cgroup/
cgroup.procs notify_on_release release_agent tasks test1.hax
livecd dev # mkdir cgroup/foo
livecd dev # echo $$ > cgroup/foo/tasks
bash: echo: write error: Operation not permitted
livecd dev # echo 42 > cgroup/foo/test1.hax
livecd dev # echo $$ > cgroup/foo/tasks
livecd dev #

:)

2009-10-08

what some people don't realize is that you can't stereotype my group as a "c" group.

because we do it all: cpuset, devices, freezer, memory. you name it, we've done it. it don't matter to me: if the cache is hot, we gonna kill it.

we're here to control processes, bottom line - we have a whole lot of files from different subsystems, so we gotta make sure we keep a directory of them, ya know? directed like cgroups.

roadmap

all right, so it seems like a good idea to map out the ideas and targets I've got in my head, for several reasons. Here's what I've determined I should be doing.

1) Fork/exit callbacks need to go. It's this functionality that cgroups has offered since (presumably) it first hit mainline in which a subsystem can set itself up to get a function called whenever a task forks or exits. Apparently, no subsystem has ever used it, and the presence of it here is going to interact funnily with module-loadable subsystems, so - at the suggestion and approval of Paul - I'm going to strip all callback code out of cgroups. This will be done as a pre-patch to the main patch series I plan on generating.

2) Changing how subsys[] is used.
a) At the bottom of the array will be the entries for builtin subsystems, which will be there at link-time, up until CGROUP_BUILTIN_SUBSYS_COUNT. CGROUP_SUBSYS_COUNT, which used to be that, is now defined as the size of the subsys_bits field in cgroupfs_root (i.e., 32 or 64), and is still the max size of the array. At link time, all entries between the builtin count and the total count will be NULL, and that's where module subsystems will put themselves. (This is done.) Also, the array will need to be surrounded in a rwlock, since when a subsystem registers itself it will need to take a subsys_id. (This is not done.)
b) All code throughout cgroups needs to be able to handle when a subsystem is gone. Each loop that iterates down the array will need to have a check for null pointers (this is done) and take the read-lock (this is not done). There may also be other things that certain loops need to do, situationally - this is as yet unclear.

3) cgroup_init_subsys() needs to be revised to be suitable as a module initcall. It needs to be able to handle failures correctly (the current version will kpanic on initialization fail, since it's assumed to call at boot time only). Of course, because some subsystems will be left as builtins, we'll still need a version suitable for calling at boottime - probably just a wrapper around the adapted module initcall. Also, we'll need to be concurrency-safe now - obviously around the subsys array, and possibly in the other various things that the function does. Among other things, when the module is loaded OR when the module is mounted on a cgroup hierarchy (see note at end of post) we'll need to pin it with try_module_get() to make sure it doesn't go away.

Once we hit this point, it can be said that cgroups has support for modular subsystems. Next, we do the whole "confirming" thing:

4) adapt one or more subsystems to become modules, or perhaps write a new skeleton one for testing, or both. in order to be a module (suppose your module is "foo" as CONFIG_FOO), you need to do the following things:
a) instead of having code interspersed in other code with stuff like #ifdef CONFIG_FOO, it has to be all in the same file (since each .o file is either going to be a builtin or a module). in the kconfig, you need to specify that it's buildable as a module, and in the makefile, you need to make sure that the config file corresponds to the right source file.
b) you need to register a bunch of stuff with the module_suchandsuch() macros - like name, version, author, and most importantly module_init() and module_exit(), which define what functions are called at module load and unload time. (the infrastructure behind this and these macros is a lot of hax.)

I am uncertain whether I'll end up supporting module unloading for cgroups - it seems like it would be useful, given that we have a limit on the number of subsystems loaded at a time. I think this would involve making sure that subsystems can't be unloaded while attached to any mounted hierarchy, but can when not. This likely will necessitate use of cgroup_lock. If we do this, we'll end up pinning the module when we mount a hierarchy - there will be a race here if somebody's trying to unload the module at the same time, so when mounting, pinning all subsystems will have to be done before committing to the mount.

The alternative approach - and the one that I'll go with to begin with, for sure - is to just say nope, never unload, and the module is pinned forever as soon as it's loaded.

2009-10-06

understanding is powered by magic

I spent a good hour or two this afternoon poring over the kernel's module build infrastructure (not daring to look at any .c files, of course), looking at various other module code trying to take them as examples. Something clicked in my head then while looking at include/linux/init.h, which hadn't done before, presumably because I hadn't looked at module examples, and suddenly I understood how things wanted to be compiled as modules and have initcalls/exitcalls registered.

It seems (read: I've been advised) that each subsystem will want to have a pointer to its struct module, so it can keep it "pinned" while the subsystem is loaded. This raises an interesting question: where the hell does the module struct come from? include/linux/module.h has a macro called THIS_MODULE which references extern struct module __this_module. Looking at a few examples, some of them have foo.mod.c files, which all look to have the same struct declaration (with perhaps a few differences, namely in the "depends=" string at the end). However, modules that I found living in kernel/ don't tend to have that, though everything else (use of the initcall macros, etc) was the same. Where does this mystery struct come from? A grep through the standard directories revealed nothing; grepping the whole source tree discovered the file scripts/mod/modpost.c which... generates a header file for module code with the relevant struct information. As in, "buf_printf(b, "struct module __this_module\n");" with surrounding context. Ugh.

I need to learn to start trusting the macros (THIS_MODULE, in this case) that look like complete hax instead of trying to figure out what the hax are.

As a note to myself, the CONFIG_CGROUP option settings live in init/Kconfig, and to enable modularization on an option you need to change 'bool' to 'tristate'.

2009-10-03

gaining momentum

I spent some time today looking through kernel/cgroup.c (and a bit of cgroup.h), focusing on the uses of the subsys[] array - the one that's currently initialized and link-time and will need to be redone for dynamic loading - and thinking about how to change it.

For one thing, cgroup_init_subsys() will need to no longer be marked as __init, and will need to be concurrency-safe (against itself, too). I think that no matter what data structure I end up using for the subsystem list, it will be guarded by a rwlock (or rwsem. one is a low-level lock, the other isn't). Aside from that, the only other challenge there appears to be the fork/exit callbacks logic, which I think can be for the most part disregarded - or thought about very briefly at least - since currently no subsystem even uses the callbacks.

As for the subsys[] array, the question of how to make it support new guys appearing is a difficult one. At first I thought I could make it a list, with list_head structs in each subsystem's struct - which might still be possible... but looking through all the uses of it, there will prove to be some fiddly bits. Namely, there are other data structures that rely on the list of subsystems being a fixed-length array - cgroup->subsys[] and css_set->subsys[], which is a list of cgroup_subsys_state objects (good naming there, guys), and also the use of template[] in find_existing_css_set (which I used over the summer!) all rely on matching up with the global subsys[] array. Additionally, struct cgroupfs_root has a pair of fields (unsigned long) called subsys_bits and actual_subsys_bits, which keep track of which subsystems are or want to be attached. So, thoughts for this are either:

1. figure out some way to do a dynamic list for subsys[] and its corresponding things, which will involve possibly fiddly uses of kmalloc() (with accompanying fail cases) and/or relying on cgroup_mutex and throwing another list_head in the subsys structs somewhere. also, replacing subsys_bits with something more suitable.
2. simply set CGROUP_SUBSYS_COUNT to sizeof(subsys_bits)*8 (i.e., maximum number of subsystems at a time is the number of bits in the thing field) and let the array have null slots in it. this seems a lot easier, but is avoiding the design problem. on the other hand, they designed clone_flags to have 32 possible settings, and they ran out of those recently, so...

I hope to spend a good chunk of tomorrow coding.

2009-09-21

last-minute presentation slides!

My project presentation in 412 is in eleven hours, and I've just finished and test-presented my slides. Here they are, for reference:

http://maximegalon.res.cmu.edu/home/bblum/project.pdf

2009-09-13

UML

I spent a good portion of today (and small amounts of time previously) trying to get a good environment for this set up. This mostly involved getting the mm-of-the-moment (mmotm) tree and making UML (user-mode linux, the way i will test most of the stuff) actually go properly.

choosing an environment:
magrathea (my laptop): runs 64-bit. last time i tried gitting mmotm and building the kernel, it failed. :iiam:
maximegalon (my server): runs 32-bit. comparatively slow processor, so build times suboptimal.
unix.andrew servers: AFS volume only 1GB, mmotm tree with built objects in it is larger. I was given the suggestion to ask for a project volume, but :effort:

I eventually discovered that the build failure on magrathea was bash's fault. Version 4 strips environment variables with a "." in the name, and the kernel's makefiles depend on not-that-behaviour. During my googling that determined this, I found a mailing list discussing design implications and whether or not bash-4 should actually do that anyway or not. Magrathea now has bash-3 installed, and the kernel builds nicely.

The next thing I needed to do was obtain a filesystem to run UML on. I at first tried just laying it on top of my laptop's root filesystem (with ./linux root=/dev/root rootfstype=hostfs rootflags=/ rw), but my current gentoo installation's boot sequence is fancy enough that UML complained on a lot of the steps. Also, running as user forbids it from writing to anything that it needs to be able to write to, and furthermore found permissions conflicts with /bin/mount, which I sort of need a lot for cgroups devel.

Next I tried pulling a debian root filesystem from maximegalon, and running with that. That gave me the mystery "request_module: runaway loop modprobe binfmt-464c" (several times in a row, then hangs) immediately after mounting the root fs. I didn't examine it and instead looked for an alternative approach, finding rootstrap, a script designed to automatically build a UML root fs from scratch. It failed instantly complaining about mount's permissions (same as booting with my laptop's root). I next pulled down a prebuilt slackware root fs from UML's website - same binfmt-464c error. This time I googled it and discovered that that's what happens when you feed a 64-bit kernel (UML doesn't seem to have CONFIG_IA32_EMULATION for some reason?) a 32-bit filesystem.

With newfound resolve, I pulled down the amd64 minimal install CD ISO from Gentoo's website, loopback-mounted it, found the squashfs image therein, discovered my host kernel didn't have squashfs, built it as a module, then mounted the root filesystem and booted it. Same problem with /bin/mount as before. Okay, well, I'll change the permissions on this one since it's not on a filesystem I care about. At least, I would have done that, if the loopback-mounted squashfs image within the loopback-mounted ISO could at all be made not readonly. Instead I made myself a new filesystem (dd if=/dev/zero of=/tmp/uml_root bs=1024 count=1 seek=$((1024*1024-1)); /sbin/mke2fs -j /tmp/uml_root) and copied it over. That seemed to boot fine, until I got to the end, when after "Starting local ... [ ok ]" (the last message in the boot sequence before agetty runs, it printed a lot of "IRQF_DISABLED is not guaranteed on shared IRQs" messages and hung (and spawned a lot of xterms on the host kernel, too. wtf?). Google revealed no useful information, so I figured that as this was the homestretch a bit of hackery was allowed: I opened up /etc/conf.d/local.start and appended exec /bin/bash. Now when it booted up, it ended with a root shell.

Just to make sure everything was in working order - i.e., appropriate for doing devel with - I went to go mount cgroups (mkdir /dev/cgroups; mount -t cgroup none /dev/cgroup) and have a look. There I see the standard cgroups and subsystem control files, and also something I didn't expect to see: /dev/cgroup/cgroup.procs. That's the file that the patchseries I put together over the summer implemented, submitted to LKML a couple of times, and had somewhat lost track of as Paul (google mentor) had taken over its custody when I left. Looks like it got into the tree after all - I'm taking that as a sign of hope for this project :)

welcome to my 412 project's blog: project proposal [repost]

following is the preliminary (read: rough) project proposal I submitted a week or two ago, as per this guide.

PROJECT PROPOSAL - Linux CGroups: Subsystems as Modules
http://lkml.org
http://www.mjmwired.net/kernel/Documentation/cgroups.txt

This project aims to alter the cgroups infrastructure to support subsystems as kernel modules. cgroups will need to be enhanced to deal with subsystems that are "not there", and for each subsysem some consideration will need to be done on the implications of it being dynamically loaded. This will entail a wide and thorough, but not deep and intricate, development on the core cgroups code, and changes on subsystems possibly ranging from superficial to major depending on the particular nature.

The control groups mechanism in Linux is currently implemented as a core part of the kernel, mostly in kernel/cgroup.c, with various peripheral code (mostly subsystem-wise) scattered throughout a few other files (sched.c, ns_cgroup.c, memcontrol.c, etcetera). As it is a rather high-level part of the kernel, all of it is in C, and the development process is standard linux kernel devel, complete with LKML submissions at the end (read: surprise, more work to do).

As this project is enabling module functionality for subsystems, the task will begin with energy being dedicated to understanding the modules system within linux. At about the same time, an interface with which to see the modules will have to be developed, and only after knowing what's going on will modulifying begin (naturally).

As far as linux development goes, cgroups seems to be fairly relatively slow. A patch series from my work over the summer is currently being landed, and may cause excitement with this project if it takes too long. Other than that, clashes should be minor enough.

Development will be done in the mmotm ("mm of the moment") tree, using stgit on top of git. Testing can be done in both UML, for convenience, and on real hardware, for completeness. I'll likely enough collaborate with Paul Menage (mentor at google, responsible for most cgroups stuff) for guidance. Hopefully submissions to LKML will start a good way through the semester, and revisions and refinery will be done by the end.