the cgroup infrastructure: a quick tour

throughout my work on cgroups I have had many moments in which I look at a struct definition or variable declaration or even a function call and have nothing to think but "uhhhhh??" there is a nontrivial amount of infrastructure in the cgroup world, and here I am going to attempt to do a quick run-through of how everything is set up.

struct cgroupfs_root: this represents the root of a cgroup hierarchy. it knows things like what subsystems are attached to it, the root cgroup in the hierarchy, and other trivialities like its own name. variables of this type are almost always called "root".

struct cgroup: represents a single cgroup. remember that each directory, looking through the VFS layer, is a cgroup, so when you 'mkdir cgroup/foo', a new cgroup is created. variables of this type can be seen most commonly as "cgrp", but also sometimes "cg" or simply "c".

struct cgroup_subsys: the heart of the subsystem API - function pointers for the subsystem's operations, and other various options. referred to usually simply as "ss".

struct cgroup_subsys_state: per-cgroup, per-subsystem state storage - when you write to a subsystem control file for a given cgroup, this is what hears about it. always called "css".

struct css_set: a collection of pointers to cgroup_subsys_state objects. each css_set is referenced by all tasks who use the given combination of subsystem states. any particular subsystem state for a given task is only likely to change if the task gets moved into a different cgroup in the hierarchy to which the subsystem is attached. the presence of these guys is purely an optimization, since multiple tasks can have the same combination of subsystem states, and will therefore use the same css_set. additionally, they are stored in a hashtable for quick access. in comments, this data structure is referred to as a "cgroup group", and variables tend to go by the name "cg", but sometimes "css" as well.

struct cg_cgroup_link: as per the comment, this is a "link structure for associating css_set objects with cgroups". each of these has a pointer to one struct cgroup and to one struct css_set, and lives at the intersection of two lists: the first list, per cgroup, links all css_sets associated with that cgroup; the second, per css_set, links all cgroups associated with that css_set - forming, as is described in Documentation/cgroups/cgroups.txt, a "lattice". they are what you want to use if you want to iterate either all css_sets in a cgroup or all cgroups in a css_set, and respond mostly to "link" but sometimes "cgl".


a module UNloading adventure...

I promised myself this morning that I would spend a "little" bit of time thinking about module unloading. An afternoon and half an evening later, I found myself with a newly written 175-line patch...

livecd dev # insmod /mnt/host/cgroup_test1.ko
livecd dev # insmod /mnt/host/cgroup_test2.ko
livecd dev # modprobe cls_cgroup
livecd dev # lsmod
Module Size Used by
cls_cgroup 5064 0
cgroup_test2 2800 0
cgroup_test1 2800 0
livecd dev # mount -t cgroup none -o net_cls,test2 cgroup/
livecd dev # lsmod
Module Size Used by
cls_cgroup 5064 1
cgroup_test2 2800 1
cgroup_test1 2800 0
livecd dev # rmmod cgroup_test1
livecd dev # rmmod cgroup_test2
ERROR: Module cgroup_test2 is in use
livecd dev # umount cgroup
livecd dev # lsmod
Module Size Used by
cls_cgroup 5064 0
cgroup_test2 2800 0
livecd dev # rmmod cgroup_test2
livecd dev #

It still has some FIXMEs, meaning I need to make sure there are no races where there might be races, but I am surprised at how easy that was.

try_module_get vs delete_module

let me introduce you to my friend, whose name is try_module_get.

478 static inline int try_module_get(struct module *module)
479 {
        int ret = 1;
482         if (module) {
483                 unsigned int cpu = get_cpu();
484                 if (likely(module_is_live(module))) {
485                         local_inc(__module_ref_addr(module, cpu));
486                         trace_module_get(module, _THIS_IP_,
487                                 local_read(__module_ref_addr(module, cpu)));
488                 }
489                 else
490                         ret = 0;
491                 put_cpu();
492         }
493         return ret;
494 }

he lives in include/linux/module.h, and he will get a reference count on the module for you, unless its state flag is set to MODULE_STATE_GOING. the get_cpu and put_cpu are SMP macros that disable/enable preemption so you can have a valid smp_processor_id.

great! now let's take a look over at a potential competitor, a system call in kernel/module.c by the name of delete_module. part of his code looks like this:

853         /* Stop the machine so refcounts can't move and disable module. */
854         ret = try_stop_module(mod, flags, &forced);
855         if (ret != 0)
856                 goto out;
858         /* Never wait if forced. */
859         if (!forced && module_refcount(mod) != 0)
860                 wait_for_zero_refcount(mod);

he can assist you in two different styles. the most common one is the "remove module immediately", which is what happens with rmmod usually. in this case, the O_NONBLOCK flag is specified. try_stop_module wants to set MODULE_STATE_GOING, and will behave differently depending on this flag.

if O_NONBLOCK is specified, try_stop_module will apply a very big hammer whose name is stop_machine. in this case, it will safely ensure that the reference count is zero (failing otherwise), and then set the MODULE_STATE_GOING flag. this is wonderful: because of the stop_machine hammer, there will be no problems racing with our first friend, try_module_get.

there is another way to invoke rmmod, which is with the --wait flag. if this is specified, try_stop_module will set MODULE_STATE_GOING without worrying about the refcount, and then delete_module will wait for the reference count to drop to zero. the keen-eyed systems hacker will at this point worry, what if we get the following scheduling pattern?

0) module state = MODULE_STATE_LIVE; refcount = 0
1) try_module_get checks module is alive, and succeeds (line 484)
2) delete_module sets MODULE_STATE_GOING flag (line 854)
3) delete_module waits until the refcount is zero, and finishes (line 860)
4) try_module_get increments the refcount (line 485).

not to worry, keen-eyed systems hacker! you will note that our clever friend try_module_get disables preemption on its CPU as it runs. this guarantees that he will not be descheduled during that bit of his code, and therefore, through the wonderful phenomenon of "very small critical section", the problematic execution order won't happen.


lkml submission #1


"This is mainly for kernel developers and desperate users."

I ran into the CONFIG_MODULES_FORCE_UNLOAD option today, and out of curiosity tried setting it and seeing what would happen.

livecd dev # rmmod -f cls_cgroup
Disabling lock debugging due to kernel taint
livecd dev # ls cgroup/
cgroup.procs net_cls.classid notify_on_release release_agent tasks
livecd dev # cat cgroup/net_cls.classid
cat: cgroup/net_cls.classid: Invalid argument
livecd dev # echo 1 > cgroup/net_cls.classid
bash: echo: write error: Invalid argument
livecd dev # umount cgroup

Modules linked in: [last unloaded: cls_cgroup]
Kernel panic - not syncing: Kernel mode fault at addr 0x0, ip 0x6017b8cc

I was impressed that it didn't die immediately when I tried looking at the file.

livecd / # modprobe cls_cgroup
livecd / # rmmod cls_cgroup
ERROR: Module cls_cgroup is in use
livecd / # rmmod -f cls_cgroup
Disabling lock debugging due to kernel taint
livecd / # cd /dev
livecd dev # mkdir cgroup
livecd dev # mount -t cgroup none -o cls_cgroup cgroup/

Modules linked in: [last unloaded: cls_cgroup]
Kernel panic - not syncing: Kernel mode fault at addr 0x0, ip 0x600fe9aa

Note that this is a case that I'd eventually like to get working with module unloading (without the -f option, of course).