lkml submission #4



lkml submission #3


I take two refcounts before I take two refcounts, and then I take two more

The best solution to the previously mentioned deadlock problem was determined to be having parse_cgroupfs_options take module reference counts (before sget takes the lock), and have rebind_subsystems drop them later. This division of work makes sure that the subsystems will stick around during sget so we can safely drop our own lock in the meantime.

So I was polishing up the deadlock-free version of the patch series today, and found a good bug. After a few routine changes and polishings, I decided to go through a bit more thorough testing than I'd done since implementing this change:

livecd dev # mount -t cgroup none -o test2,net_cls,devices,cpuacct cgroup/
livecd dev # ls cgroup/
cgroup.procs   cpuacct.usage_percpu  devices.list       release_agent
cpuacct.stat   devices.allow         net_cls.classid    tasks
cpuacct.usage  devices.deny          notify_on_release  test2.hax
livecd dev # lsmod
Module                  Size  Used by
cgroup_test2            2880  1
cls_cgroup              5080  1
livecd dev #

Simple enough. Now, cgroups has this functionality where you can't destroy a hierarchy that has children cgroups. (Children cgroups are represented as subdirectories in the filesystem tree, and are made just as you would expect.) It does, however, let you unmount it:

livecd dev # mkdir cgroup/foo
livecd dev # umount cgroup/
livecd dev #

In which case the hierarchy sticks around, invisible. You can make it reappear by mounting with the same list of subsystems; any attempt to mount a subsystem on that hierarchy with a mismatched set of subsystems will fail.

livecd dev # lsmod
Module                  Size  Used by
cgroup_test2            2880  1
cls_cgroup              5080  1
livecd dev #

Great - the modules' reference counts indicate that the hierarchy is still there. Let's bring it back again:

livecd dev # mount -t cgroup none -o cpuacct,test2,net_cls,devices cgroup/
livecd dev # lsmod
Module                  Size  Used by
cgroup_test2            2880  2
cls_cgroup              5080  2
livecd dev #

Oops! The mounting code didn't realize that we weren't changing anything with respect to the subsystems, and we leaked a reference! Fortunately, with the new changes that I made, there was a simple fix - in cgroup_get_sb, added lines 1518-1519:

1426         if (root == opts.new_root) {
1427 +--- 85 lines: We used the new root structure, so this is a new hierarchy ---
1512         } else {
1513                 /*
1514                  * We re-used an existing hierarchy - the new root (if
1515                  * any) is not needed
1516                  */
1517                 cgroup_drop_root(opts.new_root);
1518                 /* no subsys rebinding, so refcounts don't change */
1519                 drop_parsed_module_refcounts(opts.subsys_bits);
1520         }

For extra credit, tell how this bug is also a security vulnerability!


I learned something new today.


One possible solution - modify cgroup_get_sb to do as follows:

1) get subsys_mutex
2) parse_cgroupfs_options()
3) release subsys_mutex
4) call sget(), which gets s->s_umount
5) get subsys_mutex again
6) verify_cgroupfs_options - also known as the "deadlock avoidance dance"
7) proceed.

Now, note, for clarification:

1508 static struct file_system_type cgroup_fs_type = {
1509         .name = "cgroup",
1510         .get_sb = cgroup_get_sb,
1511         .kill_sb = cgroup_kill_sb,
1512 };

The issue is that s->s_umount is taken in get_sb while it already has the subsys_mutex, whereas kill_sb is called with s->s_umount already held, and deadlock comes from there. There's a good question lurking here, and it is "Who ever designed an interface of seemingly symmetrical functions, one of which has to take a lock inside it while the other has the lock taken before it's called?"

So yes, there's locking order violation, but if functions were locks (which is a reasonable comparison because isn't it intuitive to have a function that takes a lock at the start and drops it at the end?) then the blame would be on whoever wrote this interface to begin with.


so what's this net_cls thing actually useful for?

Anand came to #cslounge today with an interesting question: is there a logical equivalent of nice for network bandwidth management that i can easily apply to a process? I don't like being unable to browse the web or irc when I scp large amounts of stuff.

I was under the impression that there didn't actually exist a subsystem for controlling that sort of thing, only for "classifying" it - the net_cls subsystem has one control file, which is "classid", and it lets you specify a network class for each cgroup, and it doesn't seem to do much because it doesn't have anything hooking into it because it's a module. Then I found this, which explains the actual secret: each classid can be associated with sockets held by tasks in those cgroups, and then from -userspace- have bandwidth throttling administered! Very slick, and keeps the kernel-side labour to a minimum, so much so that it can even be modularized.

Of course, Anand's particular situation had an easier solution, which is the -l option to scp which lets you specify a bandwidth limit explicitly.