2009-12-31
2009-12-21
I take two refcounts before I take two refcounts, and then I take two more
The best solution to the previously mentioned deadlock problem was determined to be having parse_cgroupfs_options take module reference counts (before sget takes the lock), and have rebind_subsystems drop them later. This division of work makes sure that the subsystems will stick around during sget so we can safely drop our own lock in the meantime.
So I was polishing up the deadlock-free version of the patch series today, and found a good bug. After a few routine changes and polishings, I decided to go through a bit more thorough testing than I'd done since implementing this change:
livecd dev # mount -t cgroup none -o test2,net_cls,devices,cpuacct cgroup/
livecd dev # ls cgroup/
cgroup.procs cpuacct.usage_percpu devices.list release_agent
cpuacct.stat devices.allow net_cls.classid tasks
cpuacct.usage devices.deny notify_on_release test2.hax
livecd dev # lsmod
Module Size Used by
cgroup_test2 2880 1
cls_cgroup 5080 1
livecd dev #
Simple enough. Now, cgroups has this functionality where you can't destroy a hierarchy that has children cgroups. (Children cgroups are represented as subdirectories in the filesystem tree, and are made just as you would expect.) It does, however, let you unmount it:
livecd dev # mkdir cgroup/foo
livecd dev # umount cgroup/
livecd dev #
In which case the hierarchy sticks around, invisible. You can make it reappear by mounting with the same list of subsystems; any attempt to mount a subsystem on that hierarchy with a mismatched set of subsystems will fail.
livecd dev # lsmod
Module Size Used by
cgroup_test2 2880 1
cls_cgroup 5080 1
livecd dev #
Great - the modules' reference counts indicate that the hierarchy is still there. Let's bring it back again:
livecd dev # mount -t cgroup none -o cpuacct,test2,net_cls,devices cgroup/
livecd dev # lsmod
Module Size Used by
cgroup_test2 2880 2
cls_cgroup 5080 2
livecd dev #
Oops! The mounting code didn't realize that we weren't changing anything with respect to the subsystems, and we leaked a reference! Fortunately, with the new changes that I made, there was a simple fix - in cgroup_get_sb, added lines 1518-1519:
1426 if (root == opts.new_root) {
1427 +--- 85 lines: We used the new root structure, so this is a new hierarchy ---
1512 } else {
1513 /*
1514 * We re-used an existing hierarchy - the new root (if
1515 * any) is not needed
1516 */
1517 cgroup_drop_root(opts.new_root);
1518 /* no subsys rebinding, so refcounts don't change */
1519 drop_parsed_module_refcounts(opts.subsys_bits);
1520 }
For extra credit, tell how this bug is also a security vulnerability!
So I was polishing up the deadlock-free version of the patch series today, and found a good bug. After a few routine changes and polishings, I decided to go through a bit more thorough testing than I'd done since implementing this change:
livecd dev # mount -t cgroup none -o test2,net_cls,devices,cpuacct cgroup/
livecd dev # ls cgroup/
cgroup.procs cpuacct.usage_percpu devices.list release_agent
cpuacct.stat devices.allow net_cls.classid tasks
cpuacct.usage devices.deny notify_on_release test2.hax
livecd dev # lsmod
Module Size Used by
cgroup_test2 2880 1
cls_cgroup 5080 1
livecd dev #
Simple enough. Now, cgroups has this functionality where you can't destroy a hierarchy that has children cgroups. (Children cgroups are represented as subdirectories in the filesystem tree, and are made just as you would expect.) It does, however, let you unmount it:
livecd dev # mkdir cgroup/foo
livecd dev # umount cgroup/
livecd dev #
In which case the hierarchy sticks around, invisible. You can make it reappear by mounting with the same list of subsystems; any attempt to mount a subsystem on that hierarchy with a mismatched set of subsystems will fail.
livecd dev # lsmod
Module Size Used by
cgroup_test2 2880 1
cls_cgroup 5080 1
livecd dev #
Great - the modules' reference counts indicate that the hierarchy is still there. Let's bring it back again:
livecd dev # mount -t cgroup none -o cpuacct,test2,net_cls,devices cgroup/
livecd dev # lsmod
Module Size Used by
cgroup_test2 2880 2
cls_cgroup 5080 2
livecd dev #
Oops! The mounting code didn't realize that we weren't changing anything with respect to the subsystems, and we leaked a reference! Fortunately, with the new changes that I made, there was a simple fix - in cgroup_get_sb, added lines 1518-1519:
1426 if (root == opts.new_root) {
1427 +--- 85 lines: We used the new root structure, so this is a new hierarchy ---
1512 } else {
1513 /*
1514 * We re-used an existing hierarchy - the new root (if
1515 * any) is not needed
1516 */
1517 cgroup_drop_root(opts.new_root);
1518 /* no subsys rebinding, so refcounts don't change */
1519 drop_parsed_module_refcounts(opts.subsys_bits);
1520 }
For extra credit, tell how this bug is also a security vulnerability!
2009-12-09
I learned something new today.
http://lkml.org/lkml/2009/12/9/30
One possible solution - modify cgroup_get_sb to do as follows:
1) get subsys_mutex
2) parse_cgroupfs_options()
3) release subsys_mutex
4) call sget(), which gets s->s_umount
5) get subsys_mutex again
6) verify_cgroupfs_options - also known as the "deadlock avoidance dance"
7) proceed.
Now, note, for clarification:
1508 static struct file_system_type cgroup_fs_type = {
1509 .name = "cgroup",
1510 .get_sb = cgroup_get_sb,
1511 .kill_sb = cgroup_kill_sb,
1512 };
The issue is that s->s_umount is taken in get_sb while it already has the subsys_mutex, whereas kill_sb is called with s->s_umount already held, and deadlock comes from there. There's a good question lurking here, and it is "Who ever designed an interface of seemingly symmetrical functions, one of which has to take a lock inside it while the other has the lock taken before it's called?"
So yes, there's locking order violation, but if functions were locks (which is a reasonable comparison because isn't it intuitive to have a function that takes a lock at the start and drops it at the end?) then the blame would be on whoever wrote this interface to begin with.
One possible solution - modify cgroup_get_sb to do as follows:
1) get subsys_mutex
2) parse_cgroupfs_options()
3) release subsys_mutex
4) call sget(), which gets s->s_umount
5) get subsys_mutex again
6) verify_cgroupfs_options - also known as the "deadlock avoidance dance"
7) proceed.
Now, note, for clarification:
1508 static struct file_system_type cgroup_fs_type = {
1509 .name = "cgroup",
1510 .get_sb = cgroup_get_sb,
1511 .kill_sb = cgroup_kill_sb,
1512 };
The issue is that s->s_umount is taken in get_sb while it already has the subsys_mutex, whereas kill_sb is called with s->s_umount already held, and deadlock comes from there. There's a good question lurking here, and it is "Who ever designed an interface of seemingly symmetrical functions, one of which has to take a lock inside it while the other has the lock taken before it's called?"
So yes, there's locking order violation, but if functions were locks (which is a reasonable comparison because isn't it intuitive to have a function that takes a lock at the start and drops it at the end?) then the blame would be on whoever wrote this interface to begin with.
2009-12-04
2009-12-03
so what's this net_cls thing actually useful for?
Anand came to #cslounge today with an interesting question: is there a logical equivalent of nice for network bandwidth management that i can easily apply to a process? I don't like being unable to browse the web or irc when I scp large amounts of stuff.
I was under the impression that there didn't actually exist a subsystem for controlling that sort of thing, only for "classifying" it - the net_cls subsystem has one control file, which is "classid", and it lets you specify a network class for each cgroup, and it doesn't seem to do much because it doesn't have anything hooking into it because it's a module. Then I found this, which explains the actual secret: each classid can be associated with sockets held by tasks in those cgroups, and then from -userspace- have bandwidth throttling administered! Very slick, and keeps the kernel-side labour to a minimum, so much so that it can even be modularized.
Of course, Anand's particular situation had an easier solution, which is the -l option to scp which lets you specify a bandwidth limit explicitly.
I was under the impression that there didn't actually exist a subsystem for controlling that sort of thing, only for "classifying" it - the net_cls subsystem has one control file, which is "classid", and it lets you specify a network class for each cgroup, and it doesn't seem to do much because it doesn't have anything hooking into it because it's a module. Then I found this, which explains the actual secret: each classid can be associated with sockets held by tasks in those cgroups, and then from -userspace- have bandwidth throttling administered! Very slick, and keeps the kernel-side labour to a minimum, so much so that it can even be modularized.
Of course, Anand's particular situation had an easier solution, which is the -l option to scp which lets you specify a bandwidth limit explicitly.
Subscribe to:
Posts (Atom)