http://lkml.org/lkml/2009/12/9/30
One possible solution - modify cgroup_get_sb to do as follows:
1) get subsys_mutex
2) parse_cgroupfs_options()
3) release subsys_mutex
4) call sget(), which gets s->s_umount
5) get subsys_mutex again
6) verify_cgroupfs_options - also known as the "deadlock avoidance dance"
7) proceed.
Now, note, for clarification:
1508 static struct file_system_type cgroup_fs_type = {
1509 .name = "cgroup",
1510 .get_sb = cgroup_get_sb,
1511 .kill_sb = cgroup_kill_sb,
1512 };
The issue is that s->s_umount is taken in get_sb while it already has the subsys_mutex, whereas kill_sb is called with s->s_umount already held, and deadlock comes from there. There's a good question lurking here, and it is "Who ever designed an interface of seemingly symmetrical functions, one of which has to take a lock inside it while the other has the lock taken before it's called?"
So yes, there's locking order violation, but if functions were locks (which is a reasonable comparison because isn't it intuitive to have a function that takes a lock at the start and drops it at the end?) then the blame would be on whoever wrote this interface to begin with.
Showing posts with label filesystem. Show all posts
Showing posts with label filesystem. Show all posts
2009-12-09
2009-11-16
the cgroup infrastructure: a quick tour
throughout my work on cgroups I have had many moments in which I look at a struct definition or variable declaration or even a function call and have nothing to think but "uhhhhh??" there is a nontrivial amount of infrastructure in the cgroup world, and here I am going to attempt to do a quick run-through of how everything is set up.
struct cgroupfs_root: this represents the root of a cgroup hierarchy. it knows things like what subsystems are attached to it, the root cgroup in the hierarchy, and other trivialities like its own name. variables of this type are almost always called "root".
struct cgroup: represents a single cgroup. remember that each directory, looking through the VFS layer, is a cgroup, so when you 'mkdir cgroup/foo', a new cgroup is created. variables of this type can be seen most commonly as "cgrp", but also sometimes "cg" or simply "c".
struct cgroup_subsys: the heart of the subsystem API - function pointers for the subsystem's operations, and other various options. referred to usually simply as "ss".
struct cgroup_subsys_state: per-cgroup, per-subsystem state storage - when you write to a subsystem control file for a given cgroup, this is what hears about it. always called "css".
struct css_set: a collection of pointers to cgroup_subsys_state objects. each css_set is referenced by all tasks who use the given combination of subsystem states. any particular subsystem state for a given task is only likely to change if the task gets moved into a different cgroup in the hierarchy to which the subsystem is attached. the presence of these guys is purely an optimization, since multiple tasks can have the same combination of subsystem states, and will therefore use the same css_set. additionally, they are stored in a hashtable for quick access. in comments, this data structure is referred to as a "cgroup group", and variables tend to go by the name "cg", but sometimes "css" as well.
struct cg_cgroup_link: as per the comment, this is a "link structure for associating css_set objects with cgroups". each of these has a pointer to one struct cgroup and to one struct css_set, and lives at the intersection of two lists: the first list, per cgroup, links all css_sets associated with that cgroup; the second, per css_set, links all cgroups associated with that css_set - forming, as is described in Documentation/cgroups/cgroups.txt, a "lattice". they are what you want to use if you want to iterate either all css_sets in a cgroup or all cgroups in a css_set, and respond mostly to "link" but sometimes "cgl".
struct cgroupfs_root: this represents the root of a cgroup hierarchy. it knows things like what subsystems are attached to it, the root cgroup in the hierarchy, and other trivialities like its own name. variables of this type are almost always called "root".
struct cgroup: represents a single cgroup. remember that each directory, looking through the VFS layer, is a cgroup, so when you 'mkdir cgroup/foo', a new cgroup is created. variables of this type can be seen most commonly as "cgrp", but also sometimes "cg" or simply "c".
struct cgroup_subsys: the heart of the subsystem API - function pointers for the subsystem's operations, and other various options. referred to usually simply as "ss".
struct cgroup_subsys_state: per-cgroup, per-subsystem state storage - when you write to a subsystem control file for a given cgroup, this is what hears about it. always called "css".
struct css_set: a collection of pointers to cgroup_subsys_state objects. each css_set is referenced by all tasks who use the given combination of subsystem states. any particular subsystem state for a given task is only likely to change if the task gets moved into a different cgroup in the hierarchy to which the subsystem is attached. the presence of these guys is purely an optimization, since multiple tasks can have the same combination of subsystem states, and will therefore use the same css_set. additionally, they are stored in a hashtable for quick access. in comments, this data structure is referred to as a "cgroup group", and variables tend to go by the name "cg", but sometimes "css" as well.
struct cg_cgroup_link: as per the comment, this is a "link structure for associating css_set objects with cgroups". each of these has a pointer to one struct cgroup and to one struct css_set, and lives at the intersection of two lists: the first list, per cgroup, links all css_sets associated with that cgroup; the second, per css_set, links all cgroups associated with that css_set - forming, as is described in Documentation/cgroups/cgroups.txt, a "lattice". they are what you want to use if you want to iterate either all css_sets in a cgroup or all cgroups in a css_set, and respond mostly to "link" but sometimes "cgl".
2009-10-11
a module loading adventure of the "great success" variety
okay! so, yesterday, and the day before that, and a little bit today, but mostly yesterday (in fact, for just about all of yesterday), I put together two patches in my stgit tree which I called cgroups-revamp-subsys-array.patch and cgroups-subsys-module-interface.patch, satisfying (in a very rudimentary way) #2 and #3 from the roadmap. (#1 turned out to be something I needed to keep but work around anyway... details not important now.) I made sure they compiled and went to bed.
Today, after looking at ns_cgroup.c and devices_cgroup.c and realizing that they couldn't really be modularized easily, I threw together a skeleton subsystem modeled after the other ones that adds a file "hax" that wraps a global variable whose value determines whether you can attach tasks to the cgroup or not. All right, now let's follow this guide that elly pointed me at to get it to build as a module...
WARNING: "cgroup_load_subsys" [/home/bblum/Documents/School/F09/412/hax/cgroup_test1.ko] undefined!
WARNING: "cgroup_add_file" [/home/bblum/Documents/School/F09/412/hax/cgroup_test1.ko] undefined!
Well, they're just warnings, so try loading the module anyway, right? (Note: I use insmod instead of modprobe because the latter wants infrastructure and dependencies, and the former can just take any random file from the filesystem.)
livecd / # mount -t hostfs none -o /home/bblum/412 /mnt/host/
livecd / # cd /mnt/host/hax/
livecd hax # insmod cgroup_test1.ko
cgroup_test1: Unknown symbol cgroup_load_subsys
cgroup_test1: Unknown symbol cgroup_add_file
insmod: error inserting 'cgroup_test1.ko': -1 Unknown symbol in module
It was worth a shot, though. Turns out I need to EXPORT_SYMBOL(...) everything I'll need for the module in kernel/cgroup.c. For now, I just do the functions my subsystem uses; later, I'll need to worry about functions that -any- subsystem might use. Next:
livecd hax # insmod cgroup_test1.ko
cgroup_test1: version magic '2.6.31-rc9-mm1-gf013913 mod_unload ' should be '2.6.31-rc9-mm1-ge40e265 mod_unload '
insmod: error inserting 'cgroup_test1.ko': -1 Invalid module format
It took too long to realize that the kernel I'd most recently booted somehow had something different enough to change the vermagic string from the most recent time I'd built it, which is what I'd built the module against. Okay. Rebooting UML, and going to get it right this time.
livecd hax # insmod cgroup_test1.ko
Kernel panic - not syncing: Kernel mode signal 4
Modules linked in: cgroup_test1(+)
Segmentation fault
I had deliberately left out the ".module = THIS_MODULE" line in test1_subsys when first building it, to see what would happen when cgroup_load_subsys tried to pin the module... and promptly forgotten about it. Putting the line in, finally, and:
livecd dev # lsmod
Module Size Used by
livecd dev # mount -t cgroup none -o test1 cgroup/
mount: special device none does not exist
livecd dev # insmod /mnt/host/hax/cgroup_test1.ko
livecd dev # lsmod
Module Size Used by
cgroup_test1 2512 1 [permanent]
livecd dev # mount -t cgroup none -o test1 cgroup/
livecd dev # ls cgroup/
cgroup.procs notify_on_release release_agent tasks test1.hax
livecd dev # mkdir cgroup/foo
livecd dev # echo $$ > cgroup/foo/tasks
bash: echo: write error: Operation not permitted
livecd dev # echo 42 > cgroup/foo/test1.hax
livecd dev # echo $$ > cgroup/foo/tasks
livecd dev #
:)
Today, after looking at ns_cgroup.c and devices_cgroup.c and realizing that they couldn't really be modularized easily, I threw together a skeleton subsystem modeled after the other ones that adds a file "hax" that wraps a global variable whose value determines whether you can attach tasks to the cgroup or not. All right, now let's follow this guide that elly pointed me at to get it to build as a module...
WARNING: "cgroup_load_subsys" [/home/bblum/Documents/School/F09/412/hax/cgroup_test1.ko] undefined!
WARNING: "cgroup_add_file" [/home/bblum/Documents/School/F09/412/hax/cgroup_test1.ko] undefined!
Well, they're just warnings, so try loading the module anyway, right? (Note: I use insmod instead of modprobe because the latter wants infrastructure and dependencies, and the former can just take any random file from the filesystem.)
livecd / # mount -t hostfs none -o /home/bblum/412 /mnt/host/
livecd / # cd /mnt/host/hax/
livecd hax # insmod cgroup_test1.ko
cgroup_test1: Unknown symbol cgroup_load_subsys
cgroup_test1: Unknown symbol cgroup_add_file
insmod: error inserting 'cgroup_test1.ko': -1 Unknown symbol in module
It was worth a shot, though. Turns out I need to EXPORT_SYMBOL(...) everything I'll need for the module in kernel/cgroup.c. For now, I just do the functions my subsystem uses; later, I'll need to worry about functions that -any- subsystem might use. Next:
livecd hax # insmod cgroup_test1.ko
cgroup_test1: version magic '2.6.31-rc9-mm1-gf013913 mod_unload ' should be '2.6.31-rc9-mm1-ge40e265 mod_unload '
insmod: error inserting 'cgroup_test1.ko': -1 Invalid module format
It took too long to realize that the kernel I'd most recently booted somehow had something different enough to change the vermagic string from the most recent time I'd built it, which is what I'd built the module against. Okay. Rebooting UML, and going to get it right this time.
livecd hax # insmod cgroup_test1.ko
Kernel panic - not syncing: Kernel mode signal 4
Modules linked in: cgroup_test1(+)
Segmentation fault
I had deliberately left out the ".module = THIS_MODULE" line in test1_subsys when first building it, to see what would happen when cgroup_load_subsys tried to pin the module... and promptly forgotten about it. Putting the line in, finally, and:
livecd dev # lsmod
Module Size Used by
livecd dev # mount -t cgroup none -o test1 cgroup/
mount: special device none does not exist
livecd dev # insmod /mnt/host/hax/cgroup_test1.ko
livecd dev # lsmod
Module Size Used by
cgroup_test1 2512 1 [permanent]
livecd dev # mount -t cgroup none -o test1 cgroup/
livecd dev # ls cgroup/
cgroup.procs notify_on_release release_agent tasks test1.hax
livecd dev # mkdir cgroup/foo
livecd dev # echo $$ > cgroup/foo/tasks
bash: echo: write error: Operation not permitted
livecd dev # echo 42 > cgroup/foo/test1.hax
livecd dev # echo $$ > cgroup/foo/tasks
livecd dev #
:)
2009-09-13
UML
I spent a good portion of today (and small amounts of time previously) trying to get a good environment for this set up. This mostly involved getting the mm-of-the-moment (mmotm) tree and making UML (user-mode linux, the way i will test most of the stuff) actually go properly.
choosing an environment:
magrathea (my laptop): runs 64-bit. last time i tried gitting mmotm and building the kernel, it failed. :iiam:
maximegalon (my server): runs 32-bit. comparatively slow processor, so build times suboptimal.
unix.andrew servers: AFS volume only 1GB, mmotm tree with built objects in it is larger. I was given the suggestion to ask for a project volume, but :effort:
I eventually discovered that the build failure on magrathea was bash's fault. Version 4 strips environment variables with a "." in the name, and the kernel's makefiles depend on not-that-behaviour. During my googling that determined this, I found a mailing list discussing design implications and whether or not bash-4 should actually do that anyway or not. Magrathea now has bash-3 installed, and the kernel builds nicely.
The next thing I needed to do was obtain a filesystem to run UML on. I at first tried just laying it on top of my laptop's root filesystem (with ./linux root=/dev/root rootfstype=hostfs rootflags=/ rw), but my current gentoo installation's boot sequence is fancy enough that UML complained on a lot of the steps. Also, running as user forbids it from writing to anything that it needs to be able to write to, and furthermore found permissions conflicts with /bin/mount, which I sort of need a lot for cgroups devel.
Next I tried pulling a debian root filesystem from maximegalon, and running with that. That gave me the mystery "request_module: runaway loop modprobe binfmt-464c" (several times in a row, then hangs) immediately after mounting the root fs. I didn't examine it and instead looked for an alternative approach, finding rootstrap, a script designed to automatically build a UML root fs from scratch. It failed instantly complaining about mount's permissions (same as booting with my laptop's root). I next pulled down a prebuilt slackware root fs from UML's website - same binfmt-464c error. This time I googled it and discovered that that's what happens when you feed a 64-bit kernel (UML doesn't seem to have CONFIG_IA32_EMULATION for some reason?) a 32-bit filesystem.
With newfound resolve, I pulled down the amd64 minimal install CD ISO from Gentoo's website, loopback-mounted it, found the squashfs image therein, discovered my host kernel didn't have squashfs, built it as a module, then mounted the root filesystem and booted it. Same problem with /bin/mount as before. Okay, well, I'll change the permissions on this one since it's not on a filesystem I care about. At least, I would have done that, if the loopback-mounted squashfs image within the loopback-mounted ISO could at all be made not readonly. Instead I made myself a new filesystem (dd if=/dev/zero of=/tmp/uml_root bs=1024 count=1 seek=$((1024*1024-1)); /sbin/mke2fs -j /tmp/uml_root) and copied it over. That seemed to boot fine, until I got to the end, when after "Starting local ... [ ok ]" (the last message in the boot sequence before agetty runs, it printed a lot of "IRQF_DISABLED is not guaranteed on shared IRQs" messages and hung (and spawned a lot of xterms on the host kernel, too. wtf?). Google revealed no useful information, so I figured that as this was the homestretch a bit of hackery was allowed: I opened up /etc/conf.d/local.start and appended exec /bin/bash. Now when it booted up, it ended with a root shell.
Just to make sure everything was in working order - i.e., appropriate for doing devel with - I went to go mount cgroups (mkdir /dev/cgroups; mount -t cgroup none /dev/cgroup) and have a look. There I see the standard cgroups and subsystem control files, and also something I didn't expect to see: /dev/cgroup/cgroup.procs. That's the file that the patchseries I put together over the summer implemented, submitted to LKML a couple of times, and had somewhat lost track of as Paul (google mentor) had taken over its custody when I left. Looks like it got into the tree after all - I'm taking that as a sign of hope for this project :)
choosing an environment:
magrathea (my laptop): runs 64-bit. last time i tried gitting mmotm and building the kernel, it failed. :iiam:
maximegalon (my server): runs 32-bit. comparatively slow processor, so build times suboptimal.
unix.andrew servers: AFS volume only 1GB, mmotm tree with built objects in it is larger. I was given the suggestion to ask for a project volume, but :effort:
I eventually discovered that the build failure on magrathea was bash's fault. Version 4 strips environment variables with a "." in the name, and the kernel's makefiles depend on not-that-behaviour. During my googling that determined this, I found a mailing list discussing design implications and whether or not bash-4 should actually do that anyway or not. Magrathea now has bash-3 installed, and the kernel builds nicely.
The next thing I needed to do was obtain a filesystem to run UML on. I at first tried just laying it on top of my laptop's root filesystem (with ./linux root=/dev/root rootfstype=hostfs rootflags=/ rw), but my current gentoo installation's boot sequence is fancy enough that UML complained on a lot of the steps. Also, running as user forbids it from writing to anything that it needs to be able to write to, and furthermore found permissions conflicts with /bin/mount, which I sort of need a lot for cgroups devel.
Next I tried pulling a debian root filesystem from maximegalon, and running with that. That gave me the mystery "request_module: runaway loop modprobe binfmt-464c" (several times in a row, then hangs) immediately after mounting the root fs. I didn't examine it and instead looked for an alternative approach, finding rootstrap, a script designed to automatically build a UML root fs from scratch. It failed instantly complaining about mount's permissions (same as booting with my laptop's root). I next pulled down a prebuilt slackware root fs from UML's website - same binfmt-464c error. This time I googled it and discovered that that's what happens when you feed a 64-bit kernel (UML doesn't seem to have CONFIG_IA32_EMULATION for some reason?) a 32-bit filesystem.
With newfound resolve, I pulled down the amd64 minimal install CD ISO from Gentoo's website, loopback-mounted it, found the squashfs image therein, discovered my host kernel didn't have squashfs, built it as a module, then mounted the root filesystem and booted it. Same problem with /bin/mount as before. Okay, well, I'll change the permissions on this one since it's not on a filesystem I care about. At least, I would have done that, if the loopback-mounted squashfs image within the loopback-mounted ISO could at all be made not readonly. Instead I made myself a new filesystem (dd if=/dev/zero of=/tmp/uml_root bs=1024 count=1 seek=$((1024*1024-1)); /sbin/mke2fs -j /tmp/uml_root) and copied it over. That seemed to boot fine, until I got to the end, when after "Starting local ... [ ok ]" (the last message in the boot sequence before agetty runs, it printed a lot of "IRQF_DISABLED is not guaranteed on shared IRQs" messages and hung (and spawned a lot of xterms on the host kernel, too. wtf?). Google revealed no useful information, so I figured that as this was the homestretch a bit of hackery was allowed: I opened up /etc/conf.d/local.start and appended exec /bin/bash. Now when it booted up, it ended with a root shell.
Just to make sure everything was in working order - i.e., appropriate for doing devel with - I went to go mount cgroups (mkdir /dev/cgroups; mount -t cgroup none /dev/cgroup) and have a look. There I see the standard cgroups and subsystem control files, and also something I didn't expect to see: /dev/cgroup/cgroup.procs. That's the file that the patchseries I put together over the summer implemented, submitted to LKML a couple of times, and had somewhat lost track of as Paul (google mentor) had taken over its custody when I left. Looks like it got into the tree after all - I'm taking that as a sign of hope for this project :)
Subscribe to:
Posts (Atom)