lesson: never hack when it won't solve the entire problem.

I'd like to share some code I wrote for a particularly misguided solution to a yak-shaving quest. Fortunately, soon afterwards, I realized a much better way to dodge the bullet that I was here trying to take in the chest and stagger back to camp while screaming for help... so this code will never see the light of day.

That's good, because I also think it has a fatal flaw. I won't say it, but I'll give an exposition so you can figure it out. Here's the idea:
  1. You are not allowed to block, sleep, or otherwise yield while in an "atomic" section of code (usually because it means interrupts are disabled). In order to safely iterate over a thread-group, you need to hold either rcu_read_lock or tasklist_lock, both of which are low-level locks and constitute "atomic" code while being held.
  2. In kernel/cpuset.c, cpuset_change_task_nodemask has two steps, and a check between those steps which does a yield to synchronize its internal state (as part of an open-coded replacement for a proper concurrency primitive... ugh. from commit c0ff7453bb5c7c98e0885fb94279f2571946f280 on mmotm). Here I have changed it to return -EAGAIN instead of yielding, so the caller can synchronize "more appropriately".
  3. get_task_struct and put_task_struct manage a thread's reference count. After you've done get_task_struct, it will always be safe to access that memory, until you release it. If you call put_task_struct and you're the last one with a reference count, it does a bunch of cleanup. (interesting read here, perhaps a hint.)

1456         /*
1457          * This particular per-task operation requires being able to sleep, so
1458          * it can't be done in the attach_task callback, where sleeping is
1459          * forbidden. See cgroup_attach_proc.
1460          */
1461         if (threadgroup) {
1462                 struct task_struct *c = NULL;
1463 again:
1464                 rcu_read_lock();
1465                 /* ensure safe thread-group traversal */
1466                 if (thread_group_leader(tsk)) {
1467                         list_for_each_entry_rcu(c, &tsk->thread_group,
1468                                                 thread_group) {
1469                                 get_task_struct(c);
1470                                 ret = cpuset_change_task_nodemask(c,
1471                                                 &cpuset_attach_nodemask_to);
1472                                 /*
1473                                  * on a failed nodemask change, we need to exit
1474                                  * the atomic section and start over.
1475                                  */
1476                                 if (ret == -EAGAIN) {
1477                                         rcu_read_unlock();
1478                                         yield();
1479                                         goto again;
1480                                 }
1481                                 put_task_struct(c);
1482                         }      
1483                         rcu_read_unlock();
1484                 } else if (c != NULL) {
1485                         /*
1486                          * if racing with exec caused an abort, we need to
1487                          * finish the rebind operation in progress on it.
1488                          */
1489                         rcu_read_unlock();
1490                         do {
1491                                 ret = cpuset_change_task_nodemask(c,
1492                                                 &cpuset_attach_nodemask_to);
1493                                 if (ret == -EAGAIN)
1494                                         yield();
1495                         } while (ret != -EAGAIN)
1496                         put_task_struct(c);
1497                 }              
1498         } else {
1499                 do {
1500                         ret = cpuset_change_task_nodemask(tsk,
1501                                         &cpuset_attach_nodemask_to);
1502                         if (ret == -EAGAIN)
1503                                 yield();
1504                 } while (ret != -EAGAIN);
1505         }


as though holding a red pen and a lock stamp

Found two bugs in peripheral code around cgroup_attach_task while trying to finish up my older project, which is to add cgroup_attach_proc. More on the actual project later; for now:
  1. Earlier this year, cpuset was changed to use NODEMASK_ALLOC, to avoid stack-allocating a structure which can be huge on NUMA systems. However, the present usage is rather cavalier in functions that aren't allowed to fail!
  2. Last month, the locking order in memcontrol was changed to avoid a potential deadlock... introducing a bit of foot-stepping between it and cpuset.
Fortunately, none of these are design-level blockers to the patch series I'm trying to finalize. But the solutions could prove "interesting"...



2.6.34-rc2 came out three days ago, and supports modular cgroup subsystems.

(The net_cls patch wasn't merged with the rest and is presently still running the approval gauntlet, though even blkio-cg is already in.)


gdb is not so great at debugging within modules.

At the suggestion of LKML, I have spent today working on making blk-cgroup, also known as blkiocg, into a module. it has not gone entirely smoothly, as there is an already existing module (cfq-iosched) that depends on it, so the process has involved a good amount of figuring out how module dependencies (in terms of symbol-loading and the language of the kbuild system) work.

Now that I got it finally to build, boot, and load the modules (modprobe even figured out the dependency for me, it was great), I discovered that the kernel panics when trying to unload the subsystem (with what looks like a SIGILL in some memory management code). So, GDBing it up, I have discovered that it is difficult to deal with code that's loaded at runtime that isn't a standard dynamic library.


0x000000006005146a 3504 ss->destroy(ss, dummytop);
0x00000000628202ad in ?? ()
(gdb) break block/blk-cgroup.c:207
No source file named block/blk-cgroup.c.
Make breakpoint pending on future shared library load? (y or [n]) n
(gdb) disassemble 0x628202ad
No function contains specified address.
(gdb) disassemble 0x628202ad 0x62820363
Dump of assembler code from 0x628202ad to 0x62820363:
0x00000000628202ad: push %rbp

And you thought 213's bomblab would never be useful!