lesson: never hack when it won't solve the entire problem.

I'd like to share some code I wrote for a particularly misguided solution to a yak-shaving quest. Fortunately, soon afterwards, I realized a much better way to dodge the bullet that I was here trying to take in the chest and stagger back to camp while screaming for help... so this code will never see the light of day.

That's good, because I also think it has a fatal flaw. I won't say it, but I'll give an exposition so you can figure it out. Here's the idea:
  1. You are not allowed to block, sleep, or otherwise yield while in an "atomic" section of code (usually because it means interrupts are disabled). In order to safely iterate over a thread-group, you need to hold either rcu_read_lock or tasklist_lock, both of which are low-level locks and constitute "atomic" code while being held.
  2. In kernel/cpuset.c, cpuset_change_task_nodemask has two steps, and a check between those steps which does a yield to synchronize its internal state (as part of an open-coded replacement for a proper concurrency primitive... ugh. from commit c0ff7453bb5c7c98e0885fb94279f2571946f280 on mmotm). Here I have changed it to return -EAGAIN instead of yielding, so the caller can synchronize "more appropriately".
  3. get_task_struct and put_task_struct manage a thread's reference count. After you've done get_task_struct, it will always be safe to access that memory, until you release it. If you call put_task_struct and you're the last one with a reference count, it does a bunch of cleanup. (interesting read here, perhaps a hint.)

1456         /*
1457          * This particular per-task operation requires being able to sleep, so
1458          * it can't be done in the attach_task callback, where sleeping is
1459          * forbidden. See cgroup_attach_proc.
1460          */
1461         if (threadgroup) {
1462                 struct task_struct *c = NULL;
1463 again:
1464                 rcu_read_lock();
1465                 /* ensure safe thread-group traversal */
1466                 if (thread_group_leader(tsk)) {
1467                         list_for_each_entry_rcu(c, &tsk->thread_group,
1468                                                 thread_group) {
1469                                 get_task_struct(c);
1470                                 ret = cpuset_change_task_nodemask(c,
1471                                                 &cpuset_attach_nodemask_to);
1472                                 /*
1473                                  * on a failed nodemask change, we need to exit
1474                                  * the atomic section and start over.
1475                                  */
1476                                 if (ret == -EAGAIN) {
1477                                         rcu_read_unlock();
1478                                         yield();
1479                                         goto again;
1480                                 }
1481                                 put_task_struct(c);
1482                         }      
1483                         rcu_read_unlock();
1484                 } else if (c != NULL) {
1485                         /*
1486                          * if racing with exec caused an abort, we need to
1487                          * finish the rebind operation in progress on it.
1488                          */
1489                         rcu_read_unlock();
1490                         do {
1491                                 ret = cpuset_change_task_nodemask(c,
1492                                                 &cpuset_attach_nodemask_to);
1493                                 if (ret == -EAGAIN)
1494                                         yield();
1495                         } while (ret != -EAGAIN)
1496                         put_task_struct(c);
1497                 }              
1498         } else {
1499                 do {
1500                         ret = cpuset_change_task_nodemask(tsk,
1501                                         &cpuset_attach_nodemask_to);
1502                         if (ret == -EAGAIN)
1503                                 yield();
1504                 } while (ret != -EAGAIN);
1505         }


as though holding a red pen and a lock stamp

Found two bugs in peripheral code around cgroup_attach_task while trying to finish up my older project, which is to add cgroup_attach_proc. More on the actual project later; for now:
  1. Earlier this year, cpuset was changed to use NODEMASK_ALLOC, to avoid stack-allocating a structure which can be huge on NUMA systems. However, the present usage is rather cavalier in functions that aren't allowed to fail!
  2. Last month, the locking order in memcontrol was changed to avoid a potential deadlock... introducing a bit of foot-stepping between it and cpuset.
Fortunately, none of these are design-level blockers to the patch series I'm trying to finalize. But the solutions could prove "interesting"...