Asymetric Multiprocessing for Linux
Processor Groups Linux Kernel Patch
THIS IS WAAAY OUTDATED!. It’s basically just task and irq processor affinity, which can now be handled by schedutils and careful proc/irq/#/irq_affinity handling. I’m just keeping this here for historical reasons. — john.c (10/16/2005)
This was one of my first forays into seeing what could be done with an OS scheduler to take advantage of SMP and NUMA multiprocessor environments to help speed up numerical computations. It is an ugly, ugly hack that was more of a proof of concept. It did work, but the benefits gained are very small, especially on an SMP machine. I made this patch around the 2.4.0-test days of the Linux kernel. It allowed for, at compile time, an administrator to specify a certain number of processors in a multiprocessor systems to be “just” application processors (i.e., never be tied down with OS tasks). As a side effect, it also allowed a used to tie a process to run only on a particular CPU. Most of the work for this has (had) already been done by SGI (and others). In essence, it would allow users of an SMP machine to use asymmetric multiprocessing.
Benefits of assigning a process to a specific processor
If a process is assigned to only one processor, then it will never have to ‘re-prime’ the cache because it’s been switched to running on a new processor (and a new cache). For CPU/Memory intensive tasks, this can lead to a small increase in performance. On a multiprocessor system, we also have the luxury of modifying the scheduler to allow the task to run uninterrupted on that processor if we want, because normal machine activity (interrupts, other processes, login shells, etc..) will continue to be handled by other processors. This also maximises the application’s use of the cache since we guarantee no other processes, not even the scheduling quanta, will interrupt the program.
The Problems
-
The performance increases for all of these things -only- appear for CPU/Memory bound tasks, which is a very small number of applications. Any I/O waiting will mean that the application processor sits idle while it could be doing other, useful work.
-
On an SMP machine, you are still forced to sharing the CPU bus and memory bus bandwidth with other processors in the system, so while you get the benefits of your own processor and cache, you still only get ‘your share’ of the available bandwidth. On a NUMA system, this is a different story, and this patch could be much more useful on a NUMA system where you aren’t necessarily limited by such sharing. Unfortunately, i don’t have a NUMA machine to play with to test out this theory :). Anyone wishing to donate me one is more than welcome to contact me.
-
There’s a school of thought, subscribed to by Linus and most other Linux developers out there, that this case would come about naturally if Linux’s scheduler was perfect, and thus we should work on modifying the scheduler to be better rather than coming up with ugly hacks like this. In general, I agree. But I also think there’s nothing wrong with performance being your top goal, and in the interm using hacks such as this to help get your project done.
The Patch
I submitted this patch on the SGI Linux Scalability list. It generated a little discussion, but no one seemed that interested in general, so I lost interest for now. And then IBM came out with their Linux Scalability Project and no one ever really posted to SGI list but me and one guy from SGI. I guess everything happens on IBM’s list now, I don’t have much time to keep track.
This patch consists of a few parts. First, the kernel patch, which is against kernel 2.4.0-test6 (I believe). Don’t expect it to work with later kernel’s, I’ve never tried, and I don’t even have my old PII dual processor machine to test on anymore. But there should be enough there to figure out what I was trying to do. That is available here: procgroup-2.4.0-test6.diff
Second, once you’ve booted your kernel, you can optionally turn off all interrupts from being routed to your application processors. To do so, go into /proc/irq/#/irq_affinity and write into each one the bitmask of CPU’s that interrupt is allowed to be processed on. Write 1 for each OS CPU, and a 0 for each application CPU.
Finally, you need the “assign2proc” program, whcih is available in source form here. It’s ugly, it’s a gaping security hole, but it does work. It’s been so long, I don’t remember the syntax. Use the source, Luke. I’m not going to touch it again unless there’s some interest generated (by me or someone else). Here’s the source for assign2proc: assign2proc.c
Conclusions:
After getting this hack up and running I ran a few tests and was wholly unimpressed with the miniscule speedups gained. And since no one else seemed interested in the hack, I’ve given up on it. Future work should be concentrated on optimising the smarts of the current Linux scheduler rather than bastardized hacks such as this. But I’m posting it anyway, even after all this time, because it was an interesting project, I had fun doing it, and maybe someone will stumble along it someday and get inspired.
Feel free to email me, john@deater.net or clemej@alum.rpi.edu
You can visit my poorly maintained homepage as well… Maybe even look at my resume?