Judging from the AMD Hammer presentation at the Microprocessor Forum this
past week, the Hammer is a very exciting system. If done right, it should
be incredibly fast and scalable. I'm just a little concerned that it
won't be done correctly. The reason for my skepticism is that this
architecture seems to be a ccNUMA architecture masquerading as an SMP system.
*IF* The hardware is designed in such a way that it can provide hints
to the OS about where certain memory is, then the whole thing will work,
and work beatuifully. If the hardware hides these hints, however, than
the OS will have no means of making intelligent memory placement descisions,
and MP performance could suffer. The end result would be degraded performance
in an MP system.
Let me start my argument with an example:
I own a Fujitsu Pentium 133MMX laptop that has a 256KB of external L2 cache
and the Intel TX chipset. The machine also has 80MB of SDRAM (16MB onboard,
64MB SODIMM). As any hardware junkie from 5 years ago will tell you, the
TX chipset can't cache any memory above 64MB. I though it was just
the VX chipset that had this limitation, so I went happily along using this
laptop for a few years. It took approximately 50 minutes to compile
a recent 2.4 Linux kernel. Then I stumbled onto a website that informed
me of the TX 64MB restriction, and after a collective 'd'oh!', I used a little
feature of Linux that allows you to turn uncached memory into a fast ram
drive to use as swap. With only 64 MB of memory, and a really fast
high priority swap device, kernel compile times dropped to approximated 30minutes.
Kernel compiles are one of the many things that truely benefit from
cached data, yet 1/5 of my RAM was never able to be cached. Since virtually
contiguous memory doesn't have to use contiguous physical memory pages, all
accesses to main memory had a 1 in 5 chance of -always- going to slow, high
latency main memory. The result is that the average memory access time
with cache enabled was much higher that it should have been. (Say
a cache hit has a latency of 40ns, and main memory has a latency of 200ns.
If 1 out of 5 references go to main memory, then the average access
time is ((40+40+40+40+200)/5)=72ns, nearly twice what it should be.). The
real wold performance increase speaks for itself.
Linux treated all memory on my machine as being equal, and was getting suboptimal
performance. Once I taught linux that all memory was not equal, it was
able to get optimal performance. There is no way for me to teach windows
about the same problem on the same system. This situation is -very-
similar to the situation faces on an MP Hammer system.
All memory are created equal, but some are more equal than others.
Each Hammer processor has an integrated high performance memory controller
(PC2700 (DDR33), 64 or 128bits). This will mean that accesses to local
memory will be incredibly low latency (as some of the complexity and wirelength
problems have been removed), and should deliver high bandwidth. A single
processor Hammer system will be a real screamer, delivering constant performance
from any physical memory reference. However, now add a second processor,
with its own local memory. AMD has stated in their presentation that
the "Software view of memory is SMP" (slide 44). If this means that
from the software point of view, all memory is equal, then we could have a
problem (that is, IF the latency of going to remote memory is much higher
(say, twice as high) as going to local memory). Since the software (and
OS) will see all memory as equal, it will happily allocate physical memory
from -either- physical address pool. Thus, as in the case of my little
old uncached ram notebook, the average memory latency will increase. If
the latency penalty is small for going to non-local memory, this isn't a
big deal. But if the penalty is even 1.5x that of the latency of local
memory, overall memory access latency will increase by 1.25x on a 2 processor
system. This will result in overall lower memory performance from a
MP machine than a uni-processor. And, given that memory performance
is the -real- bottleneck in todays systems, it does no good to have a second
processor, no matter how fast, if it sits idle waiting for memory accesses
most of the time. The situation becomes worse as you add more processors
(i.e., 4 processors, 3/4 chance of having to go to remote memory, 8 processors,
7/8 chance, etc..)
This isn't a new problem by any stretch of the imagination. This is
the same problem that's been faced by supercomputer manufacturers for decades.
The difference is that companys like SGI, Cray, and SUN designed the
hardware so that it could provide hints to the OS on how to handle memory
management intelligently, and then they designed the OS to take advantage
of these hints. For example, if the OS knows that you have a single threaded
process, it will try and keep all the ram that you allocate for that process
in the same physical RAM as the processor you're running on. This allows
that process to always have the fastest memory accesses possible, and only
going "off chip" when absolutely necessary. If the OS was not aware
of the memory layout of the machine, this processes allocated memory could
be spread out throughout the whole system, causing unecessary contention on
the HT network and higher memory latencies. In this sense, each processors
local memory can almost be thought of as a "level 3" cache. It's a much
more complex OS design, and very difficult to get right, but it does allow
for maximum performance in a lot of cases.
For this reason I implore AMD to provide some sort of mechanism to allow
OS developers to take full advantage of the NUMA aspects of their new system
design. Those OS's that choose to take advantage of them can, and those
that don't will still work fine.
Even if AMD decides not to provide hints to the OS, there are still a few
things that they can implement (and quite likely have for this very reason)
that will help alleviate the problems.