More Comments on AMD's Hammer.
by Pianoman <john@deater_net>
The Opteron is here, and it is fast.. and then some.
Faster then an Itanium2 in some cases, and faster the the Xeon almost
everything, even with hyperthreading. AMD has a real winner on it's
hands, and yet, not many people seem to realise that so far we've just
seen the tip of the iceberg performance-wise.
So what makes the Opteron soo good,
anyway?
Lets go over the areas where the Opteron has improved performance over
a traditional ia32 chip. The first category of improvements are hardware
and architectural improvements that will apply to every mode the
processor runs in. These improvements do not require any special
compiler or OS support. They are:
- Large 1MB L2 cache
- Larger, more efficient TLB
- Integrated dual channel PC2700 memory controller
- SSE2
- Better fetch/decode logic/branch prediction (not so sure how much
this differs from the Athlon)
- Hypertransport for I/O transactions.
If you baught an Opteron today and ran any current version of Windows
on it, and any current program on it, you would see the benefits of all
of these improvements. And these combine for some very impressive
performance increaces across the board. A "souped up athlon".
Unfortunately, in some reviews, these are all the improvements you'll
see...but I'm getting ahead of myself. More on that later.
The second category of improvements revolve around the x86-64
extensions that the Opteron brings to the table. In order to experience
these improvements, you'll need a new OS and a binary recompiled for
x86-64. These include such things as:
- 64-bit addressing
- 64-bit datapath
- 8 more general purpose registers (R8-R15)
- 8 more SSE2 128-bit floating point registers
- RIP addressing mode
- SWAPGS instruction
This is the real performance goldmine, but you'll need to be running
Linux/x86-64 and using GCC (or another x86-64 capable compiler) in order
to take full advantage of it.
The final category of improvements are in the systems
architecture. That would be:
- NUMA (Hypertransport for SMP, integrated memory controller)
While any current OS will be able to use the memory on an Opteron
system (up to 4GB, then, I'm assuming, they'll have to revert to using
PAE). A careful OS would be able to glean much more performance by
controlling memory placement (see my first comments on the Hammer here).
Ok, ok, so do the second and third
categories really matter?
It's fairly obvious that the first set of improvements are a major
factor as to why the Opteron is performing so well. So how much of a
difference will the second and third categories really make? Well, let
me highlight some of the important ones in reguards to performance.
64-bit addressing/64-bit datapath.
This is actually a two edged sword. On an application level, you
geta tremendous advantage when dealing with large data sets, or
when dealing with larger-then-32-bit numbers. However, most
programs don't need fall into this category, however. The x86-64
architecture is actually rather nice in that if you don't -need- 64-bit
data you can use the old 32-bit instructions just fine and only use
64-bit ones as necessary. However, a pointer is always going to be
64-bits in x86-64 mode, so some applications might take a slight
performance hit because of the extra strain on the memory bus caused by
loading 64-bits for every pointer instead of 32. This is one reason why
AMD gave the Opteron such a fast, wide memory bus.
From an OS point of very, however, this is incredibly useful. While a
full explination of VM, page tables, and intel's PAE hack to support
more then 4GB at a time is beyond the scope of this article; suffice it
to say that the Opteron can support up to several terabytes of memory
directly, and can access all of it much more quickly then Intel 32-bit
chips. This makes every memory reference faster.
More registers. The main
complaint about the x86 is that it only has 8 general purpose
registers. The Opteron has 16. This means that every time the compiler
decides that code needs to work with more then 8 values at a time it
can store up to 8 more variables in another register as opposed to
storing it in the stack or in main memory. I remember reading somewhere
that the extra registers cause code compiled for the x86-64 to make up
to 1/3 fewer memory references then traditional x86 code. Also look at this email
which shows that, for at least these compiled binaries, the number of
stack manipulations required for x86-64 is far, far less then those
required for traditional x86. (For example, the compiler for x86
generated 117723 'push's to the stack, while the x86-64 code generated
just 20264). So, in short, the extra registers help keep the processor
busy instead of waiting for more data.
NUMA. If the OS is capable of
being intelligent of where it places an application's memory, then the
overall memory bandwidth and latency can be maximised for a specific
program. This requires an intelligent, modern OS. Linux supports
this at least to a certain degree, but all current versions of Windows
do not (I have heard rumors Windows Server 2003 will, but I can't
verify that). Therefore, when running in an SMP environment, the
average latency and bandwidth to memory running under most current
operating systems will be less then optimal.
Only when all these performance enhancements are combined will we be
able to fully see the true power of the Opteron, and I believe we're in
for a huge surprise when we do.
Lies, Damn Lies, and Benchmarks (for
once, they're -underestimating-)
So lets take a look at some of the benchmarks people have been using to
showcase the Opteron. Lets start with AMD's own. You can
find them at AMD's page here.
First, the TPCC benchmarks. That's a -highly- impressive number for the
4-way Opteron! it gives a 4-way Itanium 2 1.0 Ghzwith 16 more GB
of memory a run for it's money, while handily beating the 4-way Xeons.
Very impressive. Now look at the configuration information, and try not
to laugh... yep, that's right. It's running on MS Windows 2003, and
running MS SQL server 2000, neither
of which are compiled for x86-64! Considering MS won't even have
a beta of Win2003 server for x86-64 unil the end of June, it has to be.
Also consider the fact that the Itanium confiuration explicitly spells
out "Windows 64-bit edition", while the Opteron configuration does not.
This also means, since the Opteron system has 32GB of memory, that the
OS is using the old, slow Intel-style PAE hack to address all of its
memory instead of the new, fast, direct addressing available on the
Opteron. Just imagine the speedup possible when the OS is able to
directly access all the memory and the application has to make 1/3 less
memory references. Right now, you're essentially running 4 "souped up
Athlons".
So, what we essentially have, is an entire benchmark suite that the
Opteron is kicking butt in, and it's only flexing some of its muscle.
Salivating yet? just wait, it gets better...
Now look at the Ms Exchange benchmarks. It's the same story, only we
know for a fact it's not a 64-bit OS because the OS is Windows 2000,
and the app is Exchange 2000, which will never be compiled for
x86-64. Again, these Opterons in "souped up Athlon" mode are
doing extremely well.
The spec scores? Yep. MS Windows 2003 Server. 32-bit OS, 32-bit Apps.
More scary? Read the spec submission PDF. They had to use Intel's C compiler, so again,
you're really looking at the spec scores of a processor that using only
about 2/3 of its capabilities. Given that the spec scores are soo high,
that's just downright frightening. I imagine a 20% boost in SpecFP from
the extra 8 SSE2 registers alone. This also underscores a major problem
with AMD's launch of the Opteron, the lack of a highly optimised
compiler for 64-bit mode. But I'll get to that later.
At the bottom of AMD's page, things get very, very interesting. Recall
that for every other benchmark listed on this page, the Opteron has
been running in souped up Athlon mode. Also note that, while they are
doing very well against Xeon's, AMD carefully only shows Itanium scores
from Itanium 2 1 Ghz or 900Mhz. where are the 1.5Ghz Itanium2 scores?
Well, maybe they weren't available, I don't know. But in general, the
trend is "Opteron is better than the Xeon, a little worse than the
Itanium 2". But the SpecWeb99 scores change that.
For the first time, we see a 64-bit binary running on a 64-bit OS. And
suddenly, the Opteron is starting to really flex it's muscle, beating
out even a 1.5Ghz Itanium 2. Unfortunately, the benchmark chosen
is really a very poor showcase for the Opteron as a processor, as
specweb relies much more on the size of your memory, disk speed, and OS
performance then on processing power. In the "32-bit vs. 64-bit"
slides we can see the speed advantage of 64-bit direct addressing as
opposed to PAE addressing. The machine contains 16GB of memory,
so a 32-bit OS would have to use PAE mode. As Zeus does little
more then tell the OS "send this page", i doubt it got much of an
advantage from being recompiled, so I'm conjecturing that almost all of
the 14% boost in speed comes from the OS being able to address all 16GB
of its in-memory buffer cache directly, instead of going though PAE
hijinks.
So, AMD doesn't seem to be showing off many benchmarks that show the
Opteron operating at it's full potential. What about anyone else?
Remember, the only x86-64 64-bit OS out there right now is Linux (and
NetBSD, and FreeBSD...) so you'll have to be running that, so no
Windows platform. This also means you're pretty much stuck with
GCC, which isn't always that good of an optimising compiler.
Well, Toms
Hardware as at least attempted some 64-bit tests using Linux, but
didn't indicate whether the apps the used were recompiled for 32-bit or
64-bit. While i personally can't stand the site (I'm making no attempts
to be unbiased here), they did at least try.
Aces Hardware
also did several nice Linux tests, breaking down the x86-64 compiled
ones and more.
Every other review I've read that's given performance data has been
using the Opterons in 32-bit mode.
Bottom line is, we have very few pure x86-64 environments and test
results...And the platform is still
impressive. If only AMD could get thier act together...
Where AMD Dropped the Ball
AMD has only themselves to blame for why the Opteron is, for the
most part, only using part of it's impressive ability. As with the P4
and the Itanium, it all comes down to compilers. Intel realised it
needed good compilers to show off not only the Itanium, but also the P4
(which re-wrote every x86 optimization in the book.. so years of x86
optimisation theory had to be thrown out and re-done from
scratch. Not to mention SSE). So Intel went out and baught Kruck
and Associates, makers of a very good optimising C and Fortran
compiler; and also managed to finagle pretty much the entire Compaq
compiler group when HP took them over (no, there were no back-room
dealings there... naah... that never happens..). The result? Intel has
a very fast compiler now. AMD, until now, has only made x86 clones, so
they've just used the Intel compler.
However, now AMD has it's own architecture, which needs it's own
compilers to take full advantage of it. Realizing this, AMD actively
helped the GCC people write an x86-64 backend. However, while GCC is a
very good general purpose compiler, but is not nearly as good at high
performance optimisation as Intel's compiler. Microsoft's compiler is a
very good optimising compiler, but appearently has yet to come out with
full support for x86-64. The only other Optimising compiler out there
for the x86-64 that I know of is the Portland Groups PGICC. However,
this appearently still isn't as good in 64-bit mode as intel's is in
32-bit mode, or else one would think AMD would have submitted spec
scores using this compiler instead of Intel's.
AMD needs to either partner with or buy up a compiler company to get a
real, generally available compiler out there optimised for thier
architecture. As it stands right now, the only compiler generally
available that can make full use of all the Opteron's features is GCC,
and GCC is not capable (yet) of supporting the types of optimisations
that will be needed to be competative. They can no longer rely on Intel.
The only other possible explination for the current situation is that
the 64-bit code is too bulky and thus overshadows any benefit gained by
the extra registers abailable in 64-bit mode, but i just can't fathom
that as a likely scenario.
Conclusions
Intel should be scared. The Opteron is beating its flagship products
pretty handliy, and yet has barely started to use most of its
capability. In every test that I've seen a Xeon or P4 "beat" an
Opteron, the Opteron has been running in 32-bit mode, using only half
it's registers and 32-bit addressing. Not to mention the fact that the
Opteron is only running at 1.8Ghz, and should be able to scale easily
to 2.2Ghz, and quite probably close to 3Ghz before a major redesign is
needed. AMD should be able to release faster processors almost at will
for the next 6 months at least.
However, it's not the hardware that AMD needs to improve. It's the
software. I can't wait to see a spec score for an Opteron thats
compiled with a decently optimising compiler for the x86-64
architecture. With that in place, AMD could, quite frankly, have the
fastest CPU in the world.
A word of caution, however, to anyone who is reading reviews or sees
anything about the Opteron. Always always always find out as much info
as you can about what what software it was running! And if it's not
running in 64-bit mode, then just think of how much faster it could be.
The hardware's there, just waiting to be used.
Comments, corrections, discussion welcome.
-pm
About the Author
I could be way off base on all this, but I don't think I am.
Also, writing at 4am because of a bout of insomnia doesn't help clear
my head; so apologies for any mistakes, misconceptions, or misnomers.
I'm a student of Computer Architecture, I always will be no matter how
old I am or who I work for, and I write about stuff that I enjoy. I am
currently happily employed, but you can view my resume
and my poorly maintained homepage. You can view
my other ramblings about HPC and Linux utilities as well.
peace.