Sunday, August 22. 2010
Sometimes charts look to perfect to be measured. I had this feeling when i saw the rperf numbers of the 795 and put them into a chart. I'm a very visual person, so i put everything in a chart just to get a feeling for numbers.
At first i thought i was paranoid, but then my colleague Jan Brosowski mailed to me that he had thought the same, albeit he approached the problem from the mathematical point of view. Okay ... that left me with a lot of questions and so i did some quick bullshit-testing math on the datapoints.
Some mathAfter reading his mail i wanted to do some tests on my own. So i did a short test with the numbers. At first i've put the data of the 3,7 Ghz P7 into my favorite statistical programm R.
> procs <- c(24,48,72,96,120,144,168,192)
> rperf <- c(273.51,547.02,820.53,1094.04,1367.55,1641.06,1914.57,2188.08)
> fm <- lm (rperf ~ procs)
> fitted.values(fm)
1 2 3 4 5 6 7 8
273.51 547.02 820.53 1094.04 1367.55 1641.06 1914.57 2188.08
> residuals(fm)
1 2 3 4 5
-7.886829e-14 -1.045378e-14 6.540129e-14 6.414338e-14 3.446377e-14
6 7 8
-2.363755e-14 -2.489546e-14 -2.615336e-14
> coefficients(fm)
(Intercept) procs
1.607775e-13 1.139625e+01
> summary(fm)
Call:
lm(formula = rperf ~ procs)
Residuals:
Min 1Q Median 3Q Max
-7.887e-14 -2.521e-14 -1.705e-14 4.188e-14 6.540e-14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.608e-13 4.241e-14 3.791e+00 0.00906 **
procs 1.140e+01 3.499e-16 3.257e+16 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.442e-14 on 6 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.061e+33 on 1 and 6 DF, p-value: < 2.2e-16
Strange ... the linear model leads to coeffcients able to predict the rperf value per core with minimal residuals. And i learned not to trust data with a R-squared of 1. Okay ... let's check for the 4.0 GHz P7 perf numbers: > procs <- c(32,64,96,128,160,192,224,256)
> rperf <- c(372.27,744.54,1116.81,1489.08,1861.35,2233.62,2605.89,2978.16)
> fm <- lm (rperf ~ procs)
> residuals(fm)
1 2 3 4 5
1.627514e-13 -1.454981e-13 4.364426e-16 -4.677910e-15 -3.821397e-14
6 7 8
-1.490662e-14 -1.052861e-13 1.453949e-13
> fitted.values(fm)
1 2 3 4 5 6 7 8
372.27 744.54 1116.81 1489.08 1861.35 2233.62 2605.89 2978.16
> coefficients(fm)
(Intercept) procs
-3.215549e-13 1.163344e+01
Again ... minimal residuals. Okay ... last check ... for the 4.25 GHz procs. > procs <- c(24,32,48,64,80,96,112,128)
> rperf <- c(347.36,463.14,694.71,926.28,1157.85,1389.42,1620.99,1852.56)
> fm <- lm (rperf ~ procs)
> residuals(fm)
1 2 3 4 5
3.163842e-03 -1.638418e-03 -1.242938e-03 -8.474576e-04 -4.519774e-04
6 7 8
-5.649718e-05 3.389831e-04 7.344633e-04
> fitted.values(fm)
1 2 3 4 5 6 7
347.3568 463.1416 694.7112 926.2808 1157.8505 1389.4201 1620.9897
8
1852.5593
> coefficients(fm)
(Intercept) procs
0.002429379 14.473100282
Sorry ... that's is looking to perfect to me.
When you put the 64-cores LPARs data into the same system, you will see for 4.0 GHz: > procs <- c(128,256)
> rperf <- c(1406.36,2812.72)
> fm <- lm (rperf ~ procs)
> coefficients(fm)
(Intercept) procs
-3.215549e-13 1.098719e+01
And now for the 4.25 GHz P7:
> procs <- c(64,128)
> rperf <- c(777.09,1554.18)
> fm <- lm (rperf ~ procs)
> coefficients(fm)
(Intercept) procs
-1.607775e-13 1.214203e+01
Both times the intercept is 0 (i assume the small intercept is owed to rounding in data point by IBM or by the challenges of floating point arithmetic on computers.
That's totally unreasonable for measured data. When you just assume 99% of the performance for the 256 cores datapoint (thus an practically impossible scaling factor) you would have an intercept in the range of 28.13. > procs <- c(128,256)
> rperf <- c(1406.36,2812.72*0.99)
> fm <- lm (rperf ~ procs)
> coefficients(fm)
(Intercept) procs
28.12720 10.76744
ConclusionAt the moment i don't believe that IBM has really measured all the data it provides in the rperf list. The data fits to perfect in a linear model. The interesting question is: "Which data points were really measured?" All the data provided for the configurations look computed/guessed or something like that and not measured. Even when you want to assume that IBM found a way to the holy grail of linear scaleability, an R-squared of 1 and residuals at 0 are just ridiclious. I would really like to know what data points were really measured.
Monday, July 19. 2010
While trying to bring down my unread articles counter to zero, i've read an press release of SGI about their UV1000 SPECjbb2005 benchmark. And indeed: Some of the numbers are impressive.
However, there is a second story in this numbers. It's the story of the long arm of Mr. Amdahl - again. And it it underlines my opinion, that we need mandatory SPECjbb2005 results with 1 JVM at 64 bits I have to thank SGI for giving me some data points to check my hypothesis.
Continue reading "Thank you, SGI"
Monday, July 12. 2010
I assume i wasn't able to clarify my point versus the "multitude of JVM" SPECjbb2005 results. Of course there are many applications out there where the scaling behaviour mandates many small JVM instances. And there is nothing to win but performance to lose to use a 64-bit for an application that just uses one gigabyte of memory.
However that isn't the class of problem you try to solve with those large machines that are used in those benchmarks SPECjbb2005 benchmarks.
What's the sense of purchasing a p750 with 16 cores just to divide it in 16 JVM. What's the point of buying a p595, a Altix UV, a M9000 just to slice and dice it. When you look at many "multitude of JVMs" results, you even see commands to bind them to a CPU disallowing the process to migrate around on the system, making a group of 1 proc servers out of this large and mighty machines. Such a configuration is only justifiable by the much better RAS of such system.
In such a case it can better to purchase some M3000 instead of M9000. Many people forget that the capability to scale to a large number of CPUs doesn't come for cheap ... hardware scalability costs performance for example by introducing a slightly longer latency by a larger crossbar or multiple stages of crossbars. When you system doesn't need it, don't use it. But when your system needs large CPU counts or large memory heaps, SPECjbb2005 results may be not indicative for the performance you get, because the vendors factored out something important from the result.
Wouldn't it be more obvious just to purchase several smaller system when my application just doesn't use (and can't use because of the 32-bitness) the main advantage of such systems like providing a really large single memory area. Just to reiterate it: When you want to buy a cluster, you should buy one. This multitude of JVM configuration transform SPECjbb in a easily clusterable application. It's misguiding about the scaling behaviour of the system. But i'm repeating my self ... i would really like to see the dismissal of "multitude of JVM" results or at least a mandatory "1JVM/64 bit" result as a SPECjbb2005_base result.
Monday, July 12. 2010
TPM is bragging about the SPECjbb2005 performance of the p7 system. Interestingly he is talking about the nonsensical benchmark again: Multiple JVMs per system with 32 bit. As i wrote before: When you want a cluster, buy a cluster and don't build one for an embarrassingly parallel problem in a box. I think, that the only relevant test today is 1 JVM/64-bit today. By the way: Those multitude of JVMs/32-bit doesn't use the real advantage of the big machines (large heaps for you application), however it's the only chance those SGI systems or extremly-NUMA systems can compete with larger SMP systems, because of the vastly reduced interconnect usage (which is quite useful to avoid the long arm of Amdahl's law)
Friday, June 18. 2010
I'm really curious ... i hope there will be a I/O heavy benchmark of the p750 that's halfway comparable with an p550 configuration. Because the more often i look at the p750 the more i get the impression it's an totally unbalanced system.
Continue reading "Rocket with a duck engine"
Monday, June 14. 2010
Let's say you want to purchase a larger system ... let's say with 64 threads. How would you use it? I would assume you want to use a large memory set for example ... or you want to scale over several processors while having a shared memory.
Today i was reading through some benchmark reports and i found a nice example that benchmarks are totally misguiding.
What would you think about a benchmark for larger system that is factoring out almost all challenges of larger systems like interconnect latencies, memory locality? Doesn't sound useful ... at least i'm thinking that way.
So i was somewhat surprised when i'm reading through all the benchmarks for the p7 systems. There isn't a single benchmark testing this class system with a single JVM. Every benchmark is using as much JVM as cores. When you dig down into the benchmark description it gets even better. For example just look into this report: 64 JVMs were run in processor sets each containing 1 core I hope, i'm not the only one who finds such an configuration rather funny. Essentially they took they big system and just made 64 1-core systems out of it. You can do something like that vastly cheaper .... by taking smaller boxes.
There are other strange things: I know that 32-bit Java is faster than 64-bit Java. however what's the sense of using a 64-bit machine with half a terabyte of memory and just being able to use 160 GByte due to the 32-bit limit? At least the benchmark used this amount of memory as described by the same document.
I would really like to see an "1 JVM at 64 bit"-result for SPECjbb2005 for one of the larger IBM 7xx systems, albeit one for the smaller systems would be a start. Starting several JVMs makes an embarrasingly parallel problem out of SPECjbb2005. This will scale on every possible system and it's an uninteresting number to assess the speed of a system. Embarrasingly parallel means that the interconnect ist almost no factor, the algorithms for ensurinng coherency are almost no factor, the capability of the OS to scale is almost no factor. Everything that's important to get an impression of a large system in his natural environment is factored out. By the way ... that's the same reason why an HPC system like the SGI Altixes is capable to yield such good SPECjbb2005 results. No wonder, when it's easy to transform the benchmark by starting hundreds of JVMs to circumvent the weaknesses of the SGI Altix .... but i wrote about that earlier.
When you search a little bit through the results of SPECjbb2005 for IBM systems you will find some interesting things. The best IBM system in regard of single-JVM performance isn't a Power7 system, it's an IBM BladeCenter LS42 (Opteron based) with 244376 Bops per JVM followed by an p5 570 with 224200 Bops per JVM. The first p7 system is a p780 at 91446 Bops per JVM.
Sun did something like that, there is a benchmarking configuration result (that contains this one core dedicated for each JVM non... err ... stuff, too) for the T5440 as well as an "1 JVM/64-bit" result. The second one yielded 688692 Bops per JVM (and in total) instead of 841380 a T5440 would be able to do in the benchmarking config. And the leader in "1 JVM/64 bit" performance is the M9000 with 1757035 Bops per JVM.
This siutation may be just the case because of the missing results of other vendors in this category, but it's strange that many other vendores aren't really testing their high-end iron in a situation using the advantages of this category of systems. I would really like mandatory SPECjbb2005_baseline (1 JVM/64-bit) and SPECjbb2005_peak (all tricks the vendor knows to make the most out of the system) results. At the moment you can just stratch your head about some of the configurations. However i don't believe that we will see such results, because the difference between a SPECjbb2005_baseline and SPECjbb2005_peak would give some interesting insights into the scalability of a system. However results like separating a large machine into several small systems are totally useless. When you want a cluster of JVMs, buy a cluster ....
Monday, June 7. 2010
A colleague found out something interesting about the rperf of IBM. rPerf stands for "relative Performance" and is a construction of IBM to compare the performance of their systems. My colleague got to this points while studying the recent TPC-H result for the p6 595.
When you look at the rPerf value of a p5 595 with the 1,9 GHz in document as current of May 2009, it yields a value of 306.21 times the performance of the rPerf baseline (a pSeries 640 system). When you look at the rPerf value of the p6 p595 yo will find a value that this system delivers 553.01 times the performance of the baseline. That's a difference of 81% percent.
Why did i talked about especially those values: These systems were used in the TPC-H benchmarks by IBM. In those benchmarks the p5 p595 yielded 100,512 QphH@3000GB, the p6 p595 yielded 156,537 QphH@3000GB. When you calculate the difference you will get just 56 percent difference. There are 25% missing percent.
I know that the p5 p595 used vastly more disks, but i want to point you on certain facts:
- The new system is using AIX 6.1 instead of AIX 5.3. 6.1 is regarded as an important step in scaling on large systems.
- The p6 configuration is using twice the memory of the p5 configuration. TPC-H is known to be a very memory intensive benchmark. And usual ... only a cached I/O is a good I/O)
- The p6 provides PCIe instead of PCI-X, thus providing a much faster I/O
- The p5 config used a general purpose database, while the p6 config used a specialized DWH database
- The p6 config used 4 GB SAN instead of 2 GB FC.
- The p6 config used 667 MHz DRAM instead of 533 MHz
Let's just assume for a moment, that those changes would just outweight the number of disks.
You could draw a conclusion out of this: rPerf is vastly overscaling the performance for I/O intensive tasks. The difference is 25% less performance between estimated performance by rperf and real performance measured in the benchmark.
I think IBM is pretty aware of the fact, that rPerf isn't really usable for system sizings, at least when you look at the rPerf webpage you can get to the conclusion that they want to say "Nice numbers. But don't use them ...". Furthermore it doesn't test important parts of commercial computing: The rPerf model is not intended to represent any specific public benchmark results and should not be reasonably used in that way. The model simulates some of the system operations such as CPU, cache and memory. However, the model does not simulate disk or network I/O operations.
However i'm aware of several situations where an competing offer was based on an rPerf calculation. independent of the point if its really appropriate to sthe situation. So when you get an offer of IBM using rPerf used in the sizing you should be cautious (in the sense of very cautious) if it's really matching your needs.
Monday, September 4. 2006
This blog contains only two entries until today but both have a very interesting view to different parts of the x86-world: Scientias Blog.
Continue reading "Lies, fucking lies and Benchmarks"
Tuesday, July 25. 2006
Interesting article about new TPC benchmark suites. It´s good to see, that the media support Sun opinion, that TPC-C is absolutly useless this days. It should be subsituted by something more realistic sooner than later:
One problem with TPC-C was the ease with which certain servers could generate unrealistically high scores using hardware and software configurations that are highly improbable in the real world. For example, high TPC-C scores come from servers with colossal numbers of hard drives--6,548 in the case of IBM's top score. and Another problem hinged on the fact that TPC-C's test was easily distributed among relatively independent servers linked in a cluster. That gave the impression that a number of inexpensive machines were as good as a single multiprocessor behemoth, leading the consortium to list results separately for clustered and non-clustered systems.
Tuesday, July 18. 2006
I looked for a project into the SPEC website today and saw that the T2000 still leads the SPECweb2005 benchmark by roughly 4000 points. And this benchmark was made on the T2000 in November last year. Even the new Woodcrests stopped short of 10000 points and that were already fully blown 4 core systems (maximum for Woodcrest right now). Now keep in consideration that Niagara II already runs in our labs. Interesting times . Well ... Rolf, we should do the power meter gag again next cebit
Tuesday, June 13. 2006
Okay, very impressive number: 1.000.000 operation per second. But look at the bill of material: 24 Filerheads, not the small one, the NetApp 6040 is one of the bigbadass machines, 768 Gigabytes of Cache, 48 GB NVRAM, 1152 Harddisks. Does anybody uses really a environment like this? It´s seems to me like using an SF25K for fileserving.
This is the nature of benchmarketing. It´s not a matter of being realistic of realistic workloads, it´s a expression of the fine art of understanding and outsmart a benchmark to make a marketing statement.
|
Comments