Analysis

7Dec99: Marcin Slawicz and Jos Grupping have updated their graphical and statistical analysis of the FSBench benchmark data. By way of explanation: the first graph shows the trend for different types of CPU. Although there is a considerable spread in the individual results, the trend is clear: twice the speed gives twice the performance. Marcin's numbers (following both charts) reflect how much one line is above another line. The spread of data for each type of processor probably shows the effect of the way the different PC's are maintained. The cleaner the system, the better the performance.

It looks as if the AMD K6 is, as Marcin concluded too, the only "slow" CPU. The Intel cpu's do all about the same. The AMD K7 (Athlon) performs a bit better, probably reflecting the better fpu.

The second graph shows the benchmark results ordered by video card type (or 3D card type). A word of caution: from a true statistical standpoint one can question whether the groups are homogeneous enough for this type of analysis. Nevertheless: note how much the graphs overlap. In general the conclusion is that the type of video isn't near as important as the cpu-speed. Most modern cards will give about the same performance. It appears that the G400 is the outlier, while the new GeForce does a little better, but not very much (10%?)

7 Dec 99 Here is a series of posts recently seen on the sinflight.com FS2000 forum concerning the statistical analysis of data from the report System. Here are the best of the lot, most recent posts at the top:

Subject: Conclusions to FPS statistical analysis
From:
"Jim Ho" <jho@dres.dnd.ca>
Date:  3 Dec 1999 21:21:48 GMT

It was pointed out that to further make the
statistical tests fairer, the FPS values
should be normalized for CPU speed. To do
this, the formula [FPS/(Mhz*100)] was
suggested. This was a good approach as it
would avoid the danger of giving the
impression of comparing "apples to oranges",
a statistical sin. With apologies for having
polluted this space with previous postings,
here are the results.

Again the data sets were taken from
Avsim.com/fsbench with converted data
(GeForce & TNT2) kindly supplied by a
respondent.

Normality Test: Passed (P = 0.024)

Equal Variance Test: Passed (P = 0.395)

Group N Mean Std Dev SEM*
GeForce 5 4.048 0.630 0.282
TNT2 11 3.697 0.474 0.143
TNT2Ultr 24 3.419 0.470 0.0959
Matrox 10 3.403 0.349 0.110

*N = number of entries; mean = average; std
dev = standard deviation; SEM = standard
error of the mean

% speed difference compared to GeGorce: TNT2
= 8.7; TNT2Ultra = 15.5; Matrox = 15.9

The differences in the mean values among the
treatment groups are greater than would be
expected by chance; there is a statistically
significant difference (P = 0.031). This
suggests that at least one of the groups
should show some difference.

We proceeded to assign GeForce as "control"
group and use Dunnett's test designed for
comparing a control vs multiple others. For
this, the normalized data for TNT, Matrox and
TNT2Ultra were included as items of contrast.

Multiple Comparisons versus Control Group
(Dunnett's Method) ; as mentioned before, the
column to read is under "P<0.050". Yes
represents significant different at 95%
probability:

Comparison Diff of Means P<0.050
GeForce vs. Matrox 0.645 Yes
GeForce vs. TNT2Ultra 0.629 Yes
GeForce vs. TNT2 0.351 No

Conclusion: surprising, the test revealed
significant difference for TNT2Ultra and
Matrox while there was none for TNT2, as some
of you have hinted. Not included in this test
are the Rage, TNT and V2-3 data set. The
normalized data for these failed normality
test and could not be tested with this
methodology. So, after trying various
attempts at seeking a clearer picture to
video performance, it would appear that the
GeForce still demonstrates a slight edge
(almost 16%). But if there is any lesson to
be learned, we have demonstrated an objective
approach to solving the popular query as to
what is faster. Thanks to all the respondents
who have shown interest in this work and some
gave various suggestions for improving the
approach to this query. To the casual
observer, all this work may appear to be hair
splitting. But it is important to remember we
are attempting to arrive at a methodology
that is fair and that can be used as an
objective way to treat future data.

Thanks to B. Wilson for maintaining the
database. As per request, here is the
reference: "All data used herein provided by
flight sim enthusiasts throughout the world,
as reported at FSBench, the flight sim
benchmarking site, http://avsim.com/fsbench."

Subject: Round 3 & Final statistical analysis of Avsim Fsbench data
From:
"Jim Ho" <jho@dres.dnd.ca>
Date:  2 Dec 1999 21:44:43 GMT

Statistical Analysis of Fsbench FS2000
1024x768 Nov 27 Data

It has been shown that higher CPU Mhz can
give better frame rates. By the same token,
FPS data from slower machines may bias video
board comparisons. To avoid CPU speed
distortion, data selection included only FPS
entries from those registering CPU 400 Mhz
and higher. They were sorted according to
video types. Note that each group had unequal
items (N) making choice of test methodology
critical.

One Way Analysis of Variance done on
Thursday, December 02, 1999, 09:11:01

Normality Test: Passed (P = 0.127). This
means that conventional parametric methods
can be used. Equal Variance Test: Passed (P =
0.427). Even though some groups have small N,
they nevertheless did not present
unacceptable variance or "noise" problems.
Again, conventional tests can be used.

Group Name N Missing Mean Std Dev SEM GeForce
7 0 20.900 4.408 1.666 TnT2Ultra 24 0 16.813
3.243 0.662 TnT2 12 0 15.722 1.841 0.531 TnT
16 0 16.862 3.440 0.860 Matrox 10 0 17.826
1.909 0.604 Rage 6 0 13.695 5.193 2.120 V2-3
15 0 15.855 2.607 0.673

Source of Variation DF SS MS F P Between
Groups 6 213.133 35.522 3.526 0.004 Residual
83 836.085 10.073 Total 89 1049.217

The meaning of the unfamiliar terms can be
found here:
www.statsoft.com/textbook/stathome.html

The differences in the mean values among the
treatment groups are greater than would be
expected by chance; there is a statistically
significant difference (P = 0.004). It thus
gives the impression that some of the groups
do not come from the same population or that
they are not random (to use a Gatesian term).

Power of performed test with alpha = 0.050:
0.818; this test can flag situations where
there are insufficient data items. The
results are acceptable.

We want to find out if any one of the video
boards stands out as a different or better
performer. The multiple comparison test is
preferred over the conventional t-test as
this method takes all group variances into
account. A conservative test, one that is
less likely to flag a false positive is the
Student-Newman-Keuls Method and the test
result is shown below. The column to read is
under "P<0.050" where significant difference
at better than 95% probability is indicated.

All Pairwise Multiple Comparison using
Student-Newman-Keuls Method :

Comparison Diff of Means p q P P<0.050
GeForce vs. Rage 7.205 7 5.771 0.002 Yes
GeForce vs. TnT2 5.178 6 4.852 0.012 Yes
GeForce vs. V2-3 5.045 5 4.911 0.007 Yes
GeForce vs. TnT2Ultra 4.087 4 4.240 0.019 Yes
GeForce vs. TnT 4.038 3 3.970 0.017 Yes
GeForce vs. Matrox 3.074 2 2.779 0.053 No
Matrox vs. Rage 4.131 6 3.565 0.130 No Matrox
vs. TnT2 2.104 5 2.190 0.534 No Matrox vs.
V2-3 1.971 4 2.152 0.430 No Matrox vs.
TnT2Ultra 1.013 3 1.200 0.674 No Matrox vs.
TnT 0.964 2 1.065 0.454 No TnT vs. Rage 3.167
5 2.948 0.237 No TnT vs. TnT2 1.141 4 1.331
0.783 No TnT vs. V2-3 1.008 3 1.250 0.652 No
TnT vs. TnT2Ultra 0.0500 2 0.0690 0.961 No
TnT2Ultra vs. Rage 3.118 4 3.043 0.146 No
TnT2Ultra vs. TnT2 1.091 3 1.375 0.597 No
TnT2Ultra vs. V2-3 0.958 2 1.297 0.362 No
V2-3 vs. Rage 2.160 3 1.992 0.341 No V2-3 vs.
TnT2 0.133 2 0.153 0.914 No TnT2 vs. Rage
2.027 2 1.806 0.205 No

By SNK analysis, it would appear that the
GeForce performs better than most of the
other boards. However, we can still use an
even more conservative test, the Tukey (no
giggles please). Only a few comparative pairs
are shown to illustrate the results.

All Pairwise Multiple Comparison using Tukey
Test:

Comparison Diff of Means p q P P<0.050
GeForce vs. Rage 7.205 7 5.771 0.002 Yes
GeForce vs. TnT2 5.178 7 4.852 0.016 Yes
GeForce vs. V2-3 5.045 7 4.911 0.014 Yes
GeForce vs. TnT2Ultra 4.087 7 4.240 0.053 No
GeForce vs. TnT 4.038 7 3.970 0.086 No
GeForce vs. Matrox 3.074 7 2.779 0.444 No

With this tougher test, we now see that the
GeForce is only better than 3 cards: Rage,
TnT2 and V2-3. The Ultra has jumped into "no
difference" and why the TnT held up so well
is a mystery. By a whisker, the G400 boards
also skipped the threshold and Matrox users
can feel smug Again, statistical tests are
only as good as the soundness of the data
collected. None of the above would have any
meaning if it can be shown that there is a
flaw in the data sets or the way they are
manipulated. It is still possible that with
more data for each group, comparison results
may be different so this exercise should be
considered work in progress. Analysis was
done with SigmaStat 2.03.
(http://www.spss.com/software/science/sigmastat/)

Thanks for your previous comment, Jim.

Subject: Re: Round 3 & Final statistical analysis of Avsim Fsbench data
From:
Walt Bertram <wbertram@worldnet.att.net>
Date:  Fri, 03 Dec 1999 00:36:49 -0500

Jim,

I question whether the broad brush analysis
you applied is
significant. It seems you are still comparing
apples and oranges on
one hand, and peaches and apricots on the
other.

I took a quick look at the copy of FSBench(*)
data that I received a
few
days ago. First, a quick look shows that
there is a lot of bad data
in this dataset. In it there are 5 systems
reported which used the
GeForce
chipset (there was a 6th report, but the
clock frequency was not
reported, so I discarded it). There were 13
systems reported that
used the TNT2 chipset and that had a clock
frequency greater than 400
MHz. Two of those were identical copies of
another entry, so I
removed those two, leaving 11. The average
clock frequency of the
GeForce set was 567 MHz, and that of the TNT2
set was 456 MHz. A
significant difference, which could account
for the difference in
average
FPS of the GeForce vs. the TNT2.

These data are summarized below. The FPS used
is that in the 4th
column
of FPS data.

FPS Clock 100*FPS/Clock Chipset
18.7 450 4.16 nVidia GeForce 256
23.2 500 4.64 nVidia GeForce 256
16.5 550 3.00 nVidia GeForce 256
24.3 600 4.05 nVidia GeForce 256
32.3 733 4.41 nVidia GeForce 256
------------------------------------------
23.0 567 4.05 Average of GeForce
0.63 Std Dev, = 16%

FPS Clock 100*FPS/Clock Chipset
16.6 400 4.15 nVidia TNT2
14 400 3.50 nVidia TNT2
16.1 400 4.03 nVidia TNT2
16.6 400 4.15 nVidia TNT2
14 433 3.23 nVidia TNT2
19.1 450 4.24 nVidia TNT2
14.2 450 3.16 nVidia TNT2
16.3 466 3.50 nVidia TNT2
17.8 466 3.82 nVidia TNT2
22.2 550 4.04 nVidia TNT2
17.1 600 2.85 nVidia TNT2
-----------------------------------------
16.7 456 3.70 Average of TNT2
0.47 Std Dev, = 13%

The difference in 100*FPS/Clock is 0.35, or
9.5%.
This is a difference of about 0.6 std dev.
Is such a difference statistically
significant?

Walt

*All data used herein provided by flight sim
enthusiasts throughout the world, as reported
at FSBench, the flight sim benchmarking site,
http://avsim.com/fsbench.

Subject: Round 2 Statistical Analysis of FPS Data from Avsim.com
From:
"Jim Ho" <jho@dres.dnd.ca>
Date:  1 Dec 1999 17:44:32 GMT

On Monday, analysis was done with FPS data
from simflight.com and as someone has
noticed, a few ambiguities in the results
were probably a function of the noisy data.

Similar tests were done with what appear to
be better behaved data from Avsim.com and
shown below are the results.

1. Does MHz affect FPS performance?

Answer: Yes. Spearman Rank Order Correlation
Coefficient = 0.831; there is a high
relationship between faster CPU and better
FPS.

2. Which video board performs better?

Answer: The GeForce 256 is significantly
better than all others. Other than that it's
difficult to say if the others perform
differently from one another.

All Pairwise Multiple Comparison Procedures
(Student-Newman-Keuls Method) :

Comparison Diff of FPS Probability 95% Sig.
Diff. GeF 256 vs. Banchee 6.921 0.013 Yes GeF
256 vs. Rage128 6.886 0.014 Yes GeF 256 vs.
V3 6.331 0.003 Yes GeF 256 vs. TnT2 6.227
0.004 Yes GeF 256 vs. TnT 6.163 <0.001 Yes
GeF 256 vs. TnT2 Ult 4.352 0.009 Yes TnT2
Ultra vs. Banch 2.569 0.57 No TnT2 Ultra vs.
Rage1 2.534 0.538 No TnT2 Ultra vs. V3 1.979
0.32 No TnT2 Ultra vs. TnT2 1.875 0.313 No
TnT2 Ultra vs. TnT 1.811 0.099 No TnT vs.
Banchee 0.758 0.989 No TnT vs. Rage128 0.723
0.971 No TnT vs. V3 0.168 0.988 No TnT vs.
TnT2 0.0643 0.96 No TnT2 vs. Banchee 0.694
0.977 No TnT2 vs. Rage128 0.659 0.927 No TnT2
vs. V3 0.104 0.938 No V3 vs. Banchee 0.59
0.928 No V3 vs. Rage128 0.555 0.743 No
Rage128 vs. Banchee 0.0355 0.986 No


3. Of the 3 CPU types (P2-3, Celeron and AMD)
which one performs better?

Answer: A. P2-3 vs Celeron no difference

Compare FPS means P2-3 = 15.457 Cel = 16.192

B. Celeron vs AMD, Celeron is better.

Failed variance test so Mann-Whitney Rank Sum
Test was used.

Compare FPS median Cel = 15.935 AMD = 10.450

The difference in the median values between
the two groups is greater than would be
expected by chance; there is a statistically
significant difference (P = 0.025)

C. P2-3 vs AMD, P2-3 is better.

Equal Variance Test Failed so used
Mann-Whitney Rank Sum Test

Compare FPS median P2-3 = 16.18 AMD = 10.45

In conclusion, it is important that we work
with good clean data and approach the task
with some objectivity. Having said that, the
current analysis supports the general
expectation that faster MHz will yield better
performance. The other commonly held "wisdom"
that the type of graphics card does not
matter is NOT supported. The GeForce shows
significant advantage when properly compare
with the others. The surprise outcome was
that the P2-3 CPUs do not appear better than
the Celerons. Overclockers can rejoice. It
should be noted that the AMD result is
distorted by having too few K3 representation
and overall, the K2s dragged down the group.

Thanks to Vince for the encouragement.

Jim.

Subject: Re: Round 2 Statistical Analysis of FPS Data from Avsim.com
From:
"Jim Ho" <jho@dres.dnd.ca>
Date:  1 Dec 1999 17:50:48 GMT

Sorry about the messy table; here is a
cleaner version.

All Pairwise Multiple Comparison Procedures
(Student-Newman-Keuls Method) :

Comparison Diff of FPS Probability 95% Sig.
Diff.

GeF 256 vs. Banchee 6.921 0.013 Yes

GeF 256 vs. Rage128 6.886 0.014 Yes

GeF 256 vs. V3 6.331 0.003 Yes

GeF 256 vs. TnT2 6.227 0.004 Yes

GeF 256 vs. TnT 6.163 <0.001 Yes

GeF 256 vs. TnT2 Ult 4.352 0.009 Yes

TnT2 Ultra vs. Banch 2.569 0.57 No

TnT2 Ultra vs. Rage1 2.534 0.538 No

TnT2 Ultra vs. V3 1.979 0.32 No

TnT2 Ultra vs. TnT2 1.875 0.313 No

TnT2 Ultra vs. TnT 1.811 0.099 No

TnT vs. Banchee 0.758 0.989 No

TnT vs. Rage128 0.723 0.971 No

TnT vs. V3 0.168 0.988 No

TnT vs. TnT2 0.0643 0.96 No

TnT2 vs. Banchee 0.694 0.977 No

TnT2 vs. Rage128 0.659 0.927 No

TnT2 vs. V3 0.104 0.938 No

V3 vs. Banchee 0.59 0.928 No

V3 vs. Rage128 0.555 0.743 No

Rage128 vs. Banchee 0.0355 0.986 No

This server does not handle tables well.

Jim.

23Nov99 For those who might be interested, here is the breakdown of all Reports submitted, graphed by date and by the sim reported. Looks like FS98 is the winner so far, with FS2000 coming up fast.

18Nov99 Some interesting new analysis of the FS2000 Benchmarks have been posted on forums and sent to me lately. Here they are. Marcin's analysis used the 1024x768 data:

What does your frame rate really depend on?

Hello flightsimmers,

I agree the frame rate is not the most important thing during your flights, however nobody wants to watch the FS2000 slide show. Recently I took some FS2000 benchmarks from http://www.avsim.com/fsbench and made the data recapitulation. To be able to compare different systems (with different CPU clocks) I used the FPS / CPU clock *100 factor (frames per second for every 100 MHz). I excluded the AMD K6-2 results (read later why).

FPS for every 100 MHz with different graphic boards (average results):

Banshee        3.73
Voodoo3      3.68
GeForce256 3.58
TnT2            3.57
TnT              3.55
Rage128       3.53
G400           3.27

All current boards do well (or badly if you want). As you can see, the frame rate hardly depends on the type of graphic board. Saying more, the best boards (those that do the best benchmarks on Quake, Unreal etc.) don’t necessarily have to be the winners with FS2000. This situation can change when your CPU will pump 50 or 100 frames every second (the graphic board will be the bottleneck). You will need Pentium 1000 MHz or a new CPU architecture for this I’m afraid. In my opinion the image quality, not speed, should decide about the graphic board for FS2000.

FPS for every 100 MHz with different CPU clocks:

266 MHz 3.73
400 MHz 3.42
450 MHz 3.41
500 MHz 3.57
600 MHz 3.26

Every 100 MHz of your CPU can pump about 3.5 frames more. Remember – the numbers regard the particular FS2000 setup (as described at http://www.avsim.com/fsbench).

What FPS should you expect with the 450 MHz CPU?

TnT or TnT2 15.1
Voodoo3      15.7
Banshee        16.1

It would be about 35 FPS with future 1000 MHz CPU if nothing else will change.

But what about the CPU type? FPS for every 100 MHz for different CPU types:

K7        3.70
Celeron 3.53
P II       3.46
P III      3.43
K6-2    2.51

Only one looser: K6-2 (probably due to the weak FP unit). That’s why it is excluded above.

This recapitulation doesn't regard some other important factors like the motherboard chipset, memory type, graphic port type, sound board and others, however it seems that at the moment the best way to make your flights smoother is to use fast (clock and FP unit) CPU.

Marcin Slawicz
mslawicz@polbox.com

In all I think Marcin's analysis and Jos's graphs are convincing proof that the ONLY thing that matters to FS2000 performance is the raw clock rate of your CPU. The AMD K6-2 is a poor performer, and the Athlon might be a slightly better performer than the rest.

A personal big thanks to both Marcin and Jos for analyzing and plotting the data!