Jump to content

Things to care about when doing benchmark, SW for it

Recommended Posts

So, what everything has influence on how fast some SW executes on computers - or running time of specific code ?

1. CPU type

2. CPU clock

3. How much % of time it spends in waiting - usually it is wait to RAM access, HW register access

4. How much of task is given to special chips - which can do it faster than CPU. It would be DMA, blitter in case of ST(E) .


First 2 are easy case. 3-rd is actually complicated, especially since there is cache in all it . 4-th is not hard to get, so I start with it.


I said in 'Benchmarking' thread started by vol, who works on arithmetical benchmarking SW for Atari ST range, that blitter on or off will have influence on test result, just because printing lot of numbers on screen.

So, here is how it is with Mega STE at 8 MHz :

Tests are with 1000 digits.

Low res: blitter on: 16.06 , blitter off 16.15  . Little scrolling near end

Med res: blitter on 15.87, blitter off 15.90  . Here to add  that was no scrolling at all. That's why diff is much less here. (+ less bit planes, of course)

No sense to do it at 16 MHz, because blitter works then at 8 MHz too.  (And why - because it works with slow RAM, not cache).

So, some may say very small diff, maybe in range of test inaccuracy.  Nope. Repeat them on your STE, Mega ST, ST with blitter ...

Differences are small simply because relative time for screen output compared to calculation time is very small.

And it is worse if set larger digit count - then rarer (in time) prints on screen.

And why is med res faster ? Because less RAM access - 2 bit planes instead 4 .

So, I still say that print on screen during test is not good idea. Especially because, as may see results differ depending on used screen res.  And yes, difference will be much bigger in case of TT and Falcon - just try in their higher screen modes - partially because screen operations involving more RAM, partially because more slowdown of CPU caused by limited RAM bandwith.


Case of Stacy with 40 MHz 68030: as I see, DarkLord thinks that program gives false results, since his CPU runs faster than TT CPU (same one, just at 32 MHz).

Well, I'm pretty sure that results are OK, and actually TT is still much faster.  Why ? Faster RAM, 32-bit bus instead 16 .  Because that PAK68/3 uses only ST RAM and not fast (if there is such at all with it ?) it has plenty of waitings.   Something like very simple ST acceleration by using 16 MHz 68000 at 16 MHz whenever possible - and that's not possible during RAM access, so then must switch to 8 MHz.  Effective speed up is max some 30 % . 68030 has some cache, so it can be more, depending on running SW.

And now we are at cache:  why it is practically  (and unusually) approx 2x faster on Mega STE with 16 MHz CPU clock ?  Because vol's program is short, and all it fits in 16 KB cache of Mega STE.  To add here that 68030 internal cache is only 512 bytes. That's why some complex game like Microprose F1 is nothing faster on Falcon than on Mega STE at 16 MHz . Despite CPU executes many operations in less cycles (shifts especially) - that's compensated with larger cache of MSTE.


I can't blame vol for some things, caused by being not much familiar with Atari ST family, TOS . But there are people, active with Atari STs over decade, who did not get how to do benchmark properly. Like exxos, who claimed that Mega STE is faster at 16 MHz only some 10-20 % . Should  apologize for that shallowness (I apologize if it already happened). What was mistake ? Did test with GEMBENCH, and with blitter on - and as is said blitter stays at 8 MHz. GEMBENCH tests are rather for graphic operations, not bare CPU speed.


So, I think that vol's program pi-st is already good and fairly accurate. 

What to improve: off topic: use 8.3 filenames, pls. Randy will have heart attack if copies it on his disks with this lower case names ?

Maybe it's harder thing, but programming it so that it uses much more RAM, and cycling thru it will give better info about cache influence.

If test uses like 100 KB RAM, and not only once uses almost all of it, but goes thru it over and over again, in cycle, then cache will have lot of misses, so more slower RAM access must happen. Of course, ideal case is when even instructions are located on different RAM area - because opcode fetch time affects speed too.

Actually, should be not so damn hard. Copy core part, what is surely not more than 512 bytes to let's say 512 copies over 256 KB RAM, then calling all them in cycle, in row.  Well, during copying code should  do address corrections too - for workspace, so that changes too.  PC relative is not good because can go max 32 KB farther.




Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Create New...