Jump to content
IGNORED

Byte Magazine Sieve Benchmark


TheBF

Recommended Posts

Hmm.. I didn't look into which mode was faster or slower, but Classic99 runs the CPU clock based on Windows high resolution timers, which are usually derived from the CPU cycle counter. Under a virtual machine, I would expect these timers to be less accurate.

 

The CLOCK device just reads from the host, so no reason to expect weird slips on that device. ;)

 

  • Like 2
Link to comment
Share on other sites

Returning to the sieve program for a minute 

 

I was wondering how much extra time it was taking Forth to fill up the 8K array with ones, at the top of every sieve operation.

 

The ANS Standard word FILL ( addr len byte --)  is byte oriented and is normally written in Assembler or C in some cases.

I have a version in the can called FILLW which fills 16 bits at a time which is more natural for the 9900. 

 

Replacing FILL with FILLW  shaves  8.5 seconds off the runtime of the benchmark.

It's easy to forget that even running native code the 9900 is not a speed demon. :)

 

Edit: It helps if you get the right answer.  I was not filling all the bytes correctly.  :)

After fixing that the difference was only 2.3 seconds. 

 

 

 

 

 

 

Edited by TheBF
Wrong answer
Link to comment
Share on other sites

1 hour ago, TheBF said:

I have a version in the can called FILLW which fills 16 bits at a time which is more natural for the 9900. 

 

Does your routine deal with an odd starting address, other than letting the MOV instruction do its thing and expecting the user to know what s/he is doing? Were FILLW given an odd starting address, without erroring out or managing the first and last byte separately from the intervening word writes, the first byte will not be touched and the last word will extend 1 byte beyond the intended range.

 

...lee

Link to comment
Share on other sites

This version (from the Rosetta web site) finds 564 primes to 4095 in ~5 seconds in TF. I've modified the code very slightly - instead of having it produce the list of primes at the end I modified it to just report the number of primes found. Makes it easier for stopwatch purposes.

: prime? ( n -- ? ) here + c@ 0= ;
: composite! ( n -- ) here + 1 swap c! ;
0 value pcount

: sieve ( n -- )
  here over 0 fill
  2
  begin
    2dup dup * >
  while
    dup prime? if
      2dup dup * do
        i composite!
      dup +loop
    then
    1+
  repeat
  drop
  0 to pcount
  2 do i prime? if 1 +to pcount then loop 
  cr pcount . ." primes found" cr ;

Could be optimised by 'de-factoring' and not having prime? and composite! as separate words. When doing so in TF it seems to shave ~1 second off. That time saved is probably in the tight loop at the end of the code. 

Edited by Willsy
typo
  • Thanks 1
Link to comment
Share on other sites

On 5/4/2022 at 4:45 AM, Lee Stewart said:

 

Does your routine deal with an odd starting address, other than letting the MOV instruction do its thing and expecting the user to know what s/he is doing? Were FILLW given an odd starting address, without erroring out or managing the first and last byte separately from the intervening word writes, the first byte will not be touched and the last word will extend 1 byte beyond the intended range.

That's an important point. The UCSD Pascal implementation supports intrinsics like fillchar, moveleft and moveright. They are used to speed up data movement operations, compared to writing everything in Pascal. They are also type independent, which means you can fill an array of intergers with characters, if you like.

All three examines the starting and ending addresses, as well as the byte count. If all are even, they use MOV. If not, they use MOVB. Thus the programmer gets the best speed possible, without having to know or taking the numbers into account himself.

 

After reading the update to the Byte article, I noticed that the missing information is that UCSD Pascal on the TI 99/4A runs the sieve in 512 seconds. Virtually identical to the Apple ][. Forcing the program to run in RAM instead of VDP RAM would probably speed it up a little. Most interesting was that I got virtually the same result with an array as with a packed array. It seems isolating a single bit in a packed array is done at the same speed as looking at the bit in a standard array.

Edited by apersson850
  • Like 3
  • Thanks 1
Link to comment
Share on other sites

On 4/22/2022 at 9:27 AM, Reciprocating Bill said:

The followup January 1983 followup BYTE article, with a sh-t ton of results:    

 

https://archive.org/details/byte-magazine-1983-01/page/n291/mode/2up

 

My physical copy (from my original subscription) is tattered, taped and scribbled to an embarrassing degree. Not unlike myself. 

 

IMG_1890.thumb.jpeg.4c9ad39c8680630e85b725a627d4f042.jpeg

From a computing standpoint, this was the golden period. Nowadays, you would be hard pressed to find an equivalent magazine any where. The loss of diversity in the computing ecosystem has really made the whole field poorer, at least from a hobbyist's standpoint... Back in the day, claiming that one brand of computer was faster by 10-20 seconds at a certain task than another brand was ground enough to start fanatical wars he he... 

 

  • Like 3
  • Thanks 1
Link to comment
Share on other sites

10 hours ago, Lee Stewart said:

 

Does your routine deal with an odd starting address, other than letting the MOV instruction do its thing and expecting the user to know what s/he is doing? Were FILLW given an odd starting address, without erroring out or managing the first and last byte separately from the intervening word writes, the first byte will not be touched and the last word will extend 1 byte beyond the intended range.

 

...lee

I took the "Forth is a Macro Assembler, so caveat emptor" approach. 

Every data structure declared in an ANS compliant way is aligned to even addresses so it is a minimal risk.

It's easy to wrap it up with an error out on odd address arg, if the application required it.

The problem I missed when I tried it was I had to  provide HEX 0101 as the filling argument of course. It also worked correctly if I just used -1 (all bits set) 

 

I do adjust the count with  code for  1-  2/  and still use  the DEC instruction on the loop counter.

Seems to work but I have not really stress tested it.

 

Link to comment
Share on other sites

Thanks for the link!

 

To save everyone else the eye-strain, here's the Forth code from that article (I changed the arrangement/formatting, but the code is identical):

8190 constant size
variable flags  size allot
: do-prime
  flags size 1 fill
  0 size 0 do
    flags i + c@ if 
      i dup + 3 + dup i +
      begin dup size < while 
        0 over flags + c! over +
      repeat
      drop drop 1+
    then
  loop
  . ." primes" ;

Takes ~10 secs in TF (on a Windows 10 machine).

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

5 hours ago, apersson850 said:

That's an important pooint. The UCSD Pascal implementation supports intrinsics like fillchar, moveleft and moveright. They are used to speed up data movement operations, compared to writing everything in Pascal. They are also type independent, which means you can fill an array of intergers with characters, if you like.

All three examines the starting and ending addresses, as well as the byte count. If all are even, they use MOV. If not, they use MOVB. Thus the programmer gets the best speed possible, without having to know or taking the numbers into account himself.

 

After reading the update to the Byte article, I noticed that the missing information is that UCSD Pascal on the TI 99/4A runs the sieve in 512 seconds. Virtually identical to the Apple ][. Forcing the program to run in RAM instead of VDP RAM would probably speed it up a little. Most interesting was that I got virtually the same result with an array as with a packed array. It seems isolating a single bit in a packed array is done at the same speed as looking at the bit in a standard array.

This is a case where Wirth and Moore did not agree. (what a shock) :) 

Chuck Moore was in favour of using the Forth dictionary as the case statement to determine which code to run rather than using data as the selector.

So in effect the programmer chooses the best word for the job. The systems job was to provide a relevant set of  simple intrinsics so that it was possible to that.

This is of course has a longer runway for programmer fluency "so different horses for different courses" as they say.

 

Can I put the 512 seconds measurement into the table under the un-optimized column then?

 

I have a bit array library and was thinking about trying the sieve using an array of bits as well. I think it will be slower in my implementation.

 

  • Like 1
Link to comment
Share on other sites

If we only judge by the popularity of object oriented languages, or should I say concepts, Wirth was right.

There's not really any optimization to do, if you should stay with the orignal source code, as it appeared in the Byte article.

I actually did one, as I replaced x := x+1 by x := succ(x). Knowing what code the compiler generates for these cases, I know that the second is a little faster.

Oh yes, now I realize that I forgot to reshuffle the variable declarations. In the p-system, short form addressing of procedure calls, constants, immediates and variables works with the first few items. When it comes to variables, it depends on the order they are declared. Thus you should never start by declaring arrays before single integers.

In the program, as it's listed, the entire array, with 8000 items, is in front of the integers. That means that the compiler will generate a long load of an offset value on the stack, then use that value as an offset when fetching the variable's value. Instead of fetching the value indexed by the instruction's own opcode.

 

You can hold off entering the value into the table until I've fixed that. This is run on a real machine, so I have to get home first.

Edited by apersson850
  • Like 2
Link to comment
Share on other sites

26 minutes ago, apersson850 said:

If we only judge by the popularity of object oriented languages, or should I say concepts, Wirth was right.

And if we say that right means most popular. :)  

 

I actually tend to think, for practical purposes, the Wirth approach makes most sense.  Moore is an extremist in the goal of simplicity in computing.

I think sometimes he forgot the axiom: "make the solution simple but not too simple" (Einstein?) 

His CPU designs ideas IMHO deserve more $ and R&D.  His efforts to reduce sub-routine calls to 1 cycle are pretty neat. (Check out RISC V with 32 registers for the antithesis)

( I am a little biased toward kicking sacred cows when it's appropriate) ;) 

 

26 minutes ago, apersson850 said:

There's not really any optimization to do, if you should stay with the orignal source code, as it appeared in the Byte article.

I actually did one, as I replaced x := x+1 by x := succ(x). Knowing what code the compiler generates for these cases, I know that the second is a little faster.

Oh yes, now I realize that I forgot to reshuffle the variable declarations. In the p-system, short form addressing of procedure calls, constants, immediates and variables works with the first few items. When it comes to variables, it depends on the order they are declared. Thus you should never start by declaring arrays before single integers.

In the program, as it's listed, the entire array, with 8000 items, is in front of the integers. That means that the compiler will generate a long load of an offset value on the stack, then use that value as an offset when fetching the variable's value. Instead of fetching the value indexed by the instruction's own opcode.

 

Now this is interesting to me.  When you need that kind of "carnel knowledge" to make the things work well that seems to be anti-Wirth.  Could be the UCSD designers had no choice?

 

26 minutes ago, apersson850 said:

You can hold off entering the value into the table until I've fixed that. This is run on a real machine, so I have to get home first.

I think you have just defined the difference between the BYTE version and your optimized version, no?

  • Like 1
Link to comment
Share on other sites

I think the designers of the p-machine did their best to accomplish a few goals:

  • Create an instruction set which would be optimized to support a compiler that could handle a meaningful implementation of Pascal. It has almost everything you'd expect from standard Pascal, if there is such a creature. And then some.
  • Create an instruction set which would be compact, to allow for small memories.
  • Create an instruction set which would be reasonably efficient, under the circumstances.

Now the compiler at hand is designed to handle a full Pascal implementation. It doesn't do any optimization at all, as far as I know. It just fits in memory and can do what it's supposed to do. But with some more space, it wouldn't be too difficult to make a compiler automatically allocate space for "small" variables first. To actually know which benefits most from that would require more advanced analyzis of the whole program.

Since the program works in both cases, it's perhaps not totally anti-Wirth. For sure the UCSD implementers did what they could do with what they had in 1975-1980. As it is, the programmer has to take it from there. Declare the most frequently used things first is at least a pretty simple rule to remember. You don't have to remember exactly how many variables are available for short form addressing. If you follow that rule you've done what you can. And it's more or less universal, as it's true not only for variables, but also for constants and procedures.

 

I'd say the philosophical difference between the Forth and Pascal implementations here is that for Pascal, it's good if you know the details, but it will work without that knowledge. With Forth, it will not work as intended if you don't know the rules, and follow them.

 

The trick to use succ(x) instead of x+1 is to take advantage of that for that kind of arithmetic, the first thing the compiler will do is to generate code which fetches the value of x to the top of stack. Then, in the first case, it will short load the constant 1 to the stack and then do an ADD. In the second case, it will INCrement top of stack by one. INC and ADD will take about the same time, so you save the work of pushing the constant 1 to the stack.

 

Yes, you're right, I've defined the time to run the program as it's written to 512 seconds and by flipping the variable declaration around it went down to 471 seconds. Not too bad for just rearranging the source code a bit.

Edited by apersson850
  • Like 2
Link to comment
Share on other sites

4 hours ago, Willsy said:

Thanks for the link!

 

To save everyone else the eye-strain, here's the Forth code from that article (I changed the arrangement/formatting, but the code is identical):


NEEDS ELAPSE FROM DSK1.ELAPSE
NEEDS CARRAY FROM DSK1.ARRAYS

DECIMAL
 8190 CONSTANT SIZE
 SIZE CARRAY []FLAGS

0 []FLAGS OFF

: DO-PRIME
   0 []FLAGS SIZE 1 FILL  ( set array )
   0          ( counter )
   SIZE 0
   DO I []FLAGS C@
     IF
        I 2* 3 +  DUP I +
        BEGIN DUP SIZE <
        WHILE
           0 OVER []FLAGS C!
           OVER +
        REPEAT
        2DROP
        1+
     THEN
   LOOP
   CR SPACE . ." Primes"  ;

Takes ~10 secs in TF (on a Windows 10 machine).

I adjusted my Classic99 (Options/CPU tuning) so that TurboForth ran in 10 seconds on my Win10 machine as well.

Then I ran the code on Camel99 ITC to get a reference and it came in at 12.21 

Told you TF holds its own. :) 

 

But... there are a couple of "I just learned Forth yesterday to do a magazine article" things in this code.

 

DUP +   of course should be 2*

DROP DROP  should be 2DROP 

 

The other thing that most Forth's of the time would have provided was CARRAY.  

So I put those three things together.  (Using DSK1.ARRAYS  lib for Camel99 Forth) 

 

And just doing that gave me 10.71 seconds or -14% just by using the language better (correctly?) .

I need to stop this now.

 

  • Like 2
  • Thanks 1
Link to comment
Share on other sites

Ok just one more. I had to know.

 

I loaded the DSK1.BOOLEAN lib file and it takes about 4.5 times longer. :) 

But the flags array is only 1035 bytes in size!

The BOOLEAN code is all Forth so it could be optimized but it works. 

\ BYTE MAGAZINE SIEVE using a bit-array    May 4, 2022 Brian Fox

\ 10 ITERATION TIMINGS
\                Normal
\ Camel99 ITC    440 seconds

NEEDS ELAPSE FROM DSK1.ELAPSE
NEEDS BITS: FROM DSK1.BOOLEAN

DECIMAL
 8190 CONSTANT SIZE
 SIZE BITS: []FLAGS

 0 0 []FLAGS BRST

: BITS>BYTES ( n -- n') BITS/BYTE /MOD SWAP CELLS + ;

: DO-PRIME
   0 []FLAGS SIZE BITS>BYTES  -1 FILL
   0          ( counter )
   SIZE 0
   DO I []FLAGS BIT@
     IF
        I 2* 3 +  DUP I +
        BEGIN DUP SIZE <
        WHILE
            DUP []FLAGS BRST
           OVER +
        REPEAT
        2DROP
        1+
     THEN
   LOOP
   CR SPACE . ." Primes"  ;

: PRIMES ( -- )
   CR ." 10 Sieves with BIT-ARRAY"
   10 0 DO  DO-PRIME  LOOP
   CR ." Done!"
;
Spoiler

\ BOOLEAN array creator and access words               Brian Fox  2018
\ Tested with  CAMEL99 Forth 16 bits, Gforth 32 bits, iForth 64 bits
\ July 2019, removed Rstack juggling BSET, BRST, BTOG.
\ Relaced with OVER. 20% faster

HEX
\ calculations based on CELL size
                   8 CONSTANT BITS/BYTE
 1 CELLS BITS/BYTE * CONSTANT BITS/CELL

\ create bit array & erase memory area for 'n' bits
: BITS:      ( n -- )
      CREATE  BITS/BYTE /MOD SWAP >R   \ calc memory size
              HERE OVER 0 FILL         \ erase the memory
              R> CELLS +  ALLOT  ;     \ allocate the memory

\ compute bit# in a cell & cell address in memory
\ usage:  42 MYARRAY BITFLD
: BITFLD  ( bit# bits[] -- bit#' addr) SWAP BITS/CELL /MOD CELLS ROT + ;

\ convert bit# to a bit mask
: BITMASK ( bit# -- n )  0001 SWAP LSHIFT ;

\ API
: BIT@ ( bit# bits[] -- ?) BITFLD @ SWAP RSHIFT 0001 AND ;  \ return 1 bit
: BSET ( bit# bits[] -- )  BITFLD SWAP BITMASK        OVER @  OR SWAP ! ;
: BRST ( bit# bits[] -- )  BITFLD SWAP BITMASK INVERT OVER @ AND SWAP ! ;
: BTOG ( bit# bits[] -- )  BITFLD SWAP BITMASK        OVER @ XOR SWAP ! ;

 

 

Classic99 QI399.046 2022-05-04 3_22_07 PM.png

  • Like 3
Link to comment
Share on other sites

17 hours ago, TheBF said:

Now this is interesting to me.  When you need that kind of "carnel knowledge" to make the things work well...

Even if it works without it, just knowing this and using that knowledge took away 8% of the runtime. Not too bad for such a simple effort.

  • Like 1
Link to comment
Share on other sites

1 hour ago, apersson850 said:

Even if it works without it, just knowing this and using that knowledge took away 8% of the runtime. Not too bad for such a simple effort.

Indeed. And, importantly, even without the modification, the code still runs just fine. Forth, being a much lower-level language (more like a basic programming toolkit than anything else) often relies on in-depth knowledge; being ignorant of certain aspects of the language can trip you up very fast. 

Link to comment
Share on other sites

15 hours ago, TheBF said:

But... there are a couple of "I just learned Forth yesterday to do a magazine article" things in this code.

Good catch. I thought I would try the inliner that I posted the other night in the other thread. I inlined the following:

inline: i+c@   i + c@   ;inline
inline: dupi+  dup i +  ;inline
inline: over+  over +   ;inline

8190 constant size
variable flags  size allot
: do-prime
  flags size 1 fill
  0 size 0 do
    flags i+c@ if 
      i 2* 3 + dupi+
      begin dup size < while 
        0 over flags + c! over+
      repeat
      2drop 1+
    then
  loop
  . ." primes" ;

Knocks about 1.5 seconds off. ~12.5% faster! 

Edited by Willsy
  • Like 1
Link to comment
Share on other sites

1 hour ago, Willsy said:

Indeed. And, importantly, even without the modification, the code still runs just fine. Forth, being a much lower-level language (more like a basic programming toolkit than anything else) often relies on in-depth knowledge; being ignorant of certain aspects of the language can trip you up very fast. 

I saw there are some tests reported, in that Byte article, where they had a native code converter available for that machine. On the processors they had that available for, they reached around 20 seconds.

Knowing how the PME is written for the TI 99/4A, I can estimate a runtime of about 60 seconds if such a converter had been available for our little computer. I've never seen any reference stating that any ever existed for the TMS 9900, for any machine.

Edited by apersson850
Link to comment
Share on other sites

6 hours ago, apersson850 said:

Knowing how the PME is written for the TI 99/4A, I can estimate a runtime of about 60 seconds if such a converter had been available for our little computer. I've never seen any reference stating that any ever existed for the TMS 9900, for any machine.

I think your estimate is pretty good.

With my very simple JIT I can compile all the code inline except for the looping primitives.  

Using direct threaded Forth, which has the fastest loops, the sieve runs in 67 seconds. 

We are still way off of Bill Sullivan's 10 second ASM program. 

I am hoping to see what GCC does on this compared to Bill's program if someone could compile it. 

  • Like 2
Link to comment
Share on other sites

On 5/5/2022 at 7:38 PM, TheBF said:

I am hoping to see what GCC does on this compared to Bill's program if someone could compile it.

Voilà: eratosthenes_sieve.c.bin

 

~13 seconds using Classic99 and a stopwatch.

 

Spoiler

#include "lib.h"

void main() {
    load_standard_characters();
    screen(CYAN);
    
    int iter = 10;
    char s[20];
    format_string(s, "% ITERATIONS", iter);
    display_at(1, 1, s);
    
    int size = 8192;
    char* flags = (char*) 0xA000;
    int line = 2;
    for(; iter > 0; iter--) {
        int count = 0;
        int i;
        for (i = 0; i < size; i++)
            flags[i] = 1;
            
        for (i = 0; i < size; i++) {
            if (flags[i]) {
                int prime = i + i + 3;
                int k = i + prime;
                while (k <= size) {
                    flags[k] = 0;
                    k += prime;
                }
                count++;
            }
        }
        
        format_string(s, "% PRIMES", count);
        display_at(line, 1, s);
        line++;
    }
    
    display_at(line, 1, "DONE!");
    while(1) interrupts();
}

 

 

  • Like 4
  • Thanks 1
Link to comment
Share on other sites

11 hours ago, lucien2 said:

Voilà: eratosthenes_sieve.c.bin

 

~13 seconds using Classic99 and a stopwatch.

 

  Reveal hidden contents


#include "lib.h"

void main() {
    load_standard_characters();
    screen(CYAN);
    
    int iter = 10;
    char s[20];
    format_string(s, "% ITERATIONS", iter);
    display_at(1, 1, s);
    
    int size = 8192;
    char* flags = (char*) 0xA000;
    int line = 2;
    for(; iter > 0; iter--) {
        int count = 0;
        int i;
        for (i = 0; i < size; i++)
            flags[i] = 1;
            
        for (i = 0; i < size; i++) {
            if (flags[i]) {
                int prime = i + i + 3;
                int k = i + prime;
                while (k <= size) {
                    flags[k] = 0;
                    k += prime;
                }
                count++;
            }
        }
        
        format_string(s, "% PRIMES", count);
        display_at(line, 1, s);
        line++;
    }
    
    display_at(line, 1, "DONE!");
    while(1) interrupts();
}

 

 

That's super.  Merci Lucien.

A bit slower than Bill's hand coded version but extremely good.

GCC is amazing.

Link to comment
Share on other sites

11 hours ago, lucien2 said:

Voilà: eratosthenes_sieve.c.bin

 

~13 seconds using Classic99 and a stopwatch.

 

  Reveal hidden contents


#include "lib.h"

void main() {
    load_standard_characters();
    screen(CYAN);
    
    int iter = 10;
    char s[20];
    format_string(s, "% ITERATIONS", iter);
    display_at(1, 1, s);
    
    int size = 8192;
    char* flags = (char*) 0xA000;
    int line = 2;
    for(; iter > 0; iter--) {
        int count = 0;
        int i;
        for (i = 0; i < size; i++)
            flags[i] = 1;
            
        for (i = 0; i < size; i++) {
            if (flags[i]) {
                int prime = i + i + 3;
                int k = i + prime;
                while (k <= size) {
                    flags[k] = 0;
                    k += prime;
                }
                count++;
            }
        }
        
        format_string(s, "% PRIMES", count);
        display_at(line, 1, s);
        line++;
    }
    
    display_at(line, 1, "DONE!");
    while(1) interrupts();
}

 

 

What optimization levels did you use for this result?

 

Link to comment
Share on other sites

After some bug fixing, I got the Sieve running on the StrangeCart Basic. Needed to fix a problem with FOR loop handing when the initial value was already exceeding the limit. Also increased the size of symbol space (as I call it) to 40kbytes, so that the DIM for the flags succeeds. Below are screenshots of a run on the actual hardware as well as the listing of the TI Basic program. This cannot be run on the normal TI Basic, since the array size exceeds RAM space. Also for convenience line 110 makes the DIM based on a variable value, this does not work with TI Basic.

The runtime is 0.86 seconds with current firmware version.

22976621_SieveRun.thumb.png.396e480c1e58d656f21b7628a28ad9a8.png

And here is the listing.

2066605220_SieveListing.thumb.png.182f7208622f66f3515ab355acfc5183.png

I also edited this so that the code reads as below, which can be run on TI Basic. It takes about 47 seconds to run. On the StrangeCart Basic this takes just under 0.14 seconds. This is only 340 times faster, so I still have optimisation work to do, I haven't focused on optimisation yet. Without I/O the Basic runs at around 60.000 to 80.000 lines per second.

5240603_Sieve1024.thumb.png.5ed908e442287049782d7f511e5b54b7.png

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...