Jump to content
IGNORED

Byte Magazine Sieve Benchmark


TheBF

Recommended Posts

Over in Benchmarking Languages @VORTICON provided us with the Classic Byte magazine article on the subject .

In this topic I would like to do this for our TI-99 universe. TI-99 has an excellent set of languages for a machine of its vintage.

 

Pick your favourite language and use the Byte version of the sieve as your first contribution.

Stay as close as possible to the Byte listing as you can with the TI-99 dialect of your chosen language.

If your language supports optimization or alternative coding techniques use them and post those results as an "optimized" result.

 

 

 

Here are some results from other machines circa 1989.

 

Sieve-results-from-past (2).png

Here is what we have so far.

bytesieve-benchmark-table-OCT2020.png

  • Like 1
Link to comment
Share on other sites

The followup January 1983 followup BYTE article, with a sh-t ton of results:    

 

https://archive.org/details/byte-magazine-1983-01/page/n291/mode/2up

 

My physical copy (from my original subscription) is tattered, taped and scribbled to an embarrassing degree. Not unlike myself. 

 

Here are some additional TI results, running on a 16-bit console. IIRC the Forth code had to be modified a bit to run. Cortex BASIC gets a small push from multiple statements per line in a couple instances. 

 

Assembly:    6.4"

Wycove Forth 3.0:   92"

Wycove Forth 2.0    112"

Turbo Forth  105"

FB Forth   158"

Cortex BASIC 1  1318" (Cortex BASIC loaded into RAM, integer FLAGS array in SuperCart)

Cortex BASIC 2  1685" (Cortex BASIC from FinalGrom, integer FLAGS array in expansion memory)

 

IMG_1890.thumb.jpeg.4c9ad39c8680630e85b725a627d4f042.jpeg

  • Like 8
Link to comment
Share on other sites

3 hours ago, Reciprocating Bill said:

Here are some additional TI results, running on a 16-bit console. IIRC the Forth code had to be modified a bit to run. Cortex BASIC gets a small push from multiple statements per line in a couple instances. 

 

Assembly:    6.4"

Wycove Forth 3.0:   92"

Wycove Forth 2.0    112"

Turbo Forth  105"

FB Forth   158"

Cortex BASIC 1  1318" (Cortex BASIC loaded into RAM, integer FLAGS array in SuperCart)

Cortex BASIC 2  1685" (Cortex BASIC from FinalGrom, integer FLAGS array in expansion memory)

Bill, when you say a 16-bit console, do you mean it was modded so all the RAM is on the 16 bit buss?

 

Link to comment
Share on other sites

1 hour ago, Vorticon said:

Are these results using the full 8K array and 10 iterations?

On the ZX-81 I could only dimension a 2K array before running out of memory, so I had to multiply the processing time by 4 to match other computers (1800 seconds). 

They are all against the 8192K array, and 10 iterations.

 

CBASIC could not dimension an array that size either, because numeric arrays are dimensioned as floating point with no option for an integer DIM. But what it can do is set aside an arbitrary amount of RAM (originally for the purpose of running assembly routines) and then poke and peek bytes or words (with the MEM() operation) into specific addresses, with statements like "MEM(FLG+K)=0" where "FLG" is the base address of the array.

 

When running CBASIC entirely in RAM (the fastest option on my console) there is only about 6K free after loading the interpreter, so I loaded CBASIC from a SuperCart and situated the FLAGS array in the 8K slow RAM it contains. When running out of FinalGrom CBASIC has about 16K free, so I allocated 8K of high memory for the integer FLAGS array (but then parts of the interpreter are running out of slow module ROM). In either case there is actually a bit of extra work for the interpreter to do relative to the original algorithm, as the MEM() address has to be calculated (one extra addition each access). 

 

Because CBASIC otherwise treats variables as integers where possible (but as FP if necessary), I'd say it's meaningful to put it up against other integer BASICs running the Sieve. So, for example, Apple II Integer BASIC completes the Sieve in 1850 seconds, an IBM PC using integers in MBASIC in 1950 seconds, and so on - both of which CBASIC bests handily at either 1318 or 1685 seconds. Remembering that this is a 16-bit console, it puts me in mind of "what could have been." 

 

 

  • Like 2
Link to comment
Share on other sites

So these are Classic99 timings but they usually hold very close to my real hardware.

The Byte mag. code needed one minor change for ANS Forth and the 10X iterator seems missing so I added that.

 

Spoiler

\ BYTE MAGAZINE SIEVE

NEEDS ELAPSE FROM DSK1.ELAPSE

DECIMAL
 8192 CONSTANT SIZE
 VARIABLE FLAGS   SIZE ALLOT  0 FLAGS !
: DO-PRIME
   FLAGS SIZE 1 FILL  ( set array )
   0        ( counter )
   SIZE 0
   DO FLAGS I + C@
     IF I DUP + 3 + DUP I +
        BEGIN DUP SIZE <
        WHILE 0   OVER FLAGS +  C!  OVER +
        REPEAT
        DROP DROP 1+
     THEN
   LOOP
   CR SPACE . ." Primes"  ;

: PRIMES ( n -- )
   ."  10 Iterations"
   10 0 DO  DO-PRIME  LOOP
   CR ." Done!"
;

 

 

Camel99 DTC    101.7 secs

TurboForth        107 secs

Camel99 ITC     119.5 secs

FbForth 2.0:13  174   secs

 

Camel99 ITC V2.69 Primes.png

Camel99 DTC 2.68 Primes.png

Link to comment
Share on other sites

There was also the Creative Computing magazine article from 1984. 

 

What amazes me is just how many completely separate (and competing) hardware and software platforms there were back in the day.

It was truly the wild-west of home computing.

 

I have the magazine but here's a direct link to the specific section....

 

https://archive.org/details/creativecomputing-1984-03/page/n7/mode/2up

  • Like 2
Link to comment
Share on other sites

6 hours ago, TheBF said:

fbForth 2.0:13  174   secs

 

I changed two words in the loops to see whether it made a difference:

IF I DUP + 3 + DUP I +

to

IF I I + 3 + DUP I +

because reusing I is marginally faster than DUPing it, and

        DROP DROP 1+

to

        2DROP 1+

2DROP was defined as

HEX
CODE: 2DROP  ( n1 n2 -- )
   05C9      \ SP INCT,  
   05C9      \ SP INCT,  
;CODE

It reduced the time by only one second!  :-o

 

...lee

  • Like 2
Link to comment
Share on other sites

I enabled my inline optimizer where it could make a difference on the same code on the Indirect threaded Camel99 Forth.

(I have a bug in the Direct threading version) 

NEEDS INLINE[  FROM DSK1.INLINE
\ optimized Version
DECIMAL
: DO-PRIME2
   FLAGS SIZE 1 FILL  ( set array )
   0        ( counter )
   SIZE 0
   DO INLINE[ FLAGS I + C@ ]
     IF INLINE[ I DUP + 3 + DUP I + ]
        BEGIN INLINE[ DUP SIZE ] <
        WHILE INLINE[ 0  OVER FLAGS +  C!  OVER + ]
        REPEAT
       INLINE[ DROP DROP 1+ ]
     THEN
   LOOP
   CR SPACE . ." Primes"  ;

\ Camel99
: PRIMES-OPT ( -- )
   PAGE ." 10 Iterations"
   10 0 DO  DO-PRIME2 LOOP
   CR
   CR ." Done!"
;

This ran the 10 iterations in 80.8 seconds.

 

Here is what the Forth code  "FLAGS I + C@"   became.  

2002  0646  dect R6                     (14)
2004  C584  mov  R4,*R6                 (30)
2006  0204  li   R4,>c1c4               (20)
200A  0646  dect R6                     (14)
200C  C584  mov  R4,*R6                 (30)
200E  C117  mov  *R7,R4                 (26)
2010  6127  s    @>0002(R7),R4          (34)
2014  A136  a    *R6+,R4                (30)
2016  D114  movb *R4,R4                 (26)
2018  0984  srl  R4,8                   (32)
201A  045A  b    *R10                  

 

  • Like 1
Link to comment
Share on other sites

GPL is disappointing with this one: 342 seconds. Maybe it's all these accesses to expansion RAM?

I used an old mess version (0.136), I can't find how to run it on mame 0.164: eratosthenes sieve.g.bin

With Classic99, it takes 387 seconds.

 

Spoiler

        grom    >6000
        data    >aa00,>0100,>0000
        data    menu
        data    >0000,>0000,>0000,>0000
menu    data    >0000
        data    start
        stri    'ERATOSTHENES SIEVE'

upcase  equ     >0018
iter    equ     >8300
count   equ     >8301
size    equ     >8303
i       equ     >8305
flags   equ     >a000
prime   equ     >8307
k       equ     >8309
tmp     equ     >830b
s1      stri    'ITERATIONS'
s2      stri    'PRIMES'
s3      stri    'DONE!'
line    equ     >830d

start
        dst     8191,@size
        st      10,@iter
        
        dst     >0900,@fac
        call    upcase
        
        dclr    @arg
        clr     @arg+2
        st      @iter,@arg+3
        call    print
        
        clr     @tmp
        move    1,g@s1,@tmp+1
        dinc    @arg+4
        move    @tmp,g@s1+1,v*arg+4
        st      1,@line
        
L6      dclr    @count
        dst     @size,@i
        
        st      1,@tmp
L1      st      @tmp,@flags(@i)
        ddec    @i
        dcz     @i
        br      L1
        
        dclr    @i
L3      cz      @flags(@i)
        bs      L2
        dst     @i,@prime
        dadd    @i,@prime
        dadd    3,@prime
        dst     @prime,@k
        dadd    @i,@k
L5      dch     @size,@k
        bs      L4
        clr     @flags(@k)
        dadd    @prime,@k
        b       L5
L4      dinc    @count
L2      dinc    @i
        dch     @size,@i
        br      L3

        st      @line,@arg
        clr     @arg+1  
        dst     @count,@arg+2
        call    print
        clr     @tmp
        move    1,g@s2,@tmp+1
        clr     @arg
        st      @line,@arg+1
        inc     @line
        dsll    5,@arg
        dinc    @arg+4
        dadd    @arg+4,@arg
        move    @tmp,g@s2+1,v*arg
        
        dec     @iter
        cz      @iter
        br      L6
        
        clr     @tmp
        move    1,g@s3,@tmp+1
        clr     @arg
        st      @line,@arg+1
        dsll    5,@arg
        move    @tmp,g@s3+1,v*arg
        
L7      b       L7

* print unsigned word decimal
* arg: row, arg+1: col, arg+2/3: number
print   clr     @fac
        st      @arg,@fac+1
        dsll    5,@fac
        clr     @arg
        dadd    @arg,@fac
        call    >2f7c
        st      @>8367,@fac+3
        clr     @fac+2
        clr     @arg+4
        move    @arg+4,@>8300(@fac+2),v*fac
        rtn

 

 

  • Like 2
  • Thanks 2
Link to comment
Share on other sites

3 hours ago, lucien2 said:

GPL is disappointing with this one: 342 seconds. Maybe it's all these accesses to expansion RAM?

I used an old mess version (0.136), I can't find how to run it on mame 0.164: eratosthenes sieve.g.bin

With Classic99, it takes 387 seconds.

 

  Reveal hidden contents


        grom    >6000
        data    >aa00,>0100,>0000
        data    menu
        data    >0000,>0000,>0000,>0000
menu    data    >0000
        data    start
        stri    'ERATOSTHENES SIEVE'

upcase  equ     >0018
iter    equ     >8300
count   equ     >8301
size    equ     >8303
i       equ     >8305
flags   equ     >a000
prime   equ     >8307
k       equ     >8309
tmp     equ     >830b
s1      stri    'ITERATIONS'
s2      stri    'PRIMES'
s3      stri    'DONE!'
line    equ     >830d

start
        dst     8191,@size
        st      10,@iter
        
        dst     >0900,@fac
        call    upcase
        
        dclr    @arg
        clr     @arg+2
        st      @iter,@arg+3
        call    print
        
        clr     @tmp
        move    1,g@s1,@tmp+1
        dinc    @arg+4
        move    @tmp,g@s1+1,v*arg+4
        st      1,@line
        
L6      dclr    @count
        dst     @size,@i
        
        st      1,@tmp
L1      st      @tmp,@flags(@i)
        ddec    @i
        dcz     @i
        br      L1
        
        dclr    @i
L3      cz      @flags(@i)
        bs      L2
        dst     @i,@prime
        dadd    @i,@prime
        dadd    3,@prime
        dst     @prime,@k
        dadd    @i,@k
L5      dch     @size,@k
        bs      L4
        clr     @flags(@k)
        dadd    @prime,@k
        b       L5
L4      dinc    @count
L2      dinc    @i
        dch     @size,@i
        br      L3

        st      @line,@arg
        clr     @arg+1  
        dst     @count,@arg+2
        call    print
        clr     @tmp
        move    1,g@s2,@tmp+1
        clr     @arg
        st      @line,@arg+1
        inc     @line
        dsll    5,@arg
        dinc    @arg+4
        dadd    @arg+4,@arg
        move    @tmp,g@s2+1,v*arg
        
        dec     @iter
        cz      @iter
        br      L6
        
        clr     @tmp
        move    1,g@s3,@tmp+1
        clr     @arg
        st      @line,@arg+1
        dsll    5,@arg
        move    @tmp,g@s3+1,v*arg
        
L7      b       L7

* print unsigned word decimal
* arg: row, arg+1: col, arg+2/3: number
print   clr     @fac
        st      @arg,@fac+1
        dsll    5,@fac
        clr     @arg
        dadd    @arg,@fac
        call    >2f7c
        st      @>8367,@fac+3
        clr     @fac+2
        clr     @arg+4
        move    @arg+4,@>8300(@fac+2),v*fac
        rtn

 

 

I ran it on real iron from a FinalGrom: 399 seconds. And that's with all RAM on the 16-bit bus. 

 

Further perspective. For my assembly version:

6.4 seconds, all RAM 16-bit.

7.1 seconds, unmodified console, code and workspace squeezed into PAD ram, array in memory expansion. (So array in memory expansion is not that big of a factor.) 

~10 seconds, unmodified console, WS in PAD, code and array in memory expansion.

~15 seconds, unmodified console, everything in expansion RAM.

 

Edited by Reciprocating Bill
OCD
  • Like 2
  • Thanks 1
Link to comment
Share on other sites

5 hours ago, lucien2 said:

I used an old mess version (0.136), I can't find how to run it on mame 0.164: eratosthenes sieve.g.bin

 

No naked bins with MAME please.

 

Just put it in a RPK file with this layout.xml. You can re-use it for similar cartridges.

 


<?xml version="1.0" encoding="utf-8"?>
<romset version="1.0" listname="editass">
  <resources>
     <rom id="gromimage" file="eratosthenes_sieve.bin"/>
  </resources>
  <configuration>
      <pcb type="standard">
         <socket id="grom_socket" uses="gromimage"/>
      </pcb>
  </configuration>
</romset>

sieve.rpk

 

Edit: 402 seconds on MAME.

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

2 hours ago, Reciprocating Bill said:

I ran it on real iron from a FinalGrom: 399 seconds. And that's with all RAM on the 16-bit bus. 

 

Further perspective. For my assembly version:

6.4 seconds, all RAM 16-bit.

7.1 seconds, unmodified console, code and workspace squeezed into PAD ram, array in memory expansion. (So array in memory expansion is not that big of a factor.) 

~10 seconds, unmodified console, WS in PAD, code and array in memory expansion.

~15 seconds, unmodified console, everything in expansion RAM.

 

Bill, can you post the Assembly Language code that you wrote?

Link to comment
Share on other sites

I got the optimizer working on the Direct Threaded Forth system. 

In order to make the optimizer simpler for Direct threaded Forth I had to roll back an optimization that I had put this new system.

( inlining the 2 instruction Forth interpreter )

 

That brought the un-optimized kernel time up to 107.5  seconds (versus 101 posted earlier)

 

With optimizer the timing on the DTC system is 71.1 seconds. 

 

Link to comment
Share on other sites

Here's the assembly code. Later I'll try posting the binary.  This version sets up the workspace at >8300 but makes no attempt to roll any code into the PAD. It should run in about 10 seconds on a stock console. 

 

* SIEVE OF ERATOSTHENES ------------------------------------------
* WSM 4/2022
* TMS9900 assembly adapted from BYTE magazine 9/81 and 1/83 issues
* 10 iterations 6.4 seconds on 16-bit console
* ~10 seconds on stock console
 
        DEF   START
        REF   VMBW
 
M      DATA  >0000             for PRINT routine
M2     BSS   6
M3     DATA  10000
       DATA  1000
       DATA  100
       DATA  10
ROW    DATA  >0004
 
MSG1   TEXT '  Sieve of Eratosthenes '
MSG2   TEXT '      10 iterations '
MSG3   TEXT '      done! '
 
SIZE   EQU  8190
FLAGS  BSS  8191
PRWS   BSS  >20
 
* main *
       EVEN
START  LWPI >8300              workspace in PAD for stock console
       LI   R5,>0101           to initialize array
       LI   R8,SIZE            index
       LI   R12,10             # iterations
       LI   R15,11             constant (for calculating row of display) 
 
       LI   R0,36              display title
       LI   R1,MSG1
       LI   R2,24
       BLWP @VMBW
       AI   R0,32
       LI   R1,MSG2
       LI   R2,20
       BLWP @VMBW
 
AGAIN  CLR  R7                 0 constant
       CLR  R10                counts primes
       LI   R6,FLAGS           base of FLAGS array
       LI   R0,SIZE            index FLAGS array
 
LOOP   MOV  R5,*R6+            write ones to FLAGS
       DECT R0
       JNE  LOOP
 
       LI   R6,FLAGS           reset index
FOR    CB   *R6+,R7            FLAGS(I)=0?
       JEQ  SIEVE3             yes, skip strikeout loop
 
       MOV  R0,R1              PRIME=I
       SLA  R1,1               *2
       AI   R1,3               +3
       MOV  R0,R2              K=I
       A    R1,R2              add PRIME to K

SIEVE1 C    R2,R8              K>SIZE?
       JGT  SIEVE2             yes, goto SIEVE2
       MOVB R7,@FLAGS(R2)      no,reset @FLAGS(R2)
       A    R1,R2              K=K+PRIME
       JMP  SIEVE1             
SIEVE2 INC  R10                increment count of primes


SIEVE3 INC  R0                 next I
       C    R0,R8              I>SIZE?
       JLT  FOR                no, next
 
       MOV  R10,@M             print #primes found
       BL   @PRINT
       DEC  R12
       JNE  AGAIN              rinse and repeat
 
       MOV  @ROW,R15           print "Done!"    
       INCT R15
       SLA  R15,5
       MOV  R15,R0
       LI   R1,MSG3
       LI   R2,12
       BLWP @VMBW


OUT    LIMI 2                  wait for quit
       JMP  OUT
 
* Print binary value as decimal (from Morley, p.149)
* value to print @M

 

PRINT  LWPI PRWS
       INC  @ROW               increment row counter
       LI   R0,4         
       LI   R1,M3
       LI   R2,M2              R2 points to address containing result
       MOV  @M,R4              mov integer to be converted to R4
J1     CLR  R3                 see Morley for comments
       DIV  *R1+,R3
       SLA  R3,8
       AI   R3,>3000
       MOVB R3,*R2+
       DEC  R0
       JNE  J1
       SLA  R4,8
       AI   R4,>3000
       MOVB R4,*R2+
       LI   R1,>2000
       MOVB R1,*R2
       LI   R0,>3000
       LI   R3,M2
J2     CB   *R3,R0
       JNE  J3
       MOVB R1,*R3+
       JMP  J2
J3     CB   *R3,R1
       JEQ  J4
       JMP  J5
J4     DEC  R3
       MOVB R0,*R3
J5     MOV  @ROW,R15           use row counter
       SLA  R15,5              row *32
       MOV  R15,R0
       INCT R0                 write to this addres in VDP
       LI   R1,M2
       LI   R2,6
       BLWP @VMBW
       LWPI >8300
       RT
 
       END

Edited by Reciprocating Bill
OCD
  • Like 3
  • Thanks 2
Link to comment
Share on other sites

8 hours ago, lucien2 said:

GPL is disappointing with this one: 342 seconds. Maybe it's all these accesses to expansion RAM?

I used an old mess version (0.136), I can't find how to run it on mame 0.164: eratosthenes sieve.g.bin

With Classic99, it takes 387 seconds.

 

  Reveal hidden contents


        grom    >6000
        data    >aa00,>0100,>0000
        data    menu
        data    >0000,>0000,>0000,>0000
menu    data    >0000
        data    start
        stri    'ERATOSTHENES SIEVE'

upcase  equ     >0018
iter    equ     >8300
count   equ     >8301
size    equ     >8303
i       equ     >8305
flags   equ     >a000
prime   equ     >8307
k       equ     >8309
tmp     equ     >830b
s1      stri    'ITERATIONS'
s2      stri    'PRIMES'
s3      stri    'DONE!'
line    equ     >830d

start
        dst     8191,@size
        st      10,@iter
        
        dst     >0900,@fac
        call    upcase
        
        dclr    @arg
        clr     @arg+2
        st      @iter,@arg+3
        call    print
        
        clr     @tmp
        move    1,g@s1,@tmp+1
        dinc    @arg+4
        move    @tmp,g@s1+1,v*arg+4
        st      1,@line
        
L6      dclr    @count
        dst     @size,@i
        
        st      1,@tmp
L1      st      @tmp,@flags(@i)
        ddec    @i
        dcz     @i
        br      L1
        
        dclr    @i
L3      cz      @flags(@i)
        bs      L2
        dst     @i,@prime
        dadd    @i,@prime
        dadd    3,@prime
        dst     @prime,@k
        dadd    @i,@k
L5      dch     @size,@k
        bs      L4
        clr     @flags(@k)
        dadd    @prime,@k
        b       L5
L4      dinc    @count
L2      dinc    @i
        dch     @size,@i
        br      L3

        st      @line,@arg
        clr     @arg+1  
        dst     @count,@arg+2
        call    print
        clr     @tmp
        move    1,g@s2,@tmp+1
        clr     @arg
        st      @line,@arg+1
        inc     @line
        dsll    5,@arg
        dinc    @arg+4
        dadd    @arg+4,@arg
        move    @tmp,g@s2+1,v*arg
        
        dec     @iter
        cz      @iter
        br      L6
        
        clr     @tmp
        move    1,g@s3,@tmp+1
        clr     @arg
        st      @line,@arg+1
        dsll    5,@arg
        move    @tmp,g@s3+1,v*arg
        
L7      b       L7

* print unsigned word decimal
* arg: row, arg+1: col, arg+2/3: number
print   clr     @fac
        st      @arg,@fac+1
        dsll    5,@fac
        clr     @arg
        dadd    @arg,@fac
        call    >2f7c
        st      @>8367,@fac+3
        clr     @fac+2
        clr     @arg+4
        move    @arg+4,@>8300(@fac+2),v*fac
        rtn

 

 

Are you counting all the time it takes to load character sets and set up in that timing?

I mean I have that same routine in RXB 2022 but is now mostly Assembly!

Link to comment
Share on other sites

3 hours ago, lucien2 said:

For this benchmark program, we could also keep the default character set.

That could improve it by a few seconds, but I was expecting something 1.5x slower than Forth (~150-200 seconds).

I would use GPL and Assembly in ROMs to do things just like XB does.

Link to comment
Share on other sites

4 hours ago, RXB said:

I would use GPL and Assembly in ROMs to do things just like XB does.

Rich - Are your Call Load (when used to poke) and Call Peek operations GPL routines? Seems to me that having fast, assembly equivalents would be very useful. It would then be practical to set up the equivalent of integer arrays larger than are possible with ordinary DIM statements - just as I did with Cortex BASIC. The original Xbasic calls are too slow for that purpose. (Although perhaps that wouldn't be much help vis speed - as anything to be poked or handled after being peeked would have to be floating point...but it would allow more compact data structures when integer values are sufficient.)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...