Benchmarking Languages

JamesD · June 15, 2017

...

I would pose that using a fully integrated system like Turbo Forth or FBForth would be just as productive (once you grok the language).

Living inside the Forth REPL with editor your finger tips and the ability to test interactively every sub-routine, ASM fragment, variable, constant, memory location or data structure is extremely productive.

It has more front end loading than Pascal of course, but the typical process is to write up a set of routines that elevates the programming level, very much like you would in Pascal, but it is a low level language out of the box

When I was learning Forth I was surprised how some of Wirth's thinking and Charles Moore's thinking overlapped. Albeit one is a European academic and the other is a self professed "computer cowboy". :-)

(like building up the code from the bottom with no forward references, although both languages provided ways around that ideal)

While such an integrated environment would be more productive, Forth seems to take a little more thought than Pascal.

You have to train yourself to think a little different in Forth where Pascal works similar to other programming languages.

So I think Pascal would still be more productive.

The Wiki doesn't seen to cover the history of the P-Machine and Forth engine.

I seem to remember there being a similar compiler/interpreter for another language in the late 60s and that's probably where Forth and Pascal inherited the idea and why they both seemed to arise at the same time. BASIC maybe? I think that arouse out of the same group that wrote that educational environment using graphics terminals, speech, etc... the name escapes me at the moment. Out of Chicago???

+TheBF · June 15, 2017

There is a pretty big article on the History of Forth here: https://www.forth.com/resources/forth-programming-language/

Chuck was probably aware of the things going on around him, but from what I gather he just hated Fortran. :-)

And assembler was a slow development process. So he made Forth originally as just an interpreter to simplify batch jobs.

LOL. From there he figured out the threaded compiler concepts.

It seems the stack virtual machine has been re-invented independently by many people. Post-script, Java, Forth, Joy etc.

Here is the quote that shows where Chuck lived:

"To Moore, Forth was a personal response to his frustration with existing software tools,

which he viewed as a sort of “tower of Babel” [Moore, 1970a]:"

Edited June 15, 2017 by TheBF

apersson850 · June 15, 2017

Hmpf, I got inspired and intended to try the Pascal thing on my 99/4A. But it's obviously been too long time since last time, because I needed to use a different Pascal system disk (the old one broke down, but I had a copy). So I thought I better make a new copy of the copy, so that there's still two. And managed to overwrite and destroy the, as it seems, only disk I had which held everything related to DSR routines and such stuff for my own cards.

Now I'm not in the mood for any retro computing any more... :_(

matthew180 · June 15, 2017

You are far more experienced than I. Can you explain where the roadblocks are without having to re-write it for me?

I explain it here: http://atariage.com/forums/topic/162941-assembly-on-the-994a/page-15?do=findComment&comment=3782641

Read a few posts before and after that one. The main savings in your routine will be not having all the SWPB instructions. Also, if you are writing multiple consecutive bytes to the VDP, setting up the address every time is a waste of time. The VDP's address register increments every time your read or write to the VDP, so for block read/writes you only need to set the address once.

Having a "single byte write" function that takes a VRAM address and byte to write, is a waste. A better use of functions would be to have VRAM "read address setup" and "write address setup" functions. Then it is just a single instruction to read or write your byte. If you are reading/writing blocks (or the same byte to multiple locations, like clearing the screen, etc.), then you don't have any address overhead in the loop.

If you need to read/write a lot of bytes to VRAM in different locations, then you just call your address setup function and then read/write your byte. But at least in this case you realize you are setting the address for every byte, instead of that detail being hidden away in a function.

apersson850 · June 15, 2017

In this particular case it's the question about updating the position of the same sprite over and over again, so it makes sense to rewrite the address each time.

matthew180 · June 16, 2017

I see, then yes, in that case you would have to set-up the address every time. However, all the SWPB instructions are not needed, as discussed in the Assembly thread. So there are still some optimizations to be had.

+TheBF · June 16, 2017

I see, then yes, in that case you would have to set-up the address every time. However, all the SWPB instructions are not needed, as discussed in the Assembly thread. So there are still some optimizations to be had.

Yes that was what you had the benchmark do.

These are clever tricks. I always knew it could be done, but thought it would be slower than using the instructions as designed.

Wow. What a sad story for the 9900 designer. :-)

So if I read the 9900 manual correctly:

Using SWPB followed by MOVB R0,@>xxxx is 10+14+8 = 32 cycles, 6 bytes

but using MOVB @>XXXX,@>yyyy adds is 14+8+8= 30 cycles, 6 bytes

Correct?

Edited June 16, 2017 by TheBF

matthew180 · June 16, 2017

It is not the speed of SWPB vs MOVB, it is that you don't need SWPB at all if you set thing up differently.

; Assumes CPU Workspace is at >8300, otherwise adjust >8301 value to
; reference address of R0 LSB.
;
; R0 - VRAM address
; R1 - MSB contains byte to write to VRAM

MOVB @>8301,@>8C02  ; Send the address-LSB (LSB of R0) to the 9918A.
ORI  R0,>4000       ; Set MS-bits of R0 to "01"
MOVB R0,@>8C02      ; Send the MSB of the address to set-up.
MOVB R1,@>8C00      ; Write byte to VRAM

I don't know how you would write that in Forth, but that is the routine in straight assembly. You can also forgo the ORI instruction if you require the programmer to set the two MS-bits to "01" as part of the address.

+TheBF · June 16, 2017

Ahh.... but you must pay for your sins for using register sorcery! :spidey:

I just realized that by referencing Registers as memory it makes all my code routines break as soon as I jump to a new task which has a new work space.

Now I could use a slower muli-tasker that stays in 1 workspace, but how cool is it to have 20uS task switch on a TI-99?

Edited June 16, 2017 by TheBF

+TheBF · June 16, 2017

It is not the speed of SWPB vs MOVB, it is that you don't need SWPB at all if you set thing up differently.
; Assumes CPU Workspace is at >8300, otherwise adjust >8301 value to
; reference address of R0 LSB.
;
; R0 - VRAM address
; R1 - MSB contains byte to write to VRAM

MOVB @>8301,@>8C02  ; Send the address-LSB (LSB of R0) to the 9918A.
ORI  R0,>4000       ; Set MS-bits of R0 to "01"
MOVB R0,@>8C02      ; Send the MSB of the address to set-up.
MOVB R1,@>8C00      ; Write byte to VRAM
I don't know how you would write that in Forth, but that is the routine in straight assembly. You can also forgo the ORI instruction if you require the programmer to set the two MS-bits to "01" as part of the address.

Actually the code I provided was Assembler, just a reverse Polish notation assembler. :-)

A very cool concept in Forth is the concept of compiling being nothing more than putting numbers into memory. (hopefully the correct numbers)

There is an operator called "comma" and it looks like this ,

It is a number "compiler" in Forth. It takes a number from the stack and puts in in the next available memory location and incs the memory pointer by 2 bytes (for 16 bit machines)

Typical usage would be just like a data statement

HEX

CREATE MYDATA DEAD , BEEF , ADDA , FFFF ,

MYDATA is equivalent to a label. Notice how you need to last comma to "compile" the last number into memory.

With that little comma and some logic you can make an assembler in an afternoon. The weird thing is it's not a progam! Each mnemonic is it's own little assembler for 1 instruction.

It's a totally different way to think about translating text to machine code.

But since arguments go on the stack the registers come first then the instruction.

</oldmanlecture>

We now return to our regularly scheduled program.

+TheBF · June 16, 2017

It is not the speed of SWPB vs MOVB, it is that you don't need SWPB at all if you set thing up differently.

But SWPB+MOVB R0, <symbolic> is the same size as MOVB <symbolic>,<symbolic>

and only uses 2 more cycles. Or am I missing something else?

+jedimatt42 · June 16, 2017

Benchmarking aside, I agree your task switcher is cool, but may need to consider the platform architecture a bit...

That sort of memory access pattern is used in a lot of legacy code. For instance, code that is called with a BLWP knows that R13 is the WP of the caller, so something like MOVB x,@1(R13) will copy a byte to LSB of R0 in caller's workspace. So, I'd say it isn't sorcery anymore, remembering that the workspace is not a set of registers. Treating them like registers is just a convenience for people used to other CPUs. The 9900 only has 3 registers to my understanding, and then a bunch of instructions that are optimized for parameter addreses that are offsets of the workspace register so another full word doesn't have to be read to execute the instruction.

Then my real concern:

Slowing down your multi-tasker won't help you. You need to have safe points... If a routine is relying on the incrementors in the VDP chip or a GROM chip to optimize addressing of these devices as they were designed to be used, then you need to allow a routine to prevent interruption so it can trust the state of the incrementor.

Even if you set the write address before every VDP byte transfer, what stops the task switcher from letting another routine move that write address before restoring back to your first routine that assumes the address is already set correctly. You could say only thread-0 can access VDP. But then your other threads can't take advantage of that even as a RAM resource.

Before you drive yourself nuts debugging code running in the task switcher, you should make sure you can wrap routines in safe points...

It would be interesting to see a video of two threads in your task switcher, one repeatedly filling the top half of the screen with '-' and the other repeatedly filling the bottom half with '*' - I suspect you'll see out of place characters that wouldn't be out of place without the task switcher. But it is a cool sounding system, so it can be fixed!

Unsolicited advice. And a little off topic.

-M@

Tursi · June 16, 2017

...by exposing everything to the programmer so much complexity melts away.

I wish everyone believed that. I fight that battle all the time, the belief that we should not allow "lessor" programmers access to "more important" functions just costs everyone time and results in poorer code. Why not educate the "lessor" programmers instead?

Tursi · June 16, 2017

But SWPB+MOVB R0, <symbolic> is the same size as MOVB <symbolic>,<symbolic>

and only uses 2 more cycles. Or am I missing something else?

Normally memory access to read the instruction costs an additional 4 cycles unless you're in 16-bit memory. As a result of that alone, it's almost always faster on the TI to use the sequence with the fewest number of instructions, even if the instructions themselves are more expensive. It's even stronger if you go for the shortest sequence too - ie: if you could permanently store the VDP address in a register. In this case I think you're right (I didn't double-check), and 2 cycles probably isn't worth quibbling over - on this test you'd end up with, what, 166,400 cycles more, which is only about 50ms.

matthew180 · June 16, 2017

Ahh.... but you must pay for your sins for using register sorcery!

I just realized that by referencing Registers as memory it makes all my code routines break as soon as I jump to a new task which has a new work space.

Now I could use a slower muli-tasker that stays in 1 workspace, but how cool is it to have 20uS task switch on a TI-99?

Yes, I do pay for those sins.

Well, BLWP puts the old workspace pointer in the new workspace at R13, so if you pass in the two parameters via the *old workspace* R0 and R1, then you can reference them through R13. I did not pay attention how your were passing the parameters though, since you are using BLWP. An alternative would be to not change workspace when the VDP routines are involved.

But SWPB+MOVB R0, <symbolic> is the same size as MOVB <symbolic>,<symbolic>

and only uses 2 more cycles. Or am I missing something else?

You have the overhead of having to execute *another* SWPB on R0 to get the address-MSB back into R0-MSB.

If you pass in the address already in the format LSB/MSB, then having a single SWPB would be fine, and you are only out two cycles.

+TheBF · June 16, 2017

Benchmarking aside, I agree your task switcher is cool, but may need to consider the platform architecture a bit...

That sort of memory access pattern is used in a lot of legacy code. For instance, code that is called with a BLWP knows that R13 is the WP of the caller, so something like MOVB x,@1(R13) will copy a byte to LSB of R0 in caller's workspace. So, I'd say it isn't sorcery anymore, remembering that the workspace is not a set of registers. Treating them like registers is just a convenience for people used to other CPUs. The 9900 only has 3 registers to my understanding, and then a bunch of instructions that are optimized for parameter addreses that are offsets of the workspace register so another full word doesn't have to be read to execute the instruction.

Then my real concern:

Slowing down your multi-tasker won't help you. You need to have safe points... If a routine is relying on the incrementors in the VDP chip or a GROM chip to optimize addressing of these devices as they were designed to be used, then you need to allow a routine to prevent interruption so it can trust the state of the incrementor.

Even if you set the write address before every VDP byte transfer, what stops the task switcher from letting another routine move that write address before restoring back to your first routine that assumes the address is already set correctly. You could say only thread-0 can access VDP. But then your other threads can't take advantage of that even as a RAM resource.

Before you drive yourself nuts debugging code running in the task switcher, you should make sure you can wrap routines in safe points...

It would be interesting to see a video of two threads in your task switcher, one repeatedly filling the top half of the screen with '-' and the other repeatedly filling the bottom half with '*' - I suspect you'll see out of place characters that wouldn't be out of place without the task switcher. But it is a cool sounding system, so it can be fixed!

Unsolicited advice. And a little off topic.

-M@

Very sound advice for a preemptive task switcher.

The Classic Forth task switcher is cooperative relying on much faster context switch times than is normal and switch only after I/O routines run to completion.

All I/O routines are keep fast as well to allow smooth operation.

Remarkably simple but also efficient. Forth Inc's Polyforth system could hang 150 dumb terminals on a i386 machine and users didn't complain. :twisted:

I know that sounds ridiculous, but they did it.

So I will create a demo for you, but I know that there will not be an issue , because there are no surprises with these systems.

They switch when you say and only when you say with a function called PAUSE but at very atomic intervals compared to what you are used to with other cooperative tasking systems.

But to your point in your example, referencing via a register (R13) will not be a problem. But dead reckoning of an address in the code as a literal is.

Edited June 16, 2017 by TheBF

JamesD · June 16, 2017

For preemptive multi-tasking,you need locks/semaphores around hardware accesses on all machines.
That's a given and certainly not unique to the TI.
Then you need signals to trigger tasks when a resource becomes available.

The AmigaOS is worth a look if you want a good example of how to do it.

apersson850 · June 16, 2017

You realize that you are talking about two different things here, right?

The requirements for preemptive vs. cooperative task switching are quite different.

UCSD Pascal for the 99/4A was the first environment available for that machine, at least as far as I know, that supported concurrency. Only cooperative concurrency, though. I did an attempt to change it to preemptive, but I ran into issues with code which seemed to be unprotected in the unit HEAPOPS, and without the source for the operating system, I found it virtually impossible to correct that.

Note that if you run code that reads from the VDP, then you need one instruction between writing the most significant byte of the VDP address and reading from the VDP. Otherwise your code will not work on machines with 16-bit RAM without wait states.

+TheBF · June 16, 2017

Yes very different. A good cooperative tasker sounds like it would not be reliable but with fine grained control in the Forth environment you can tune the tasks very precisely.

And when external inputs must be dealt with you only need to put the data acquisition routine(s) on an interrupt and queue the data somewhere.

And problems with locks and semaphores disappear because you design routines to run to completion or exit at convenient places.

There are some pretty big projects running this way on Forth systems.

So here is a demo video that runs this code that just flails away at the screen with '-' and '.' while the console is still running.

There is a bar graph task, showing the CPU loading, installed but I didn't run it in the video.

\ DEMO BEGINS
CREATE TASK1   USIZE ALLOT
CREATE TASK2   USIZE ALLOT
CREATE TASK3   USIZE ALLOT

TASK1 FORK
TASK2 FORK
TASK3 FORK

DECIMAL
CHAR * CONSTANT '*'
CHAR ! CONSTANT '!'
CHAR - CONSTANT '-'
CHAR . CONSTANT '.'
VARIABLE SPEED   1 SPEED !

: SET_ROWS  ( n -- end start )   \ create addresses for a DO/LOOP
            C/L @ *              \ calc. the start address
            DUP  C/L @ 4* +  1-  \ add start to VDP address of 4 more lines less 1
            SWAP ;

\ erase 4 rows starting at row(n)
: CLR_ROWS  ( n -- ) C/L @ *  128 BL VFILL ;

: THING1
       BEGIN
            '*' SET# 3 1 COLOR  \ set# returns the charset#
                                \ cuz I can't remember them
            0 CLR_ROWS
            0 SET_ROWS
            DO
               '.'  I  VC!   \ VC! = "video char store"
               PAUSE         \ service tasks
            LOOP
            PAUSE            \ be "nice"
       AGAIN ;

: THING2
       BEGIN
            '-' SET# 11 1 COLOR
            11 CLR_ROWS
            11 SET_ROWS
            DO
               '-'   I  VC!
               PAUSE
            LOOP
            PAUSE
       AGAIN ;
       
: PGRAPH   \ bar graph of CPU loading
        BEGIN
            1 5 BL  32 HCHAR PAUSE
            0 5 30  PULSE 32 MOD 1+ HCHAR
            6 MS
        AGAIN ;

\ ' finds the execution address of a routine
 ' THING1 TASK1 ASSIGN
 ' THING2 TASK2 ASSIGN
 ' PGRAPH TASK3 ASSIGN

2taskdemo.mov

Edited June 16, 2017 by TheBF

Willsy · June 16, 2017

I re-wrote my Cross-compiler to generate direct threaded code and to test it out I re-visited the now famous Tursi Benchmark.

Here are the results of the optimized version, which only improved the speed by about 14% versus the Indirect threaded version.

VARIABLE CNT
HEX
 300 CONSTANT $300
 301 CONSTANT $301
DECIMAL 
: BENCH
      GRAPHICS
      1 MAGNIFY
      2 0 0 42 1 SPRITE
      100 CNT !
      BEGIN
           CNT @ 0> WHILE
           239 0 DO  I $301 VC!      LOOP
           175 0 DO  I $300 VC!      LOOP
           0 239 DO  I $301 VC!  -1 +LOOP
           0 175 DO  I $300 VC!  -1 +LOOP
          -1 CNT +!
      REPEAT
;  ( DTC 26.3 seconds versus 30 seconds with ITC compiler)
   ( DTC w/fast VC!  25.2)

Then I realized that this is not how you would do this in Forth. You would add something to the compiler to do the job.

So I wrote VDP_IC! , is Forth naming for VDP loop-index character store.

Here is the new word added to Forth. It takes 2 parameters, loops and VDP address.

*EDIT* fixed the obvious mistakes, putting the TOS ->R0 and 0 LIMI in the loop

CODE: VDP_IC! ( loops Vaddr  -- )       \ Video CHAR store
              R3 POP,                   \ POP loop counter into R3
              TOS R0 MOV,             \ need address in R0 to call WMODE
              0 LIMI,
              BEGIN,
                R0 4000  ORI,           \ set control bits to write mode (01)
                R0 SWPB,                \ R0= VDP-adr, set the VDP WRITE address
                R0 VDPWA @@ MOVB,       \ send low byte of vdp ram write address
                R0 SWPB,
                R0 VDPWA @@ MOVB,       \ send high byte of vdp ram write address
                R3 R5 MOV,
                R5 SWPB,
                R5 VDPWD @@ MOVB,       \ write char to vdp data port

                R3 DEC,
              EQ UNTIL,
              TOS POP,                  \ refill TOS
              2 LIMI,
              NEXT,
              END-CODE

And so with this optimization which is more in keeping with how you do Forth here are the results.

: BENCH2
      GRAPHICS
      1 MAGNIFY
      2 0 0 42 1 SPRITE
      100 CNT !
      BEGIN
           CNT @ WHILE
           240 $301 VDP_IC!
           175 $300 VDP_IC!
           239 $301 VDP_IC!
           175 $300 VDP_IC!
           CNT 1-!
      REPEAT
;  ( DTC 6.5 seconds, faster after the editing VDP_IC!)

I'll have a go at this tonight. I think the TF and fbForth times will be very similar though, as the majority of the work is being done in "assembly land".

+TheBF · June 16, 2017

I'll have a go at this tonight. I think the TF and fbForth times will be very similar though, as the majority of the work is being done in "assembly land".

Cool!

Ya it's not too tricky to get started.

The gotchas come on things like DEFER and TO where you have to remember to skip past the code fragment at the front of the word to get to the parameters.

And if you plug in the addresses of code words anywhere you need remember to NOT skip over anything because there is nothing to skip.

And it does make things a little bigger. On my system it added 220 or so bytes... not good news for a packed cartridge.

apersson850 · June 30, 2017

Now when I got my own RAMdisk working in the system again, after accidentally wiping out all drivers before, I tried the benchmark.

The optimized version (which is the only one that makes sense in the Pascal implementation) runs in 683 seconds on my machine, which has memory without wait states. I presume the 780 seconds were measured on a conventional memory expansion?

The faster memory will of course benefit other languages too. How much depends on how much of their access that is in standard 32 K RAM expansion, of course.

It's obvious that it's the sprite access through the SPRITE library in the p-system that's slow. On the other hand, it's more comprehensive than what you get in Extended BASIC, for example.

I've never used sprites for anything serious with the p-system. But if I feel I have the time I'll check how the p-system behaves if you short-circuit the library's sprite handling and write to the VDP memory directly.

By the way, the first edition runs in 8254 seconds on my machine. Are you sure the time of 7300 seconds is correct? I modified the program to measure the time with my real time clock, so I know my timing is correct.

Edited June 30, 2017 by apersson850

+TheBF · July 1, 2017

Wow. That is very peculiar to me. If you replaced the P-code system, a stack VM with a Forth stack base VM I am certain that Pascal code compiled to Forth would still run much faster than your results even with the run-time tests that Pascal incorporates.

Does this indicate that there is a significant amount of GPL in the P-code interpreter?

apersson850 · July 1, 2017

There is not a single GPL instruction running in the PME. The P-Machine Emulator runs in native code, of which some is in the RAM at 8300H to optimize speed.

The problem with the chosen benchmark is that it mainly tests the system's unit sprite, which I suspect is written to implement a more advanced automatic motion/sprite redefinition system than most, and doesn't care much about efficiency when doing manual changes.

Edited July 1, 2017 by apersson850

+Vorticon · July 1, 2017

Now when I got my own RAMdisk working in the system again, after accidentally wiping out all drivers before, I tried the benchmark.

The optimized version (which is the only one that makes sense in the Pascal implementation) runs in 683 seconds on my machine, which has memory without wait states. I presume the 780 seconds were measured on a conventional memory expansion?

The faster memory will of course benefit other languages too. How much depends on how much of their access that is in standard 32 K RAM expansion, of course.

It's obvious that it's the sprite access through the SPRITE library in the p-system that's slow. On the other hand, it's more comprehensive than what you get in Extended BASIC, for example.

I've never used sprites for anything serious with the p-system. But if I feel I have the time I'll check how the p-system behaves if you short-circuit the library's sprite handling and write to the VDP memory directly.

By the way, the first edition runs in 8254 seconds on my machine. Are you sure the time of 7300 seconds is correct? I modified the program to measure the time with my real time clock, so I know my timing is correct.

Yes, I used the 32K conventional RAM for testing. I measured the time with a good old fashioned chronometer manually. Not sure why there is such a significant time discrepancy with the unoptimized version on your system, even accounting for my reaction time in starting and stopping the clock...

Benchmarking Languages

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members