Jump to content
IGNORED

Camel99 Forth Information goes here


TheBF

Recommended Posts

Well didn't this open an interesting can of worms. :) 

 

My great improvement was short lived.

On my Windows 10 computer using Classic99 and Camel99 Forth 2.66 I get a different timing than @Speccery did.

I should have timed it first on my machine before I even started working on improvements (doh!) but here is the Apples to Apples comparison:

 

So after seeing some improvement re-working the DO LOOP with an inline NEXT interpreter I wondered if I could use some of that extra space to speed up other primitives?

 

Next I found this reference table to provide some insight into what Forth words run most often from "Stack Computers, the next Wave" by Phil Koopman

6.3.1 Dynamic instruction frequencies

NAMES           FRAC     LIFE     MATH  COMPILE      AVE
CALL           11.16%   12.73%   12.59%   12.36%   12.21%
EXIT           11.07%   12.72%   12.55%   10.60%   11.74%
VARIABLE        7.63%   10.30%    2.26%    1.65%    5.46%
@               7.49%    2.05%    0.96%   11.09%    5.40%
0BRANCH         3.39%    6.38%    3.23%    6.11%    4.78%
LIT             3.94%    5.22%    4.92%    4.09%    4.54%
+               3.41%   10.45%    0.60%    2.26%    4.18%
SWAP            4.43%    2.99%    7.00%    1.17%    3.90%
R>              2.05%    0.00%   11.28%    2.23%    3.89%
>R              2.05%    0.00%   11.28%    2.16%    3.87%
CONSTANT        3.92%    3.50%    2.78%    4.50%    3.68%
DUP             4.08%    0.45%    1.88%    5.78%    3.05%
ROT             4.05%    0.00%    4.61%    0.48%    2.29%
USER            0.07%    0.00%    0.06%    8.59%    2.18%
C@              0.00%    7.52%    0.01%    0.36%    1.97%
I               0.58%    6.66%    0.01%    0.23%    1.87%
=               0.33%    4.48%    0.01%    1.87%    1.67%
AND             0.17%    3.12%    3.14%    0.04%    1.61%
BRANCH          1.61%    1.57%    0.72%    2.26%    1.54%
EXECUTE         0.14%    0.00%    0.02%    2.45%    0.65%

I already have 0BRANCH, BRANCH, EXIT, CALL (DOCOL), LIT, @ and DROP  in 16 bit RAM.

 

Using this table I changed the NEXT macro in each of the following words to use the ILNEXT macro. (inline NEXT, 3 instructions)

DOVAR,  +, SWAP,  R> , >R,  DOCON , ROT,  DOUSER,  C@, I and =.

 

After re-compiling the kernel with these changes the FIB2-BENCH a bit faster again.

Fibonacci  FIB2-BENCH Timings

----------------------------------------

V2.66        1:46.80

V2.67        1:46.03         0.7% better                      ( DO LOOP change)

V2.67b       1:44.50        2.2% better than original   ( inline next on hi usage words) 

 

Since a threaded Forth program spends about 50% of it's time running the interpreter NEXT, you can get improvements by removing the branch through a register to get to it and placing it inline as we can see.

 

But... this consumed 46 bytes in my tiny kernel so is it worth it?

I will play with it more and run some more benchmarks before I make up my mind.

 

  • Like 2
Link to comment
Share on other sites

In the course of optimizing EMIT for faster screen printing I was amazed that the speed of the SEVENs problem benchmark was not reduced by very much.

I remembered that FbForth was doing Lee's version of the benchmark in less than 1 minute.

 

I suspected my scroll was the issue since I bucked the trend and did not write it entirely in Assembler.

I use an intermediate word called  MOVEUP to scroll the screen by a different number of lines and you call it the number of times you need to scroll 24 lines.

: MOVEUP ( vaddr -- 'vaddr)
         C/L@ 8* >R       \ compute chunk size. 8* means 8 lines
         HERE 100 +  OVER C/L@ +  ( -- 1stline heap 2ndline)
         OVER R@ VREAD
         OVER R@ VWRITE
         R> +   ; \ 36 bytes

 

I re-wrote the scroll as one Forth word and use a buffer that was the size of the screen minus one line.

The brought the speed down to 51.5 seconds.  

 : SCROLL ( -- )
         PAUSE
         TOPLN                   \ top of VDP screen memory
         C/SCR @ C/L@ - >R       \ C/SCR-1line to rstack
         HERE 100 + OVER C/L@ +  ( -- 1stline heap 2ndline)
         OVER R@ VREAD
         SWAP R> VWRITE
         0 17 AT-XY  VPOS C/L@ BL VFILL  \ SEVENS = 51.5 SECS
;

I don't really want to use such a huge buffer even though it's in un-allocated memory because at some point it will crash into the stack in a big program project. 

This is especially true in 80 column mode.

I also don't want to put the scroll buffer in low RAM since that is so useful for SAMS buffers.

 

Since I can rebuild the kernel in 5 seconds and re-run it on Classic99 I did the experiment to find out how the buffer size affected the speed of the benchmark.

Here is the data.  I think I will stay with my original decision to use an 8 line buffer but at least I know now that in the SEVENs benchmark almost 5 extra seconds are being used just to scroll the screen. Amazing.

Buffer Lines Sevens Speed Reduction Notes
1 01:17.26    
2 01:07.83  -12.21%        Uses do loop
4 01:02.83 -18.68%      Uses do loop
8 01:00.90 -21.18%      Uses do loop
8 01:00.06 -22.26% MOVEUP MOVEUP MOVEUP
12 00:59.36 -23.17% MOVEUP MOVEUP
24 00:51.50 -33.34%      Scroll is 1 word
  • Like 1
Link to comment
Share on other sites

5 hours ago, TheBF said:

I suspected my scroll was the issue since I bucked the trend and did not write it entirely in Assembler.

 

This is probably not terribly responsive, but I use only a one-line buffer with ALC:

*
*** SCROLLING ROUTINE
*
SCROLL MOV  @$SSTRT(U),R0   VRAM addr
       LI   R1,LINBUF       Line buffer
       MOV  @$SWDTH(U),R2   Count
       A    R2,R0           Start at line 2
SCROL1 BLWP @VMBR
       S    R2,R0           One line back to write
       BLWP @VMBW
       A    R2,R0           Two lines ahead for next read
       A    R2,R0
       C    R0,@$SEND(U)    End of screen?
       JL   SCROL1
       MOV  R2,R1           Blank bottom row of screen
       LI   R0,>2000        Blank
       S    @$SEND(U),R2
       NEG  R2              Now contains address of start of last line
       MOV  LINK,R6
       BL   @FILL1          Write the blanks
       B    *R6

If you need details about missing definitions, I can supply them, but the comments will likely suffice.

 

...lee

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

Thanks. This is very concise.

I should go down that road. Early on to save space I decided to limit functions to the Forth interface only so I can not  BLWP or BL to VMBW or VMBR or FILL.

It not tricky to change but in the beginning of this journey I had no room left in the 8K. :) 

Hell I was still figuring out how to use the cross-compiler that I made. :) 

 

I have about 80 bytes free in the existing system and I can play games with headless definitions and labels to save space.

So I will take a run at this method too since my social life is somewhat limited these days. 

Did get outside for a walk in the park with my brother in law this afternoon so that was good.

  • Like 2
Link to comment
Share on other sites

Different environment, but same ole outta memory, I was able to, (and this is probably easier for you guys), but I did manage to make a type of loader for some of my routines to use free upper SAMs today. I was at 1014 bytes free, now after pushing some things up, I'm at 2050 free and 2 routines in upper SAMs.

I'm learning.

Link to comment
Share on other sites

1 minute ago, GDMike said:

Different environment, but same ole outta memory, I was able to, (and this is probably easier for you guys), but I did manage to make a type of loader for some of my routines to use free upper SAMs today. I was at 1014 bytes free, now after pushing some things up, I'm at 2050 free and 2 routines in upper SAMs.

I'm learning.

Ya a little different. I start with this tiny 8K piece and then I can add to it after it loads. But squashing everything into the first 8K has been a fun challenge.

I have actually built a version where I don't have any loops or branching in the kernel, but then it compiles those when it starts. :) 

 

( I know what you are thinking. How does the kernel work without loops or branching?  The cross compiler knows how to compile them to make the kernel, but the finished program does not have the BEGIN AGAIN , IF THEN etc. words. Crazy stuff.)

 

  • Like 1
Link to comment
Share on other sites

When you turn over a stone don't be surprise that you find a worm. :)

In the course of testing my VMBR /VMBW code as sub-routines I found it was slower than normal because I used a WHILE loop structure to protect from cnt=0 conditions.

I changed that and will take responsibility for the risk.   :)

It made things much faster.

CODE: SCRL ( buffer Vaddr len -- )  \ TOS hold address of C/L user variable
\ R0 VDP address
\ R1 CPU BUFFER for routines, copy kept on stack
\ R5 line length
       TOS R5 MOV,        \ line length -> R5
      *SP+ R0 MOV,        \ VDPRDaddr -> R0
       R5  R0 ADD,        \ start at line2
       BEGIN,
          R5  TOS MOV,      \ COPY length to R4 for VMBR
         *SP  R1  MOV,      \ buffer -> R1
          RMODE @@ BL,
          VMBR @@ BL,       \ read line2 to buffer
          R5 R0  SUB,       \ One line back to write

          R5 TOS MOV,       \ set counter for the write
         *SP R1  MOV,       \ restore buffer address
          WMODE @@ BL,
          VMBW @@ BL,
          R0 3FFF ANDI,     \ strip off write bit
          R5   R0 ADD,      \ Two lines ahead for next read
          R5   R0 ADD,
          R0 C/SCR @@ CMP,  \ End of screen?
       HI UNTIL,
       TOS POP,             \ drop buffer
       TOS POP,             \ refill TOS register 
       NEXT,
       END-CODE

\ Buffer Lines	Sevens Speed
\   1           	00:58.16    26 bytes bigger
: SCROLL  ( -- )
       HERE 100 +  TOPLN C/L@ SCRL
       0 17 AT-XY  VPOS C/L@ BL VFILL ;

 

It turns out that since I am not using  BLWP,  again to save space, I need a few extra instructions in my loop to reset the control registers.

I call a sub-routine to setup the VDP address each time since I didn't want to push/pop R11.

 

I also decided to erase the last screen line in Forth because it's pretty fast being mostly code words and only one line of code.

 

I pass the buffer, screen and length as parameters since my TOPLN can be in different places in VDP RAM if you use the SCREEN: word to create different VDP text screens.

And I don't keep variables for screen-end and screen-start so parameter passing was simplest.

 

The ALC version is 26 bytes bigger than this Forth code which I created to do an "apple to apples" comparison.


\ Notes: Using SEVENs program as a benchmark
\ Buffer Lines	Sevens Speed
\   1           	01:08.71
: SCROLL ( buffer vaddr -- )
       DUP C/L@ L/SCR * +
       SWAP  ( -- buffer SCRend SCRstart)
       DO
         I  C/L@ +  OVER  C/L@ VREAD
         DUP  I           C/L@ VWRITE
       C/L@ +LOOP
       DROP
       0 17 AT-XY  VPOS C/L@ BL VFILL
;

So we can see that the ALC scroll makes the benchmark program ~15% faster at the cost of 26 bytes at least the way I did the ALC code.

 

To show how much my VDP code improved, the older method that used with the 8 line buffer, improved  from 1:00.6  to 0:55.75  or 8.7% improvement which I was very happy to see.

I comes in 18 bytes bigger than the single line DO/LOOP method.

 

I will explore what happens now with this improved VDP code and a 2 line buffer which seems like a reasonable trade-off.

 

 

 

 

 

 

 

 

  • Like 4
Link to comment
Share on other sites

Continuing on the scroll research...

 

Adding this code to SCRL to clear the last line:

\ erase last line  adds 8 bytes, buys .4 seconds on benchmark
       C/SCR @@ R0 MOV,     \ end of screen Vaddr -> R0
       R5 R0 SUB,           \ go back 1 line
       R5 R2 MOV,           \ byte count for VFILL -> R2
       TOS 2000 LI,         \ space char -> R4
       WMODE @@ BL,
      _VFILL @@ BL,

Versus this code in Forth:

 0 17 AT-XY  VPOS C/L@ BL VFILL

ALC for clearing the last line was not worth the trouble since it only improved the benchmark .4 seconds on the very long SEVENS benchmark and consumed an extra 8 bytes.

And I needed to reset the cursor after SCRL completed anyway.

 

I re-wrote my previous Forth SCROLL using the idea of keeping the buffer and VDP address arguments on the stack but I un-rolled the DO/LOOP.

It was a bit faster.  The code below ran the benchmark in 1:08.06 and was 36 bytes smaller than using the ALC scroll. 

 

Example 1:


: MOVEUP ( buffer vaddr -- buffer 'vaddr)
         2DUP C/L@ +  SWAP C/L@ VREAD
         2DUP              C/L@ VWRITE
         C/L@  +  ;

: MOVE8  ( buffer Vaddr -- buffer 'Vaddr)
      MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP ;

     [PUBLIC]

: SCROLL ( -- )
         PAUSE
         HERE 100 + TOPLN  MOVE8 MOVE8 MOVE8   2DROP
         0 17 AT-XY  VPOS C/L@ BL VFILL
;

Re-writing the above to use a four line buffer and creating a code word to return C/L@ 4*  was the only way I could get a faster scroll than the 1 line ALC code.

57 seconds vs 58 in ALC.

 

Size or speed. Its hard to get both.

 

 

 

 

 

  • Like 2
Link to comment
Share on other sites

And just to beat this horse until it is well and truly dead :) 

 

Let's try ALC scrolling with a 2 line buffer. Since I am using un-allocated memory for the buffer, it was only 3 extra instructions.

This dropped the benchmark time from 58.16 to 56.00  seconds.  Just 1 second off of the Forth time using a 24 line buffer!

Gotta love that assembler.

 

Using all this fancy stuff created a 8,148 byte kernel, so I still have 48 bytes left over!

And I now have VMBW and VMBR as BL callable sub-routines. Nice progress overall

 

And if I really need a slightly smaller kernel I just change the TRUE below to FALSE and the kernel is 8,116 bytes with slower scroll.

Spoiler

\ Scroll in Assembler is a re-write of concept from FbForth Lee Stewart
FALSE [IF]
               [PRIVATE]
CODE: SCRL ( buffer Vaddr len -- )  \ TOS hold address of C/L user variable
\ R0 VDP address
\ R1 CPU BUFFER for routines, copy kept on stack
\ R5 line length
       TOS R5 MOV,        \ line length -> R5
      *SP+ R0 MOV,        \ VDPRDaddr -> R0
       R5  R0 ADD,        \ start at line2
       BEGIN,
          R5  TOS MOV,      \ COPY length to R4 for VMBR
          TOS TOS ADD,      \ *read 2 lines
         *SP  R1  MOV,      \ buffer -> R1
          RMODE @@ BL,
          VMBR  @@ BL,      \ read line2 to buffer
          R5 R0  SUB,       \ One line back to write

          R5 TOS MOV,       \ set counter for the write
          TOS TOS ADD,      \ *write 2 lines
         *SP R1  MOV,       \ restore buffer address
          WMODE @@ BL,
          VMBW  @@ BL,
          R0 3FFF ANDI,     \ strip off write bit
          R5   R0 ADD,
          R5   R0 ADD,
          R5   R0 ADD,      \ *advance one extra line
          R0 C/SCR @@ CMP,  \ End of screen?
       HI UNTIL,
       TOS POP,             \ DROP buffer address
       TOS POP,             \ refill TOS register
       NEXT,
       END-CODE
              [PUBLIC]

\ Buffer Lines	Sevens Speed
\   1           	00:58.16    26 bytes bigger
: SCROLL  ( -- )
       HERE 100 +  TOPLN C/L@ SCRL
       0 17 AT-XY  VPOS C/L@ BL VFILL ;


[ELSE]

\                 [PRIVATE]
\ Notes: Using SEVENs program as a benchmark
\ Buffer Lines	Sevens Speed
\   1         	01:08.06
\   2           01:02.00   \ 1:01.00  using 2LINES code word.
\   4	          00:57.28
\   8	          00:55.43

\ CODE: 2LINES ( -- n)
\    TOS PUSH,
\    R1  STWP,
\    2E (R1) TOS MOV,  \ read user var C/L
\    TOS 1 SLA,        \ 2*
\    NEXT,
\   END-CODE
               [PUBLIC]

: SCROLL ( buffer vaddr -- )
       HERE 100 +
       TOPLN C/SCR @  ( -- buffer Vstart len)
       BOUNDS  ( -- buffer SCRend SCRstart)
       DO
         I  C/L@ +  OVER  C/L@  VREAD
         DUP  I           C/L@  VWRITE
       C/L@ +LOOP
       DROP
       0 17 AT-XY  VPOS C/L@ BL VFILL
;


[THEN]

 

 

  • Like 2
Link to comment
Share on other sites

While updating the list in Benchmarking Languages I started re-reading some of the posts by @matthew180 and @jedimatt42 where I was being soundly scolded for writing inefficient VDP routines.  :)

 

 

At the time I understood what they said but I did not know how to implement it in my system because of my use of different workspaces in the multi-tasker. 

I could not reference a register's odd numbered byte using indirect addressing because the numerical address could be anything.

 

A long time back I decided to use the workspace pointer register to define not just my register space but also local variable space above the registers for each task.

I recently re-wrote my character output routine to use more inline code versus Forth and to do that I have to access those local variables for COL and ROW using indexed addressing.

Well... it suddenly occurred to me that I can also access my registers the same way.

 

This gave rise to a re-write of CPUT:  (TOS is R4)

Edit:  Took out 1 more instruction.

CODE: CPUT ( char -- ?)  \ put a char at cursor position, return eol flag
            R1         STWP,    \ workspace is USER area base address
            32 (R1) R2  MOV,    \ vrow->r3
            2E (R1) R2  MPY,    \ vrow*c/l->r3
            34 (R1) R3 ADD,    \ add vcol
            VPG @@  R3 ADD,    \ add video page address
            0 LIMI,
            7 (R1) 8C02 @@ MOVB,   \ write odd byte from R3
            R3 4000 ORI,
            R3 8C02 @@ MOVB,
            9 (R1) VDPWD @@ MOVB,  \ Odd byte R4, write to screen
            2 LIMI,
            TOS CLR,
            34 (R1)  INC,          \ bump VCOL
            34 (R1)  2E (R1) CMP,  \ compare VCOL = C/L
            EQ IF,
                TOS SETO,          \ set true flag
            ENDIF,
            NEXT,
            END-CODE

 

  • Like 4
Link to comment
Share on other sites

Armed with this new tool for faster VDP access I had to re-write my VDP driver.

You know I had to. :) 

 

It makes everything just a little more "perky".  I like it.

All the benchmarks that do anything with the screen like the TURSI sprite benchmark, go faster and even compiling is a touch quicker because we can get things in and out of the PAB faster.

Only took me 5 years to get here.  ?

 

It also made me look some very early code and discover how I had used 4 instructions in my VWTR word where I only needed 2.

I guess I am getting a little better at this Assembly Language thing.

 

With these faster Forth words I don't feel a real need for sub-routine access to VMBR and VMBW anymore.

I pays to listen to the experts.

 

I have not tried this stuff on real iron yet. I hope it doesn't over run the VDP.

 

Example:  (edit: Found an extra instruction from the old version that was not needed)

\ VSBR Forth style, on the stack
CODE: VC@   ( VDP-adr -- char )  \ Video CHAR fetch
            0 LIMI,
            R1 STWP,
            9 (R1) 8C02 @@ MOVB, \ write odd byte from TOS ie R4
            TOS 8C02 @@ MOVB,    \ write even bytes from TOS
            VDPRD @@ TOS MOVB,   \ READ char from VDP RAM into TOS
            TOS 8 SRL,           \ move the damned byte to correct half of the word
            2 LIMI,
            NEXT,
            END-CODE

 

 

 

  • Like 4
Link to comment
Share on other sites

12 hours ago, TheBF said:

I could not reference a register's odd numbered byte using indirect addressing because the numerical address could be anything.

That is just a slight optimization to the VDP routines, and you only incur a small increase (8us or something like that) by using other methods.  I see you worked it out, even if it is backwards (i.e. written in Forth). ;)

 

12 minutes ago, TheBF said:

I have not tried this stuff on real iron yet. I hope it doesn't over run the VDP.

This has been debated and tested quite a bit on the 99/4A, and IIRC there is only one very specific situation where you might be able to overrun the VDP (again, on the 99/4A).  Then again, I think that one case was on a modified system, so it is probably not possible on a stock console.  The threads about it are here on A.A. if you want to dig around.  The main clincher (again, IIRC) for the 99/4A is that VDP access triggers the wait-state generator, so you pretty much cannot overrun the VDP.

 

Also, if the system has an F18A, it cannot be overrun on the retro computers that used the 9918A family of VDP.  You need a CPU clock around 25MHz or faster to overrun the F18A.

  • Thanks 1
Link to comment
Share on other sites

3 minutes ago, matthew180 said:

That is just a slight optimization to the VDP routines, and you only incur a small increase (8us or something like that) by using other methods.  I see you worked it out, even if it is backwards (i.e. written in Forth). ;)

 

This has been debated and tested quite a bit on the 99/4A, and IIRC there is only one very specific situation where you might be able to overrun the VDP (again, on the 99/4A).  Then again, I think that one case was on a modified system, so it is probably not possible on a stock console.  The threads about it are here on A.A. if you want to dig around.  The main clincher (again, IIRC) for the 99/4A is that VDP access triggers the wait-state generator, so you pretty much cannot overrun the VDP.

 

Also, if the system has an F18A, it cannot be overrun on the retro computers that used the 9918A family of VDP.  You need a CPU clock around 25MHz or faster to overrun the F18A.

... even if it is backwards... 

The good news is the code gets laid down in the correct order. :)

I still find it amazing that you can write functional assembler with structured branching and looping in 200 lines. :) 

 

Thanks for the news on the over-run not being an issue. I only have stock hardware.

 

 

Link to comment
Share on other sites

Once you start looking...

 

This is faster again by not used SRL but CLR and moving the data into the correct side of the register.

\ VSBR Forth style, on the stack
CODE: VC@   ( VDP-adr -- char )  \ Video CHAR fetch
            0 LIMI,
            R1 STWP,
            9 (R1) 8C02 @@ MOVB, \ write odd byte from TOS ie R4
            TOS 8C02 @@ MOVB,    \ write even bytes from TOS
            TOS CLR,               
            VDPRD @@ 9 (R1) MOVB, \  READ char from VDP RAM into TOS
            2 LIMI,
            NEXT,
            END-CODE

I can do this in a number of places...

  • Like 1
Link to comment
Share on other sites

You might want to check out the 9900 datasheet and get a little familiar with the instruction timings (pg. 28).  The slowest instructions are: DIV, MPY, Shift instructions, XOP, LDCR/STCR, BLWP.  The barrel shifter takes a cycle for each bit-shift, so the more bits, the longer the instruction takes.  So, even though shifting is faster than DIV and MPY for powers of 2, it is still slower than other instructions if you can do the same task with other instructions.

Link to comment
Share on other sites

1 hour ago, matthew180 said:

You might want to check out the 9900 datasheet and get a little familiar with the instruction timings (pg. 28).  The slowest instructions are: DIV, MPY, Shift instructions, XOP, LDCR/STCR, BLWP.  The barrel shifter takes a cycle for each bit-shift, so the more bits, the longer the instruction takes.  So, even though shifting is faster than DIV and MPY for powers of 2, it is still slower than other instructions if you can do the same task with other instructions.

When I first started writing the low level code for this system I had that thing in front me all the time.  :)

 

 I use shift for the routines called 2* 4* 8* for fast multiplication and  2/ which divides by 2. 

 

The challenge with text is when you grab a byte and Forth wants the byte in odd byte of the register. 

SRL  8  is pretty slow, but its even worse to mask with AI and then SWPB.

 

I find it pretty hard to make the old 9900 give you anything for free. :)

 

 

Link to comment
Share on other sites

8 hours ago, TheBF said:

SRL  8  is pretty slow, but its even worse to mask with AI and then SWPB.

AI is 14 cycles, SWPB is 10 cycles, so 24 cycles total.  This assumes register addressing to make it comparable to shift, since the shift can only operate on registers.

 

The shift instructions have two forms (three actually, but only two that apply here), using R0 for the count, or the count is a fixed value.

 

If the count is in R0, then the timing is 20+2N (N is the value read from R0).  So in this case it would be 20+2*8 = 36 cycles.

 

If the count is fixed (which is encoded as part of the instruction), then the timing is 12+2C.  So in this case it would be 12+2*8 = 26 cycles.

 

So, AI + SWPB is still faster, by at least 2 cycles, than shifting by 8.

  • Thanks 1
Link to comment
Share on other sites

Bottom line on all these changes to the VDP driver.

 

I reached into the archive and pulled out Camel99 V2.54 which also used a 2 line buffer for scrolling.

I did a hex DUMP of 1024 bytes from address 0000, starting on the bottom line of the screen.

I timed them by hand so that interrupts would not skew the result.

 

Here are the results:

V2.54      17.45 seconds

V2.67      13.55 seconds

 

That's a 28% improvement. I'm will keep it.

 

  • Like 4
Link to comment
Share on other sites

Seeing the mandelbrot in BASIC made me wonder what it would take to do it Forth. 

Rosetta code has a version that use floating point. Not going there. :) 

I found this excellent version that is all integer math here:  (4) Mandelbrot Set Rendered in ASCII Art Using Forth on 6502 Machine - YouTube 

 

The code is in the spoiler

Spoiler

\ Setup constants to remove magic numbers to allow
\ for greater zoom with different scale factors.

\ SOURCE: https://github.com/Martin-H1/Forth-CS-101/blob/master/mandelbrot.fs

DECIMAL
20  CONSTANT MAXITER
-39 CONSTANT MINVAL
 40 CONSTANT MAXVAL
20 5 LSHIFT CONSTANT RESCALE
RESCALE 4 * CONSTANT S_ESCAPE

\ These variables hold values during the escape calculation.
VARIABLE CREAL
VARIABLE CIMAG
VARIABLE ZREAL
VARIABLE ZIMAG
VARIABLE CNT

\ Compute squares, but rescale to remove extra scaling factor.
: ZR_SQ ZREAL @ DUP RESCALE */ ;
: ZI_SQ ZIMAG @ DUP RESCALE */ ;

\ Translate escape count to ascii greyscale.
\ : .CHAR ( n --) S" ..,'~!^:;[/<&?oxOX#  "   DROP + 1  TYPE ;
 : .CHAR ( n --) S" ..,'~!^:;[/<&?oxOX#  "   DROP + C@ EMIT ; \ BF. better :)

\ Numbers above 4 will always escape, so compare to a scaled value.
: ESCAPES?     S_ESCAPE > ;

\ Increment count and compare to max iterations.
: COUNT_AND_TEST?
  CNT @ 1+ DUP CNT !
  MAXITER > ;

\ stores the row column values from the stack for the escape calculation.
: INIT_VARS
   5 LSHIFT DUP CREAL ! ZREAL !
   5 LSHIFT DUP CIMAG ! ZIMAG !
  1 CNT ! ;

\ Performs a single iteration of the escape calculation.
: DOESCAPE
    ZR_SQ ZI_SQ 2DUP +
    ESCAPES? IF
      2DROP
      TRUE
    ELSE
      - CREAL @ +   \ leave result on stack
      ZREAL @ ZIMAG @ RESCALE */ 1 LSHIFT
      CIMAG @ + ZIMAG !
      ZREAL !                   \ Store stack item into ZREAL
      COUNT_AND_TEST?
    THEN ;

\ Iterates on a single cell to compute its escape factor.
: DOCELL
  INIT_VARS
  BEGIN
    DOESCAPE
  UNTIL
  CNT @ .CHAR ;

\ For each cell in a row.
: DOROW
  MAXVAL MINVAL DO
    DUP I
    DOCELL
  LOOP
  DROP ;

\ For each row in the set.
: MANDELBROT
  CR
  MAXVAL MINVAL DO
    I DOROW  CR
  LOOP ;

\ Run the computation.
MANDELBROT

 

 

It works perfectly in GForth as you can see in the video. It compiles on Camel99 Forth and TurboForth but does not render correctly.

I suspect it is a big-endian/ little-endian problem but I have not figured it out yet. :( 

  • Like 1
Link to comment
Share on other sites

Over in another topic 

Mr @vol was talking about making the interrupt poll the 9901 timer.  I thought that was just a splendid idea so here is a version in Forth.

Since Camel99 starts the timer when it boots I just needed to write the interrupt handler to read the timer.

 

I think this is correct, but man counting at 21.3 uS per tick is REALLY fast!

 

\ Interrupt polled 9901 timer

NEEDS MOV,     FROM DSK1.LOWTOOLS
NEEDS INSTALL  FROM DSK1.ISRSUPPORT

DECIMAL
\ ISR workspace registers
\ R0,R1  32 bit timer variable
\ R2,    difference register
\ R3     temp
\ R4     previous time reading
CREATE IWKSP  16 CELLS ALLOT  IWKSP 16 CELLS 0 FILL

CODE READ9901
             0 LIMI,
             IWKSP LWPI,
             R2 CLR,
             R12 2 LI,      \ load 9901 Timer CRU address
            -1 SBO,         \ SET bit 0 TO 1, Enter timer mode
             R2 14 STCR,    \ READ TIMER (14 bits)
            -1 SBZ,         \ RESET bit 1, exit timer mode
             2 LIMI,
             R4 R3 MOV,     \ old reading -> temp
             R2 R4 MOV,     \ save this read for next time
             R3 R2 SUB,     \ compute ticks since last read
             R2 ABS,
             R2 R1 ADD,     \ add ticks to timer registers
             OC IF,
                 R0 INC,    \ deal with overflow to make 32bit value
             ENDIF,
         HEX 83E0 LWPI,     \ return to GPL workspace
             RT,
ENDCODE

REMOVE-TOOLS

: T   ( -- ) IWKSP 2@  ;  \ read the workspace as memory

: COLD   0 INSTALL  COLD ;  \ disable interrupt before leaving Forth

ISR' READ9901 INSTALL

: TEST   PAGE  BEGIN  10 10 AT-XY  T DU.   ?TERMINAL UNTIL ;

 

 

  • Like 2
Link to comment
Share on other sites

I took a look my inline optimizer to see if it was possible optimize Forth loop structures as code.

While I was at it, things were getting a little complicated so I reduced the business end of the process, copying kernel code snippets, into one word nice word call CODE, .

I can now optimize DO LOOP , BEGIN UNTIL and BEGIN AGAIN with this version.

I don't think I will go any further. In theory one could build a recursive descent compiler over the Forth code but I think that's above my pay grade. :) 

 

In order to keep the optimized loop info from getting mixed up with the Forth DATA stack, I make a little secondary LIFO called a control stack.

This made it much simpler and I can do nested loops without losing my mind managing mix data on the Forth data stack.

 

The video shows the difference in speed for these test 64K iteration loops:

 

: COUNTDN     FFFF         BEGIN 1- DUP 0= UNTIL DROP  ;
: OPTCOUNTDN  FFFF INLINE[ BEGIN 1- DUP 0= UNTIL DROP ] ;

: FORTHLOOP   FFFF 0 DO LOOP ;
: OPTLOOP     INLINE[ FFFF 0  DO LOOP ] ;

 

Spoiler

\ **not portable Forth code**  Uses TMS9900/CAMEL99 CARNAL Knowledge

NEEDS .S     FROM DSK1.TOOLS
NEEDS CASE   FROM DSK1.CASE
NEEDS LIFO:  FROM DSK1.STACKS
NEEDS ELAPSE FROM DSK1.ELAPSE

MARKER /INLINE

8 LIFO: CS     \ small control flow stack for loops and branching
: >CS  ( n -- )  CS PUSH ;
: CS>  ( -- n )  CS POP ;

HEX
\ need NORMAL copies of words that are WEIRD in the Camel99 kernel
CODE @      C114 ,         NEXT, ENDCODE
CODE C@     D114 , 0984 ,  NEXT, ENDCODE
CODE DROP   C136 ,         NEXT, ENDCODE

\ Heap management words
: THERE  ( -- addr) H @ ;  \ returns end of Target memory in HEAP
: HALLOT ( n -- )   H +! ; \ Allocate n bytes of target memory.
: T,     ( n -- )   THERE ! 2 HALLOT ;         \ "target compile" n into memory
: NEW    ( -- ) 2000 2000 0 FILL   2000 H ! ;  \ clean HEAP memory

045A CONSTANT 'NEXT'  \ 9900 CODE for B *R10   Camel99 Forth's NEXT code

: CODE,  ( xt --)  \ Read code word from kernel, compile into target memory
           >BODY
           DUP 80 CELLS +   \ set a max size for any code fragment
           SWAP   ( -- IPend IPstart)
           BEGIN
              DUP @ 'NEXT' <>  \ the instruction is not 'NEXT'
           WHILE
             DUP @ ( -- IP instruction)
             T,  \ compile instruction
             CELL+  \ advance IP
             2DUP < ABORT" End of code not found"
           REPEAT
           2DROP
;
\ now we can steal code word from the kernel and compile it to target memory
: DUP,   ['] DUP  CODE, ;
: DROP,  ['] DROP CODE, ;

\ LIT,   DUP TOS and LI n into R4
: LIT,      ( n -- ) DUP,  0204 T, ( n) T, ;

\ <DO> is the preamble to setup return stack. Runs only once.
\ THERE is the address that loop jumps back to
: DO,  ( -- there)  ['] <DO> CODE, THERE >CS  ;

\ store a byte offset in odd byte of addr.
\ Addr is the location of Jump instruction
: RESOLVE ( addr offset --) 2- 2/ SWAP 1+ C! ;

\ compute offset from addr addr' & complete the jump instruction
: <BACK   ( addr addr' -- ) TUCK -  RESOLVE ;

\ compile misc. jump instructions with no offset.
: JMP,     1000 T, ;
: JNO,     1900 T, ;
: JEQ,     1300 T, ;
: JNE,     1600 T, ;

: LOOP,
          0597 T,             \ *RP INC,
          CS>  THERE JNO, <BACK   \ compute offset between 2 THERE addresses
          ['] UNLOOP CODE,    \ collapse stack frame
          DROP                \ ?? not sure what's going here
;

: +LOOP,
          A5CA T,  \ TOS *RP ADD,
          DROP,    \ don't need TOS value anymore
          LOOP,    \ compile loop code
;

: AGAIN,   CS> THERE JMP, <BACK ;

: UNTIL,  ( 8104 T, )
           1302 T,  \ 2 JEQ,
           DROP,
           AGAIN,
           DROP,
;

\ CFA of a code word contains the address of the next cell
: NOTCODE? ( XT -- ?)  DUP @ 2- - ;

: OPT-FORTH ( cfa)
       ['] DOCOL @ OVER @ =    \ is a colon definition?
       IF  \ colon definition
          CASE ( loop words)
             ['] DO    OF DO, THERE  ENDOF
             ['] LOOP  OF LOOP,      ENDOF
             ['] +LOOP OF +LOOP,     ENDOF
             ['] BEGIN OF THERE >CS  ENDOF
             ['] UNTIL OF UNTIL,     ENDOF
             ['] AGAIN OF AGAIN,     ENDOF
             TRUE ABORT" Can't optimize word"
          ENDCASE
          DROP
       ELSE \ Forth DATA word
          DUP @   \ get the "executor" code routine address
          CASE ( data words )
             ['] DOVAR    OF >BODY LIT,    ENDOF
             ['] DOCON    OF  EXECUTE LIT, ENDOF
             ['] DOUSER @ OF  EXECUTE LIT, ENDOF
             TRUE ABORT" Unknown data type"
         ENDCASE
         DROP
      THEN
;
\ new interpreter loop for inlining
: INLINE[ ( -- addr)  \ Returns address where code has been copied
           THERE ( -- XT)    \ execution token (XT) for the NEW compiled code
           DUP CELL+ T,      \ create the ITC header for CODE word
           BEGIN
             BL WORD CHAR+ C@  [CHAR] ] <>
           WHILE
              HERE FIND
              IF ( *it's a word in the dictionary* )
                 DUP NOTCODE?
                 IF ( -- cfa )
                    DUP OPT-FORTH
                 ELSE  \ it's a CODE primitive
                    CODE,  \ compile code without NEXT
                 THEN
             ELSE ( maybe its a number)
                 COUNT NUMBER? ?ERR
                 ( n ) LIT,   \ compile n as a literal
             THEN
           REPEAT      \ CR .S  ( debug line)
           'NEXT' T,   \ compile NEXT at end of new code word
            ,          \ compile CODE word's XT into Forth definition
; IMMEDIATE

 

 

 

  • Like 3
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   1 member

×
×
  • Create New...