Machine Forth OMG

+TheBF · April 7, 2021

For the curious this is the same program as previous except it uses a FOR NEXT loop structure which is just a down-counter with the index held on the return stack.

This the output code with comments. You can see how the 9900 instructions map onto Forth quite well.

Spoiler


2018  0646  dect R6               ; 3000 #  ( pushes R4 accumulator 1st)  
201A  C584  mov  R4,*R6     
201C  0204  li   R4,>3000         
2020  0647  dect R7        
2022  C5C4  mov  R4,*R7           ; FOR  ( loop index on return stack)
2024  C136  mov  *R6+,R4          ; DROP
2026  0646  dect R6               ; AAAA #   
2028  C584  mov  R4,*R6           
202A  0204  li   R4,>AAAA         
202E  0646  dect R6               ; DUP
2030  C584  mov  R4,*R6           
2032  C204  mov  R4,R8            ; SWAP
2034  C116  mov  *R6,R4               
2036  C588  mov  R8,*R6          
2038  0646  dect R6               ; OVER
203A  C584  mov  R4,*R6           
203C  C126  mov  @>0002(R6),R4  
2040  06A0  bl   @>2004           ; CALL ROT     
2044  C136  mov  *R6+,R4          ; DROP      
2046  0646  dect R6               ; DUP   
2048  C584  mov  R4,*R6                
204A  0556  inv  *R6              ; AND     
204C  4136  szc  *R6+,R4               
204E  0646  dect R6               ; DUP      
2050  C584  mov  R4,*R6                
2052  E136  soc  *R6+,R4          ; OR     
2054  0646  dect R6               ; DUP      
2056  C584  mov  R4,*R6                
2058  2936  xor  *R6+,R4          ; XOR       
205A  0584  inc  R4               ; 1+
205C  0604  dec  R4               ; 1- 
205E  05C4  inct R4               ; 2+    
2060  0644  dect R4               ; 2-      
2062  0A14  sla  R4,1             ; 2*    
2064  0814  sra  R4,1             ; 2/     
2066  0504  neg  R4               ; NEGATE     
2068  0744  abs  R4               ; ABS      
206A  A136  a    *R6+,R4          ; +     
206C  0646  dect R6               ; 2 #     
206E  C584  mov  R4,*R6                
2070  0204  li   R4,>0002              
2074  C0F6  mov  *R6+,R3          ; *      
2076  38C4  mpy  R4,R3                 
2078  C136  mov  *R6+,R4          ; DROP      
207A  0617  dec  *R7              ; NEXT     
207C  18D4  joc  >2026                 
207E  05C7  inct R7                    
2080  045A  b    *R10             ; NEXT,  (return to ITC Forth)

+Lee Stewart · April 7, 2021

14 hours ago, TheBF said:

For the curious this is the same program as previous except it uses a FOR NEXT loop structure which is just a down-counter with the index held on the return stack.

This the output code with comments. You can see how the 9900 instructions map onto Forth quite well.

Reveal hidden contents



2018  0646  dect R6               ; 3000 #  ( pushes R4 accumulator 1st)  
201A  C584  mov  R4,*R6     
201C  0204  li   R4,>3000         
2020  0647  dect R7        
2022  C5C4  mov  R4,*R7           ; FOR  ( loop index on return stack)
2024  C136  mov  *R6+,R4          ; DROP
2026  0646  dect R6               ; AAAA #   
2028  C584  mov  R4,*R6           
202A  0204  li   R4,>AAAA         
202E  0646  dect R6               ; DUP
2030  C584  mov  R4,*R6           
2032  C204  mov  R4,R8            ; SWAP
2034  C116  mov  *R6,R4               
2036  C588  mov  R8,*R6          
2038  0646  dect R6               ; OVER
203A  C584  mov  R4,*R6           
203C  C126  mov  @>0002(R6),R4  
2040  06A0  bl   @>2004           ; CALL ROT     
2044  C136  mov  *R6+,R4          ; DROP      
2046  0646  dect R6               ; DUP   
2048  C584  mov  R4,*R6                
204A  0556  inv  *R6              ; AND     
204C  4136  szc  *R6+,R4               
204E  0646  dect R6               ; DUP      
2050  C584  mov  R4,*R6                
2052  E136  soc  *R6+,R4          ; OR     
2054  0646  dect R6               ; DUP      
2056  C584  mov  R4,*R6                
2058  2936  xor  *R6+,R4          ; XOR       
205A  0584  inc  R4               ; 1+
205C  0604  dec  R4               ; 1- 
205E  05C4  inct R4               ; 2+    
2060  0644  dect R4               ; 2-      
2062  0A14  sla  R4,1             ; 2*    
2064  0814  sra  R4,1             ; 2/     
2066  0504  neg  R4               ; NEGATE     
2068  0744  abs  R4               ; ABS      
206A  A136  a    *R6+,R4          ; +     
206C  0646  dect R6               ; 2 #     
206E  C584  mov  R4,*R6                
2070  0204  li   R4,>0002              
2074  C0F6  mov  *R6+,R3          ; *      
2076  38C4  mpy  R4,R3                 
2078  C136  mov  *R6+,R4          ; DROP      
207A  0617  dec  *R7              ; NEXT     
207C  18D4  joc  >2026                 
207E  05C7  inct R7                    
2080  045A  b    *R10             ; NEXT,  (return to ITC Forth)

Is it usual for the FOR limit to not be consumed?

...lee

+TheBF · April 7, 2021

2 hours ago, Lee Stewart said:

Is it usual for the FOR limit to not be consumed?

...lee

The DROP following FOR is doing that remembering that this system uses R4 as a cache for the top of stack.

So DROP always refills R4 from the memory stack.

The return stack works as a normal stack in memory so the inct R7 is removing the limit from the return stack

207A  0617  dec  *R7              ; NEXT     
207C  18D4  joc  >2026                 
207E  05C7  inct R7

Unless you have found something I am completely missing, which has happened before, that is how I think it should work.

GDMike · April 7, 2021

Not a bad thing to happen ?.

+TheBF · April 8, 2021

2 hours ago, GDMike said:

Not a bad thing to happen ?.

Indeed not. Lee has found so many bugs in my code I want to start calling him "Raid".

GDMike · April 8, 2021

Just now, TheBF said:

Indeed not. Lee has found so many bugs in my code I want to start calling him "Raid".

Mine to, no matter what I was doing. The Eyes have it with him. Lol

+TheBF · April 8, 2021

POP/PUSH Optimization

This is something that I know should be a part of a good Forth native code compiler but I always created bugs when I tried it in the past.

I think I have this working so I am going to explain again to myself and anyone who cares to read about it just to confirm my logic.

When you run a Forth machine with a cache register for the top of stack element there are many Forth instructions that end with an instruction to refill the cache register. This is effectively a DROP function in the Forth machine because you are POPPING the stack into the register.

Other Forth instructions need to use the cache register when they start, so they push the cache register onto the stack in memory first thing.

This is effectively a DUP instruction on the Forth machine.

IF a Forth instruction that ends with a DROP is followed immediately by an instruction that does a DUP that is three useless instructions that just thrash the top element of the stack. Three extra instructions on the 9900 can really slow things down, especially inside a loop.

The solution was a "SMARTDUP" and I think I have the logic correct this time.

Spoiler


\ ************* optimizable operations ***************
COMPILER
: D=    ( d d -- ?)  ROT = -ROT = AND ;
: 1LOOKBACK ( n -- ? ) THERE 1 CELLS - @ = ;
: 2LOOKBACK ( d -- ? ) THERE 2 CELLS - 2@ D= ;

: REMOVE ( n -- )  CELLS NEGATE TALLOT ;  \ remove n cells from program
: ADUP     C584 0646 ;  \ DUP is 2 instructions, 4 bytes

: !,    TOS SWAP @@ MOV, ;
: DROP,   TOS POP, ;
: DUP,    TOS PUSH, ;
: C!,     TOS SWPB, TOS SWAP @@ MOVB, ;

\ POP/PUSH optimization:
\ Some words refill the stack with DROP. If the next word does a DUP
\ we should not have compiled the DROP, so SMARTDUP removes it.
COMPILER
: SMARTDUP
        OPTIMIZER @
        IF
          0C136 1LOOKBACK \ did we just emit a drop?
          IF  1 REMOVE    \ YES, so remove it
          ELSE DUP,       \ NO, so we must DUP
          THEN

        ELSE
           DUP,           \ regular DUP is compiled
        THEN ;

 TARGET
 : !  ( n variable --)
 [CC]
       OPTIMIZER @
       IF
          ADUP 2LOOKBACK   \ look back for ADUP
          IF   2 REMOVE  !,
          ELSE !, DROP,    \ un-optimized
          THEN
       ELSE
          !, DROP,         \ un-optimized
       THEN ;
TARGET
 : C!   ( c variable --)
 [CC]
       OPTIMIZER @
       IF
          ADUP 2LOOKBACK  \ look back for DUP
          IF    2 REMOVE C!,  \ optimized
          ELSE  C!, DROP,  \ un-optimized
          THEN
       ELSE
          C!, DROP,   \ un-optimized
       THEN ;

Using these concepts I also optimized ! and C! for expressions like :

1234 DUP X !

Since ! (store) consumes both of its arguments it always ends with a DROP.

I look back 2 cells in the program and if I find a DUP I can remove that dup since 1234 is sitting in R4 ready to go.

And since I removed the DUP I don't need the DROP after I store the number in X.

* ADUP in the code has the instructions in reverse order to match the way 2@ reads memory in 2LOOKBACK.

senior_falcon · April 8, 2021

9 hours ago, TheBF said:

Indeed not. Lee has found so many bugs in my code I want to start calling him "Raid".

The nose knows!

+TheBF · April 8, 2021

MACHFORTH is getting closer to being useful

This little program relocates the code to load at >A000 and it also steals the entire scratchpad for Forth stacks and workspace.

And it successfully saves the image to disk.

\ MFORTH DEMO #1b  Use new workspace and stacks, save binary program
\ If running on Classic99 you will see R4 counting down

\ This demo shows:
\ - compile to >A000 origin
\ - create workspace and both stacks in scratchpad memory
\ - saves a finished program that can RUN from E/A Option 5

COMPILER
   NEW.
   HEX A000 ORIGIN.

INCLUDE DSK2.BYE  \ a little code to exit program

TARGET
PROG: DEMO1
       0 LIMI,      \ disable interrupts to take over the machine
       8300 WORKSPACE
       8380 RSTACK
       8400 DSTACK
       FFFF #
       BEGIN
           1-       \ decrement data stack
       -UNTIL       \ -UNTIL DOES NOT consume the stack parameter
       DROP         \ clean up the stack
       BYE          \ Return to TI title screen
END.

SAVE DSK2.DEMO1C

+TheBF · September 7, 2021

Sometimes I miss the oh so obvious.

I ask a question over on the GCC topic. Lots of help came back.

@TURSI generously sent me a hello world program in C with all the trimmings. Thanks @Tursi

Here is the program

// tiny hello world
// assume started from Editor/Assembler and so screen is already set up and cleared

// Write Address/Register
#define VDPWA	*((volatile unsigned char*)0x8C02)
// Write Data
#define VDPWD	*((volatile unsigned char*)0x8C00)

int main() {
	// define the string
	unsigned char *pTxt = "Hello World";

	// set the VDP address
	VDPWA = 0x00;		// LSB
	VDPWA = 0x40;		// MSB + "write" bit
	
	// now display the string. No need for delays cause 8-bit TI code
	while (*pTxt) {
		VDPWD = *(pTxt++);
	}
	
	// now just spin until the user resets us. Use inline assembly to enable interrupts
	__asm__("LIMI 2");
	for (;;) { }	// spin forever

	// not reached
	return 0;
}

These two lines made realize something I had never considered about interfacing to the TMS9918.

	// set the VDP address
	VDPWA = 0x00;		// LSB
	VDPWA = 0x40;		// MSB + "write" bit

These things are everyday for people who work in IT but it was a revelation for me.

Of course the '=' (assignment) operator in C can write to a memory mapped data port. It's just memory after all.

I was stuck in the rut of thinking I needed to make a sub-routine to set VDP addresses.

Forth is a LOAD/STORE machine and the assignment operator for a byte is called 'c-store' (Char store) and the operator is C!.

So I translated this C program to my latest iteration of Machine Forth and it looks like this.

Deviations from standard Forth.

There are compiler directives to control the compiler and TARGET makes things point to the memory image where the program is created.

The # operator handles literal numbers loading them into R4, top of stack cache

#C! handles the address parameter with symbolic addressing to store a byte. (versus taking the address from the Forth data stack)

\ tiny hello world in machine Forth
\ Translated from hello.c by Tursi for comparison
\ assume started from Editor/Assembler and so screen is already set up and cleared

COMPILER
   NEW.
   HEX 2000 ORIGIN.
  OPT-ON

TARGET                 \ code for the target binary program

\ Write Address port
HEX 8C02 EQU VDPWA
\ Write Data port
HEX 8C00 EQU VDPWD

\ define the string
CREATE TXT  S" Hello World!" S,

HEX
PROG: MAIN
   \ this compiler is dumb, so we need to setup the machine manually
    0 LIMI,
    3F00 WORKSPACE
    FE00 RSTACK
    FF00 DSTACK

     0 # VDPWA #C!   \ character store VDP address LSB
    40 # VDPWA #C!   \ character store VDP address MSB + "write" bit

   \ now display the string. No need for delays cause its 8-bit TI code
     TXT # COUNT 1- FOR   COUNT VDPWD #C!    NEXT    DROP

   \ Use Forth Assembler to enable interrupts
	 2 LIMI,

   \ Return to Forth for convenience
	 8300 WORKSPACE
   NEXT,
END.

The C program compiles 118 bytes.

The Machine Forth program is 128 bytes with the push/pop optimizer OFF and 117 bytes with OPT-ON

Not too shabby for a very naïve compiler.

The C 'while' loop is very impressive. Machine Forth has an 'address' register with auto-increment that might let me do something similar.

The spoiler has the listing from Classic99 and i manually added the 'Hello World!' string and the branch to explain it better.

There are still some optimizations that could be made but that will get hard real fast since the compiler is really dumb.

Spoiler


\ hello.fth  machine Forth code with push/pop  optimizer on

\ R4 is the 'cache register' for the top-of-stack in the  Virtual machine
\ R6  DATA stack pointer
\ R7  Return stack pointer
\ R9  FOR/NEXT loop index
\ R10  points to code to return to Forth 
2000        B @>2012 
2004        BYTE >0C
2005        TEXT 'Hello World!'
            EVEN 
2012  0300  limi >0000
2016  02E0  lwpi >3f00          \     3F00 WORKSPACE
201A  0207  li   R7,>fe00       \     FE00 RSTACK
201E  0206  li   R6,>ff00       \     FF00 DSTACK
2022  0646  dect R6             \     DUP 
2024  C584  mov  R4,*R6
2026  0204  li   R4,>0000       \      0 #
202A  06C4  swpb R4
202C  D804  movb R4,@>8c02      \      VDPWA #C! 
2030  0204  li   R4,>0040       \      40 #
2034  06C4  swpb R4             
2036  D804  movb R4,@>8c02      \      VDPWA #C!
203A  0204  li   R4,>2004       \      TXT #
203E  0646  dect R6             \      COUNT 
2040  C584  mov  R4,*R6         
2042  0596  inc  *R6            
2044  D114  movb *R4,R4
2046  0984  srl  R4,8
2048  0604  dec  R4             \      1-  
204A  0647  dect R7             \      FOR 
204C  C5C8  mov  R8,*R7
204E  C204  mov  R4,R8
2050  C136  mov  *R6+,R4        
2052  0646  dect R6             \      COUNT 
2054  C584  mov  R4,*R6
2056  0596  inc  *R6
2058  D114  movb *R4,R4
205A  0984  srl  R4,8
205C  06C4  swpb R4            
205E  D804  movb R4,@>8c00     \     VDPWD #C!  
2062  C136  mov  *R6+,R4
2064  0608  dec  R8            \     NEXT 
2066  18F5  joc  >2052
2068  05C7  inct R7
206A  C136  mov  *R6+,R4       \     DROP 
206C  0300  limi >0002
2070  02E0  lwpi >8300
2074  045A  b    *R10

+TheBF · September 8, 2021

Chuck Moore added the 'A' register to his virtual machine in machine Forth.

It's just a temp register that lets you use auto-increment and on his CPU auto-decrement.

It can save a lot of stack juggling. The idea is great for 9900 but making a syntax that I like might take me awhile. :)

I re-wrote Hello to be a bit more idiomatic Forth by adding a TYPE sub-routine.

The loop is much tighter but I had to resort to some Assembly language to really make it good.

Here is a simple TYPE in Machine Forth with Assembly language

: TYPE ( Caddr len -- )
\ Mixing machine Forth and Assembler for best use
      *SP+ AREG MOV,   \ pop address into Address register ie: R9
                       \ len remains in TOS register
       0 LIMI,
       BEGIN
          *AREG+ VDPWD @@ MOVB,
          1-           \ dec TOS
       -UNTIL          \ until tos=0
       DROP
;

Here is the code that machine Forth emitted.

\ type sub-routine code
        dect R7                \ enter sub-routine saves R11 on return stack
        mov  R11,*R7

        mov  *R6+,R9           \ pop the string address to AREG  ie: R9
        limi >0000
>201C:  movb *R9+,@>8c00       \ write a byte to VDP port
        dec  R4                \ dec the length
        jne  >201c             \ loop until len=0
        mov  *R6+,R4           \ drop

        mov  *R7+,R11          \ POP r11
        b    *R11              \ return

To use it I did:

   \ display the string with a sub-routine
    TXT # COUNT TYPE

Although using *SP+ is more efficient for 9900, this is still not how Chuck envisioned using the A register.

I think TYPE should be more like:

: TYPE ( Caddr len -- )
       0 LIMI,
       SWAP A! 
       BEGIN
          AC@+ VDPWD C! 
          1-           \ dec TOS
       -UNTIL          \ until tos=0
       DROP
;

But the original Machine Forth did not have SWAP. :_( Not sure how that worked. :)

I will continue refining how best to use this Machine Forth concept but adapt it to the 9900 instruction set.

For example SWAP might just be an alias for *SP or something like that.

Spoiler shows the Forthier version of Hello.

\ tiny hello world in machine Forth Demo 2

\ Creates a TYPE sub-routine with ASM and Address register

COMPILER
   NEW.
   HEX 2000 ORIGIN.
   OPT-ON

TARGET                 \ code for the target binary program

\ Write Address port
HEX 8C02 EQU VDPWA
\ Write Data port
HEX 8C00 EQU VDPWD

\ define the string
CREATE TXT  S" Hello World!" S,

: TYPE ( Caddr len -- )
\ Mixing machine Forth and Assembler for best use
      *SP+ AREG MOV,   \ pop address into Address register ie: R9
                       \ len remains in TOS register
       0 LIMI,
       BEGIN
          *AREG+ VDPWD @@ MOVB,
          1-           \ dec TOS
       -UNTIL          \ until tos=0
       DROP
;

HEX
PROG: MAIN
   \ this compiler is dumb, so we need to setup the machine manually
    0 LIMI,
    3F00 WORKSPACE
    FE00 RSTACK
    FF00 DSTACK

     0 # VDPWA #C!   \ character store VDP address LSB
    40 # VDPWA #C!   \ character store VDP address MSB + "write" bit

   \ display the string with a sub-routine
    TXT # COUNT TYPE

   \ Return to Forth
	 8300 WORKSPACE
   NEXT,
END.

+TheBF · September 8, 2021

Smoke is clearing a little...

I think the big deal for me is to get this thing in stable form and not try to optimize everything all at once.

I reverted back to the way things worked early on where all data creating words push their address onto the data stack when invoked.

This is just like normal Forth, but uses the LI instruction. The PUSH/POP optimizer makes this practical otherwise it would always be three instructions to load R4.

( *?DPUSH is the PUSH/POP optimizer. It looks back to see if there was a DROP in the previous cell and if so, it erases the DROP and does not DUP.

This is a big improvement (essential?) in reducing code size in a stack machine where the TOS is cached in a register)

\ Machine Forth data structure creation
COMPILER
: LIT,  ( n -- )  ?DPUSH   TOS SWAP LI,  ;

: CONSTANT ( n -- n)  \ create the compiler's constant
      CREATE   ,              \ remember the value
      DOES> ( pfa ) @  LIT, ; \ compile constant as a literal no.

: CREATE ( -- addr)
        CREATE CHERE ,     \ remember the target address
        DOES> @  LIT,  ;   \ pushes address onto stack

: VARIABLE  ( -- addr) CREATE  0000 T, ;

Next thing was to make a better COUNT word, which I renamed $@ ( string fetch) since it leaves ONLY the length on the data stack but puts the string address into register A.

I can see just from this how using the A register simplifies things. At the moment it is inline code. At four instructions it could be a sub-routine if it was used a great deal.

\ ** WARNING ** puts the string address in register A
: $@   ( Caddr --  len) ( A: Caddr+1)
        TOS   AREG MOV,   \ base address to register A
              AREG INC,   \ bump address past count byte
        C@                \ fetch byte count onto data stack
;

With these changes and putting TYPE inline, which is more like the C version , the hello program shrunk to 85 bytes! ( and it still worked)

Getting closer...

New source code

Spoiler


\ tiny hello world in machine Forth Demo with $@     Sept 8 2021  Fox
\ compiles to 85 bytes 

COMPILER
   NEW.
   HEX 2000 ORIGIN.
   OPT-ON

TARGET                 \ code for the target binary program

HEX 8C02 EQU VDPWA     \ Write Address port
HEX 8C00 EQU VDPWD     \ Write Data port

CREATE TXT  S" Hello World!" S,

HEX
PROG: MAIN
   \ setup Forth machine
    0 LIMI,
    3F00 WORKSPACE
    3D00 RSTACK
    3E00 DSTACK

     0 # VDPWA #C!   \ character store VDP address LSB
    40 # VDPWA #C!   \ character store VDP address MSB + "write" bit

    TXT $@
    BEGIN
       *AREG+ VDPWD @@ MOVB,
       1-
    -UNTIL
    DROP

   \ Return to Forth
	 8300 WORKSPACE
   NEXT,
END.

Emitted code

Spoiler


2012  0300  limi >0000          
  2016  02E0  lwpi >3f00  
  201A  0207  li   R7,>3d00 
  201E  0206  li   R6,>3e00 
  2022  0646  dect R6 
  2024  C584  mov  R4,*R6   
  2026  0204  li   R4,>0000
  202A  06C4  swpb R4 
  202C  D804  movb R4,@>8c02 
  2030  0204  li   R4,>0040 
  2034  06C4  swpb R4 
  2036  D804  movb R4,@>8c02
  203A  0204  li   R4,>2004              
  203E  C244  mov  R4,R9                 
  2040  0589  inc  R9                    
  2042  D114  movb *R4,R4                
  2044  0984  srl  R4,8                  
  2046  D839  movb *R9+,@>8c00           
  204A  0604  dec  R4                    
  204C  16FC  jne  >2046                 
  204E  C136  mov  *R6+,R4               
  2050  02E0  lwpi >8300  
  2054  045A  b    *R10

+TheBF · September 8, 2021

If you've optimized one thing...

Working at compile time is really interesting. If you understand how to detect a situation that you don't like it's easy to remove it and replace it with different code.

This is new to me.

So I have a smart DUP that detects if there was a DROP in the previous instruction. This can save 6 bytes whenever two Forth primitives are connected together where the 1st one ends with DROP and the second word starts with DUP. (to make room in R4)

Here is how the optimizer looks:

\ pop/push optimizer
HEX
 C136 CONSTANT 'DROP'  \ machine code for DROP

: DUP,     ( n -- n n)    TOS DPUSH, ;  \ normal dup

: LOOKBACK ( -- u)  THERE 2- @ ; \ fetch previous instruction code

: OPT-DUP, ( n -- n ?n)   \ SMART dup
   LOOKBACK 'DROP' =    \ look back for DROP
   IF -2 TALLOT         \ move target dictionary back 1 cell
   ELSE  DUP,
   THEN ;

* TALLOT is like ALLOT but operates on the target memory image

I wanted to see how I could remove the assembly language in the Hello program print loop but continue to use the A register.

I have landed on using 9900 type syntax so the A register looks like a 9900 register in the Forth code but with extra characters that are from Forth.

A@ fetches register A to the top of the data stack.
A! stores the top of the data stack into the A register.
*A@ means fetch A, indirect address to top of data stack
*A@+ means fetch A, indirect with auto-incrementing.

This is different than Chuck Moore's CPU but in order to get the performance out of the CPU we have to use its features.

\ A register Machine Operators for TMS9900
: A@    ( -- n)   ?DPUSH   AREG  TOS MOV, ;  \ Dpush(T) T=A
: *A@   ( -- n)   ?DPUSH  *AREG  TOS MOV, ;  \ Dpush(T) T=*A
: *A@+  ( -- n)   ?DPUSH  *AREG+ TOS MOV, ;  \ Dpush(T) T=*A  A=A+cell
: (A)@  ( u --)   ?DPUSH  (AREG) TOS MOV, ;  \ Dpush(T) T=u@(A)

: #A!   ( addr --)  AREG SWAP LI, ; \ load A with literal number BF addition
: A!    ( addr -- ) TOS  AREG  MOV, DROP ;  \ A!   A=T  Dpop(T)
: *A!   ( addr)     TOS *AREG  MOV, DROP ;  \ !A  [A]=T Dpop(T)
: *A!+  ( n --)     TOS *AREG+ MOV, DROP ;  \ !A+ [A]=T A=A+cell Dpop(T)
: (A)!  ( n --)     TOS  SWAP (AREG) MOV, DROP ;

\ addr A-plus-store for versatility. 
: A+!   ( n -- )  TOS AREG ADD,  DROP ;

Chuck's machine did not have byte access and so he did it in his code as needed.

That's not right for the 9900 so I have these byte-wise operators again with the 9900 addressing modes.

\ added byte operations. BFox
: *AC@   ( -- 0c00) ?DPUSH  *AREG  TOS MOVB, TOS 8 SRL, ;
: *AC@+  ( -- 0c00) ?DPUSH  *AREG+ TOS MOVB, TOS 8 SRL, ;
: *AC!   ( 0c00 --)  1 (TOS) *AREG MOVB,  DROP ;
: *AC!+  ( 0c00 --)  1 (TOS) *AREG+ MOVB, DROP ;

A problem arises when you do this

*AC@+  VDPWD #C!

As seen above, the *AC@+ ends with the SRL instruction to swap the byte in TOS (ie: R4)

But the #C! operator is this:

: #C!   ( c addr --)  TOS SWPB,  TOS SWAP @@ MOVB,   DROP ;

So we swap the byte to one side only to swap it back to other side.

So I replaced TOS SWPB, with ?SWPB,

\ swap byte optimizer
: ?SWPB,  ( n -- n)
         LOOKBACK 0984 =   \ look back for "SRL R4,8"
         IF -2 TALLOT      \ remove SRL
         ELSE  TOS SWPB,   \ we need SWPB
         THEN
;

Seems to work and program is still pretty efficient. It wastes a move into R4 versus using the Assembly language single instruction.

So the actual program looks like this with only machine Forth.

PROG: MAIN
   \ setup Forth machine
    0 LIMI,
    3F00 WORKSPACE
    3D00 RSTACK
    3E00 DSTACK

     0 # VDPWA #C!   \ character store VDP address LSB
    40 # VDPWA #C!   \ character store VDP address MSB + "write" bit

    TXT $@
    BEGIN
     *AC@+ VDPWD #C!
      1-
   -UNTIL
    DROP

     8300 WORKSPACE
   NEXT,
END.

It's not normal Forth but I think you can write some pretty fast programs with it.

The next thing to tackle is tail-call optimization.

+TheBF · September 8, 2021

I may be getting the hang of this.

I am going to quote an article from ForthWrite Magazine, from the Forth Interest Group UK. June 2000, Special Issue.

I used it for reference and the author, John Tasgal, explains this better than I could.

-----------

"Tail-Recursion Optimisation
In any definition the return action of the word before a semicolon, and of the semicolon itself, can always be compiled into a single return.

word1 ..... lastword ;

As nothing happens between lastword returning and ';' returning, the lastword return is superfluous.

A more elaborate example is the recursive call at the end of a WHILE loop. If we have a series of nested calls then the last instruction is in each case a return.

At runtime this produces '; ; ; ; ;' viz. a sequence of returns.

The point is that when these calls unwind all that happens is that a sequence of returns are executed, one after the other. Nothing is done between them. The only necessary return is the first one pushed onto the return stack (and so the last to be executed). Removing these superfluous returns is known as tail-recursion optimisation.

Most Machine Forth compilers (and also Color Forth) contain a 'tail-recursion optimiser'."

-----------

Machine Forth has a special semi-colon for this purpose called -;

Like most things Forth it is up to you to use it where you want to. This would be whenever the last word in a definition is a COLON definition ie: a sub-routine.

It won't work if the last item is a constant or a variable or an inline primitive word for example.

Here is how I implemented -; and it seems to work.

(H: ;H are aliases for Camel99's (the Host) colon/semi-colon so I can keep my head on straight)

\ tail call removal semi-colon
H: -; ( --  )
     LOOKBACK ( addr ) >R   \ fetch & save sub-routine address
     -8 TALLOT              \ remove the call sequence (go back 8 bytes)
      R> @@ B,              \ compile a branch to the sub-routine
;H

Here is a the test program that showed it working.

It saves 32 bytes using tail-call optimization which is a welcome bonus and on the TI-99 that's 16 instructions of speed improvement too!

\ tail-call optimization test program    Sept 8 2021  Fox
COMPILER
   NEW.
   HEX 2000 ORIGIN.
   OPT-ON

TARGET                 \ code for the target binary program

HEX 8C02 EQU VDPWA     \ Write Address port
HEX 8C00 EQU VDPWD     \ Write Data port

CREATE TXT  S" Hello World!" S,

: HI
     0 # VDPWA #C!   \ character store VDP address LSB
    40 # VDPWA #C!   \ character store VDP address MSB + "write" bit
    TXT $@  BEGIN  *AREG+ VDPWD @@ MOVB, 1-  -UNTIL   DROP ;

: LEVEL4  HI      -;
: LEVEL3  LEVEL4  -;
: LEVEL2  LEVEL3  -;
: LEVEL1  LEVEL2  -;

HEX
PROG: MAIN
   \ setup Forth machine
    0 LIMI,
    3F00 WORKSPACE
    3D00 RSTACK
    3E00 DSTACK

    LEVEL1

    8300 WORKSPACE
    NEXT,

END.

GDMike · September 8, 2021

I always try to keep all my constant and variable DEFS up front and prior to my structured DEFS, ie;, loop DEFS and other words.

Edited September 8, 2021 by GDMike

+TheBF · September 8, 2021

11 minutes ago, GDMike said:

I always try to keep all my constant and variable DEFS up front and prior to my structured DEFS, ie;, loop DEFS and other words.

From the human perspective that makes perfect sense. You can see all the data a glance.

From the TI-99 perspective our old computer doesn't really care. It's a wild memory model however with a lot of different type of memory in the system.

Many modern machines force a separation of code and data memory so there's that.

+TheBF · September 8, 2021

36 minutes ago, GDMike said:

I always try to keep all my constant and variable DEFS up front and prior to my structured DEFS, ie;, loop DEFS and other words.

Wait. Were you just making a joke?

GDMike · September 9, 2021

Nope, just using my 3 cents. Ok, I'm being a smart ass, but I'm trying to follow what your saying, but some of it I'm gonna have to look harder at, but I'm trying to follow.lol

+TheBF · September 9, 2021

37 minutes ago, GDMike said:

Nope, just using my 3 cents. Ok, I'm being a smart ass, but I'm trying to follow what your saying, but some of it I'm gonna have to look harder at, but I'm trying to follow.lol

Well I have had trouble following it myself. Some of the advanced stuff started to fall into place in the last week or so.

Here is a summary:

Normal Forth has a bunch of Assembly language words that do stuff. ( DUP SWAP OVER + - * / etc).

These things are always "called" so there is some overhead to make everything go but they only take 2 bytes in your program every time you use a word.

Machine Forth does the opposite.

It uses these same short pieces of Assembly code but instead of calling them, it copies them into RAM one after another. No calling unless you want that.

The magic is that the Forth colon definition lets you record Forth Assembler code as a Forth word.

When you run that word it will run the Assembler code which writes the code into memory.

This would be called a macro in a modern "macro-assembler" language.

So when I want machine Forth to do addition I make this:

: +     ( n n -- n)  *SP+ TOS ADD, ;

It does not RUN the code when you type + in your machine Forth program.

When you use + in a machine Forth program it is like you typed in the assembly language, so the code gets written into RAM.

In this case in a separate memory block, not part of Camel Forth.

Make a bit more sense?

The rest is the details of getting the @#$!# thing to make an actual EA5 program image.

GDMike · September 9, 2021

Ok. Gotcha. That's making sense.

+TheBF · September 10, 2021

It turns out it is hard to make a compiler that fits in 18.3K beat GCC performance.

I thought I would try Tursi's Sprite benchmark with this new compiler.

GCC did this benchmark in 5 seconds.

This version in generic Forth ran in 27 seconds

DECIMAL
( more direct translation of Tursi ASM code to Forth)
: TURSI.OPT
      100 0
      DO
           239 0 DO   I $301 VC!     LOOP
           175 0 DO   I $300 VC!     LOOP
           0 239 DO   I $301 VC! -1  +LOOP
           0 175 DO   I $300 VC! -1  +LOOP
      LOOP ;

This version using the Camel99 inline optimizer and ran in 20 seconds

( optimize inner loop code)
: TURSI.INLINE
      100 0
      DO
  INLINE[ 239 0 ] DO  INLINE[ I $301 VC! ]     LOOP
  INLINE[ 175 0 ] DO  INLINE[ I $300 VC! ]     LOOP
  INLINE[ 0 239 ] DO  INLINE[ I $301 VC! -1 ] +LOOP
  INLINE[ 0 175 ] DO  INLINE[ I $300 VC! -1 ] +LOOP
      LOOP ;

This version in Machine Forth ran in 15 seconds. It uses the A register on two loops because the FOR NEXT loop as envisioned by Chuck Moore is a down-counter.

      100 #
      BEGIN
        \ using register A for up counting
        0 #A!  239 # FOR  $301 VDPA!   A@ VDPWD #C!  A1+!   NEXT
        0 #A!  175 # FOR  $300 VDPA!   A@ VDPWD #C!  A1+!   NEXT
        \ for/next index is a down-counter
               239 # FOR  $301 VDPA!   I@ VDPWD #C!         NEXT
               175 # FOR  $300 VDPA!   I@ VDPWD #C!         NEXT
        1-
      -UNTIL

If I made a macro for VDPA! (VDP address store) it ran in 10 seconds.

That was the best I could do so far.

I have also added some incrementors/decrementors for the A register because they are native instructions on the 9900.

A1+!
A1-!
A2+!
A2+!

My push/pop optimizer failed in this test as well so more sleuthing is required.

Edit: Got the optimizer working on A@ and I@. That got it down to 8 seconds with a VDPA! as a macro.

+TheBF · September 10, 2021

My pop/push optimizer problem seems to have been my logic on when to invoke it. It seems to work reliably now.

In this little program program it was used 8 times which saved 48 bytes! To be clear "Forth" is not in the program. It's just native code glued together by Forth.

So here is a little video of how it works. There is still lots of work to do to make it something someone else could use but I have always wanted to know more about Forth generating native code so this is a bit of personal victory. It pales in comparison to XB256 but it is a compiler that can generate fast code so it could be "library enabled".

Here is the entire benchmark program using some tricks so that it runs as fast as I can make it go.

The video shows it built and run. You could run it from within Forth and return to Forth, but I wanted to show the EA5 creation function.

I gotta go make pizza.

Happy weekend

Spoiler


\ Tursi sprite benchmark in Machine Forth         Sept 8 2021  Fox

\ INCLUDE DSK2.MFORTH,FTH

COMPILER
   NEW.
   HEX 2000 ORIGIN.

TARGET
OPT-ON
INCLUDE DSK2.TINYVDP

\ A few screen variables
VARIABLE C/L
VARIABLE C/SCR
VARIABLE VMODE
0380 CONSTANT CTAB      \ colour table VDP address

HEX
: GRAPHICS
         0 # CTAB 0 # VFILL
         0E0 # DUP 83D4 #C! 1 # VWTR
         0 #  2 # VWTR    \ set VDP screen page
         0E # 3 # VWTR
         01 # 4 # VWTR
         06 # 5 # VWTR
         01 # 6 # VWTR
         CTAB 10 # 10 # VFILL  \ charset colors
         27 # 7 # VWTR         \ screen color
         20 # C/L !
         300 # C/SCR !
         1 # VMODE !
         0 # 300 # 20 # VFILL        \ clear screen
;

HEX
: MAGNIFY  ( mag-factor -- )
        83D4 #C@  0FC # AND +  DUP 1 # VWTR  83D4 #C! ;

: SPRITE0  ( char colr x y -- ) \ create a SPRITE, sp# = 0..31
           300 # VC!      \ set Y position
           301 # VC!      \ set X position
           303 # VC!      \ set the sprite color
           302 # VC!      \ set the character pattern to use
;

\ *COMPILE time trick*
\ Use HOST Forth to make VDP addresses with write bit set and pre-swapped
 HOST 300 4000 OR ><  TARGET CONSTANT $300
 HOST 301 4000 OR ><  TARGET CONSTANT $301

\ We can use the Host Forth colon to make a macro
H: VDPA! ( Vaddr -- ) \ set vdp address (read mode)
          TOS VDPWA @@ MOVB,
          TOS SWPB,
          TOS VDPWA @@ MOVB,
          DROP
;H

: TURSI
      DECIMAL
      GRAPHICS
      42 # 4 # 0 # 0 #  SPRITE0
      1 # MAGNIFY
      0 LIMI,
      100 #
      FOR
        \ using register A for up cou nting
        0 #A!  239 # FOR  $301 VDPA!  A@ VDPWD #C!  A1+!  NEXT
        0 #A!  175 # FOR  $300 VDPA!  A@ VDPWD #C!  A1+!  NEXT
        \ for/next index is a down-counter
               239 # FOR  $301 VDPA!  I@ VDPWD #C!   NEXT
               175 # FOR  $300 VDPA!  I@ VDPWD #C!   NEXT
      NEXT
      BEGIN AGAIN  \ loop forever
;

\ prog: names the entry address for the images
PROG: MAIN
HEX   8300 WORKSPACE
      3FDE DSTACK   ( 20 cells)
      3FB6 RSTACK

      TURSI         \ call the program
END.

COMPILER SAVE DSK2.TURSI

CR ." Optimizations: " OPTS ?

+TheBF · September 24, 2021

Slowly expanding the code for this machine Forth compiler and adding more hi-level code that makes the transition from standard Forth a heck of a lot easier. (for me anyway)

I have decided it is simpler to just assume parameters are in the TOS register per normal Forth. This keeps the syntax less "creative".

I could add a literal stack to the compiler and make smarter decisions on data but that increases the complexity of the compiler quite a bit which is not the in the spirit of machine Forth.

I have made some sense of how to make POP/PUSH optimizations work and it's not rocket science.

It turns out that the when a literal number, a constant or a variable is used everything is reliable IF you PUSH the TOS register first. (which is a DUP operation)

This is how Forth expects things to be done.

It seems you can't get fancy and try to optimize that first PUSH away. It will bite you. Maybe with much more complicated analysis it could be done but it's probably above my pay grade. After that it just works. Code that ends with the DROP instruction that is followed by a DUP, will trigger a removal of the DROP and the DUP saving 6 bytes.

Results of this process are below in a simple test program.

\ Difference with/without optimizer:
\  15.6 vs 12.35 seconds. 24.8% faster
\  12 bytes smaller
\ ITC Forth runs equivalent program in 47.7 seconds. 4X slower

TARGET
VARIABLE X
VARIABLE Y
VARIABLE Z

FFFF CONSTANT LOOPS

PROG: DEMO5
        LOOPS
        BEGIN
           1-
        WHILE
          -3 # X +!
           Y 1+!
           X @ Y @  +  Z !
        REPEAT
        DROP
        NEXT,            \ return to Camel99 Forth
END.  \ end directive test program size, tests for stack junk

I have factored out setting the VDP address in the word VDPA!. With this is becomes possible to take advantage of VDP auto-incrementing address feature.

So the code for TYPE as a primitive operation is below. ( It is up to you to set the write bit on the address if setting the address for writing but that is easy)

TYPE is a mixture of machine Forth and Forth Assembler which is really handy.

: TYPE  ( addr len )
       *SP+ AREG MOV,
        R3 8C00 LI,          \ 12% faster to use a register
        BEGIN
            *AREG+  R3 ** MOVB,
            1-
        -UNTIL
        DROP ;

Here is how this new TYPE is used in a test program:

Spoiler


\ hello world in machine Forth Demo     Sept 23 2021  Fox
\ compiles to 128 bytes

COMPILER              \ Use compiler wordlist (for interpreted words)
   NEW.
   HEX A000 ORIGIN.
   OPT-ON

TARGET                 \ Use TARGET wordlist (to compile code)
HEX 8C02 EQU VDPWA     \ Write Address port
HEX 8C00 EQU VDPWD     \ Write Data port

CREATE TXT  S" Hello World! " S,

: VDPA! ( Vaddr -- ) \ set vdp address (read mode)
        0 LIMI,
        TOS SWPB,
        TOS VDPWA @@ MOVB,
        TOS SWPB,
        TOS VDPWA @@ MOVB,
        DROP
;

HEX
PROG: MAIN
      0 LIMI,           \ disable interrupts
      8300 WORKSPACE    \ Fast ram for registers
      83BE RSTACK       \ and return stack
      83FE DSTACK       \ and Data stack

      4000 # VDPA!      \ initial screen address + write bit
DECIMAL
      50 #
      FOR
         TXT COUNT TYPE  \ VDP auto increments
      NEXT
      BEGIN AGAIN        \ loop forever
END.

COMPILER SAVE DSK2.HELLO4

That's all for now.

GDMike · September 24, 2021

That's really a considerable speed difference.

+TheBF · September 24, 2021

17 minutes ago, GDMike said:

That's really a considerable speed difference.

Ya it makes big difference when you remove even a few instructions from a small loop.

Here is another test I just did. It was a benchmark found by @speccery

I redid the timings on my machine with Lee and Mark's systems so everything was on the same classic99 version and on the same machine.

This is using my latest kernel which I have not released. It seems to be a bit faster than previous versions.


DECIMAL
: FIB2      0 1 ROT 0 DO   OVER + SWAP    LOOP DROP ;
: FIB2-BENCH   1000 0 DO   I FIB2 DROP   LOOP ;

         Normal       INLINE[ OVER + SWAP ]    BOUNDS
--------------------------------------------------------
TForth    1:46
Camel99   1:51             1:19                 0:59
FbForth   1:53
MachForth 0:43

The test program in MachForth had to have a DO LOOP added to it. I copied it into the test program because I was debugging it tonight.

I want to see if I can make it faster by using the simpler FOR NEXT which is how MachForth would do it natively.

Edit: Removed bad comment

Spoiler


\ fibonacci benchmark in Camel Forth

COMPILER \ Set up environment
   HEX
   NEW.
   2000 ORIGIN.
  OPT-ON

TARGET
\ Machine Forth does not have DO/LOOP

\ setup parameters on return stack
H: (DO)  
          R0  8000 LI,      \ load "fudge factor" to LIMIT
         *SP+ R0  SUB,      \ Pop limit, compute 8000h-limit "fudge factor"
          R0  TOS ADD,      \ loop ctr = index+fudge
          R0 RPUSH,
          TOS RPUSH,
          TOS DPOP,          \ refill TOS
;H

H: DO ( limit indx -- )  (DO)  BEGIN   ;H

H: UNLOOP    RP 4 AI, ;H

H: LOOP ( addr --)
        *RP INC,            \ increment the index number
( addr) THERE 0 JNO, <BACK  \ compute, compile the jump
        UNLOOP              \ clean the return stack 2 items
;H

H: +LOOP   TOS *RP ADD,  TOS DPOP,  LOOP ;H
H: I       TOS DPUSH,  *RP TOS MOV,  2 (RP) TOS SUB,  ;H

\ Machine Forth doesn't normally have ROTate, we have to create one.
: ROT  ( n1 n2 n3 --  n2 n3 n1)
        2 (SP)   R0 MOV,
       *SP   2 (SP) MOV,
        TOS     *SP MOV,
        R0      TOS MOV,
;

DECIMAL
: FIB
     0 #  1 #  ROT 0 #
     DO
       OVER + SWAP
     LOOP
     DROP
;

PROG: MAIN
     1000 # 0 #
     DO
        I FIB DROP
     LOOP
     NEXT,      \ Return to Forth
END.

Machine Forth OMG

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members