+TheBF Posted April 7, 2021 Author Share Posted April 7, 2021 For the curious this is the same program as previous except it uses a FOR NEXT loop structure which is just a down-counter with the index held on the return stack. This the output code with comments. You can see how the 9900 instructions map onto Forth quite well. Spoiler 2018 0646 dect R6 ; 3000 # ( pushes R4 accumulator 1st) 201A C584 mov R4,*R6 201C 0204 li R4,>3000 2020 0647 dect R7 2022 C5C4 mov R4,*R7 ; FOR ( loop index on return stack) 2024 C136 mov *R6+,R4 ; DROP 2026 0646 dect R6 ; AAAA # 2028 C584 mov R4,*R6 202A 0204 li R4,>AAAA 202E 0646 dect R6 ; DUP 2030 C584 mov R4,*R6 2032 C204 mov R4,R8 ; SWAP 2034 C116 mov *R6,R4 2036 C588 mov R8,*R6 2038 0646 dect R6 ; OVER 203A C584 mov R4,*R6 203C C126 mov @>0002(R6),R4 2040 06A0 bl @>2004 ; CALL ROT 2044 C136 mov *R6+,R4 ; DROP 2046 0646 dect R6 ; DUP 2048 C584 mov R4,*R6 204A 0556 inv *R6 ; AND 204C 4136 szc *R6+,R4 204E 0646 dect R6 ; DUP 2050 C584 mov R4,*R6 2052 E136 soc *R6+,R4 ; OR 2054 0646 dect R6 ; DUP 2056 C584 mov R4,*R6 2058 2936 xor *R6+,R4 ; XOR 205A 0584 inc R4 ; 1+ 205C 0604 dec R4 ; 1- 205E 05C4 inct R4 ; 2+ 2060 0644 dect R4 ; 2- 2062 0A14 sla R4,1 ; 2* 2064 0814 sra R4,1 ; 2/ 2066 0504 neg R4 ; NEGATE 2068 0744 abs R4 ; ABS 206A A136 a *R6+,R4 ; + 206C 0646 dect R6 ; 2 # 206E C584 mov R4,*R6 2070 0204 li R4,>0002 2074 C0F6 mov *R6+,R3 ; * 2076 38C4 mpy R4,R3 2078 C136 mov *R6+,R4 ; DROP 207A 0617 dec *R7 ; NEXT 207C 18D4 joc >2026 207E 05C7 inct R7 2080 045A b *R10 ; NEXT, (return to ITC Forth) Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted April 7, 2021 Share Posted April 7, 2021 14 hours ago, TheBF said: For the curious this is the same program as previous except it uses a FOR NEXT loop structure which is just a down-counter with the index held on the return stack. This the output code with comments. You can see how the 9900 instructions map onto Forth quite well. Reveal hidden contents 2018 0646 dect R6 ; 3000 # ( pushes R4 accumulator 1st) 201A C584 mov R4,*R6 201C 0204 li R4,>3000 2020 0647 dect R7 2022 C5C4 mov R4,*R7 ; FOR ( loop index on return stack) 2024 C136 mov *R6+,R4 ; DROP 2026 0646 dect R6 ; AAAA # 2028 C584 mov R4,*R6 202A 0204 li R4,>AAAA 202E 0646 dect R6 ; DUP 2030 C584 mov R4,*R6 2032 C204 mov R4,R8 ; SWAP 2034 C116 mov *R6,R4 2036 C588 mov R8,*R6 2038 0646 dect R6 ; OVER 203A C584 mov R4,*R6 203C C126 mov @>0002(R6),R4 2040 06A0 bl @>2004 ; CALL ROT 2044 C136 mov *R6+,R4 ; DROP 2046 0646 dect R6 ; DUP 2048 C584 mov R4,*R6 204A 0556 inv *R6 ; AND 204C 4136 szc *R6+,R4 204E 0646 dect R6 ; DUP 2050 C584 mov R4,*R6 2052 E136 soc *R6+,R4 ; OR 2054 0646 dect R6 ; DUP 2056 C584 mov R4,*R6 2058 2936 xor *R6+,R4 ; XOR 205A 0584 inc R4 ; 1+ 205C 0604 dec R4 ; 1- 205E 05C4 inct R4 ; 2+ 2060 0644 dect R4 ; 2- 2062 0A14 sla R4,1 ; 2* 2064 0814 sra R4,1 ; 2/ 2066 0504 neg R4 ; NEGATE 2068 0744 abs R4 ; ABS 206A A136 a *R6+,R4 ; + 206C 0646 dect R6 ; 2 # 206E C584 mov R4,*R6 2070 0204 li R4,>0002 2074 C0F6 mov *R6+,R3 ; * 2076 38C4 mpy R4,R3 2078 C136 mov *R6+,R4 ; DROP 207A 0617 dec *R7 ; NEXT 207C 18D4 joc >2026 207E 05C7 inct R7 2080 045A b *R10 ; NEXT, (return to ITC Forth) Is it usual for the FOR limit to not be consumed? ...lee 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted April 7, 2021 Author Share Posted April 7, 2021 2 hours ago, Lee Stewart said: Is it usual for the FOR limit to not be consumed? ...lee The DROP following FOR is doing that remembering that this system uses R4 as a cache for the top of stack. So DROP always refills R4 from the memory stack. The return stack works as a normal stack in memory so the inct R7 is removing the limit from the return stack 207A 0617 dec *R7 ; NEXT 207C 18D4 joc >2026 207E 05C7 inct R7 Unless you have found something I am completely missing, which has happened before, that is how I think it should work. 1 Quote Link to comment Share on other sites More sharing options...
GDMike Posted April 7, 2021 Share Posted April 7, 2021 Not a bad thing to happen ?. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted April 8, 2021 Author Share Posted April 8, 2021 2 hours ago, GDMike said: Not a bad thing to happen ?. Indeed not. Lee has found so many bugs in my code I want to start calling him "Raid". 2 Quote Link to comment Share on other sites More sharing options...
GDMike Posted April 8, 2021 Share Posted April 8, 2021 Just now, TheBF said: Indeed not. Lee has found so many bugs in my code I want to start calling him "Raid". Mine to, no matter what I was doing. The Eyes have it with him. Lol 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted April 8, 2021 Author Share Posted April 8, 2021 POP/PUSH Optimization This is something that I know should be a part of a good Forth native code compiler but I always created bugs when I tried it in the past. I think I have this working so I am going to explain again to myself and anyone who cares to read about it just to confirm my logic. When you run a Forth machine with a cache register for the top of stack element there are many Forth instructions that end with an instruction to refill the cache register. This is effectively a DROP function in the Forth machine because you are POPPING the stack into the register. Other Forth instructions need to use the cache register when they start, so they push the cache register onto the stack in memory first thing. This is effectively a DUP instruction on the Forth machine. IF a Forth instruction that ends with a DROP is followed immediately by an instruction that does a DUP that is three useless instructions that just thrash the top element of the stack. Three extra instructions on the 9900 can really slow things down, especially inside a loop. The solution was a "SMARTDUP" and I think I have the logic correct this time. Spoiler \ ************* optimizable operations *************** COMPILER : D= ( d d -- ?) ROT = -ROT = AND ; : 1LOOKBACK ( n -- ? ) THERE 1 CELLS - @ = ; : 2LOOKBACK ( d -- ? ) THERE 2 CELLS - 2@ D= ; : REMOVE ( n -- ) CELLS NEGATE TALLOT ; \ remove n cells from program : ADUP C584 0646 ; \ DUP is 2 instructions, 4 bytes : !, TOS SWAP @@ MOV, ; : DROP, TOS POP, ; : DUP, TOS PUSH, ; : C!, TOS SWPB, TOS SWAP @@ MOVB, ; \ POP/PUSH optimization: \ Some words refill the stack with DROP. If the next word does a DUP \ we should not have compiled the DROP, so SMARTDUP removes it. COMPILER : SMARTDUP OPTIMIZER @ IF 0C136 1LOOKBACK \ did we just emit a drop? IF 1 REMOVE \ YES, so remove it ELSE DUP, \ NO, so we must DUP THEN ELSE DUP, \ regular DUP is compiled THEN ; TARGET : ! ( n variable --) [CC] OPTIMIZER @ IF ADUP 2LOOKBACK \ look back for ADUP IF 2 REMOVE !, ELSE !, DROP, \ un-optimized THEN ELSE !, DROP, \ un-optimized THEN ; TARGET : C! ( c variable --) [CC] OPTIMIZER @ IF ADUP 2LOOKBACK \ look back for DUP IF 2 REMOVE C!, \ optimized ELSE C!, DROP, \ un-optimized THEN ELSE C!, DROP, \ un-optimized THEN ; Using these concepts I also optimized ! and C! for expressions like : 1234 DUP X ! Since ! (store) consumes both of its arguments it always ends with a DROP. I look back 2 cells in the program and if I find a DUP I can remove that dup since 1234 is sitting in R4 ready to go. And since I removed the DUP I don't need the DROP after I store the number in X. * ADUP in the code has the instructions in reverse order to match the way 2@ reads memory in 2LOOKBACK. 3 Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted April 8, 2021 Share Posted April 8, 2021 9 hours ago, TheBF said: Indeed not. Lee has found so many bugs in my code I want to start calling him "Raid". The nose knows! 3 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted April 8, 2021 Author Share Posted April 8, 2021 MACHFORTH is getting closer to being useful This little program relocates the code to load at >A000 and it also steals the entire scratchpad for Forth stacks and workspace. And it successfully saves the image to disk. \ MFORTH DEMO #1b Use new workspace and stacks, save binary program \ If running on Classic99 you will see R4 counting down \ This demo shows: \ - compile to >A000 origin \ - create workspace and both stacks in scratchpad memory \ - saves a finished program that can RUN from E/A Option 5 COMPILER NEW. HEX A000 ORIGIN. INCLUDE DSK2.BYE \ a little code to exit program TARGET PROG: DEMO1 0 LIMI, \ disable interrupts to take over the machine 8300 WORKSPACE 8380 RSTACK 8400 DSTACK FFFF # BEGIN 1- \ decrement data stack -UNTIL \ -UNTIL DOES NOT consume the stack parameter DROP \ clean up the stack BYE \ Return to TI title screen END. SAVE DSK2.DEMO1C 5 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 7, 2021 Author Share Posted September 7, 2021 Sometimes I miss the oh so obvious. I ask a question over on the GCC topic. Lots of help came back. @TURSI generously sent me a hello world program in C with all the trimmings. Thanks @Tursi Here is the program // tiny hello world // assume started from Editor/Assembler and so screen is already set up and cleared // Write Address/Register #define VDPWA *((volatile unsigned char*)0x8C02) // Write Data #define VDPWD *((volatile unsigned char*)0x8C00) int main() { // define the string unsigned char *pTxt = "Hello World"; // set the VDP address VDPWA = 0x00; // LSB VDPWA = 0x40; // MSB + "write" bit // now display the string. No need for delays cause 8-bit TI code while (*pTxt) { VDPWD = *(pTxt++); } // now just spin until the user resets us. Use inline assembly to enable interrupts __asm__("LIMI 2"); for (;;) { } // spin forever // not reached return 0; } These two lines made realize something I had never considered about interfacing to the TMS9918. // set the VDP address VDPWA = 0x00; // LSB VDPWA = 0x40; // MSB + "write" bit These things are everyday for people who work in IT but it was a revelation for me. Of course the '=' (assignment) operator in C can write to a memory mapped data port. It's just memory after all. I was stuck in the rut of thinking I needed to make a sub-routine to set VDP addresses. Forth is a LOAD/STORE machine and the assignment operator for a byte is called 'c-store' (Char store) and the operator is C!. So I translated this C program to my latest iteration of Machine Forth and it looks like this. Deviations from standard Forth. There are compiler directives to control the compiler and TARGET makes things point to the memory image where the program is created. The # operator handles literal numbers loading them into R4, top of stack cache #C! handles the address parameter with symbolic addressing to store a byte. (versus taking the address from the Forth data stack) \ tiny hello world in machine Forth \ Translated from hello.c by Tursi for comparison \ assume started from Editor/Assembler and so screen is already set up and cleared COMPILER NEW. HEX 2000 ORIGIN. OPT-ON TARGET \ code for the target binary program \ Write Address port HEX 8C02 EQU VDPWA \ Write Data port HEX 8C00 EQU VDPWD \ define the string CREATE TXT S" Hello World!" S, HEX PROG: MAIN \ this compiler is dumb, so we need to setup the machine manually 0 LIMI, 3F00 WORKSPACE FE00 RSTACK FF00 DSTACK 0 # VDPWA #C! \ character store VDP address LSB 40 # VDPWA #C! \ character store VDP address MSB + "write" bit \ now display the string. No need for delays cause its 8-bit TI code TXT # COUNT 1- FOR COUNT VDPWD #C! NEXT DROP \ Use Forth Assembler to enable interrupts 2 LIMI, \ Return to Forth for convenience 8300 WORKSPACE NEXT, END. The C program compiles 118 bytes. The Machine Forth program is 128 bytes with the push/pop optimizer OFF and 117 bytes with OPT-ON Not too shabby for a very naïve compiler. The C 'while' loop is very impressive. Machine Forth has an 'address' register with auto-increment that might let me do something similar. The spoiler has the listing from Classic99 and i manually added the 'Hello World!' string and the branch to explain it better. There are still some optimizations that could be made but that will get hard real fast since the compiler is really dumb. Spoiler \ hello.fth machine Forth code with push/pop optimizer on \ R4 is the 'cache register' for the top-of-stack in the Virtual machine \ R6 DATA stack pointer \ R7 Return stack pointer \ R9 FOR/NEXT loop index \ R10 points to code to return to Forth 2000 B @>2012 2004 BYTE >0C 2005 TEXT 'Hello World!' EVEN 2012 0300 limi >0000 2016 02E0 lwpi >3f00 \ 3F00 WORKSPACE 201A 0207 li R7,>fe00 \ FE00 RSTACK 201E 0206 li R6,>ff00 \ FF00 DSTACK 2022 0646 dect R6 \ DUP 2024 C584 mov R4,*R6 2026 0204 li R4,>0000 \ 0 # 202A 06C4 swpb R4 202C D804 movb R4,@>8c02 \ VDPWA #C! 2030 0204 li R4,>0040 \ 40 # 2034 06C4 swpb R4 2036 D804 movb R4,@>8c02 \ VDPWA #C! 203A 0204 li R4,>2004 \ TXT # 203E 0646 dect R6 \ COUNT 2040 C584 mov R4,*R6 2042 0596 inc *R6 2044 D114 movb *R4,R4 2046 0984 srl R4,8 2048 0604 dec R4 \ 1- 204A 0647 dect R7 \ FOR 204C C5C8 mov R8,*R7 204E C204 mov R4,R8 2050 C136 mov *R6+,R4 2052 0646 dect R6 \ COUNT 2054 C584 mov R4,*R6 2056 0596 inc *R6 2058 D114 movb *R4,R4 205A 0984 srl R4,8 205C 06C4 swpb R4 205E D804 movb R4,@>8c00 \ VDPWD #C! 2062 C136 mov *R6+,R4 2064 0608 dec R8 \ NEXT 2066 18F5 joc >2052 2068 05C7 inct R7 206A C136 mov *R6+,R4 \ DROP 206C 0300 limi >0002 2070 02E0 lwpi >8300 2074 045A b *R10 6 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 8, 2021 Author Share Posted September 8, 2021 Chuck Moore added the 'A' register to his virtual machine in machine Forth. It's just a temp register that lets you use auto-increment and on his CPU auto-decrement. It can save a lot of stack juggling. The idea is great for 9900 but making a syntax that I like might take me awhile. :) I re-wrote Hello to be a bit more idiomatic Forth by adding a TYPE sub-routine. The loop is much tighter but I had to resort to some Assembly language to really make it good. Here is a simple TYPE in Machine Forth with Assembly language : TYPE ( Caddr len -- ) \ Mixing machine Forth and Assembler for best use *SP+ AREG MOV, \ pop address into Address register ie: R9 \ len remains in TOS register 0 LIMI, BEGIN *AREG+ VDPWD @@ MOVB, 1- \ dec TOS -UNTIL \ until tos=0 DROP ; Here is the code that machine Forth emitted. \ type sub-routine code dect R7 \ enter sub-routine saves R11 on return stack mov R11,*R7 mov *R6+,R9 \ pop the string address to AREG ie: R9 limi >0000 >201C: movb *R9+,@>8c00 \ write a byte to VDP port dec R4 \ dec the length jne >201c \ loop until len=0 mov *R6+,R4 \ drop mov *R7+,R11 \ POP r11 b *R11 \ return To use it I did: \ display the string with a sub-routine TXT # COUNT TYPE Although using *SP+ is more efficient for 9900, this is still not how Chuck envisioned using the A register. I think TYPE should be more like: : TYPE ( Caddr len -- ) 0 LIMI, SWAP A! BEGIN AC@+ VDPWD C! 1- \ dec TOS -UNTIL \ until tos=0 DROP ; But the original Machine Forth did not have SWAP. Not sure how that worked. :) I will continue refining how best to use this Machine Forth concept but adapt it to the 9900 instruction set. For example SWAP might just be an alias for *SP or something like that. Spoiler shows the Forthier version of Hello. \ tiny hello world in machine Forth Demo 2 \ Creates a TYPE sub-routine with ASM and Address register COMPILER NEW. HEX 2000 ORIGIN. OPT-ON TARGET \ code for the target binary program \ Write Address port HEX 8C02 EQU VDPWA \ Write Data port HEX 8C00 EQU VDPWD \ define the string CREATE TXT S" Hello World!" S, : TYPE ( Caddr len -- ) \ Mixing machine Forth and Assembler for best use *SP+ AREG MOV, \ pop address into Address register ie: R9 \ len remains in TOS register 0 LIMI, BEGIN *AREG+ VDPWD @@ MOVB, 1- \ dec TOS -UNTIL \ until tos=0 DROP ; HEX PROG: MAIN \ this compiler is dumb, so we need to setup the machine manually 0 LIMI, 3F00 WORKSPACE FE00 RSTACK FF00 DSTACK 0 # VDPWA #C! \ character store VDP address LSB 40 # VDPWA #C! \ character store VDP address MSB + "write" bit \ display the string with a sub-routine TXT # COUNT TYPE \ Return to Forth 8300 WORKSPACE NEXT, END. 3 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 8, 2021 Author Share Posted September 8, 2021 Smoke is clearing a little... I think the big deal for me is to get this thing in stable form and not try to optimize everything all at once. I reverted back to the way things worked early on where all data creating words push their address onto the data stack when invoked. This is just like normal Forth, but uses the LI instruction. The PUSH/POP optimizer makes this practical otherwise it would always be three instructions to load R4. ( *?DPUSH is the PUSH/POP optimizer. It looks back to see if there was a DROP in the previous cell and if so, it erases the DROP and does not DUP. This is a big improvement (essential?) in reducing code size in a stack machine where the TOS is cached in a register) \ Machine Forth data structure creation COMPILER : LIT, ( n -- ) ?DPUSH TOS SWAP LI, ; : CONSTANT ( n -- n) \ create the compiler's constant CREATE , \ remember the value DOES> ( pfa ) @ LIT, ; \ compile constant as a literal no. : CREATE ( -- addr) CREATE CHERE , \ remember the target address DOES> @ LIT, ; \ pushes address onto stack : VARIABLE ( -- addr) CREATE 0000 T, ; Next thing was to make a better COUNT word, which I renamed $@ ( string fetch) since it leaves ONLY the length on the data stack but puts the string address into register A. I can see just from this how using the A register simplifies things. At the moment it is inline code. At four instructions it could be a sub-routine if it was used a great deal. \ ** WARNING ** puts the string address in register A : $@ ( Caddr -- len) ( A: Caddr+1) TOS AREG MOV, \ base address to register A AREG INC, \ bump address past count byte C@ \ fetch byte count onto data stack ; With these changes and putting TYPE inline, which is more like the C version , the hello program shrunk to 85 bytes! ( and it still worked) Getting closer... New source code Spoiler \ tiny hello world in machine Forth Demo with $@ Sept 8 2021 Fox \ compiles to 85 bytes COMPILER NEW. HEX 2000 ORIGIN. OPT-ON TARGET \ code for the target binary program HEX 8C02 EQU VDPWA \ Write Address port HEX 8C00 EQU VDPWD \ Write Data port CREATE TXT S" Hello World!" S, HEX PROG: MAIN \ setup Forth machine 0 LIMI, 3F00 WORKSPACE 3D00 RSTACK 3E00 DSTACK 0 # VDPWA #C! \ character store VDP address LSB 40 # VDPWA #C! \ character store VDP address MSB + "write" bit TXT $@ BEGIN *AREG+ VDPWD @@ MOVB, 1- -UNTIL DROP \ Return to Forth 8300 WORKSPACE NEXT, END. Emitted code Spoiler 2012 0300 limi >0000 2016 02E0 lwpi >3f00 201A 0207 li R7,>3d00 201E 0206 li R6,>3e00 2022 0646 dect R6 2024 C584 mov R4,*R6 2026 0204 li R4,>0000 202A 06C4 swpb R4 202C D804 movb R4,@>8c02 2030 0204 li R4,>0040 2034 06C4 swpb R4 2036 D804 movb R4,@>8c02 203A 0204 li R4,>2004 203E C244 mov R4,R9 2040 0589 inc R9 2042 D114 movb *R4,R4 2044 0984 srl R4,8 2046 D839 movb *R9+,@>8c00 204A 0604 dec R4 204C 16FC jne >2046 204E C136 mov *R6+,R4 2050 02E0 lwpi >8300 2054 045A b *R10 2 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 8, 2021 Author Share Posted September 8, 2021 If you've optimized one thing... Working at compile time is really interesting. If you understand how to detect a situation that you don't like it's easy to remove it and replace it with different code. This is new to me. So I have a smart DUP that detects if there was a DROP in the previous instruction. This can save 6 bytes whenever two Forth primitives are connected together where the 1st one ends with DROP and the second word starts with DUP. (to make room in R4) Here is how the optimizer looks: \ pop/push optimizer HEX C136 CONSTANT 'DROP' \ machine code for DROP : DUP, ( n -- n n) TOS DPUSH, ; \ normal dup : LOOKBACK ( -- u) THERE 2- @ ; \ fetch previous instruction code : OPT-DUP, ( n -- n ?n) \ SMART dup LOOKBACK 'DROP' = \ look back for DROP IF -2 TALLOT \ move target dictionary back 1 cell ELSE DUP, THEN ; * TALLOT is like ALLOT but operates on the target memory image I wanted to see how I could remove the assembly language in the Hello program print loop but continue to use the A register. I have landed on using 9900 type syntax so the A register looks like a 9900 register in the Forth code but with extra characters that are from Forth. A@ fetches register A to the top of the data stack. A! stores the top of the data stack into the A register. *A@ means fetch A, indirect address to top of data stack *A@+ means fetch A, indirect with auto-incrementing. This is different than Chuck Moore's CPU but in order to get the performance out of the CPU we have to use its features. \ A register Machine Operators for TMS9900 : A@ ( -- n) ?DPUSH AREG TOS MOV, ; \ Dpush(T) T=A : *A@ ( -- n) ?DPUSH *AREG TOS MOV, ; \ Dpush(T) T=*A : *A@+ ( -- n) ?DPUSH *AREG+ TOS MOV, ; \ Dpush(T) T=*A A=A+cell : (A)@ ( u --) ?DPUSH (AREG) TOS MOV, ; \ Dpush(T) T=u@(A) : #A! ( addr --) AREG SWAP LI, ; \ load A with literal number BF addition : A! ( addr -- ) TOS AREG MOV, DROP ; \ A! A=T Dpop(T) : *A! ( addr) TOS *AREG MOV, DROP ; \ !A [A]=T Dpop(T) : *A!+ ( n --) TOS *AREG+ MOV, DROP ; \ !A+ [A]=T A=A+cell Dpop(T) : (A)! ( n --) TOS SWAP (AREG) MOV, DROP ; \ addr A-plus-store for versatility. : A+! ( n -- ) TOS AREG ADD, DROP ; Chuck's machine did not have byte access and so he did it in his code as needed. That's not right for the 9900 so I have these byte-wise operators again with the 9900 addressing modes. \ added byte operations. BFox : *AC@ ( -- 0c00) ?DPUSH *AREG TOS MOVB, TOS 8 SRL, ; : *AC@+ ( -- 0c00) ?DPUSH *AREG+ TOS MOVB, TOS 8 SRL, ; : *AC! ( 0c00 --) 1 (TOS) *AREG MOVB, DROP ; : *AC!+ ( 0c00 --) 1 (TOS) *AREG+ MOVB, DROP ; A problem arises when you do this *AC@+ VDPWD #C! As seen above, the *AC@+ ends with the SRL instruction to swap the byte in TOS (ie: R4) But the #C! operator is this: : #C! ( c addr --) TOS SWPB, TOS SWAP @@ MOVB, DROP ; So we swap the byte to one side only to swap it back to other side. So I replaced TOS SWPB, with ?SWPB, \ swap byte optimizer : ?SWPB, ( n -- n) LOOKBACK 0984 = \ look back for "SRL R4,8" IF -2 TALLOT \ remove SRL ELSE TOS SWPB, \ we need SWPB THEN ; Seems to work and program is still pretty efficient. It wastes a move into R4 versus using the Assembly language single instruction. So the actual program looks like this with only machine Forth. PROG: MAIN \ setup Forth machine 0 LIMI, 3F00 WORKSPACE 3D00 RSTACK 3E00 DSTACK 0 # VDPWA #C! \ character store VDP address LSB 40 # VDPWA #C! \ character store VDP address MSB + "write" bit TXT $@ BEGIN *AC@+ VDPWD #C! 1- -UNTIL DROP 8300 WORKSPACE NEXT, END. It's not normal Forth but I think you can write some pretty fast programs with it. The next thing to tackle is tail-call optimization. 3 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 8, 2021 Author Share Posted September 8, 2021 I may be getting the hang of this. I am going to quote an article from ForthWrite Magazine, from the Forth Interest Group UK. June 2000, Special Issue. I used it for reference and the author, John Tasgal, explains this better than I could. ----------- "Tail-Recursion Optimisation In any definition the return action of the word before a semicolon, and of the semicolon itself, can always be compiled into a single return. word1 ..... lastword ; As nothing happens between lastword returning and ';' returning, the lastword return is superfluous. A more elaborate example is the recursive call at the end of a WHILE loop. If we have a series of nested calls then the last instruction is in each case a return. At runtime this produces '; ; ; ; ;' viz. a sequence of returns. The point is that when these calls unwind all that happens is that a sequence of returns are executed, one after the other. Nothing is done between them. The only necessary return is the first one pushed onto the return stack (and so the last to be executed). Removing these superfluous returns is known as tail-recursion optimisation. Most Machine Forth compilers (and also Color Forth) contain a 'tail-recursion optimiser'." ----------- Machine Forth has a special semi-colon for this purpose called -; Like most things Forth it is up to you to use it where you want to. This would be whenever the last word in a definition is a COLON definition ie: a sub-routine. It won't work if the last item is a constant or a variable or an inline primitive word for example. Here is how I implemented -; and it seems to work. (H: ;H are aliases for Camel99's (the Host) colon/semi-colon so I can keep my head on straight) \ tail call removal semi-colon H: -; ( -- ) LOOKBACK ( addr ) >R \ fetch & save sub-routine address -8 TALLOT \ remove the call sequence (go back 8 bytes) R> @@ B, \ compile a branch to the sub-routine ;H Here is a the test program that showed it working. It saves 32 bytes using tail-call optimization which is a welcome bonus and on the TI-99 that's 16 instructions of speed improvement too! \ tail-call optimization test program Sept 8 2021 Fox COMPILER NEW. HEX 2000 ORIGIN. OPT-ON TARGET \ code for the target binary program HEX 8C02 EQU VDPWA \ Write Address port HEX 8C00 EQU VDPWD \ Write Data port CREATE TXT S" Hello World!" S, : HI 0 # VDPWA #C! \ character store VDP address LSB 40 # VDPWA #C! \ character store VDP address MSB + "write" bit TXT $@ BEGIN *AREG+ VDPWD @@ MOVB, 1- -UNTIL DROP ; : LEVEL4 HI -; : LEVEL3 LEVEL4 -; : LEVEL2 LEVEL3 -; : LEVEL1 LEVEL2 -; HEX PROG: MAIN \ setup Forth machine 0 LIMI, 3F00 WORKSPACE 3D00 RSTACK 3E00 DSTACK LEVEL1 8300 WORKSPACE NEXT, END. 3 Quote Link to comment Share on other sites More sharing options...
GDMike Posted September 8, 2021 Share Posted September 8, 2021 (edited) I always try to keep all my constant and variable DEFS up front and prior to my structured DEFS, ie;, loop DEFS and other words. Edited September 8, 2021 by GDMike Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 8, 2021 Author Share Posted September 8, 2021 11 minutes ago, GDMike said: I always try to keep all my constant and variable DEFS up front and prior to my structured DEFS, ie;, loop DEFS and other words. From the human perspective that makes perfect sense. You can see all the data a glance. From the TI-99 perspective our old computer doesn't really care. It's a wild memory model however with a lot of different type of memory in the system. Many modern machines force a separation of code and data memory so there's that. 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 8, 2021 Author Share Posted September 8, 2021 36 minutes ago, GDMike said: I always try to keep all my constant and variable DEFS up front and prior to my structured DEFS, ie;, loop DEFS and other words. Wait. Were you just making a joke? 1 Quote Link to comment Share on other sites More sharing options...
GDMike Posted September 9, 2021 Share Posted September 9, 2021 Nope, just using my 3 cents. Ok, I'm being a smart ass, but I'm trying to follow what your saying, but some of it I'm gonna have to look harder at, but I'm trying to follow.lol Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 9, 2021 Author Share Posted September 9, 2021 37 minutes ago, GDMike said: Nope, just using my 3 cents. Ok, I'm being a smart ass, but I'm trying to follow what your saying, but some of it I'm gonna have to look harder at, but I'm trying to follow.lol Well I have had trouble following it myself. Some of the advanced stuff started to fall into place in the last week or so. Here is a summary: Normal Forth has a bunch of Assembly language words that do stuff. ( DUP SWAP OVER + - * / etc). These things are always "called" so there is some overhead to make everything go but they only take 2 bytes in your program every time you use a word. Machine Forth does the opposite. It uses these same short pieces of Assembly code but instead of calling them, it copies them into RAM one after another. No calling unless you want that. The magic is that the Forth colon definition lets you record Forth Assembler code as a Forth word. When you run that word it will run the Assembler code which writes the code into memory. This would be called a macro in a modern "macro-assembler" language. So when I want machine Forth to do addition I make this: : + ( n n -- n) *SP+ TOS ADD, ; It does not RUN the code when you type + in your machine Forth program. When you use + in a machine Forth program it is like you typed in the assembly language, so the code gets written into RAM. In this case in a separate memory block, not part of Camel Forth. Make a bit more sense? The rest is the details of getting the @#$!# thing to make an actual EA5 program image. 4 Quote Link to comment Share on other sites More sharing options...
GDMike Posted September 9, 2021 Share Posted September 9, 2021 Ok. Gotcha. That's making sense. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 10, 2021 Author Share Posted September 10, 2021 It turns out it is hard to make a compiler that fits in 18.3K beat GCC performance. I thought I would try Tursi's Sprite benchmark with this new compiler. GCC did this benchmark in 5 seconds. This version in generic Forth ran in 27 seconds DECIMAL ( more direct translation of Tursi ASM code to Forth) : TURSI.OPT 100 0 DO 239 0 DO I $301 VC! LOOP 175 0 DO I $300 VC! LOOP 0 239 DO I $301 VC! -1 +LOOP 0 175 DO I $300 VC! -1 +LOOP LOOP ; This version using the Camel99 inline optimizer and ran in 20 seconds ( optimize inner loop code) : TURSI.INLINE 100 0 DO INLINE[ 239 0 ] DO INLINE[ I $301 VC! ] LOOP INLINE[ 175 0 ] DO INLINE[ I $300 VC! ] LOOP INLINE[ 0 239 ] DO INLINE[ I $301 VC! -1 ] +LOOP INLINE[ 0 175 ] DO INLINE[ I $300 VC! -1 ] +LOOP LOOP ; This version in Machine Forth ran in 15 seconds. It uses the A register on two loops because the FOR NEXT loop as envisioned by Chuck Moore is a down-counter. 100 # BEGIN \ using register A for up counting 0 #A! 239 # FOR $301 VDPA! A@ VDPWD #C! A1+! NEXT 0 #A! 175 # FOR $300 VDPA! A@ VDPWD #C! A1+! NEXT \ for/next index is a down-counter 239 # FOR $301 VDPA! I@ VDPWD #C! NEXT 175 # FOR $300 VDPA! I@ VDPWD #C! NEXT 1- -UNTIL If I made a macro for VDPA! (VDP address store) it ran in 10 seconds. That was the best I could do so far. I have also added some incrementors/decrementors for the A register because they are native instructions on the 9900. A1+! A1-! A2+! A2+! My push/pop optimizer failed in this test as well so more sleuthing is required. Edit: Got the optimizer working on A@ and I@. That got it down to 8 seconds with a VDPA! as a macro. 3 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 10, 2021 Author Share Posted September 10, 2021 My pop/push optimizer problem seems to have been my logic on when to invoke it. It seems to work reliably now. In this little program program it was used 8 times which saved 48 bytes! To be clear "Forth" is not in the program. It's just native code glued together by Forth. So here is a little video of how it works. There is still lots of work to do to make it something someone else could use but I have always wanted to know more about Forth generating native code so this is a bit of personal victory. It pales in comparison to XB256 but it is a compiler that can generate fast code so it could be "library enabled". Here is the entire benchmark program using some tricks so that it runs as fast as I can make it go. The video shows it built and run. You could run it from within Forth and return to Forth, but I wanted to show the EA5 creation function. I gotta go make pizza. Happy weekend Spoiler \ Tursi sprite benchmark in Machine Forth Sept 8 2021 Fox \ INCLUDE DSK2.MFORTH,FTH COMPILER NEW. HEX 2000 ORIGIN. TARGET OPT-ON INCLUDE DSK2.TINYVDP \ A few screen variables VARIABLE C/L VARIABLE C/SCR VARIABLE VMODE 0380 CONSTANT CTAB \ colour table VDP address HEX : GRAPHICS 0 # CTAB 0 # VFILL 0E0 # DUP 83D4 #C! 1 # VWTR 0 # 2 # VWTR \ set VDP screen page 0E # 3 # VWTR 01 # 4 # VWTR 06 # 5 # VWTR 01 # 6 # VWTR CTAB 10 # 10 # VFILL \ charset colors 27 # 7 # VWTR \ screen color 20 # C/L ! 300 # C/SCR ! 1 # VMODE ! 0 # 300 # 20 # VFILL \ clear screen ; HEX : MAGNIFY ( mag-factor -- ) 83D4 #C@ 0FC # AND + DUP 1 # VWTR 83D4 #C! ; : SPRITE0 ( char colr x y -- ) \ create a SPRITE, sp# = 0..31 300 # VC! \ set Y position 301 # VC! \ set X position 303 # VC! \ set the sprite color 302 # VC! \ set the character pattern to use ; \ *COMPILE time trick* \ Use HOST Forth to make VDP addresses with write bit set and pre-swapped HOST 300 4000 OR >< TARGET CONSTANT $300 HOST 301 4000 OR >< TARGET CONSTANT $301 \ We can use the Host Forth colon to make a macro H: VDPA! ( Vaddr -- ) \ set vdp address (read mode) TOS VDPWA @@ MOVB, TOS SWPB, TOS VDPWA @@ MOVB, DROP ;H : TURSI DECIMAL GRAPHICS 42 # 4 # 0 # 0 # SPRITE0 1 # MAGNIFY 0 LIMI, 100 # FOR \ using register A for up cou nting 0 #A! 239 # FOR $301 VDPA! A@ VDPWD #C! A1+! NEXT 0 #A! 175 # FOR $300 VDPA! A@ VDPWD #C! A1+! NEXT \ for/next index is a down-counter 239 # FOR $301 VDPA! I@ VDPWD #C! NEXT 175 # FOR $300 VDPA! I@ VDPWD #C! NEXT NEXT BEGIN AGAIN \ loop forever ; \ prog: names the entry address for the images PROG: MAIN HEX 8300 WORKSPACE 3FDE DSTACK ( 20 cells) 3FB6 RSTACK TURSI \ call the program END. COMPILER SAVE DSK2.TURSI CR ." Optimizations: " OPTS ? MachineForthTest.mp4 3 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 24, 2021 Author Share Posted September 24, 2021 Slowly expanding the code for this machine Forth compiler and adding more hi-level code that makes the transition from standard Forth a heck of a lot easier. (for me anyway) I have decided it is simpler to just assume parameters are in the TOS register per normal Forth. This keeps the syntax less "creative". I could add a literal stack to the compiler and make smarter decisions on data but that increases the complexity of the compiler quite a bit which is not the in the spirit of machine Forth. I have made some sense of how to make POP/PUSH optimizations work and it's not rocket science. It turns out that the when a literal number, a constant or a variable is used everything is reliable IF you PUSH the TOS register first. (which is a DUP operation) This is how Forth expects things to be done. It seems you can't get fancy and try to optimize that first PUSH away. It will bite you. Maybe with much more complicated analysis it could be done but it's probably above my pay grade. After that it just works. Code that ends with the DROP instruction that is followed by a DUP, will trigger a removal of the DROP and the DUP saving 6 bytes. Results of this process are below in a simple test program. \ Difference with/without optimizer: \ 15.6 vs 12.35 seconds. 24.8% faster \ 12 bytes smaller \ ITC Forth runs equivalent program in 47.7 seconds. 4X slower TARGET VARIABLE X VARIABLE Y VARIABLE Z FFFF CONSTANT LOOPS PROG: DEMO5 LOOPS BEGIN 1- WHILE -3 # X +! Y 1+! X @ Y @ + Z ! REPEAT DROP NEXT, \ return to Camel99 Forth END. \ end directive test program size, tests for stack junk I have factored out setting the VDP address in the word VDPA!. With this is becomes possible to take advantage of VDP auto-incrementing address feature. So the code for TYPE as a primitive operation is below. ( It is up to you to set the write bit on the address if setting the address for writing but that is easy) TYPE is a mixture of machine Forth and Forth Assembler which is really handy. : TYPE ( addr len ) *SP+ AREG MOV, R3 8C00 LI, \ 12% faster to use a register BEGIN *AREG+ R3 ** MOVB, 1- -UNTIL DROP ; Here is how this new TYPE is used in a test program: Spoiler \ hello world in machine Forth Demo Sept 23 2021 Fox \ compiles to 128 bytes COMPILER \ Use compiler wordlist (for interpreted words) NEW. HEX A000 ORIGIN. OPT-ON TARGET \ Use TARGET wordlist (to compile code) HEX 8C02 EQU VDPWA \ Write Address port HEX 8C00 EQU VDPWD \ Write Data port CREATE TXT S" Hello World! " S, : VDPA! ( Vaddr -- ) \ set vdp address (read mode) 0 LIMI, TOS SWPB, TOS VDPWA @@ MOVB, TOS SWPB, TOS VDPWA @@ MOVB, DROP ; HEX PROG: MAIN 0 LIMI, \ disable interrupts 8300 WORKSPACE \ Fast ram for registers 83BE RSTACK \ and return stack 83FE DSTACK \ and Data stack 4000 # VDPA! \ initial screen address + write bit DECIMAL 50 # FOR TXT COUNT TYPE \ VDP auto increments NEXT BEGIN AGAIN \ loop forever END. COMPILER SAVE DSK2.HELLO4 That's all for now. 4 Quote Link to comment Share on other sites More sharing options...
GDMike Posted September 24, 2021 Share Posted September 24, 2021 That's really a considerable speed difference. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 24, 2021 Author Share Posted September 24, 2021 17 minutes ago, GDMike said: That's really a considerable speed difference. Ya it makes big difference when you remove even a few instructions from a small loop. Here is another test I just did. It was a benchmark found by @speccery I redid the timings on my machine with Lee and Mark's systems so everything was on the same classic99 version and on the same machine. This is using my latest kernel which I have not released. It seems to be a bit faster than previous versions. DECIMAL : FIB2 0 1 ROT 0 DO OVER + SWAP LOOP DROP ; : FIB2-BENCH 1000 0 DO I FIB2 DROP LOOP ; Normal INLINE[ OVER + SWAP ] BOUNDS -------------------------------------------------------- TForth 1:46 Camel99 1:51 1:19 0:59 FbForth 1:53 MachForth 0:43 The test program in MachForth had to have a DO LOOP added to it. I copied it into the test program because I was debugging it tonight. I want to see if I can make it faster by using the simpler FOR NEXT which is how MachForth would do it natively. Edit: Removed bad comment Spoiler \ fibonacci benchmark in Camel Forth COMPILER \ Set up environment HEX NEW. 2000 ORIGIN. OPT-ON TARGET \ Machine Forth does not have DO/LOOP \ setup parameters on return stack H: (DO) R0 8000 LI, \ load "fudge factor" to LIMIT *SP+ R0 SUB, \ Pop limit, compute 8000h-limit "fudge factor" R0 TOS ADD, \ loop ctr = index+fudge R0 RPUSH, TOS RPUSH, TOS DPOP, \ refill TOS ;H H: DO ( limit indx -- ) (DO) BEGIN ;H H: UNLOOP RP 4 AI, ;H H: LOOP ( addr --) *RP INC, \ increment the index number ( addr) THERE 0 JNO, <BACK \ compute, compile the jump UNLOOP \ clean the return stack 2 items ;H H: +LOOP TOS *RP ADD, TOS DPOP, LOOP ;H H: I TOS DPUSH, *RP TOS MOV, 2 (RP) TOS SUB, ;H \ Machine Forth doesn't normally have ROTate, we have to create one. : ROT ( n1 n2 n3 -- n2 n3 n1) 2 (SP) R0 MOV, *SP 2 (SP) MOV, TOS *SP MOV, R0 TOS MOV, ; DECIMAL : FIB 0 # 1 # ROT 0 # DO OVER + SWAP LOOP DROP ; PROG: MAIN 1000 # 0 # DO I FIB DROP LOOP NEXT, \ Return to Forth END. 2 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.