+Lee Stewart Posted November 4 Author Share Posted November 4 (edited) While changing how the values of a couple of user variables (UVs) are pushed to the stack, I noticed that the same code is used for ten of them. Here is the code for one of them: ;[*** SATR *** ( --- vaddr ) * DATA CLTB_N * SATR_N .NAME_FIELD 4, 'SATR ' SATR DATA $+2 DECT SP ; make room on stack MOV @$SATR(U),*SP ; push VRAM address of SATR to stack B *NEXT ; return to inner interpreter ;]* I reduced that code to 2 cells for each of those 10 UVs by putting the location of the ALC retrieval routine in the code field and the UV table offset of the UV in the parameter field: ;[*** SATR *** ( --- vaddr ) * DATA CLTB_N * SATR_N .NAME_FIELD 4, 'SATR ' SATR DATA GETVAL ;routine to retrieve User Variable value DATA $SATR ;User Variable table offset of $SATR ;]* Here is the ALC retrieval routine: *...The fbForth registers in use below are *......W (R10) points to the current cell of the currently executing word. *......U (R8) points to the start of the User Variable table. *......SP (R9) points to the top of the stack. GETVAL DECT SP MOV *W,R0 ;copy parameter field of word that got us here (UV table offset) A U,R0 ;correct to actual address in UV table MOV *R0,*SP ;push UV value to stack B *NEXT ;back to inner interpreter It appears to work as it should. Anyone with deep Forth insight ( @TheBF @Willsy @FarmerPotato ??? ) see any possible problems? ...lee Edited November 4 by Lee Stewart code clarification 1 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5560293 Share on other sites More sharing options...
+TheBF Posted November 4 Share Posted November 4 If understand this correctly you have made something like a "USER CONSTANT" rather than a USER VARIABLE. If you are ok breaking the mold of FigForth I think it's a good improvement if those USER variables are not changed very often. For context this is how many Forth systems in the 90s started doing VARIABLE, CONSTANT and USER. There is a separate code snippet for each data type that is compiled into the code field of each word. I think FigForth did it with <BUILD DOES> ? i don't remember. Example: Camel Forth data words are defined: : CONSTANT ( n --) HEADER COMPILE DOCON COMPILE, ; : USER ( n --) HEADER COMPILE DOUSER COMPILE, ; : CREATE ( -- ) HEADER COMPILE DOVAR ; : VARIABLE ( -- ) CREATE 0 COMPILE, ; 1 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5560366 Share on other sites More sharing options...
+FarmerPotato Posted November 4 Share Posted November 4 1 hour ago, Lee Stewart said: GETVAL DECT SP Took me a while to find your DECT above the comment. It's correct. I see only one further optimization, which is self-modifying code! Rewrite the MOV to use the addressing mode @>3E(U) of the original MOV. Of course that won't work in ROM, no help to you. It does eliminate one instruction which is almost always better. My code replaces >1234 with the value at *W *RORG DATA ************************ Opcode Td dddd Ts ssss * 0000 C81A MOV *W,@$+6 1100 10 0000 01 1010 0002 0006r 0004 CA19 MOV @>1234(U),*SP 1100 10 1000 01 1001 0006 1234 0008 045F B *NEXT 0000 0100 01 01 1111 If the CPU were a native Forth machine with a stack, it would be quite elegant compared to this. PUSH W ( -- addr ) FETCH ( addr -- n ) ADD U ( n -- addr ) FETCH ( addr -- n NEXT 1 1 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5560371 Share on other sites More sharing options...
+FarmerPotato Posted November 4 Share Posted November 4 11 minutes ago, TheBF said: If understand this correctly you have made something like a "USER CONSTANT" rather than a USER VARIABLE. If you are ok breaking the mold of FigForth I think it's a good improvement if those USER variables are not changed very often. For context this is how many Forth systems in the 90s started doing VARIABLE, CONSTANT and USER. There is a separate code snippet for each data type that is compiled into the code field of each word. I think FigForth did it with <BUILD DOES> ? i don't remember. Example: Camel Forth data words are defined: : CONSTANT ( n --) HEADER COMPILE DOCON COMPILE, ; : USER ( n --) HEADER COMPILE DOUSER COMPILE, ; : CREATE ( -- ) HEADER COMPILE DOVAR ; : VARIABLE ( -- ) CREATE 0 COMPILE, ; TI Forth has these VDP base addresses stored in "User Values". Write-once, read-often variables. Configured when GRAPHICS, SPLIT, TEXT set the VDP registers. for Sprite Attribute Table, Pattern Generator Table, and so on. I learned to like "VALUE ... IS" in ANS, so I'm putting that in my Forth (TI fig-Forth). DOVALU * fetch from pfa * fetch * next 3 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5560376 Share on other sites More sharing options...
+TheBF Posted November 4 Share Posted November 4 16 minutes ago, FarmerPotato said: Took me a while to find your DECT above the comment. It's correct. I see only one further optimization, which is self-modifying code! Rewrite the MOV to use the addressing mode @>3E(U) of the original MOV. Of course that won't work in ROM, no help to you. It does eliminate one instruction which is almost always better. My code replaces >1234 with the value at *W *RORG DATA ************************ Opcode Td dddd Ts ssss * 0000 C81A MOV *W,@$+6 1100 10 0000 01 1010 0002 0006r 0004 CA19 MOV @>1234(U),*SP 1100 10 1000 01 1001 0006 1234 0008 045F B *NEXT 0000 0100 01 01 1111 If the CPU were a native Forth machine with a stack, it would be quite elegant compared to this. PUSH W ( -- addr ) FETCH ( addr -- n ) ADD U ( n -- addr ) FETCH ( addr -- n NEXT There you go dreaming about good stuff again. This begs the question: If your native Forth CPU has a TOS register and, as Chuck added in later designs, an A register for addresses, would it make sense to make an indexed addressing mode? Something like: MYARRAY A! \ set the base address 7 CELLS \ TOS=14 @(A) \ get the value at TOS+A @(A)+ \ get the value at TOS+A, autoincr. A by 2 1 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5560388 Share on other sites More sharing options...
+Lee Stewart Posted November 4 Author Share Posted November 4 52 minutes ago, TheBF said: If understand this correctly you have made something like a "USER CONSTANT" rather than a USER VARIABLE. If you are ok breaking the mold of FigForth I think it's a good improvement if those USER variables are not changed very often. For context this is how many Forth systems in the 90s started doing VARIABLE, CONSTANT and USER. There is a separate code snippet for each data type that is compiled into the code field of each word. I think FigForth did it with <BUILD DOES> ? i don't remember. Example: Camel Forth data words are defined: : CONSTANT ( n --) HEADER COMPILE DOCON COMPILE, ; : USER ( n --) HEADER COMPILE DOUSER COMPILE, ; : CREATE ( -- ) HEADER COMPILE DOVAR ; : VARIABLE ( -- ) CREATE 0 COMPILE, ; fbForth: : VARIABLE ( n -- ) <BUILDS , DOES> ; : CONSTANT ( n -- ) <BUILDS , DOES> @ ; : USER ( n -- ) <BUILDS , DOES> @ U + ; ...lee 2 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5560407 Share on other sites More sharing options...
+FarmerPotato Posted November 4 Share Posted November 4 1 hour ago, TheBF said: There you go dreaming about good stuff again. This begs the question: If your native Forth CPU has a TOS register and, as Chuck added in later designs, an A register for addresses, would it make sense to make an indexed addressing mode? Something like: Yeah, I think you mean indexing into the stack? I stick to thinking of simple operations, knowing that at runtime they will actually work in parallel. For instance: 1. the ALU calculates an indexed address from the TOS and TOS+1, then the result is simultaneously used for a memory load AND stored on the stack (or incremented again...) In my CPU, stack operations (DUP, DROP, OVER) are optimized away wherever possible. The stack is more of a high-level programming concept. I was impressed to learn of a CPU architecture called STRAIGHT. While not stack-oriented, it reminds me of Chuck's circular stack. STRAIGHT has a circular register file. Each instruction may push one new top register and refer to past registers by index. This allows scheduling multiple instructions as all the data dependencies are written simply in the code. Modern processors have huge baggage, where the scheduler does "register renaming" on data dependencies and speculative execution. In STRAIGHT, the compiler just writes in the dependencies as stack indices. (Reminder, it's a circular stack, no need to balance it.) J1A (forth machine) has a little parallelism. One instruction feeds the ALU from stack indexes, then can adjust SP up or down, finally writing to TOS or TOS+1. One or several Forth words can be packed into one cycle. In particular, OVER, DROP, DUP are free when combined with another word: "DUP 1+" or "2OVER +" are one cycle. If you add a barrel shifter stage (J4 I think) then in one cycle "DUP 2 SHIFT-LEFT OVER +" ( n -- n 5*n ) I bet Chuck thought of this long ago. If your Forth machine just needs stack indices, then the stack pointer becomes less a part of the CPU architecture and more of a high-level language idea. Maybe there's a use right now: Assume a big circular stack with no DROPs needed: 1 2 + MOV @ONE,*SP+ SP points after TOS MOV @TWO,*SP A -2(SP),*SP+ If you used the WP to "window" the stack , then registers could even be used for indexes. Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5560481 Share on other sites More sharing options...
+TheBF Posted November 4 Share Posted November 4 No I meant indexing into the address in the A register with the TOS register. In later CPUs Chuck realized using a register to hold memory addresses was faster than using the TOS register all the time. I wonder if STRAIGHT took that idea from Chuck. All his machines used circular stacks for both data and return. It just makes sense to use a counter that rolls around as the index into the stack memory. I thought about making one in a Camel Kernel, but it would be pretty slow on our old machine. Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5560492 Share on other sites More sharing options...
+FarmerPotato Posted November 5 Share Posted November 5 5 hours ago, TheBF said: No I meant indexing into the address in the A register with the TOS register. In later CPUs Chuck realized using a register to hold memory addresses was faster than using the TOS register all the time. I wonder if STRAIGHT took that idea from Chuck. All his machines used circular stacks for both data and return. It just makes sense to use a counter that rolls around as the index into the stack memory. I thought about making one in a Camel Kernel, but it would be pretty slow on our old machine. Ah, I see it now. "Register Renaming" has led us to monstrous register files, multi-ported, up to simultaneous 5 read/writes. A scheduler allocates them when an instruction references an actual register. The scheduler makes queues of instructions with data dependencies (more temporary registers!), while looking ahead for independent instructions to execute (more temporary registers!). The guarantee is that they have cause/effect as if executed in order. If an instruction completes, or "retires", its result comes out of Renaming and goes to the real register. Then somebody else stuffs that on the stack, good grief. STRAIGHT eliminates that with a circular memory. With all the space saved, it implements more Integer execution units for parallelism, and by some measurement, comes out way ahead on power burned. I wrote some "STRAIGHT" style C code with a linear stack, then compared Clang's generated x86_64 vs Arm64 code. At several optimization levels. Pseudocode: MAIN: SP = S0 // SP points at first unused stack cell PUSH A PUSH B PUSH C Call MPYADD PRINT SP[-1] END MPYADD: PUSH SP[-3] * SP[-2] + SP[-1] RETURN I wrote it a couple different ways and compiled for x86_64 and ARM64, with several optimization levels. I tried to imagine register renaming going on with the x86_64 version, but the result was too twisted to contemplate. It relies on the 4 most common general registers, but has to stick one on the stack temporarily. The ARM64 code was nice and clean whether I gave the MPYADD function some values or just the indexes. It is essentially one Multiply-Add instruction in the subroutine, using indexed addressing. When I turned on optimization -O2, haha, the ARM64 was short: MAIN: "RETURN CONSTANT". Anyhow, not a real code test, just wanted to see some real instructions. I'm horrified at how much overhead the compiler creates. Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5560698 Share on other sites More sharing options...
+TheBF Posted November 5 Share Posted November 5 Apologies to Lee for taking his thread in a adjacent direction. 9 hours ago, FarmerPotato said: I'm horrified at how much overhead the compiler creates. Haha. I can't imagine what the compiler did to use all those indexed values in the calculations. Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5560856 Share on other sites More sharing options...
+Lee Stewart Posted November 6 Author Share Posted November 6 I am not sure I have mused about this before, but I am wondering how significant an improvement there would be from putting the body of EXECUTE (6 bytes) in Scratchpad RAM with the inner interpreter. EXECUTE is used by INTERPRET (the text interpreter) and getCODE (the CODE: interpreter). I would just need to ensure I move all current temporary use of those 6 bytes elsewhere in RAM. ...lee 1 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5561429 Share on other sites More sharing options...
+TheBF Posted November 6 Share Posted November 6 From the sound of things it would improve compile times a bit. But there is a lot of stuff going on in WORD and FIND that might swamp the improvement. I am of the opinion that the runtime of common primitives is more important for overall performance. I used this list from Stack Machines, Koopman. Stack Computers: 6.3 A STUDY OF FORTH INSTRUCTION FREQUENCIES For better or worse I put my user variables after the workspace which limited how much free space I had for primitives. NAMES FRAC LIFE MATH COMPILE AVE ---------------------------------------------------------- CALL 11.16% 12.73% 12.59% 12.36% 12.21% EXIT 11.07% 12.72% 12.55% 10.60% 11.74% VARIABLE 7.63% 10.30% 2.26% 1.65% 5.46% @ 7.49% 2.05% 0.96% 11.09% 5.40% 0BRANCH 3.39% 6.38% 3.23% 6.11% 4.78% LIT 3.94% 5.22% 4.92% 4.09% 4.54% + 3.41% 10.45% 0.60% 2.26% 4.18% SWAP 4.43% 2.99% 7.00% 1.17% 3.90% R> 2.05% 0.00% 11.28% 2.23% 3.89% >R 2.05% 0.00% 11.28% 2.16% 3.87% CONSTANT 3.92% 3.50% 2.78% 4.50% 3.68% DUP 4.08% 0.45% 1.88% 5.78% 3.05% ROT 4.05% 0.00% 4.61% 0.48% 2.29% USER 0.07% 0.00% 0.06% 8.59% 2.18% C@ 0.00% 7.52% 0.01% 0.36% 1.97% I 0.58% 6.66% 0.01% 0.23% 1.87% = 0.33% 4.48% 0.01% 1.87% 1.67% AND 0.17% 3.12% 3.14% 0.04% 1.61% BRANCH 1.61% 1.57% 0.72% 2.26% 1.54% EXECUTE 0.14% 0.00% 0.02% 2.45% 0.65% Using this list I shoehorned the following into scratchpad, but there's no room for more. l: _exit IP RPOP, \ >8388 l: _next @@9: *IP+ W MOV, *W+ R5 MOV, *R5 B, l: _enter IP RPUSH, W IP MOV, @@9 JMP, l: _?branch TOS DEC, TOS POP, @@2 JOC, l: _branch *IP IP ADD, @@9 JMP, @@2: IP INCT, @@9 JMP, l: _lit TOS PUSH, *IP+ TOS MOV, @@9 JMP, l: _drop TOS POP, @@9 JMP, l: _DUP TOS PUSH, @@9 JMP, l: _PLUS *SP+ TOS ADD, @@9 JMP, 1 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5561500 Share on other sites More sharing options...
+Lee Stewart Posted November 7 Author Share Posted November 7 Yeah, I agree about EXECUTE, but I think I may try to add some of those others to Scratchpad RAM—they get hit a lot! ...lee Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5561651 Share on other sites More sharing options...
+TheBF Posted November 7 Share Posted November 7 1 hour ago, Lee Stewart said: Yeah, I agree about EXECUTE, but I think I may try to add some of those others to Scratchpad RAM—they get hit a lot! ...lee I think you will see a marked improvement in general if you do that. I shot myself in the foot somewhat putting those user variables above the registers but it works so nice for context switches I still like the idea. Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5561679 Share on other sites More sharing options...
+Lee Stewart Posted November 7 Author Share Posted November 7 21 hours ago, TheBF said: I used this list from Stack Machines, Koopman. Stack Computers: 6.3 A STUDY OF FORTH INSTRUCTION FREQUENCIES I have room for 20 bytes that won’t cause me too much pain. Looking at the above reference, it seems LIT, @, and DUP (20 bytes) or LIT, 0BRANCH, and BRANCH (18 bytes) might be good additions. Thoughts? ...lee Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5561917 Share on other sites More sharing options...
+TheBF Posted November 7 Share Posted November 7 I did some tests with 0BRANCH in scratchpad and it does speed up your loops so that might have the broadest impact. The other three are important but hard to pin down for where they will really make a material difference. I would start with the BRANCH brothers. 1 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5561954 Share on other sites More sharing options...
+Lee Stewart Posted November 7 Author Share Posted November 7 5 hours ago, Lee Stewart said: I have room for 20 bytes that won’t cause me too much pain. Looking at the above reference, it seems LIT, @, and DUP (20 bytes) or LIT, 0BRANCH, and BRANCH (18 bytes) might be good additions. Thoughts? ...lee BRANCH and 0BRANCH take 12 bytes. If I include LIT, I will need to replace the inlined code for EXIT with a JMP to $NEXT to recover the extra bytes I need. That will only cost 333¹⁄₃ ns per instance. What do you think? ...lee Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5562042 Share on other sites More sharing options...
+TheBF Posted November 7 Share Posted November 7 Inline EXIT affects every hi-level word so It's important. Can you fit @ with branch? Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5562089 Share on other sites More sharing options...
+Lee Stewart Posted November 7 Author Share Posted November 7 44 minutes ago, TheBF said: Inline EXIT affects every hi-level word so It's important. Can you fit @ with branch? I only have 16 bytes without replacing the inline code. The only word in the discussed list with 4 bytes is DROP. I could actually put up to 64 bytes in the subroutine and data stack areas, but would, at least, need to save/restore for many of the GPLLNK and Floating Point Library calls. That might kill the speed advantage. ...lee 1 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5562118 Share on other sites More sharing options...
+TheBF Posted November 8 Share Posted November 8 Well then I think branch , 0branch and drop are candidates for the experiment. I know the 0branch will speed up anything with AGAIN UNTIL REPEAT, but of course if the loops are full of lots of other stuff it has less affect. The joys of the scratchpad. 3 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5562122 Share on other sites More sharing options...
Willsy Posted November 12 Share Posted November 12 On 11/4/2024 at 5:00 PM, Lee Stewart said: While changing how the values of a couple of user variables (UVs) are pushed to the stack, I noticed that the same code is used for ten of them. Here is the code for one of them: ;[*** SATR *** ( --- vaddr ) * DATA CLTB_N * SATR_N .NAME_FIELD 4, 'SATR ' SATR DATA $+2 DECT SP ; make room on stack MOV @$SATR(U),*SP ; push VRAM address of SATR to stack B *NEXT ; return to inner interpreter ;]* I reduced that code to 2 cells for each of those 10 UVs by putting the location of the ALC retrieval routine in the code field and the UV table offset of the UV in the parameter field: ;[*** SATR *** ( --- vaddr ) * DATA CLTB_N * SATR_N .NAME_FIELD 4, 'SATR ' SATR DATA GETVAL ;routine to retrieve User Variable value DATA $SATR ;User Variable table offset of $SATR ;]* Here is the ALC retrieval routine: *...The fbForth registers in use below are *......W (R10) points to the current cell of the currently executing word. *......U (R8) points to the start of the User Variable table. *......SP (R9) points to the top of the stack. GETVAL DECT SP MOV *W,R0 ;copy parameter field of word that got us here (UV table offset) A U,R0 ;correct to actual address in UV table MOV *R0,*SP ;push UV value to stack B *NEXT ;back to inner interpreter It appears to work as it should. Anyone with deep Forth insight ( @TheBF @Willsy @FarmerPotato ??? ) see any possible problems? ...lee Looks good. TF used the same approach if I've read your code correctly. A small stub to place the address of a var in a register, then a JMP to a common routine to do the push and subsequent return: https://github.com/Mark-Wills/TurboForth/blob/main/bank0/0-14-Variables.asm 1 1 Quote Link to comment https://forums.atariage.com/topic/210660-fbforth%E2%80%94ti-forth-with-file-based-block-io-post-1-updated-06052024/page/85/#findComment-5564243 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.