Jump to content

Recommended Posts

While changing how the values of a couple of user variables (UVs) are pushed to the stack, I noticed that the same code is used for ten of them. Here is the code for one of them:

 

;[*** SATR ***       ( --- vaddr )
*        DATA CLTB_N
* SATR_N .NAME_FIELD 4, 'SATR '

SATR   DATA $+2
       DECT SP                  ; make room on stack
       MOV  @$SATR(U),*SP       ; push VRAM address of SATR to stack
       B    *NEXT               ; return to inner interpreter
;]*

 

I reduced that code to 2 cells for each of those 10 UVs by putting the location of the ALC retrieval routine in the code field and the UV table offset of the UV in the parameter field:

 

;[*** SATR ***       ( --- vaddr )
*        DATA CLTB_N
* SATR_N .NAME_FIELD 4, 'SATR '

SATR   DATA GETVAL		;routine to retrieve User Variable value
       DATA $SATR		;User Variable table offset of $SATR
;]*

 

Here is the ALC retrieval routine:

 

*...The fbForth registers in use below are
*......W  (R10) points to the current cell of the currently executing word.
*......U  (R8)  points to the start of the User Variable table.
*......SP (R9)  points to the top of the stack.
GETVAL DECT SP
       MOV  *W,R0		;copy parameter field of word that got us here (UV table offset)
       A    U,R0		;correct to actual address in UV table
       MOV  *R0,*SP		;push UV value to stack
       B    *NEXT		;back to inner interpreter

 

It appears to work as it should. Anyone with deep Forth insight ( @TheBF @Willsy @FarmerPotato ??? ) see any possible problems?

 

...lee

Edited by Lee Stewart
code clarification
  • Like 1

If understand this correctly you have made something like a "USER CONSTANT" rather than a USER VARIABLE.

If you are ok breaking the mold of FigForth I think it's a good improvement if those USER variables are not changed very often.

 

For context this is how many Forth systems in the 90s started doing VARIABLE, CONSTANT and USER.

There is a separate code snippet for each data type that is compiled into the code field of each word.

I think FigForth did it with <BUILD DOES> ?   i don't remember.

 

Example: Camel Forth data words are defined:

 : CONSTANT  ( n --)  HEADER  COMPILE DOCON     COMPILE, ;
 : USER      ( n --)  HEADER  COMPILE DOUSER    COMPILE, ;
 : CREATE    ( -- )   HEADER  COMPILE DOVAR              ;
 : VARIABLE  ( -- )   CREATE                  0 COMPILE, ;

 

  • Thanks 1
1 hour ago, Lee Stewart said:
GETVAL DECT SP

 

Took me a while to find your DECT above the comment.

 

It's correct.

 

I see only one further optimization, which is self-modifying code!   Rewrite the MOV to use the addressing mode @>3E(U) of the original MOV.

Of course that won't work in ROM, no help to you.   It does eliminate one instruction which is almost always better.

 

My code replaces >1234  with the value at *W

 

*RORG DATA ************************   Opcode       Td dddd Ts ssss
*
 0000 C81A      MOV  *W,@$+6          1100         10 0000 01 1010
 0002 0006r
 0004 CA19      MOV  @>1234(U),*SP    1100         10 1000 01 1001 
 0006 1234
 0008 045F      B    *NEXT            0000 0100 01         01 1111

 

 

If the CPU were a native Forth machine with a stack,  it would be quite elegant compared to this.

 

PUSH   W     (      -- addr )
FETCH        ( addr -- n    )
ADD    U     ( n    -- addr )
FETCH        ( addr -- n
NEXT

 

 

  • Like 1
  • Thanks 1
11 minutes ago, TheBF said:

If understand this correctly you have made something like a "USER CONSTANT" rather than a USER VARIABLE.

If you are ok breaking the mold of FigForth I think it's a good improvement if those USER variables are not changed very often.

 

For context this is how many Forth systems in the 90s started doing VARIABLE, CONSTANT and USER.

There is a separate code snippet for each data type that is compiled into the code field of each word.

I think FigForth did it with <BUILD DOES> ?   i don't remember.

 

Example: Camel Forth data words are defined:

 : CONSTANT  ( n --)  HEADER  COMPILE DOCON     COMPILE, ;
 : USER      ( n --)  HEADER  COMPILE DOUSER    COMPILE, ;
 : CREATE    ( -- )   HEADER  COMPILE DOVAR              ;
 : VARIABLE  ( -- )   CREATE                  0 COMPILE, ;

 

TI Forth has these VDP base addresses stored in "User Values".  Write-once, read-often variables.  Configured when GRAPHICS, SPLIT, TEXT set the VDP registers.

for Sprite Attribute Table, Pattern Generator Table, and so on.

 

I learned to like "VALUE ... IS" in ANS, so I'm putting that in my Forth (TI fig-Forth).

DOVALU  

* fetch from pfa

* fetch

* next

 

  • Like 3
16 minutes ago, FarmerPotato said:

Took me a while to find your DECT above the comment.

 

It's correct.

 

I see only one further optimization, which is self-modifying code!   Rewrite the MOV to use the addressing mode @>3E(U) of the original MOV.

Of course that won't work in ROM, no help to you.   It does eliminate one instruction which is almost always better.

 

My code replaces >1234  with the value at *W

 

*RORG DATA ************************   Opcode       Td dddd Ts ssss
*
 0000 C81A      MOV  *W,@$+6          1100         10 0000 01 1010
 0002 0006r
 0004 CA19      MOV  @>1234(U),*SP    1100         10 1000 01 1001 
 0006 1234
 0008 045F      B    *NEXT            0000 0100 01         01 1111

 

 

If the CPU were a native Forth machine with a stack,  it would be quite elegant compared to this.

 

PUSH   W     (      -- addr )
FETCH        ( addr -- n    )
ADD    U     ( n    -- addr )
FETCH        ( addr -- n
NEXT

There you go dreaming about good stuff again.

 

This begs the question:

If your native Forth CPU has a TOS register and, as Chuck added in later designs, an A register for addresses,

would it make sense to make an indexed addressing mode?

Something like:

 
 

MYARRAY  A!      \ set the base address
7 CELLS          \ TOS=14
 @(A)            \ get the value at TOS+A 
 @(A)+           \ get the value at TOS+A, autoincr. A by 2

 

  • Like 1
52 minutes ago, TheBF said:

If understand this correctly you have made something like a "USER CONSTANT" rather than a USER VARIABLE.

If you are ok breaking the mold of FigForth I think it's a good improvement if those USER variables are not changed very often.

 

For context this is how many Forth systems in the 90s started doing VARIABLE, CONSTANT and USER.

There is a separate code snippet for each data type that is compiled into the code field of each word.

I think FigForth did it with <BUILD DOES> ?   i don't remember.

 

Example: Camel Forth data words are defined:

 : CONSTANT  ( n --)  HEADER  COMPILE DOCON     COMPILE, ;
 : USER      ( n --)  HEADER  COMPILE DOUSER    COMPILE, ;
 : CREATE    ( -- )   HEADER  COMPILE DOVAR              ;
 : VARIABLE  ( -- )   CREATE                  0 COMPILE, ;

 

fbForth

: VARIABLE   ( n -- )   <BUILDS , DOES> ;
: CONSTANT   ( n -- )   <BUILDS , DOES> @ ;
: USER       ( n -- )   <BUILDS , DOES> @ U + ;

 

...lee

 

  • Like 2
1 hour ago, TheBF said:

There you go dreaming about good stuff again.

 

This begs the question:

If your native Forth CPU has a TOS register and, as Chuck added in later designs, an A register for addresses,

would it make sense to make an indexed addressing mode?

Something like:

 

Yeah, I think you mean indexing into the stack?

 

I stick to thinking of simple operations, knowing that at runtime they will actually work in parallel. For instance: 1. the ALU calculates an indexed address from the TOS and TOS+1, then the result is simultaneously used for a memory load AND stored on the stack (or incremented again...)   In my CPU, stack operations (DUP, DROP, OVER) are optimized away wherever possible.  The stack is more of a high-level programming concept. 


I was impressed to learn of a CPU architecture called STRAIGHT.  While not stack-oriented, it reminds me of Chuck's circular stack.  STRAIGHT has a circular register file. Each instruction may push one new top register and refer to past registers by index.  This allows scheduling multiple instructions as all the data dependencies are written simply in the code.  Modern processors have  huge baggage, where the scheduler does "register renaming" on data dependencies and speculative execution.   In STRAIGHT, the compiler just writes in the dependencies as stack indices.  (Reminder, it's a circular stack, no need to balance it.)  

 

J1A (forth machine) has a little parallelism.  One instruction feeds the ALU from stack indexes, then can adjust SP up or down, finally writing to TOS or TOS+1.  One or several Forth words can be packed into one cycle.  In particular,  OVER, DROP, DUP are free when combined with another word: "DUP 1+" or "2OVER +" are one cycle.  If you add a barrel shifter stage (J4 I think) then in one cycle "DUP 2 SHIFT-LEFT OVER +" ( n -- n 5*n )

 

I bet Chuck thought of this long ago.  If your Forth machine just needs stack indices, then the stack pointer becomes less a part of the CPU  architecture and more of a high-level language idea.  

 

Maybe there's a use right now:  Assume a big circular stack with no DROPs needed:

1 2 +

MOV  @ONE,*SP+       SP points after TOS

MOV  @TWO,*SP

A     -2(SP),*SP+

 

If you used the WP to "window" the stack , then registers could even be used for indexes.  

 

 

No I meant indexing into the address in the A register with the TOS register.

In later CPUs Chuck realized using a register to hold memory addresses was faster than using the TOS register all the time.

 

I wonder if STRAIGHT took that idea from Chuck. All his machines used circular stacks for both data and return.

It just makes sense to use a counter that rolls around as the index into the stack memory.

I thought about making one in a Camel Kernel, but it would be pretty slow on our old machine.

 

5 hours ago, TheBF said:

No I meant indexing into the address in the A register with the TOS register.

In later CPUs Chuck realized using a register to hold memory addresses was faster than using the TOS register all the time.

 

I wonder if STRAIGHT took that idea from Chuck. All his machines used circular stacks for both data and return.

It just makes sense to use a counter that rolls around as the index into the stack memory.

I thought about making one in a Camel Kernel, but it would be pretty slow on our old machine.

 

Ah, I see it now. 

 

"Register Renaming" has led us to monstrous register files, multi-ported, up to simultaneous 5 read/writes.  A scheduler allocates them when an instruction references an  actual register.   The scheduler makes queues of instructions with data dependencies (more temporary registers!), while looking ahead for independent instructions to execute (more temporary registers!).   The guarantee is that they have cause/effect as if executed in order.  If an instruction completes, or "retires", its result comes out of Renaming and goes to the real register. Then somebody else stuffs that on the stack, good grief. 

 

STRAIGHT eliminates that with a circular memory.   With all the space saved, it implements more Integer execution units for parallelism, and by some measurement, comes out way ahead on power burned.    

 

 

I wrote some "STRAIGHT" style C code with a linear stack, then compared Clang's generated x86_64 vs Arm64 code.  At several optimization levels.  

Pseudocode:

MAIN:
SP = S0          // SP points at first unused stack cell
PUSH A
PUSH B
PUSH C
Call MPYADD
PRINT SP[-1]
END

MPYADD:
PUSH SP[-3] * SP[-2] + SP[-1]
RETURN

 

 

I wrote it a couple different ways and compiled for x86_64 and ARM64, with several optimization levels. 

 

 

I tried to imagine register renaming going on with the x86_64 version, but the result was too twisted to contemplate.  It relies on the 4 most common general registers, but has to stick one on the stack temporarily.  

 

 

The ARM64 code was nice and clean whether I gave the MPYADD function some values or just the indexes.   It is essentially one Multiply-Add instruction in the subroutine, using indexed addressing.  When I turned on optimization -O2, haha, the ARM64 was short:  MAIN: "RETURN CONSTANT".

 

Anyhow, not a real code test, just wanted to see some real instructions.  I'm horrified at how much overhead the compiler creates.

 

 

 

Apologies to Lee for taking his thread in a adjacent direction.

 

9 hours ago, FarmerPotato said:

I'm horrified at how much overhead the compiler creates.

Haha. I can't imagine what the compiler did to use all those indexed values in the calculations. :)

 

I am not sure I have mused about this before, but I am wondering how significant an improvement there would be from putting the body of EXECUTE (6 bytes) in Scratchpad RAM with the inner interpreter. EXECUTE is used by INTERPRET (the text interpreter) and getCODE (the CODE: interpreter). I would just need to ensure I move all current temporary use of those 6 bytes elsewhere in RAM. 

 

...lee

  • Like 1

From the sound of things it would improve compile times a bit.

But there is a lot of stuff going on in WORD and FIND that might swamp the improvement. 

 

I am of the opinion that the runtime of common primitives is more important for overall performance. 

 

I used this list from Stack Machines, Koopman. 

Stack Computers: 6.3 A STUDY OF FORTH INSTRUCTION FREQUENCIES

 

For better or worse I put my user variables after the workspace which limited how much free space I had for primitives.

 

NAMES           FRAC     LIFE     MATH  COMPILE      AVE
----------------------------------------------------------
CALL           11.16%   12.73%   12.59%   12.36%   12.21%
EXIT           11.07%   12.72%   12.55%   10.60%   11.74%
VARIABLE        7.63%   10.30%    2.26%    1.65%    5.46%
@               7.49%    2.05%    0.96%   11.09%    5.40%
0BRANCH         3.39%    6.38%    3.23%    6.11%    4.78%
LIT             3.94%    5.22%    4.92%    4.09%    4.54%
+               3.41%   10.45%    0.60%    2.26%    4.18%
SWAP            4.43%    2.99%    7.00%    1.17%    3.90%
R>              2.05%    0.00%   11.28%    2.23%    3.89%
>R              2.05%    0.00%   11.28%    2.16%    3.87%
CONSTANT        3.92%    3.50%    2.78%    4.50%    3.68%
DUP             4.08%    0.45%    1.88%    5.78%    3.05%
ROT             4.05%    0.00%    4.61%    0.48%    2.29%
USER            0.07%    0.00%    0.06%    8.59%    2.18%
C@              0.00%    7.52%    0.01%    0.36%    1.97%
I               0.58%    6.66%    0.01%    0.23%    1.87%
=               0.33%    4.48%    0.01%    1.87%    1.67%
AND             0.17%    3.12%    3.14%    0.04%    1.61%
BRANCH          1.61%    1.57%    0.72%    2.26%    1.54%
EXECUTE         0.14%    0.00%    0.02%    2.45%    0.65%

 

 

Using this list I shoehorned the following into scratchpad, but there's no room for more. 

l: _exit    IP RPOP,    \ >8388
l: _next
@@9:        *IP+ W  MOV,
            *W+  R5 MOV,
            *R5  B,

l: _enter   IP RPUSH,
            W IP MOV,
            @@9 JMP,
l: _?branch
            TOS DEC,
            TOS POP,
            @@2 JOC,
l: _branch  *IP IP ADD,
            @@9 JMP,
@@2:        IP INCT,
            @@9 JMP,

l: _lit      TOS PUSH,
            *IP+ TOS MOV,
             @@9 JMP,

l: _drop    TOS POP,
            @@9 JMP,

l: _DUP     TOS PUSH,
            @@9 JMP,

l: _PLUS    *SP+ TOS ADD,
            @@9 JMP,

 

  • Like 1
1 hour ago, Lee Stewart said:

Yeah, I agree about EXECUTE, but I think I may try to add some of those others to Scratchpad RAM—they get hit a lot!

 

...lee

I think you will see a marked improvement in general if you do that. 

I shot myself in the foot somewhat putting those user variables above the registers but it works so nice for context switches I still like the idea. :) 

21 hours ago, TheBF said:

I used this list from Stack Machines, Koopman. 

Stack Computers: 6.3 A STUDY OF FORTH INSTRUCTION FREQUENCIES

 

I have room for 20 bytes that won’t cause me too much pain. Looking at the above reference, it seems LIT, @, and DUP (20 bytes) or LIT, 0BRANCH, and BRANCH (18 bytes) might be good additions. Thoughts?

 

...lee

I did some tests with 0BRANCH in scratchpad and it does speed up your loops so that might have the broadest impact.

The other three are important but hard to pin down for where they will really make a material difference.

 

I would start with the BRANCH brothers.

 

  • Thanks 1
5 hours ago, Lee Stewart said:

 

I have room for 20 bytes that won’t cause me too much pain. Looking at the above reference, it seems LIT, @, and DUP (20 bytes) or LIT, 0BRANCH, and BRANCH (18 bytes) might be good additions. Thoughts?

 

...lee

 

BRANCH and 0BRANCH take 12 bytes. If I include LIT, I will need to replace the inlined code for EXIT with a JMP to $NEXT to recover the extra bytes I need. That will only cost 333¹⁄₃ ns per instance. What do you think?

 

...lee

44 minutes ago, TheBF said:

Inline EXIT affects every hi-level word so It's important.

Can you fit @ with branch?

 

I only have 16 bytes without replacing the inline code. The only word in the discussed list with 4 bytes is DROP.

 

I could actually put up to 64 bytes in the subroutine and data stack areas, but would, at least, need to save/restore for many of the GPLLNK and Floating Point Library calls. That might kill the speed advantage.

 

...lee

  • Like 1

Well then I think branch , 0branch and drop are candidates for the experiment.

I know the 0branch will speed up anything with AGAIN UNTIL REPEAT, but of course if the loops are full of lots of other stuff it has less affect. 

The joys of the scratchpad. :)

 

  • Like 3
On 11/4/2024 at 5:00 PM, Lee Stewart said:

While changing how the values of a couple of user variables (UVs) are pushed to the stack, I noticed that the same code is used for ten of them. Here is the code for one of them:

 

;[*** SATR ***       ( --- vaddr )
*        DATA CLTB_N
* SATR_N .NAME_FIELD 4, 'SATR '

SATR   DATA $+2
       DECT SP                  ; make room on stack
       MOV  @$SATR(U),*SP       ; push VRAM address of SATR to stack
       B    *NEXT               ; return to inner interpreter
;]*

 

I reduced that code to 2 cells for each of those 10 UVs by putting the location of the ALC retrieval routine in the code field and the UV table offset of the UV in the parameter field:

 

;[*** SATR ***       ( --- vaddr )
*        DATA CLTB_N
* SATR_N .NAME_FIELD 4, 'SATR '

SATR   DATA GETVAL		;routine to retrieve User Variable value
       DATA $SATR		;User Variable table offset of $SATR
;]*

 

Here is the ALC retrieval routine:

 

*...The fbForth registers in use below are
*......W  (R10) points to the current cell of the currently executing word.
*......U  (R8)  points to the start of the User Variable table.
*......SP (R9)  points to the top of the stack.
GETVAL DECT SP
       MOV  *W,R0		;copy parameter field of word that got us here (UV table offset)
       A    U,R0		;correct to actual address in UV table
       MOV  *R0,*SP		;push UV value to stack
       B    *NEXT		;back to inner interpreter

 

It appears to work as it should. Anyone with deep Forth insight ( @TheBF @Willsy @FarmerPotato ??? ) see any possible problems?

 

...lee

 

Looks good. TF used the same approach if I've read your code correctly. A small stub to place the address of a var in a register, then a JMP to a common routine to do the push and subsequent return: https://github.com/Mark-Wills/TurboForth/blob/main/bank0/0-14-Variables.asm

 

 

  • Like 1
  • Thanks 1

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...