Jump to content
IGNORED

BLWP vs BL


Willsy

Recommended Posts

Oh but of course if you need nested sub-routines on stack, it is really sad. 

 

DECT RP              10

MOV  R11,*RP      16

BL  @ABCD          20 

-------------------------

                           46 

 

* RETURN

 MOV *RP+,R11   22

 B      *R11          16 

----------------------------

                          38                           

 

* NO MEMORY WAIT STATES ASSUMED

  • Like 2
Link to comment
Share on other sites

1 hour ago, TheBF said:

Oh but of course if you need nested sub-routines on stack, it is really sad. 

 

DECT RP              10

MOV  R11,*RP      16

BL  @ABCD          20 

-------------------------

                           46 

 

* RETURN

 MOV *RP+,R11   22

 B      *R11          16 

----------------------------

                          38                           

 

* NO MEMORY WAIT STATES ASSUMED

Corrected slight timing errors

 

DECT RP              10

MOV  R11,*RP      18

BL  @ABCD          20 

-------------------------

                           48 

 

* RETURN

 MOV *RP+,R11   22

 B      *11            12

----------------------------

                           34                          

 

* NO MEMORY WAIT STATES ASSUMED

 

But if you then need to start shuffling around data in the subroutine, since you need more registers than you can "steal" from the calling procedure, you may quickly find yourself in a slower melass than if you use BLWP. It's not just the call to look at, but also the effect of the call.

Edited by apersson850
  • Like 1
  • Thanks 1
Link to comment
Share on other sites

Having more Registers is always going to be slightly more efficient as with a set number of registers long term the speed gain is going to sacrifice efficiency.

Adding 10 more Registers is very much like using smaller disks vs larger disks. Yea you can skimp and save but you in long run have to drop features.

A BL uses the same registers, so you end up having to save some stuff then reload it later, this is inefficient. 

BLWP is more like using a stack, matter of fact is very much like pushing and popping a stack.

  • Like 2
Link to comment
Share on other sites

I want to bring a different point of view to this discussion, which seems to favour BLWP or BL. At least personally, for assembler programs, I very much prefer BL over BLWP. 

  • For leaf routines (which don't call other routines) you don't need the stack of return addresses, and you can just do BL @ABCD and B *R11 to return. 
  • Comparing BLWP and DECT RP, MOV R11,*RP, BL @routine is not in my opinion fair, since the latter supports recursion and is more flexible in that regard, while the former does not. One should add similar code to BLWP to support multiple levels of calls. And that really becomes involved, something like STWP Rx, AI Rx,-32, MOV Rx,@somewhere+2, BLWP @somewhere.
  • In general you want to keep your workspace pointer in the scratchpad memory. This leads to static allocation of workspaces in the scratchpad if you want to use BLWP, not as flexible as with manual stacking of return addresses. Plus you run out of scratchpad space real fast when putting multiple workspaces there.
  • One also needs to be careful when using BLWP and a non-trivial program, since if the routine called to with BLWP shares this workspace with other routines, unexpected results might arise if one expects the workspace to preserve its state.
  • When using BL, since you're not changing the workspace pointer, you can quickly and easily access the registers of the calling routine with direct references (Rx), without using something more error prone with offsets like MOV @4(R13),R7 to get data from the caller's workspace.

For all of the reasons above I guess all compiler generated code uses BL and not BLWP.

Edited by speccery
  • Like 6
Link to comment
Share on other sites

26 minutes ago, speccery said:
  • ...without using something more error prone with offsets like MOV @4(R13),R7 to get data from the caller's workspace.

That item is the only one I disagree with. Accessing the calling workspace based on content of R13, or data after the call based on the content of R14, isn't any more error prone than using the register directly or access data based on the content of R11.

Rather accessing registers via R13 is less error prone, as you are less likely to change a register you better shouldn't mess with.

  • Like 4
Link to comment
Share on other sites

2 hours ago, apersson850 said:

That item is the only one I disagree with. Accessing the calling workspace based on content of R13, or data after the call based on the content of R14, isn't any more error prone than using the register directly or access data based on the content of R11.

Rather accessing registers via R13 is less error prone, as you are less likely to change a register you better shouldn't mess with.

I never use BLWP, so is it common to read (or even change) the callers registers when using BLWP? It appears to me as bad practice for a child routine to know anything about the parent. With BL it's clear that the workspace is shared, so you know you have to be very specific about which registers you use as parameters and which registers the child routine are allowed to change. Admittedly that can also often become a mess. 

  • Like 2
Link to comment
Share on other sites

4 minutes ago, Asmusr said:

I never use BLWP, so is it common to read (or even change) the callers registers when using BLWP? It appears to me as bad practice for a child routine to know anything about the parent. With BL it's clear that the workspace is shared, so you know you have to be very specific about which registers you use as parameters and which registers the child routine are allowed to change. Admittedly that can also often become a mess. 

Indeed, I do the same with having comments at each entry point describing registers used and modified.  I'm on the verge of writing my own macro assembler or a compiler that will track/allocate registers automatically.

 

Regarding BLWP and the workspace in limited scratchpad memory, could you overlap the workspaces in a way to allow the callee access to the caller-passed registers as well as space for its own?

Edited by PeteE
  • Like 2
  • Thanks 2
Link to comment
Share on other sites

Considering the TI99/4A has >83E0 for registers for GPL and does most of the workload in a TI99/4A and has >83C0 for Interrupt Registers for Interrupts.

So how come they did not use BL only for both of these thus freeing up more Scratch Pad?

I think the most realistic answer is the inefficiency of BL limited number of registers to do complicated things. 

Yes, BLWP takes up more memory, so BL is better for many things and is more efficient until it gets complicated.

i.e. Using memory to save registers or duplicate them for later use. Each time you save R11 you have to have some place to put it using BL, unlike BLWP RTWP does this.

Link to comment
Share on other sites

54 minutes ago, Asmusr said:

I never use BLWP, so is it common to read (or even change) the callers registers when using BLWP?

Indeed you do. That's how you pass parameters. It's of course up to the programmer to make sure you don't write things where you should not, but apart from that, you can also return results the same way.

That kind of carefulness is just the normal standard, when you are programming in assembly language on this class of processors, where there is no protection from anything at all. You asked for it, you got it.

50 minutes ago, PeteE said:

Regarding BLWP and the workspace in limited scratchpad memory, could you overlap the workspaces in a way to allow the callee access to the caller-passed registers as well as space for its own?

Yes, you can overlap. In theory. The smallest overlap is three registers, since you have to have a new set of R13, R14 and R15. But if you do that, then either your new R13, R14 and R15 will overlap R10-R12 of the caller's, and that's where the caller has return linkage and CRU base, so you may not want to. Or you overlap in the other direction, but then the caller's return registers will be your return linkage and CRU base address, which you may not like any better.

You can have a larger offset, but no matter how you do, you'll lose three registers in one of the workspaces for the return linkage. I've never used that technology, since I think it creates more problems than it solves.

44 minutes ago, RXB said:

Considering the TI99/4A has >83E0 for registers for GPL and does most of the workload in a TI99/4A and has >83C0 for Interrupt Registers for Interrupts.

So how come they did not use BL only for both of these thus freeing up more Scratch Pad?

I think the most realistic answer is the inefficiency of BL limited number of registers to do complicated things. 

But they do use BL with these workspaces. At least with GPLWS. I haven't checked the interrupt code to see if it uses BL too, but it's of course fully possible.

That there are two different workspaces is of course a given thing, considering their purpose.

Edited by apersson850
  • Like 1
Link to comment
Share on other sites

I find that if you have taken the time to reserve a register and some memory for a stack, it is easy to use and doesn't not waste as much memory as using BLWP.

BLWP/RTWP IMHO is an amazing context switch mechanism, but most programs don't need a full context switch for sub-routines.

 

I will add an exception to that statement.

For data structures that need some local storage to maintain their state, a queue for example, it's nice to keep all the variables of the queue in registers.

This can be a good use case for BLWP.

 

For that purpose, I added an optional PROG: directive to my Assembler, that lets you define a sub-program. You give it a workspace address and it allocates the vector

All the children of PROG: can "call" (BLWP) themselves from Forth. 

I thought it was pretty slick.  This way you can write sub-programs that use a common workspace very easily.  These sub-programs behave like any other word in the language inside Forth. 

 

It looks like this:

HEX 20 MALLOC CONSTANT QWKSP  \ points to a workspace for Q operations

QWKSP PROG: INIT-QWKSP   \ code that initializes wksp
      <Assembler code ....
;PROG 

QWKSP PROG: ENQ ( c -- ? ) \ put byte in Q, return error code
      <Assembler code ....
;PROG 

QWKSP PROG: DEQ ( 0 -- c) \ returned char can be any byte value. [0..255]
      <Assembler code ....
;PROG 

QWKSP PROG: QSTAT ( 0 -- ?) \ true means data waiting
      <Assembler code ....
;PROG 


 

 

  • Like 1
Link to comment
Share on other sites

this isn't really an answer to the OP but since others are talking about usage...  When I was "teaching" assembly language in our TI user group setting, BLWP was often preferred by the "students". The discrete register workspaces were easier to deal with and seemed to reduce the bugs they introduced by poor register management.  The BLWP didn't necessarily save CPU clock cycles but it often saved time for the programmer.  Some people gravitated to BL once they felt comfortable with their coding or had a need to shave clock cycles for iterative routines.

  • Like 4
Link to comment
Share on other sites

24 minutes ago, InsaneMultitasker said:

this isn't really an answer to the OP but since others are talking about usage...  When I was "teaching" assembly language in our TI user group setting, BLWP was often preferred by the "students". The discrete register workspaces were easier to deal with and seemed to reduce the bugs they introduced by poor register management.  The BLWP didn't necessarily save CPU clock cycles but it often saved time for the programmer.  Some people gravitated to BL once they felt comfortable with their coding or had a need to shave clock cycles for iterative routines.

Yes I agree, Quinton and I taught an Assembly course here for TI99/4A Club PUNN here in Portland Oregon.

I am slowly converting RXB from GPL to Assembly and use the GPL Registers and Scratch pad only so all has to work with ONLY CONSOLE and RXB Cart.

 

This I have to avoid BLWP constantly and only use BL, and the very first problem Lee Stewart and I have run into is having to use up Scratch Pad.

You see there are only 10 Registers (R11 to R15 are taken) you can use and sometimes you need 14 to do something really complicated but have to use us Scratch Pad locations.

Scratch Pad is fast, but Registers are way faster. More Registers make it way easier to do more complicated things with less errors and less memory swaps.

  • Like 2
Link to comment
Share on other sites

BLWP, by having an isolated register set, is useful because it's sort of like a method in a high-end language like Java, C, C#, etc.

 

For BL/RT, I usually set up an internal stack and burn R10 as a pointer for it, so I can push and pop return addresses off if needed. That way BL routines can call other BL routines but still work their way back to the original caller.

 

What I did find when coding Realms of Antiquity was that I'd frequently run into issues where I was using registers to hold specific values at the top of a return stack, and either I'd end up running out of registers to use or worse, accidentally use one and cause a bug. That's where I would convert some subroutines into BLWP versions, because then they operated independently.

 

Also, I've used register sets in the regular CPU memory areas other than the scratch pad and not had performance problems. It's all about the context of what you're using them for.

  • Like 7
  • Thanks 1
Link to comment
Share on other sites

This is an interesting discussion.

 

I was thinking about this idiomatic code that we often see:

 

        BL @THING
        ...
        ...
THING   MOV R11,@R11SAV  ; save return address
        ...
        ...
        do stuff that uses R11 in some way
        ...
        ...
        MOV @R11SAV,R11
        RT

By the time you've saved R11 and restored it afterwards, isn't it a wash in terms of performance against BLWP/RTWP?

 

In a multi-tiered application (which could apply to most applications, such as games) such as:

image.png.9ec13ba635de59b16e87e037d314814f.png

Would there not be a distinct advantage* in using dedicated workspaces for each logical layer? At least in terms of programmer convenience. As rich noted above, workspace linking is the closest we have to a stack. Using separate workspaces would be a much easier to (for example, referring to the layer drawing above) load some data from the IO layer, process it in some way through the logic layer, and then present it at the presentation layer. E.g. a word processor or text editor. 

 

Thoughts?

  • Like 4
Link to comment
Share on other sites

3 hours ago, Willsy said:

Frequently you can make it slightly more efficient.

 

        BL @THING
        ...
        ...
THING   MOV R11,R12  ; save return address
        ...
        ...
        do stuff that uses R11 in some way, like BL @somewhere
        ...
        ...
        B   *R12
 

 

If CRU access isn't used, then there's no problem using R12 to save the return address.

 

As a general thing, if your call a subroutine that's complex enough, then it usually pays to use BLWP. By complex enough I mean something which benefits from using several registers by itself. It may pay in easy to count clock cycles, as you don't have to move around so much data, and it may certainly pay in hours, as it's easier to avoid messing with the caller's data.

Especially in machines like mine, that has 16 bit wide RAM everywhere, there's no penalty for having the workspace outside of the scratch pad RAM.

  • Like 2
  • Thanks 2
Link to comment
Share on other sites

I forgot to define my asterisk above :-)

 

* depends on your definition of 'advantage' I guess. I'm currently in favour of clarity and ease-of-coding over performance. For example, if you have a hand-rolled keyboard scanning routine, why bother worrying about 'performance' when by far the biggest time sink is the scanning of the keyboard in the first place. Surely better to let the keyboard scan be entirely independent and use its own workspace?

Link to comment
Share on other sites

2 hours ago, Willsy said:

I forgot to define my asterisk above :-)

 

* depends on your definition of 'advantage' I guess. I'm currently in favour of clarity and ease-of-coding over performance. For example, if you have a hand-rolled keyboard scanning routine, why bother worrying about 'performance' when by far the biggest time sink is the scanning of the keyboard in the first place. Surely better to let the keyboard scan be entirely independent and use its own workspace?

I wonder if part of the issue with "ease-of-coding" is that for years we were using an assembler with no macro capability.

The assembler I use has PUSH, POP,  RPUSH and RPOP for the two Forth stacks. If we accept the performance hit, it's pretty easy to use.

And since the 9900 accesses memory so easily even parameters sitting on the stack(s) can he modified with little effort.

 

  • Like 2
Link to comment
Share on other sites

2 hours ago, apersson850 said:

Since the p-system was my favorite environment, I enjoyed a macro assembler as soon as I got the p-code card. BLS (Branch with Link on Stack) and RLS (Return with Link from Stack) were some of my definitions.

I named mine CALL and RET  :) 

 

Link to comment
Share on other sites

Yea I forgot the cost in CPU cycles having to save R11 and then put it back for a return.

MOV  R11,R9

B      *R9

 

Versus

BLWP @Address

RTWP

 

BL is going to run out of Registers to use fast and no way around that.

And each time you need to do a BL to another subroutine you need another register to save that address:

MOV   R11,R8

B       *R8

Also you have to add in those two lines to get back to original routing that it came from so how is that faster?

With BLWP you do not waste register or do you need to use up all registers to get back to calling routine.

Link to comment
Share on other sites

As you show, you don't have to put a saved return address back to return, if you saved it in a register. Just return via the register you saved it in, like you show with B *R9.

 

Simple devices like programmable calculators used to have 3-6 levels of subroutines (HP 67 and TI 59 are examples of that). Rarely were these limits a problem. How many levels deep do you normally need, if you run out of registers to use fast? It's not very often I'm more than 2-3 levels deep. Subroutines on the same level that don't call each other can use the same register. Here's an example of a main program which calls three different subroutines which in turn call three different subroutines, nesting max two levels deep. Since one level is handled by the normal R11 procedure, only one more level needs to be handled and that's done by using one single register.

 

; Main program

BL @SUBA
...
BL @SUBB
...
BL @SUBC
...
; End of main

SUBA
...
MOV R11,R9
BL @SUB1
BL @SUB2
B *R9

SUBB
...
MOV R11,R9
BL @SUB3
BL @SUB1
B *R9

SUBC
...
MOV R11,R9
BL @SUB2
BL @SUB3
BL @SUB1
B *R9

SUB1
...
B*R11

SUB2
...
B*R11

SUB3
...
B *R11

 

  • Like 2
Link to comment
Share on other sites

1 hour ago, RXB said:

Yea I forgot the cost in CPU cycles having to save R11 and then put it back for a return.

MOV  R11,R9

B      *R9

 

Versus

BLWP @Address

RTWP

 

BL is going to run out of Registers to use fast and no way around that.

And each time you need to do a BL to another subroutine you need another register to save that address:

MOV   R11,R8

B       *R8

Also you have to add in those two lines to get back to original routing that it came from so how is that faster?

With BLWP you do not waste register or do you need to use up all registers to get back to calling routine.

There is a way around that. It is done on the ARM processor and other modern machines, this way as well. 

You allocate one register to be a stack pointer.

 

Then your call sequence is as @apersson850  showed previously.

RP   EQU R10 

* call *
DECT RP              10
MOV  R11,*RP         18
BL  @ABCD            20 
-------------------------
                     48 

* RETURN *
 MOV *RP+,R11        22
 B   *R11            12
----------------------------
                     34     

 

If you use a macro assembler these can be  turned into one line.

CALL @MYCODE  

MYCODE    BLAH
          BLAH
            .
            .
            .
          RET 

 

 

Nope. It's not faster that BLWP/RTWP, but instead of needing a workspace for every sub-routine you just reserve a few bytes as your return stack.

With 20 bytes you can nest sub-routines 10 deep.

 

If you make macros for PUSH and POP, you can also use the return stack for temp storage anytime you need it.

      

       PUSH  R1   
       PUSH  R2
       
       LI R1,>1234
       LI R2 >5678
       A  R1,R2  
       MOV R2,@TOTAL 
       
       POP R2 
       POP R1 

 

With a stack you are using a small memory space over and over for multiple purposes. 

 

Edited by TheBF
CODE mistake
Link to comment
Share on other sites

The drawback of the flexible stack method is that the TMS 9900 doesn't have "deferred indirect". If it did, you could have returned with a mechanism like B @*SP+, which should be read as branch to the address stored in the memory position pointed to by the stack pointer and increment the stack pointer by two. Such instructions do exist, but typically in 32-bit architectures.

We need to do as BF did above, MOV *SP+,R11 and then B *R11.

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...