Willsy Posted December 1, 2022 Share Posted December 1, 2022 How much more expensive is BLWP against BL? In terms of clock cycles and/or uS? Ta. 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted December 1, 2022 Share Posted December 1, 2022 BLWP 26 (+ addressing) RTWP 14 (+ addressing) BL 12 (+ addressing) B *R11 16 ( RT) 2 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted December 1, 2022 Share Posted December 1, 2022 Oh but of course if you need nested sub-routines on stack, it is really sad. DECT RP 10 MOV R11,*RP 16 BL @ABCD 20 ------------------------- 46 * RETURN MOV *RP+,R11 22 B *R11 16 ---------------------------- 38 * NO MEMORY WAIT STATES ASSUMED 2 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted December 1, 2022 Share Posted December 1, 2022 (edited) 1 hour ago, TheBF said: Oh but of course if you need nested sub-routines on stack, it is really sad. DECT RP 10 MOV R11,*RP 16 BL @ABCD 20 ------------------------- 46 * RETURN MOV *RP+,R11 22 B *R11 16 ---------------------------- 38 * NO MEMORY WAIT STATES ASSUMED Corrected slight timing errors DECT RP 10 MOV R11,*RP 18 BL @ABCD 20 ------------------------- 48 * RETURN MOV *RP+,R11 22 B *11 12 ---------------------------- 34 * NO MEMORY WAIT STATES ASSUMED But if you then need to start shuffling around data in the subroutine, since you need more registers than you can "steal" from the calling procedure, you may quickly find yourself in a slower melass than if you use BLWP. It's not just the call to look at, but also the effect of the call. Edited December 1, 2022 by apersson850 1 1 Quote Link to comment Share on other sites More sharing options...
RXB Posted December 1, 2022 Share Posted December 1, 2022 Having more Registers is always going to be slightly more efficient as with a set number of registers long term the speed gain is going to sacrifice efficiency. Adding 10 more Registers is very much like using smaller disks vs larger disks. Yea you can skimp and save but you in long run have to drop features. A BL uses the same registers, so you end up having to save some stuff then reload it later, this is inefficient. BLWP is more like using a stack, matter of fact is very much like pushing and popping a stack. 2 Quote Link to comment Share on other sites More sharing options...
speccery Posted December 2, 2022 Share Posted December 2, 2022 (edited) I want to bring a different point of view to this discussion, which seems to favour BLWP or BL. At least personally, for assembler programs, I very much prefer BL over BLWP. For leaf routines (which don't call other routines) you don't need the stack of return addresses, and you can just do BL @ABCD and B *R11 to return. Comparing BLWP and DECT RP, MOV R11,*RP, BL @routine is not in my opinion fair, since the latter supports recursion and is more flexible in that regard, while the former does not. One should add similar code to BLWP to support multiple levels of calls. And that really becomes involved, something like STWP Rx, AI Rx,-32, MOV Rx,@somewhere+2, BLWP @somewhere. In general you want to keep your workspace pointer in the scratchpad memory. This leads to static allocation of workspaces in the scratchpad if you want to use BLWP, not as flexible as with manual stacking of return addresses. Plus you run out of scratchpad space real fast when putting multiple workspaces there. One also needs to be careful when using BLWP and a non-trivial program, since if the routine called to with BLWP shares this workspace with other routines, unexpected results might arise if one expects the workspace to preserve its state. When using BL, since you're not changing the workspace pointer, you can quickly and easily access the registers of the calling routine with direct references (Rx), without using something more error prone with offsets like MOV @4(R13),R7 to get data from the caller's workspace. For all of the reasons above I guess all compiler generated code uses BL and not BLWP. Edited December 2, 2022 by speccery 6 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted December 2, 2022 Share Posted December 2, 2022 26 minutes ago, speccery said: ...without using something more error prone with offsets like MOV @4(R13),R7 to get data from the caller's workspace. That item is the only one I disagree with. Accessing the calling workspace based on content of R13, or data after the call based on the content of R14, isn't any more error prone than using the register directly or access data based on the content of R11. Rather accessing registers via R13 is less error prone, as you are less likely to change a register you better shouldn't mess with. 4 Quote Link to comment Share on other sites More sharing options...
Asmusr Posted December 2, 2022 Share Posted December 2, 2022 2 hours ago, apersson850 said: That item is the only one I disagree with. Accessing the calling workspace based on content of R13, or data after the call based on the content of R14, isn't any more error prone than using the register directly or access data based on the content of R11. Rather accessing registers via R13 is less error prone, as you are less likely to change a register you better shouldn't mess with. I never use BLWP, so is it common to read (or even change) the callers registers when using BLWP? It appears to me as bad practice for a child routine to know anything about the parent. With BL it's clear that the workspace is shared, so you know you have to be very specific about which registers you use as parameters and which registers the child routine are allowed to change. Admittedly that can also often become a mess. 2 Quote Link to comment Share on other sites More sharing options...
PeteE Posted December 2, 2022 Share Posted December 2, 2022 (edited) 4 minutes ago, Asmusr said: I never use BLWP, so is it common to read (or even change) the callers registers when using BLWP? It appears to me as bad practice for a child routine to know anything about the parent. With BL it's clear that the workspace is shared, so you know you have to be very specific about which registers you use as parameters and which registers the child routine are allowed to change. Admittedly that can also often become a mess. Indeed, I do the same with having comments at each entry point describing registers used and modified. I'm on the verge of writing my own macro assembler or a compiler that will track/allocate registers automatically. Regarding BLWP and the workspace in limited scratchpad memory, could you overlap the workspaces in a way to allow the callee access to the caller-passed registers as well as space for its own? Edited December 2, 2022 by PeteE 2 2 Quote Link to comment Share on other sites More sharing options...
RXB Posted December 2, 2022 Share Posted December 2, 2022 Considering the TI99/4A has >83E0 for registers for GPL and does most of the workload in a TI99/4A and has >83C0 for Interrupt Registers for Interrupts. So how come they did not use BL only for both of these thus freeing up more Scratch Pad? I think the most realistic answer is the inefficiency of BL limited number of registers to do complicated things. Yes, BLWP takes up more memory, so BL is better for many things and is more efficient until it gets complicated. i.e. Using memory to save registers or duplicate them for later use. Each time you save R11 you have to have some place to put it using BL, unlike BLWP RTWP does this. Quote Link to comment Share on other sites More sharing options...
apersson850 Posted December 2, 2022 Share Posted December 2, 2022 (edited) 54 minutes ago, Asmusr said: I never use BLWP, so is it common to read (or even change) the callers registers when using BLWP? Indeed you do. That's how you pass parameters. It's of course up to the programmer to make sure you don't write things where you should not, but apart from that, you can also return results the same way. That kind of carefulness is just the normal standard, when you are programming in assembly language on this class of processors, where there is no protection from anything at all. You asked for it, you got it. 50 minutes ago, PeteE said: Regarding BLWP and the workspace in limited scratchpad memory, could you overlap the workspaces in a way to allow the callee access to the caller-passed registers as well as space for its own? Yes, you can overlap. In theory. The smallest overlap is three registers, since you have to have a new set of R13, R14 and R15. But if you do that, then either your new R13, R14 and R15 will overlap R10-R12 of the caller's, and that's where the caller has return linkage and CRU base, so you may not want to. Or you overlap in the other direction, but then the caller's return registers will be your return linkage and CRU base address, which you may not like any better. You can have a larger offset, but no matter how you do, you'll lose three registers in one of the workspaces for the return linkage. I've never used that technology, since I think it creates more problems than it solves. 44 minutes ago, RXB said: Considering the TI99/4A has >83E0 for registers for GPL and does most of the workload in a TI99/4A and has >83C0 for Interrupt Registers for Interrupts. So how come they did not use BL only for both of these thus freeing up more Scratch Pad? I think the most realistic answer is the inefficiency of BL limited number of registers to do complicated things. But they do use BL with these workspaces. At least with GPLWS. I haven't checked the interrupt code to see if it uses BL too, but it's of course fully possible. That there are two different workspaces is of course a given thing, considering their purpose. Edited December 2, 2022 by apersson850 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted December 2, 2022 Share Posted December 2, 2022 I find that if you have taken the time to reserve a register and some memory for a stack, it is easy to use and doesn't not waste as much memory as using BLWP. BLWP/RTWP IMHO is an amazing context switch mechanism, but most programs don't need a full context switch for sub-routines. I will add an exception to that statement. For data structures that need some local storage to maintain their state, a queue for example, it's nice to keep all the variables of the queue in registers. This can be a good use case for BLWP. For that purpose, I added an optional PROG: directive to my Assembler, that lets you define a sub-program. You give it a workspace address and it allocates the vector All the children of PROG: can "call" (BLWP) themselves from Forth. I thought it was pretty slick. This way you can write sub-programs that use a common workspace very easily. These sub-programs behave like any other word in the language inside Forth. It looks like this: HEX 20 MALLOC CONSTANT QWKSP \ points to a workspace for Q operations QWKSP PROG: INIT-QWKSP \ code that initializes wksp <Assembler code .... ;PROG QWKSP PROG: ENQ ( c -- ? ) \ put byte in Q, return error code <Assembler code .... ;PROG QWKSP PROG: DEQ ( 0 -- c) \ returned char can be any byte value. [0..255] <Assembler code .... ;PROG QWKSP PROG: QSTAT ( 0 -- ?) \ true means data waiting <Assembler code .... ;PROG 1 Quote Link to comment Share on other sites More sharing options...
+InsaneMultitasker Posted December 2, 2022 Share Posted December 2, 2022 this isn't really an answer to the OP but since others are talking about usage... When I was "teaching" assembly language in our TI user group setting, BLWP was often preferred by the "students". The discrete register workspaces were easier to deal with and seemed to reduce the bugs they introduced by poor register management. The BLWP didn't necessarily save CPU clock cycles but it often saved time for the programmer. Some people gravitated to BL once they felt comfortable with their coding or had a need to shave clock cycles for iterative routines. 4 Quote Link to comment Share on other sites More sharing options...
RXB Posted December 2, 2022 Share Posted December 2, 2022 24 minutes ago, InsaneMultitasker said: this isn't really an answer to the OP but since others are talking about usage... When I was "teaching" assembly language in our TI user group setting, BLWP was often preferred by the "students". The discrete register workspaces were easier to deal with and seemed to reduce the bugs they introduced by poor register management. The BLWP didn't necessarily save CPU clock cycles but it often saved time for the programmer. Some people gravitated to BL once they felt comfortable with their coding or had a need to shave clock cycles for iterative routines. Yes I agree, Quinton and I taught an Assembly course here for TI99/4A Club PUNN here in Portland Oregon. I am slowly converting RXB from GPL to Assembly and use the GPL Registers and Scratch pad only so all has to work with ONLY CONSOLE and RXB Cart. This I have to avoid BLWP constantly and only use BL, and the very first problem Lee Stewart and I have run into is having to use up Scratch Pad. You see there are only 10 Registers (R11 to R15 are taken) you can use and sometimes you need 14 to do something really complicated but have to use us Scratch Pad locations. Scratch Pad is fast, but Registers are way faster. More Registers make it way easier to do more complicated things with less errors and less memory swaps. 2 Quote Link to comment Share on other sites More sharing options...
+adamantyr Posted December 2, 2022 Share Posted December 2, 2022 BLWP, by having an isolated register set, is useful because it's sort of like a method in a high-end language like Java, C, C#, etc. For BL/RT, I usually set up an internal stack and burn R10 as a pointer for it, so I can push and pop return addresses off if needed. That way BL routines can call other BL routines but still work their way back to the original caller. What I did find when coding Realms of Antiquity was that I'd frequently run into issues where I was using registers to hold specific values at the top of a return stack, and either I'd end up running out of registers to use or worse, accidentally use one and cause a bug. That's where I would convert some subroutines into BLWP versions, because then they operated independently. Also, I've used register sets in the regular CPU memory areas other than the scratch pad and not had performance problems. It's all about the context of what you're using them for. 7 1 Quote Link to comment Share on other sites More sharing options...
Willsy Posted December 3, 2022 Author Share Posted December 3, 2022 This is an interesting discussion. I was thinking about this idiomatic code that we often see: BL @THING ... ... THING MOV R11,@R11SAV ; save return address ... ... do stuff that uses R11 in some way ... ... MOV @R11SAV,R11 RT By the time you've saved R11 and restored it afterwards, isn't it a wash in terms of performance against BLWP/RTWP? In a multi-tiered application (which could apply to most applications, such as games) such as: Would there not be a distinct advantage* in using dedicated workspaces for each logical layer? At least in terms of programmer convenience. As rich noted above, workspace linking is the closest we have to a stack. Using separate workspaces would be a much easier to (for example, referring to the layer drawing above) load some data from the IO layer, process it in some way through the logic layer, and then present it at the presentation layer. E.g. a word processor or text editor. Thoughts? 4 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted December 3, 2022 Share Posted December 3, 2022 3 hours ago, Willsy said: Frequently you can make it slightly more efficient. BL @THING ... ... THING MOV R11,R12 ; save return address ... ... do stuff that uses R11 in some way, like BL @somewhere ... ... B *R12 If CRU access isn't used, then there's no problem using R12 to save the return address. As a general thing, if your call a subroutine that's complex enough, then it usually pays to use BLWP. By complex enough I mean something which benefits from using several registers by itself. It may pay in easy to count clock cycles, as you don't have to move around so much data, and it may certainly pay in hours, as it's easier to avoid messing with the caller's data. Especially in machines like mine, that has 16 bit wide RAM everywhere, there's no penalty for having the workspace outside of the scratch pad RAM. 2 2 Quote Link to comment Share on other sites More sharing options...
Willsy Posted December 3, 2022 Author Share Posted December 3, 2022 I forgot to define my asterisk above :-) * depends on your definition of 'advantage' I guess. I'm currently in favour of clarity and ease-of-coding over performance. For example, if you have a hand-rolled keyboard scanning routine, why bother worrying about 'performance' when by far the biggest time sink is the scanning of the keyboard in the first place. Surely better to let the keyboard scan be entirely independent and use its own workspace? Quote Link to comment Share on other sites More sharing options...
+TheBF Posted December 3, 2022 Share Posted December 3, 2022 2 hours ago, Willsy said: I forgot to define my asterisk above :-) * depends on your definition of 'advantage' I guess. I'm currently in favour of clarity and ease-of-coding over performance. For example, if you have a hand-rolled keyboard scanning routine, why bother worrying about 'performance' when by far the biggest time sink is the scanning of the keyboard in the first place. Surely better to let the keyboard scan be entirely independent and use its own workspace? I wonder if part of the issue with "ease-of-coding" is that for years we were using an assembler with no macro capability. The assembler I use has PUSH, POP, RPUSH and RPOP for the two Forth stacks. If we accept the performance hit, it's pretty easy to use. And since the 9900 accesses memory so easily even parameters sitting on the stack(s) can he modified with little effort. 2 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted December 3, 2022 Share Posted December 3, 2022 Since the p-system was my favorite environment, I enjoyed a macro assembler as soon as I got the p-code card. BLS (Branch with Link on Stack) and RLS (Return with Link from Stack) were some of my definitions. 4 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted December 3, 2022 Share Posted December 3, 2022 2 hours ago, apersson850 said: Since the p-system was my favorite environment, I enjoyed a macro assembler as soon as I got the p-code card. BLS (Branch with Link on Stack) and RLS (Return with Link from Stack) were some of my definitions. I named mine CALL and RET Quote Link to comment Share on other sites More sharing options...
RXB Posted December 3, 2022 Share Posted December 3, 2022 Yea I forgot the cost in CPU cycles having to save R11 and then put it back for a return. MOV R11,R9 B *R9 Versus BLWP @Address RTWP BL is going to run out of Registers to use fast and no way around that. And each time you need to do a BL to another subroutine you need another register to save that address: MOV R11,R8 B *R8 Also you have to add in those two lines to get back to original routing that it came from so how is that faster? With BLWP you do not waste register or do you need to use up all registers to get back to calling routine. Quote Link to comment Share on other sites More sharing options...
apersson850 Posted December 3, 2022 Share Posted December 3, 2022 As you show, you don't have to put a saved return address back to return, if you saved it in a register. Just return via the register you saved it in, like you show with B *R9. Simple devices like programmable calculators used to have 3-6 levels of subroutines (HP 67 and TI 59 are examples of that). Rarely were these limits a problem. How many levels deep do you normally need, if you run out of registers to use fast? It's not very often I'm more than 2-3 levels deep. Subroutines on the same level that don't call each other can use the same register. Here's an example of a main program which calls three different subroutines which in turn call three different subroutines, nesting max two levels deep. Since one level is handled by the normal R11 procedure, only one more level needs to be handled and that's done by using one single register. ; Main program BL @SUBA ... BL @SUBB ... BL @SUBC ... ; End of main SUBA ... MOV R11,R9 BL @SUB1 BL @SUB2 B *R9 SUBB ... MOV R11,R9 BL @SUB3 BL @SUB1 B *R9 SUBC ... MOV R11,R9 BL @SUB2 BL @SUB3 BL @SUB1 B *R9 SUB1 ... B*R11 SUB2 ... B*R11 SUB3 ... B *R11 2 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted December 3, 2022 Share Posted December 3, 2022 (edited) 1 hour ago, RXB said: Yea I forgot the cost in CPU cycles having to save R11 and then put it back for a return. MOV R11,R9 B *R9 Versus BLWP @Address RTWP BL is going to run out of Registers to use fast and no way around that. And each time you need to do a BL to another subroutine you need another register to save that address: MOV R11,R8 B *R8 Also you have to add in those two lines to get back to original routing that it came from so how is that faster? With BLWP you do not waste register or do you need to use up all registers to get back to calling routine. There is a way around that. It is done on the ARM processor and other modern machines, this way as well. You allocate one register to be a stack pointer. Then your call sequence is as @apersson850 showed previously. RP EQU R10 * call * DECT RP 10 MOV R11,*RP 18 BL @ABCD 20 ------------------------- 48 * RETURN * MOV *RP+,R11 22 B *R11 12 ---------------------------- 34 If you use a macro assembler these can be turned into one line. CALL @MYCODE MYCODE BLAH BLAH . . . RET Nope. It's not faster that BLWP/RTWP, but instead of needing a workspace for every sub-routine you just reserve a few bytes as your return stack. With 20 bytes you can nest sub-routines 10 deep. If you make macros for PUSH and POP, you can also use the return stack for temp storage anytime you need it. PUSH R1 PUSH R2 LI R1,>1234 LI R2 >5678 A R1,R2 MOV R2,@TOTAL POP R2 POP R1 With a stack you are using a small memory space over and over for multiple purposes. Edited December 3, 2022 by TheBF CODE mistake Quote Link to comment Share on other sites More sharing options...
apersson850 Posted December 3, 2022 Share Posted December 3, 2022 The drawback of the flexible stack method is that the TMS 9900 doesn't have "deferred indirect". If it did, you could have returned with a mechanism like B @*SP+, which should be read as branch to the address stored in the memory position pointed to by the stack pointer and increment the stack pointer by two. Such instructions do exist, but typically in 32-bit architectures. We need to do as BF did above, MOV *SP+,R11 and then B *R11. 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.