+retroclouds Posted September 5, 2010 Share Posted September 5, 2010 I'd like to use this thread for collecting any cool tricks you can do in TMS9900 assembly language. With "trick" I mean optimize a statement for speed and/or size. Or just do something with an instruction you didn't think was possible at all. So how about it, any cool tricks you wanna share ? I'll kick it off with a little trick I found on Thierry's page: C instruction Appart for comparison, this instruction can also be used to increment a register by four: C *Rx+,*Rx+ This uses only one word of memory as opposed to the equivalent : INCT Rx INCT Rx Note that the corresponding CB instruction would increment the register by two, but there is no advantage over a plain vanilla INCT in this case. Quote Link to comment Share on other sites More sharing options...
Tursi Posted September 5, 2010 Share Posted September 5, 2010 Well, a similar trick to Theirry's with Compare is that you can increment /two/ registers in one instruction: C *R1+,*R2+ -- increment each register by 2 CB *R1+,*R2+ -- increment each register by 1 and since it supports full addressing modes, you can increment memory locations as well as registers. The idea is you get two for one of whatever it is. One I've used a few times was to use the parity status bit to test a bit - sort of. It's not often thought of, but if you know that the number of set bits will change in a byte, you can use JOP rather than masking and comparing, since the MOVB will set the parity bit if there are an odd number of set bits in the moved byte. (Doesn't work with words!) Storing a commonly used memory address or even a commonly used number in a register can make a huge difference in the size of your program. Smaller programs on the 9900 execute faster. We have 15 general purpose registers - put them to work! Registers should pretty much always be in scratchpad if you care at all about performance of the code, or you use no registers. Not really a trick, I guess, but pretty important. Likewise, if you can spare the space, copying commonly used code to scratchpad becomes a performance win after about 4 instructions. This one is common to all architectures, but tail recursion is a common optimization, and it works very well on the 9900 where there's no stack. A common way to deal with a subroutine that calls a subroutine is to move R11 to another temporary register or memory location. Something like this: SUB1 DO_SOME_WORK MOV R11,R10 * Save return address BL @SUB2 B *R10 SUB2 DO_SOME_OTHER_WORK B *R11 If your second subroutine call is the last thing you do, then don't bother saving the return address. Branch to the subroutine instead and it will use YOUR R11 when it returns. Saves memory, and instructions. This does assume that SUB2 doesn't need R11 to point into SUB1 for any reason. SUB1 DO_SOME_WORK B @SUB2 * SUB2 will return for us SUB2 DO_SOME_OTHER_WORK B *R11 Another common-to-all-archs one reminds that shifting is much faster than multiply and divide, if you are doing a power of two. For instance, SLA R1,2 is much faster and saves a register over multiplying R1 by 4. If you are not doing a power of two to multiply, it is commonly reputed faster if you can break it into two multiplies and an addition. For example, to multiply R1 by 10, use a second register to make a copy, multiply one by 8 and one by 2, then add them, like so: * 10 can be done in powers of two as 8 + 2 R1X10 MOV R1,R2 SLA R1,3 * multiply by 8 SLA R2,1 * multiply by 2 A R2,R1 * put result in R1 Of course, the TI has a funny architecture. All other things being equal, the above doesn't work out for multiplies. The above code takes 14+18+14+14 cycles and 14 memory accesses, plus 8 bytes. In 8 bit RAM, Regs in scratchpad, that would total 76 cycles. The MPY would probably look like this: R1X10 LI R0,10 MPY R0,R1 This takes 12+52 cyles and 8 memory accesses, plus 6 bytes. In 8-bit RAM, Regs in scratchpad, it would total 76 cycles. Note it's the same, and the MPY takes less code space. If you don't need to load the 10 (ie: it's already loaded elsewhere), the MPY can actually be faster. Worse, if registears are in 8-bit RAM, the shift/shift/add approach takes 116 cycles, while the MPY approach only goes up to 96 cycles. So the rule there is SLA if it's a multiple of 2, this is much faster and saves a register, but if you need two shifts the MPY is nearly always the better choice, unintuitively. DIV, on the other hand, has a best case of 92 cycles (unless it overflows), but I don't think two shifts and an add work out.. but if you can find a way to avoid DIV your code will appreciate it. On the other hand, DIV is the slowest instruction on the chip and could be used for delays. Remember that almost every instruction that touches memory does an implicit compare against zero. If you can make that meaningful, you can skip explicit compares after most operations. For instance, JNE/JEQ jump on exactly zero, JLT/JGT jumps based on the status of the highest bit (treating it as a sign bit), JOP jumps based on the number of '1' bits if it was a byte operation. Arranging your data will almost always give you the best savings, if you can. I've used this VDP trick once or twice - it works on hardware but not so much on some of the emulators. Remember that the only difference between setting a read address and setting a write address is whether the prefetch occurs. So if you are desperate and don't care about compatibility, setting a read address one less than the one you want will still have the correct result because the prefetch will bump the address up before you write, and is faster than the operations needed to set (and clear) a bit. For instance: DEC R1 / INC R1 - 10 cycles each (write address as a read address one less than desired) ORI R1 / ANDI R1 - 14 cycles each plus extra program memory read (slowest method) XOR R1 / XOR R1 - 14 cycles each plus extra memory read SOCB / SZCB - 14 cycles each plus extra memory read Unfortunately, to reiterate, this trick is not compatible with some emulation. Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted September 5, 2010 Author Share Posted September 5, 2010 Ah yes, those are some nice tricks. Learned some of them the hard way. Especially these on register and scratch pad usage I do am a bit lost on the VDP trick, I suppose you are referring to reading a byte from VDP and in the next step writing a byte to the VDP without setting the write address specifically ? Would you mind giving an example ? Quote Link to comment Share on other sites More sharing options...
+InsaneMultitasker Posted September 5, 2010 Share Posted September 5, 2010 (edited) I often use ABS, CLR, INV, SETO for simple on/off flags. Sometimes STWP for a quick 'set' if I don't care about the value in the register. All except STWP can be used on non-registers [edited- originally, incorrectly wrote 'words'] equally well, though a little slower. SETO R6 * set flag ABS R6 * test flag JEQ SETFLG * jump if EQ it set CLR R6 * clear flag ... or we could use INV R6 assuming we used SETO/CLR * Flip the flag: INV R6 ABS R6 * is it on or off? * sometimes I'll "set" the flag to nonzero like this - 8 cycles IIRC STWP R6 * always !=0, but INV can't be used to turn flag 'off' Edited September 9, 2010 by InsaneMultitasker Quote Link to comment Share on other sites More sharing options...
Tursi Posted September 5, 2010 Share Posted September 5, 2010 (edited) I do am a bit lost on the VDP trick, I suppose you are referring to reading a byte from VDP and in the next step writing a byte to the VDP without setting the write address specifically ? Would you mind giving an example ? No, you set an address without the prefetch inhibit bit, but you write without reading first. A huge amount of confusion of the way the 9918 address counter works was caused by TI telling you to "set a read address" or "set a write address". When you get down to the lowest levels, there is no such thing, the bit that you set for a "write" address is actually a prefetch inhibit. So the normal approach to set a "write" address already stored in R1 (and leaving it untouched in the end) is something like this: VDPWB DATA >4000 SOC @VDPWB,R1 * Make it a 'write' address SWPB R1 * LSB first MOVB R1,@VDPWA * Write to VDP address register SWPB R1 * Get MSB and delay MOVB R1,@VDPWA * Write to VDP address register SZC @VDPWB,R1 * Get rid of the 'write' bit This works just as well, and is slightly faster: DEC R1 * Make one less to account for VDP prefetch SWPB R1 * LSB first MOVB R1,@VDPWA * Write to VDP address register SWPB R1 * Get MSB and delay MOVB R1,@VDPWA * Write to VDP address register INC R1 * Get back the original value In emulation it only works on emulators to get the VDP prefetch correct. Emulators that get it wrong will also fail on the Diagnostic cartridge memory "checkerboard test" and the game Popeye will leave graphical glitches when a bottle is thrown. Knowing about the way that VDP address register works can help your code a little, too, since you can freely change between reads and writes without changing the address register, if your data layout happens to work with that. For that matter, if you need to skip one or two bytes of VDP memory, it's generally faster to just read them than to set the address explicitly again. (Since it takes two VDP writes to set the address again). You can do this even if you are writing data, you just have to be careful of the address counter, since reads and writes increment it at different times (reads increment before you read the data due to prefetch, writes after you write it). For instance, let's say I want to move two sprites in the sprite table, but not touch color or character data. The sprite table layout is: Y, X, Char, Color (For simplicity, assume sprite one X and Y is in R0,R1, and sprite two is in R2,R3, and the SAL is at >0300): SALTAB EQU >0043 * another trick - pre-swapping the defined address saves a SWPB in the code. Note Ive set >4000 here for write. LI R5,SALTAB * Get address of sprite attribute list MOVB R5,@VDPWA * pre-swapped, dont need the first SWPB SWPB R5 * get MSB and delay MOVB R5,@VDPWA * write MSB - address is now set. On a stock 99/4A we dont need to delay unless * we use register-only addressing to access the VDP in the very next instruction * but you can if you are nervous or want to work on accelerated machines. MOVB R1,@VDPWD * write sprite 0 Y. No delay needed between writes on a stock 99/4A MOVB R0,@VDPWD * write sprite 0 X. The address pointer now points to Spr0.Char, we want to skip two. MOVB @VDPRD,R5 * read garbage from prefetch and increment address pointer. Prefetch now has Spr0.Char and address is Spr0.Color MOVB @VDPRD,R5 * read Spr0.Char from prefetch. Prefetch now has Spr0.Color and Address is Spr1.Y. No delay needed on stock 99/4A * between reads unless you are using register-only addressing (even that is on the edge). MOVB R3,@VDPWD * write sprite 1 Y. Address counter increments as you expect. MOVB R4,@VDPWD * write sprite 1 X. We're done. The trick really, is that the VDP doesn't have a "read mode" or a "write mode". It has an address register and a prefetch register, and a read port and a write port. Accessing the read port returns the prefetch register, fetches the data at the address register into the prefetch register, then increments the address register. Accessing the write port writes the data byte to memory at the address register, then increments the address register (it may store the byte temporarily in the prefetch register, I need to test that still). There is a caveat to all the above, though. Later versions of the chip had separate read and write address pointers, meaning that these tricks will NOT work on the 9938 or 9958. If you want to be compatible then you do need to think of it in terms of how TI specified a "Read address" and a "write address". Edited September 5, 2010 by Tursi Quote Link to comment Share on other sites More sharing options...
matthew180 Posted September 6, 2010 Share Posted September 6, 2010 Also related to what Tursi is talking about is a situation where you set up a write address, but perform a read from the VDP. Since setting up the write address inhibits the prefetch of the data, what you actually read will not be the data at the address currently held in the VDP's address register. What you will get is *probably* the last byte that was read or written, but don't count on it. Matthew Quote Link to comment Share on other sites More sharing options...
Opry99er Posted September 8, 2010 Share Posted September 8, 2010 Tursi gave me a piece of advice today that may not be considered a "trick" per se, but it is quite helpful. It had primarily to do with "spreading out" your workload in your game loop. For instance, if you need to check for 16 collisions, do 8 per loop cycle and alternate between them. For instance, check for collisions 1-8, then on the next cycle check for 9-16. This doesn't work for collisions very well in XB, but in assembly, it reduces your loop length and still allows for excellent accuracy in the checks. Since the ISR occurs 60(+-) times a second, that gives roughly 30 checks per second of each detection, as long as you tie the routine into the ISR. I hope I did not misunderstand this advice, but I think I have the concept now. Quote Link to comment Share on other sites More sharing options...
insomnia Posted September 8, 2010 Share Posted September 8, 2010 I don't know if this is a trick or not, but I thought this was kind of neat. By using the BLWP instruction, you can have overlapping workspaces between the caller and callee. This allows parameters to be passed to the callee, but preserve some register values across the function call. This would be sort of like a "caller save" calling convention, where the caller saves all important info before calling a function. Example: Save four registers, then call a function which takes three arguments and returns a value Caller regs: 0 1 2 3 4 5 6 7 8 9 A B C D E F Callee regs: 1 2 3 4 5 6 7 8 9 A B C D E F '--.--' '-.-' | | | '-.-' | | Arg3 --------------------' | | | | | Arg2 ----------------------' | | | | Arg1/Return ------------------------' | | | Callee context ----------------------' | | Caller saved regs -----------------------------' | Caller context ------------------------------------' In this example, R9 to R12 are saved across the call, the caller places arguments in R3, R4 and R5. A return value will be placed in R5 after the return. The caller registers R0 to R8 are destroyed by the call, but the callee has nearly all registers available for its use. * Argument setup li r3, 1 li r4, 2 li r5, 3 * Call setup: stwp r6 * 8 * Get current workspace ai r6, -(4+3)*2 * 14+4 * Locate caller workspace 14 bytes above this one li r7, FUNC * 12+4 * Set jump address blwp r6 * 26 * Jump to called function * ---- * 68 * Return value in R5 Even though BLWP is slower than BL, this is pretty quick. The call convention I came up with for my GCC port takes about 100 cycles for a call setup, return and stack maintenance. Suprisingly, this is faster. I don't know of any other archetecture where you can do something like this, so no compiler would support this kind of call. Assembly only for this guy. There are some other obvious drawbacks: The author would be restricted to using certain registers for certain purposes. The register restrictions could change from call to call. Making tricky assembly code REALLY confusing. The small amount of scratchpad memory restricts the "stack" usage, and call tree depth. Wrapping around the top of scratchpad memory would result in hard-to-find memory errors. I'm not sure if "blwp r6" is valid or not to be honest. If not, that's OK, it just means the call setup needs a few additional instructions. Honestly, I don't think this is a reasonable call method for a general case, but it is awfully cool. Quote Link to comment Share on other sites More sharing options...
Willsy Posted September 8, 2010 Share Posted September 8, 2010 I often use ABS, CLR, INV, SETO for simple on/off flags. Sometimes STWP for a quick 'set' if I don't care about the value in the register. All except STWP can be used on word values equally well, though a little slower. SETO R6 * set flag ABS R6 * test flag JEQ SETFLG * jump if EQ it set CLR R6 * clear flag ... or we could use INV R6 assuming we used SETO/CLR * Flip the flag: INV R6 ABS R6 * is it on or off? * sometimes I'll "set" the flag to nonzero like this - 8 cycles IIRC STWP R6 * always !=0, but INV can't be used to turn flag 'off' He he, I never thought about using ABS. I would have used MOV R6,R6 in your example above. Sets the EQ bit if 0 :-) Quote Link to comment Share on other sites More sharing options...
Willsy Posted September 8, 2010 Share Posted September 8, 2010 Very quick one to round a value in a register (in this case R0) up to an even byte boundary: INC R0 ; add 1 to r0 ANDI R0,>FFFE ; round down to even (word) address That's all I can think of right now! Brain fried! Quote Link to comment Share on other sites More sharing options...
insomnia Posted September 9, 2010 Share Posted September 9, 2010 I often use ABS, CLR, INV, SETO for simple on/off flags. Sometimes STWP for a quick 'set' if I don't care about the value in the register. All except STWP can be used on word values equally well, though a little slower. SETO R6 * set flag ABS R6 * test flag JEQ SETFLG * jump if EQ it set CLR R6 * clear flag ... or we could use INV R6 assuming we used SETO/CLR * Flip the flag: INV R6 ABS R6 * is it on or off? * sometimes I'll "set" the flag to nonzero like this - 8 cycles IIRC STWP R6 * always !=0, but INV can't be used to turn flag 'off' He he, I never thought about using ABS. I would have used MOV R6,R6 in your example above. Sets the EQ bit if 0 :-) I totally agree. In fact I'm shamelessly stealing the ABS test for a GCC optimization step. Every cycle counts right? Quote Link to comment Share on other sites More sharing options...
Opry99er Posted September 9, 2010 Share Posted September 9, 2010 This is cool stuff. . I'll be on your level soon, boys.... Then I'll be posting some cool tricks as well. Quote Link to comment Share on other sites More sharing options...
insomnia Posted September 10, 2010 Share Posted September 10, 2010 How about computed jumps? I was thinking of ways to take advantage of the X instruction, and this is what I came up with In basic: on index goto 100, 110, 120, 130 in assembly: * Assume "index" is stored in R0 * Validate input value ci r0, 4 jh badval * Index is negative or greater then 4, bad value * Jump to correct line ai r0, >1000 * This is the code for "JMP 0" x r0 * Jump into the table below jmp line100 jmp line110 jmp line120 jmp line130 Yeah, this is pretty much what X is for, but it's an easy instruction to overlook. 1 Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted May 7, 2011 Author Share Posted May 7, 2011 Has been a while since I've seen some abracadabra Today I went through TI Intern (page 41, address >0EDC) and found this little snippet: CLR R0 MPY R0,R1 Looks kinda weird? It's code optimized for space. It clears register R0, R1, R2. The cool thing is; the machine code for these 2 instructions is only 4 bytes. Makes me wonder how many programmers worked on the TI-99/4A operating system. Some things look real cool and others well Guess they did a good job on code size. There are still 18 bytes of free space at >1FEA. Quote Link to comment Share on other sites More sharing options...
RXB Posted May 7, 2011 Share Posted May 7, 2011 (edited) In GPL similar short cut exist. LN1 DEC @VARIABLE LN2 BR LN1 The DEC does a -1 and checks for zero at same time so the BR will work with out a CZ @VARIABLE LN1 INC @VARIABLE LN2 BR LN1 This works the same way, when VARIABLE hits >FF it then goes to >00 so you get the same without CZ @VARIABLE LN1 DECT @VARIABLE LN2 BR LN1 This one presets a problem if the number is odd it never gets out of loop, only if it is even as >01 - >02 = >FF and it starts over. (Not that anyone besides me does any GPL) Edited May 7, 2011 by RXB Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted May 17, 2011 Author Share Posted May 17, 2011 ok, here's a little trick I use quite often now. I'm using shift instructions for clearing the X leftmost bits in a register. In the below example I clear the 3 leftmost bits in register R1. SLA R1,3 SRL R1,3 Normally one would use a single SZC or ANDI instruction. But I find the above easier and the machine code is also shorter (4 bytes) instead of at least 6 bytes. I haven't done the math on this yet, but in general short machine code takes less time to process. So the two instructions combined should perform as well or better as an SZC or ANDI instruction. Note: by swapping the two instructions it should also work for the X rightmost bits in a register Quote Link to comment Share on other sites More sharing options...
matthew180 Posted May 18, 2011 Share Posted May 18, 2011 Actually, on the 9900 the shift instructions can be slow since their execution time depends on the amount of shifting. Most CPUs have dedicated barrel shift units that make shifting really really fast. But alas, just like the 9900's external registers, it does not have any dedicated internal shift units (actually it does, but it is used exclusively for the CRU IO.) I think the 9900 was a test bed for all of TI's *ideas* about different CPU design... Anyway, the shift instructions have a base execution time of 12-clocks, with an additional 2-clocks for every shift. Both ANDI and SZC have an execution time of 14-clocks, and only SZC has possible time modification based on the addressing mode of the two operands (which is zero if you go with registers for both.) So: SLA R1,3 18-clocks SRL R1,3 18-clocks ANDI R1,>1FFF 14-clocks I'm not sure how the SZC comes in to play with just clearing some bits? Am I missing something? 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 >F96A some value to clear TOP 3 bits 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 >1FFF mask ---------------------------------- AND 0 0 0 1 1 0 0 1 0 1 1 0 1 0 1 0 >196A 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 >F96A some value to clear BOTTOM 3 bits 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 >FFF8 mask ---------------------------------- AND 1 1 1 1 1 0 0 1 0 1 1 0 1 0 0 0 >F968 Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted May 18, 2011 Author Share Posted May 18, 2011 Actually, on the 9900 the shift instructions can be slow since their execution time depends on the amount of shifting. Most CPUs have dedicated barrel shift units that make shifting really really fast. But alas, just like the 9900's external registers, it does not have any dedicated internal shift units (actually it does, but it is used exclusively for the CRU IO.) I think the 9900 was a test bed for all of TI's *ideas* about different CPU design... Anyway, the shift instructions have a base execution time of 12-clocks, with an additional 2-clocks for every shift. Both ANDI and SZC have an execution time of 14-clocks, and only SZC has possible time modification based on the addressing mode of the two operands (which is zero if you go with registers for both.) So: SLA R1,3 18-clocks SRL R1,3 18-clocks ANDI R1,>1FFF 14-clocks I'm not sure how the SZC comes in to play with just clearing some bits? Am I missing something? 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 >F96A some value to clear TOP 3 bits 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 >1FFF mask ---------------------------------- AND 0 0 0 1 1 0 0 1 0 1 1 0 1 0 1 0 >196A 1 1 1 1 1 0 0 1 0 1 1 0 1 0 1 0 >F96A some value to clear BOTTOM 3 bits 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 >FFF8 mask ---------------------------------- AND 1 1 1 1 1 0 0 1 0 1 1 0 1 0 0 0 >F968 As you I'm also a speed freak, so thanks for giving the facts. This is really an interting topic ok, even as it slower as the ANDI instruction -which I really didn't expect- I'm still fine with it. For most purposes I'm clearing max 3 bits and I'm not using it in loops. So in my case the bytes saved on instruction size are more important as the actual raw performance. Good to know, that you have to consider the clock-cycles when using it in loops though. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted July 11, 2022 Share Posted July 11, 2022 Seems like it's time to bring this old topic into the current decade. I am wondering if there is a better way to duplicate a byte in a register to the other side of the same register. My current method is using three instructions. Yuk! MOV R4,R3 SWPB R3 A R3,R4 1 Quote Link to comment Share on other sites More sharing options...
Asmusr Posted July 11, 2022 Share Posted July 11, 2022 (edited) 35 minutes ago, TheBF said: Seems like it's time to bring this old topic into the current decade. I am wondering if there is a better way to duplicate a byte in a register to the other side of the same register. My current method is using three instructions. Yuk! MOV R4,R3 SWPB R3 A R3,R4 MOVB R4,@R4LB where R4LB is the address of the low byte of R4. Edited July 11, 2022 by Asmusr 2 Quote Link to comment Share on other sites More sharing options...
Willsy Posted July 11, 2022 Share Posted July 11, 2022 ^^^^ what he said Quote Link to comment Share on other sites More sharing options...
+TheBF Posted July 11, 2022 Share Posted July 11, 2022 2 hours ago, Asmusr said: MOVB R4,@R4LB where R4LB is the address of the low byte of R4. Thank you guys. I see now that I have been here before with VDP routines. If you have single threaded system where there is only one workspace this method works great. That is not my case. The code I am making could be called by a task using a workspace anywhere in RAM potentially. So if I want to do this I will have to use STWP Rx and use indexed addressing to get to the other side of the register. It might still be better than three instructions but probably just by a bit. Edit: Looks like it will be about 1/2 the cycles use STWP and indexed addressing versus the three instruction method. I'll take it. 3 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted July 12, 2022 Share Posted July 12, 2022 18 hours ago, TheBF said: MOV R4,R3 SWPB R3 A R3,R4 I presume R4 (first instruction) is assured to always have an empty destination byte to preclude trashing it with the last instruction. ...lee 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted July 12, 2022 Share Posted July 12, 2022 48 minutes ago, Lee Stewart said: I presume R4 (first instruction) is assured to always have an empty destination byte to preclude trashing it with the last instruction. ...lee Hmm. I thought I tested this and saw the correct result. Yes. I just redid the test I was looking at how slow it was to fill memory byte by byte with FILL. So I made FILLW (fill word) but I needed code to duplicate the character argument in both bytes. Since my Forth keeps the TOS cached in R4 this was my first cut at it. Here is where I landed after discussions here. CODE FILLW ( adr len c --) *SP+ R0 MOV, *SP+ W MOV, R1 STWP, 9 R1 () TOS MOVB, \ dup c in both bytes of TOS BEGIN, TOS *W+ MOV, \ 2 chars are in TOS register R0 DECT, \ decr. count by two NC UNTIL, NEXT, ENDCODE It is much faster than fill. I took 3 seconds off of 10 iterations of the Byte magazine sieve benchmark. Of course it will always fill an even number of bytes. Programmer is in charge of the outcome in Forth. Note: My assembler has TI ASM addressing notation only for SP RP TOS and W registers. The rest have to use the Forth style addressing 4 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 12, 2022 Share Posted July 12, 2022 (edited) 3 hours ago, TheBF said: Hmm. I thought I tested this and saw the correct result. I think Lee was referring to that the value in R4 has to be 00XX in hex. Or XX00. If you had used 199 hex in your example, it would have added 0199 to 9901, with the result 9A9A instead. Edited July 12, 2022 by apersson850 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.