djmips Posted November 15, 2005 Author Share Posted November 15, 2005 I also like his 'toggle carry' method, though I still find myself wishing the 6502 allowed a direct method to do that. Yes, that is a nice one. At a quick glance, I thought it trashed A (but it doesn't as bit seven is safely tucked away in the carry and is restored in the ror) rol ; Cb into b0 eor #$01 ; toggle bit ror ; b0 into Cb Quote Link to comment Share on other sites More sharing options...
TROGDOR Posted December 21, 2005 Share Posted December 21, 2005 (edited) Here's my all-time favorite hack for the 6502. The "skip 2 bytes" pseudo-instruction, which piggybacks off the BIT instruction. It saves a byte when performing "either/or" branching. Consider the following example. You want to store a value in either Loc1 or Loc2, depending on whether it's negative. Normally, your code would look like: LDA MyNumber;2 BPL B1;2 STA Loc1;2 BMI Done;2 (This is effectively a BRA to Done.) B1 STA Loc2;2 Done It uses 10 bytes. Now consider this alternative: LDA MyNumber;2 BPL B1;2 STA Loc1;2 .BYTE $2C;1 (Skips the next 2 bytes, effectively doing a BRA to Done.) B1 STA Loc2;2 Done It only uses 9 bytes, and does the same thing. When the processor hits the ".BYTE $2C", it reads this as an Absolute BIT command, which is a 3 byte command. So, it uses the two bytes of the STA Loc2 command as the argument to the BIT command, performs a meaningless BIT test, and then continues to run, without ever executing the STA Loc2 command. The only thing to be careful with this is that the N, V, and Z flags will be corrupted. But the accumulator is not affected. So you've just saved a byte. Also be wary that one extra cycle is used. BIT Absolute is 4 cycles, and a BMI branch is 3 cycles (or 4 if it crosses a page boundary.) Edited December 21, 2005 by TROGDOR Quote Link to comment Share on other sites More sharing options...
supercat Posted December 21, 2005 Share Posted December 21, 2005 The only thing to be careful with this is that the N, V, and Z flags will be corrupted. But the accumulator is not affected. So you've just saved a byte. Also be wary that one extra cycle is used. BIT Absolute is 4 cycles, and a BMI branch is 3 cycles (or 4 if it crosses a page boundary.) 986490[/snapback] Another thing to beware of is that certain memory locations can be affected by reads. In a 4K cart, this isn't apt to be a problem (the only thing affected by a read is the timer-interrupt flag on the RIOT which is cleared by reading INTIM). In a 16K or 32K cart, skipping "INC nn,X" or "ISB nn,x" will trigger a bank switch if 'nn' is $1F, $3F, $5F, $7F, $9F, $BF, $DF, or $FF, but such an instruction isn't too likely to occur. In a 32K cart, skipping "SBC nn,X" will also trigger a bankswitch. None of those bankswitch hotspots is in practice likely to be a problem. On the other hand, RAM based carts open up a much bigger danger zone. On a Superchip cart, skipping any instruction whose opcode is $00-$7F and whose operand is $10, $30, $50, $70, $90, $B0, $D0, or $F0 will trash a byte of Superchip RAM. On a Supercharger, skipping any two-byte instruction whose operand is one of the above will be, to put it mildly, "interesting". Quote Link to comment Share on other sites More sharing options...
supercat Posted December 21, 2005 Share Posted December 21, 2005 Yes, that is a nice one. At a quick glance, I thought it trashed A (but it doesn't as bit seven is safely tucked away in the carry and is restored in the ror) 965779[/snapback] BTW, if it's desired to copy bit 7 of an address into all bits of the accumulator (so it's 00 or FF): lda #$7F cmp Address ; Carry clear if >= $80 adc #$80 ; $00 if carry set; $FF if carry clear Quote Link to comment Share on other sites More sharing options...
TROGDOR Posted December 21, 2005 Share Posted December 21, 2005 The only thing to be careful with this is that the N, V, and Z flags will be corrupted. But the accumulator is not affected. So you've just saved a byte. Also be wary that one extra cycle is used. BIT Absolute is 4 cycles, and a BMI branch is 3 cycles (or 4 if it crosses a page boundary.) 986490[/snapback] Another thing to beware of is that certain memory locations can be affected by reads. In a 4K cart, this isn't apt to be a problem (the only thing affected by a read is the timer-interrupt flag on the RIOT which is cleared by reading INTIM). In a 16K or 32K cart, skipping "INC nn,X" or "ISB nn,x" will trigger a bank switch if 'nn' is $1F, $3F, $5F, $7F, $9F, $BF, $DF, or $FF, but such an instruction isn't too likely to occur. In a 32K cart, skipping "SBC nn,X" will also trigger a bankswitch. None of those bankswitch hotspots is in practice likely to be a problem. On the other hand, RAM based carts open up a much bigger danger zone. On a Superchip cart, skipping any instruction whose opcode is $00-$7F and whose operand is $10, $30, $50, $70, $90, $B0, $D0, or $F0 will trash a byte of Superchip RAM. On a Supercharger, skipping any two-byte instruction whose operand is one of the above will be, to put it mildly, "interesting". 986508[/snapback] Thanks for the heads-up supercat. I want to maintain Supercharger compatibility for all my programs, so it looks like I'll have to ditch my favorite hack. Quote Link to comment Share on other sites More sharing options...
+batari Posted December 21, 2005 Share Posted December 21, 2005 (edited) On a Supercharger, skipping any two-byte instruction whose operand is one of the above will be, to put it mildly, "interesting". 986508[/snapback] Thanks for the heads-up supercat. I want to maintain Supercharger compatibility for all my programs, so it looks like I'll have to ditch my favorite hack. No need to abandon the hack unless you are writing actual Supercharger games. If you create a regular 4k binary and run it on a Supercharger, you will have writes disabled anyway so this doesn't apply. Edited December 21, 2005 by batari Quote Link to comment Share on other sites More sharing options...
cd-w Posted December 21, 2005 Share Posted December 21, 2005 On a Supercharger, skipping any two-byte instruction whose operand is one of the above will be, to put it mildly, "interesting". 986508[/snapback] Thanks for the heads-up supercat. I want to maintain Supercharger compatibility for all my programs, so it looks like I'll have to ditch my favorite hack. No need to abandon the hack unless you are writing actual Supercharger games. If you create a regular 4k binary and run it on a Supercharger, you will have writes disabled anyway so this doesn't apply. 986549[/snapback] Ah, I have been wondering for some time why this trick was unreliable when I used it within my PoP code Thanks, Chris Quote Link to comment Share on other sites More sharing options...
cd-w Posted April 19, 2006 Share Posted April 19, 2006 Here's the software collision routine I worked up for Go Fish!A little different - I use it to check if two boxes overlap. Call it once for X values, then call it again with Y values. CheckBoundaries lda rect1.leftortop cmp rect2.leftortop [*] bmi Check2 cmp rect2.rightorbottom bmi InsideBoundingBox Check2 lda rect2.leftortop cmp rect1.leftortop [*] bmi NotInsideBoundingBox cmp rect1.rightorbottom bpl NotInsideBoundingBox InsideBoundingBox sec rts NotInsideBoundingBox clc rts I was looking back through this thread for a collision detection routine for my Juno First game, and the above routine looks like it will be ideal. However, if I am not mistaken, the comparisons marked [*] are the inverse of each other, so I think the code can be simplified as follows: CheckBoundaries lda rect1.leftortop cmp rect2.leftortop bmi Check2 cmp rect2.rightorbottom bmi InsideBoundingBox NotInsideBoundingBox clc rts Check2 lda rect2.leftortop cmp rect1.rightorbottom bpl NotInsideBoundingBox InsideBoundingBox sec rts Chris Quote Link to comment Share on other sites More sharing options...
vdub_bobby Posted April 19, 2006 Share Posted April 19, 2006 Don't know how I missed that; that looks nice. Quote Link to comment Share on other sites More sharing options...
cd-w Posted April 19, 2006 Share Posted April 19, 2006 Don't know how I missed that; that looks nice. Thanks - I haven't actually tried it yet, but it should work unless I am missing something Chris Quote Link to comment Share on other sites More sharing options...
supercat Posted April 19, 2006 Share Posted April 19, 2006 I was looking back through this thread for a collision detection routine for my Juno First game, and the above routine looks like it will be ideal. However, if I am not mistaken, the comparisons marked [*] are the inverse of each other, so I think the code can be simplified as follows: If one is trying to determine whether two objects of fixed size are overlapping, why not just use: ; Start with carry clear lda xpos1 sbc xpos2 ; Note will subtract n-1 sbc #SIZE2-1 adc #SIZE1+SIZE2-1 ; Carry set if overlap If SIZE1=4 and SIZE2=8, xpos1=20... - If xpos2=24, the first SBC yields $FB (no carry) and the second, $F3 (carry set). The ADC generates no carry. - If xpos2=12, the first SBC yields $07 (carry set) and the second, $00 (carry set). Again, the ADC generates no carry. - If xpos2=23, the first SBC yields $FC (no carry) and the second, $F4 (carry set). The ADC generates carry. - If xpos2=13, the first SBC yields $06 (carry set) and the second, $FF (carry clear). Again, the ADC generates carry. Quote Link to comment Share on other sites More sharing options...
cd-w Posted April 20, 2006 Share Posted April 20, 2006 (edited) If one is trying to determine whether two objects of fixed size are overlapping, why not just use: ; Start with carry clear lda xpos1 sbc xpos2; Note will subtract n-1 sbc #SIZE2-1 adc #SIZE1+SIZE2-1; Carry set if overlap If SIZE1=4 and SIZE2=8, xpos1=20... - If xpos2=24, the first SBC yields $FB (no carry) and the second, $F3 (carry set). The ADC generates no carry. - If xpos2=12, the first SBC yields $07 (carry set) and the second, $00 (carry set). Again, the ADC generates no carry. - If xpos2=23, the first SBC yields $FC (no carry) and the second, $F4 (carry set). The ADC generates carry. - If xpos2=13, the first SBC yields $06 (carry set) and the second, $FF (carry clear). Again, the ADC generates carry. This seems to be an even better approach, and is particularly useful for me as it has a constant cycle count. I wish I could write code like this, but I always seem to have trouble figuring the arithmetic Incidentally, I implemented the previous version which I posted above and it works fine. Chris Edited April 20, 2006 by cd-w Quote Link to comment Share on other sites More sharing options...
+batari Posted May 21, 2006 Share Posted May 21, 2006 Here's another one for the code wizards out there. I'm trying to write some good code for 7-bit binary->8-bit BCD. Here's two examples I've come up with: Obvious method (23 bytes): ; input: binary value in A ldx #$FF sec count sbc #10 inx bcs count adc #10 ora table,x ; output: BCD value in A table .byte $0,$10,$20,$30,$40,$50,$60,$70,$80,$90 Obfuscated method: 19 bytes ; input: binary value in A ldx #8 ldy #4 repeat dey bmi noadd cmp #$50 bcc noadd adc #$2f noadd cmp #$80 rol dex bne repeat ; output: BCD value in A Can anyone guess how #2 works? I'm not satisfied with either though. It seems that there should be a better way, in terms of speed and cycles... Maybe something using the "D" flag? Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 21, 2006 Share Posted May 21, 2006 Quickly written... probably more cycles - table values adjusted since carry will be set for add. tay and #3; starting figure (0-7) pha tya lsr a lsr a lsr a sta ztemp pla ldx #4 loop: lsr ztemp bcc noadd adc dectab-1,x noadd: dex bne loop rts dectab .byte $63,$31,$15,$07 Another tip: Instead of using branch, load/or, etc in certain conditions, you can just use the Processor Status, e.g. preserve bit 7 of a location. Lda (screen),y php <other stuff> plp and #$80 ora newchar sta (screen),y I used a similar technique to preserve bit 7 of screen data in a softsprite routine - so that multicolour bitpair 11 still uses the right playfield. Handily, the N and V flags and bit 7 and 6, so something like a BIT instruction followed by PHP will preserve the bit values for later processing. Quote Link to comment Share on other sites More sharing options...
+batari Posted May 21, 2006 Share Posted May 21, 2006 (edited) Quickly written... probably more cycles - table values adjusted since carry will be set for add. Actually it might be quicker in terms of cycles, but it is quite a bit longer. After research, I came up with something really short (17 bytes) ; Binary in A sed sta temp1 lda #0 ldx #8 loop asl temp1 sta temp2 adc temp2 dex bne loop cld ; BCD in A What's cool about this one is that it actually will do 8-bit binary -> 9-bit BCD, with the 9th bit contained in the carry! Can this be improved any more, though? Edited May 21, 2006 by batari Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 22, 2006 Share Posted May 22, 2006 ehm...call me stupid but for what do i need 9th bit bcd? positive/negative range? Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 22, 2006 Share Posted May 22, 2006 Theoretically, allowing full representation of 0-127, I suppose. Probably would want some modification to allow for calling and using the carry bit in each initial calculation. Anyway, stuff like that you only need once in a program and memory usage is mainly a 2600 consideration for stuff like BCD score / timers etc. Quote Link to comment Share on other sites More sharing options...
djmips Posted March 7, 2007 Author Share Posted March 7, 2007 This isn't exactly a 6502 Killer hack but I wanted to share this link. http://graphics.stanford.edu/~seander/bithacks.html There you will find some useful methods that apply to your 6502 coding. Quote Link to comment Share on other sites More sharing options...
vdub_bobby Posted March 9, 2007 Share Posted March 9, 2007 This isn't exactly a 6502 Killer hack but I wanted to share this link. http://graphics.stanford.edu/~seander/bithacks.html There you will find some useful methods that apply to your 6502 coding. Some of those are pretty cool. I thought I'd translate some of them to 6502 assembly: Counting bits set, Brian Kernighan's way unsigned int v; // count the number of bits set in v unsigned int c; // c accumulates the total bits set in v for (c = 0; v; c++) { v &= v - 1; // clear the least significant bit set } Assembly: ;--count set bits in v ; trashes v, accumulator, result is in c lda #0 sta c Loop lda v beq Done inc c dec v and v sta v bne Loop Done Seems like that should be improveable. Quote Link to comment Share on other sites More sharing options...
vdub_bobby Posted March 9, 2007 Share Posted March 9, 2007 (edited) Here's another: Swapping values with XOR #define SWAP(a, b) (((a) ^= (b)), ((b) ^= (a)), ((a) ^= (b))) This is an old trick to exchange the values of the variables a and b without using extra space for a temporary variable. ;--swap a and b lda a eor b sta a eor b sta b eor a sta a That right? Edited March 9, 2007 by vdub_bobby Quote Link to comment Share on other sites More sharing options...
+batari Posted March 9, 2007 Share Posted March 9, 2007 Here's another: Swapping values with XOR #define SWAP(a, b) (((a) ^= (b)), ((b) ^= (a)), ((a) ^= (b))) This is an old trick to exchange the values of the variables a and b without using extra space for a temporary variable. ;--swap a and b lda a eor b sta a eor b sta b eor a sta a That right? That does have some academic appeal, but this is how I usually do it: lda a ldy b sta b sty a I suppose if you absolutely can't use RAM or another register, the EOR trick would work. Quote Link to comment Share on other sites More sharing options...
vdub_bobby Posted March 9, 2007 Share Posted March 9, 2007 (edited) That does have some academic appeal That's what I'm going for here! Edited March 9, 2007 by vdub_bobby Quote Link to comment Share on other sites More sharing options...
Urchlay Posted March 9, 2007 Share Posted March 9, 2007 Assembly: ;--count set bits in v ; trashes v, accumulator, result is in c lda #0 sta c Loop lda v beq Done inc c dec v and v sta v bne Loop Done Seems like that should be improveable. I'll take a stab at it... Execution time for your version is 11 cycles if v is 0, or 5+n*23-1 cycles if v is not zero (where n is the number of bits set in v). Average time (4 bits set in input) is 96 cycles. One obvious improvement would be to use the X register instead of memory location c. Use "ldx #0" to initialize (saves 3 cycles, assuming "c" was a zero page address), and "inx" instead of "inc c" (saves 3 cycles per set bit in v). Also the code gets 2 bytes smaller. Of course, if you already had written a lot of code that used this as a subroutine, all that code would have to change (to look for the result in X instead of c). Depending on how the calling code used the result, you might end up losing a lot more than the 2 bytes you've saved. Another improvement would be to move the Loop label down two lines (below the "beq Done")... You only need to do "lda v : beq Done" once, before the first loop iteration (to catch the special case where the initial value of v is zero). At the end of the loop, you're making this check again (the "bne Loop")... if this branch is taken (back to Loop), it's redundant to check v for zero again (you're guaranteed it's not, since you just took the "bne Loop" branch). This will save you 4 cycles per set bit in v, and it won't break compatibility with (hypothetical) existing code that calls the routine, either. (It will add 4 constant cycles before the loop, though) Applying both improvements (X register and moving the label) would save 7 cycles per loop iteration, which is a 30% speedup (of just the loop)... not bad. Code looks like: ;--count set bits in v ; trashes v, accumulator, result is in X register ldx #0 lda v beq Done Loop inx dec v and v sta v bne Loop Done Timing is 8 cycles if v is 0, or 7+16*n-1 if v is not 0... Average case (4 bits set in input) is 70 cycles, 28% faster than the original, assuming I've gotten all the numbers right (if not, I apologize). Total code size is 15 bytes (3 bytes smaller than the original). However, there are a couple of other ways to count set bits that are faster at the expense of more code. I was messing around with this a while back, and came up with 5 other solutions (I never came up with the Kernighan method on my own, though). Here are a couple of them: ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; Count the number of set bits in A ; Unrolled version ; Result in A ; Uses one byte of RAM at tmp (preferably ZP) ; On exit: X and Y untouched, C/Z/N undefined ; Code size: 36 bytes (+1 for RTS = 37) ; Execution time: Constant, 5+(7*8) = 61 cycles (+6 for RTS = 67) countbits_3: sta tmp; 3 lda #0 ; 2 rol tmp; 5 adc #0 ; 2 rol tmp; 5 adc #0 ; 2 rol tmp; 5 adc #0 ; 2 rol tmp; 5 adc #0 ; 2 rol tmp; 5 adc #0 ; 2 rol tmp; 5 adc #0 ; 2 rol tmp; 5 adc #0 ; 2 rol tmp; 5 adc #0 ; 2 rts ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; Count the number of set bits in A ; Nybble table version ; Result in A ; Uses one byte of RAM at tmp (preferably ZP) ; Execution time: Constant, 35 cycles (+6 for RTS) ; Code size: 17 bytes table + 19 bytes code = 36 bytes (+1 for RTS = 37) ; This is faster than the fully-unrolled version, ; but it trashes all the registers and flags. Could use PHA/PLA instead of ; TAY/TYA, to preserve the Y register, at the expense of 2 cycles. nyb_bits: byte 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4 countbits_4: tay ; 2 and #$0F ; 3 tax ; 2 lda nyb_bits,x; 4 sta tmp ; 3 tya ; 2 lsr ; 2 lsr ; 2 lsr ; 2 lsr ; 2 tax ; 2 lda nyb_bits,x; 4 clc ; 2 adc tmp ; 3 rts ; 6 Neither of these is all that original. The unrolled version is obvious to any 6502'er, and the nybble-table idea came from some code I saw either here or on the [stella] list (I couldn't find the original code, so I rewrote it from scratch). Here's one I came up with on my own. It's not all that fast, but I like it because it fits in 11 bytes of code and doesn't use a temp location in RAM. Also, if I added a ROL before the RTS, the accumulator and carry flag would keep their original values, which might simplify things for the calling code. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; Count the number of set bits in A ; Result in Y ; Uses no RAM ; On exit: ; Y = result, X = 0, A = undefined, C = undefined, Z = 1, N = 0 ; Min execution time (input $00, 0 bits set): ; 4+(2+3+2+3)*8-1 = 83 cycles ; Avg execution time (input $AA, 4 bits set): ; 4+(2+3+2+3)*4+(2+2+2+2+3)*4-1 = 87 cycles ; Max execution time (input $FF, 8 bits set): ; 4+(2+2+2+2+3)*8-1 = 91 cycles ; (add 6 cycles for RTS to all 3 counts) ; Code size: 11 bytes (+1 for RTS = 12) countbits_2: ldx #8; 2 ldy #0; 2 cb2loop rol ; 2 bcc cb2_noiny; 2/3 iny ; 2 cb2_noiny dex ; 2 bne cb2loop ; 3 ; rol; adding this ROL preserves the original values of A and C (and sets Z and N according to the A value) rts; 6 Quote Link to comment Share on other sites More sharing options...
Urchlay Posted March 9, 2007 Share Posted March 9, 2007 I suppose if you absolutely can't use RAM or another register, the EOR trick would work. The EOR trick still uses RAM (locations a and b). Quote Link to comment Share on other sites More sharing options...
Urchlay Posted March 9, 2007 Share Posted March 9, 2007 A while ago, I wrote this bit of code to calculate even/odd parity. It could probably stand some improvement... I definitely don't consider it a "killer hack": it's coded for clarity more than speed or size, but it does implement something from the bithack page, so I thought someone might be interested. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; calc_parity will calculate even or odd parity for the low 7 bits of A, ; sets the high bit and N flag to the result. ; Input: ; A = data to calculate parity for (bit 7 ignored) ; C = carry set for odd parity, clear for even ; X, Y, other flags: ignored ; Output: ; A = data with correct parity bit in bit 7 (bits 0-6 unchanged) ; N = parity bit (same as A bit 7) ; C = parity bit (same as A bit 7 and N) ; Z = 1 if A is 0, or 0 otherwise ; X = 0 ; Y = bits 0-6 of input ; tmp = same as Y ; Memory usage: One byte at tmp. No stack used. ; If desired, the routine could easily be modified to preserve the ; Y register (and use another RAM temp instead). ; Execution time: Constant, 108 cycles, +6 for RTS. Add 10 cycles if tmp is ; not located in zero page. ; Code size: 25 bytes, +1 for RTS (+4 if tmp not zero page) ; Code: calc_parity: subroutine;{ and #$7F ; 2; throw away top bit (it will get replaced anyway) sta tmp ; 3; copy in tmp will be repeatedly shifted (destroyed), tay ; 2; so stash another copy in Y lda #0 ; 2; init accumulator: rol ; 2; A=1 if C was set, otherwise A=0 ldx #7 ; 2 ; =13 (+1 if tmp not ZP) ; { .loop ; loop executes 7 times... lsr tmp ; 5 adc #0 ; 2 dex ; 2 bne .loop; 3 ; =12*7-1 = 83, +7 if tmp not ZP ; } ; A now has count of set bits from bits 0-6 of the original argument, ; plus 1 if C flag was initially set. ; Bit 0 of A will be the new parity bit. ror ; 2; C = bit 0 of A lda #0 ; 2 ror ; 2; bit 7 of A = C, bits 0-6 are 0 sty tmp ; 3 ora tmp ; 3 ; =12 (+2 if tmp not ZP) rts ; 6 ;} BTW, the comments with the curly braces (lines like "; }") are a Stupid Editor Trick: My editor lets me jump from an opening { to a closing } quickly (very handy for C or Java code), so I've gotten in the habit of using them in 6502 asm code for easy navigation. I use Vim, and I know Emacs has support for this, too. I'd bet that any halfway decent code editor or IDE supports it, too. I've also got a vim macro that selects the current block of code (from the { to the }), for cut/copy or applying other commands. I just press F2 to highlight... and I've got F1 bound to a macro that comments out the block (or uncomments it, if it's already commented). Who says vi is hard to use? Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.