TROGDOR Posted July 19, 2006 Share Posted July 19, 2006 Greetings Atari Junkies, I'm continuing work on Stellar Fortress, and have encountered an interesting programming challenge. I need to count bits. Lots of bits. I need to count all the individual 1's across 28 bytes. I've been trying to write an optimally efficient algorithm to do this. My target is to get a good mix of minimal ROM usage and minimal cycles. Here's what I've come up with so far. The first method is brute force: LDY #0 ;2,2 AddBit0 LSR ;2,1 BCC AddBit1 ;2,2 INY ;2,1 AddBit1 LSR ;2,1 BCC AddBit2 ;2,2 INY ;2,1 AddBit2 LSR ;2,1 BCC AddBit3 ;2,2 INY ;2,1 AddBit3 LSR ;2,1 BCC AddBit4 ;2,2 INY ;2,1 AddBit4 LSR ;2,1 BCC AddBit5 ;2,2 INY ;2,1 AddBit5 LSR ;2,1 BCC AddBit6 ;2,2 INY ;2,1 AddBit6 LSR ;2,1 BCC AddBit7 ;2,2 INY ;2,1 AddBit7 LSR ;2,1 BCC DoneAddBit ;2,2 INY ;2,1 DoneAddBit The code assumes that the target byte is already stored in A. The brute force method isn't that bad. The worst case path uses 6 x 8 + 2 = 50 cycles and 4 x 8 + 2 = 34 bytes. The resulting bit count is in Y. An alternative method is to use a lookup table. The fastest lookup table would be 256 bytes, which would only require 6 cycles. But I really don't want to burn 1 / 16 th of my ROM on a bit counter, so I went with a compromise. A two-pass nibble lookup: LDY #0 ;2,2 STY Temp1 ;3,2 STA Temp2 ;3,2 AND #%00001111 ;2,2 TAY ;2,1 LDA BitCounter,Y ;4,2 STA Temp3 ;3,2 LDA Temp2 ;3,2 LSR ;2,1 LSR ;2,1 LSR ;2,1 LSR ;2,1 TAY ;2,1 LDA BitCounter,Y ;4,2 CLC ;2,1 ADC Temp3 ;3,2 ADC Temp1 ;3,2 STA Temp1 ;3,2 BitCounter dc.b #0 ;0000 dc.b #1 ;0001 dc.b #1 ;0010 dc.b #2 ;0011 dc.b #1 ;0100 dc.b #2 ;0101 dc.b #2 ;0110 dc.b #3 ;0111 dc.b #1 ;1000 dc.b #2 ;1001 dc.b #2 ;1010 dc.b #3 ;1011 dc.b #2 ;1100 dc.b #3 ;1101 dc.b #3 ;1110 dc.b #4 ;1111 47 cycles and 45 bytes. A little bit faster, but 11 bytes larger. (The extra storage in Temp1 assumes that I'll be using this across multiple bytes, and that I'll want to keep a running total. The same is true for the brute force approach, but the total is automatically maintained in Y.) At this point, I'm going with the brute force approach. I'm curious if there's a way to make this any smaller or faster, so I'm interested in seeing other people's bit counting tricks. While I'm asking, can anyone recommend a website of 8-bit algorithm implementations? Does such a thing exist? I've found techniques here and there with Google searches, but I've never found a good centralized database of code snipets. Quote Link to comment Share on other sites More sharing options...
djmips Posted July 19, 2006 Share Posted July 19, 2006 (edited) Here's an old trick coded into 6502. I haven't tested it or properly stepped through the code in my head. :-) edit: fix bugs... tay and #$55 sta temp tya lsr and #$55 clc adc temp tay and #$33 sta temp tya lsr lsr and #$33 clc adc temp tay and #$0F sta temp tya lsr lsr lsr lsr clc adc temp Edited July 19, 2006 by djmips Quote Link to comment Share on other sites More sharing options...
djmips Posted July 19, 2006 Share Posted July 19, 2006 Like your first attempt but branchless. lsr adc #0 lsr adc #0 lsr adc #0 lsr adc #0 lsr adc #0 lsr adc #0 lsr adc #0 lsr adc #0 Quote Link to comment Share on other sites More sharing options...
vdub_bobby Posted July 19, 2006 Share Posted July 19, 2006 (edited) Like your first attempt but branchless. lsr adc #0 lsr adc #0 lsr adc #0 etc. Er, that won't work! Try this instead: sta Temp lda #0 lsr Temp adc #0 lsr Temp adc #0 lsr Temp adc #0 lsr Temp adc #0 lsr Temp adc #0 lsr Temp adc #0 lsr Temp adc #0 lsr Temp adc #0 EDIT: 7*8 + 5 = 61 cycles and 4*8 + 4 = 36 bytes. 'NOTHER EDIT: LDY #0 ;2,2 STY Temp1 ;3,2 STA Temp2 ;3,2 AND #%00001111 ;2,2 TAY ;2,1 LDA BitCounter,Y ;4,2 STA Temp3 ;3,2 LDA Temp2 ;3,2 LSR ;2,1 LSR ;2,1 LSR ;2,1 LSR ;2,1 TAY ;2,1 LDA BitCounter,Y ;4,2 CLC ;2,1 ADC Temp3 ;3,2 ADC Temp1 ;3,2 STA Temp1 ;3,2 Temp1 isn't needed. You could remove all references to it with no effect. Trims 9 cycles and 6 bytes and your answer ends up in Temp3. Edited July 19, 2006 by vdub_bobby Quote Link to comment Share on other sites More sharing options...
djmips Posted July 19, 2006 Share Posted July 19, 2006 (edited) - wishful thinking. That's what I meant! Edited July 19, 2006 by djmips Quote Link to comment Share on other sites More sharing options...
djmips Posted July 19, 2006 Share Posted July 19, 2006 (edited) Temp1 isn't needed. You could remove all references to it with no effect. Trims 9 cycles and 6 bytes and your answer ends up in Temp3. Isn't that 11 cycles trimmed? edit: maybe not, but in any case, he wants to keep a running total so.... Edited July 19, 2006 by djmips Quote Link to comment Share on other sites More sharing options...
jbanes Posted July 19, 2006 Share Posted July 19, 2006 I presume you are at some point setting the bits yourself? In which case, have you considered using a counter that you increment/decrement every time you set/unset a bit? It would be a lot faster than setting each bit. If that doesn't work for you, then the "brute force" method is probably the fastest. You could put it into a loop to make it smaller, but unrolling the loop saves cycles. Quote Link to comment Share on other sites More sharing options...
supercat Posted July 20, 2006 Share Posted July 20, 2006 I'd suggest doing something like once per pair of bytes to count half the bits, then do it again with the stack pointer located one higher to count the other half. pla asr #$AA tax tya adc table,y tay pla and #$55 tax tya adc table,y tay Thirty cycles per eight bits. Code is needed to save/restore the stack pointer and set it up of course; code is also needed to special-case the first and last byte. Also, you may want some form of looping construct. Note that while this may seem to require a 128-byte table, it actually only needs 16 bytes of that table. One could "nestle in" other tables at base addresses table+2, table+8, table+10, table+32, table+34, table+42, and table+44. Strat-O-Gems uses this technique to do the conversions from packed gem data into colors. Quote Link to comment Share on other sites More sharing options...
djmips Posted July 20, 2006 Share Posted July 20, 2006 I'd suggest doing something like once per pair of bytes to count half the bits, then do it again with the stack pointer located one higher to count the other half. pla asr #$AA tax tya adc table,y tay pla and #$55 tax tya adc table,y tay Thirty cycles per eight bits. Code is needed to save/restore the stack pointer and set it up of course; code is also needed to special-case the first and last byte. Also, you may want some form of looping construct. Note that while this may seem to require a 128-byte table, it actually only needs 16 bytes of that table. One could "nestle in" other tables at base addresses table+2, table+8, table+10, table+32, table+34, table+42, and table+44. Strat-O-Gems uses this technique to do the conversions from packed gem data into colors. cool John has executed a nice pass here at X-coder 2006. He's using the fact that the stack overlays the zero page memory quite nicely: fetching the next byte with a pla in only 3 cycles. Look at the use of an undocumented opcode to avoid an extra instruction penalty and allows him to share the table for each 4 bits. Having each 4 bit section only cost 15 cycles by keeping everything largely in registers. Running over the data twice gets mucho bonus points, but you've got to add in that setup cost... Interleaving other data into the 16 byte table also cost him a little but overall a great run... Quote Link to comment Share on other sites More sharing options...
supercat Posted July 20, 2006 Share Posted July 20, 2006 He's using the fact that the stack overlays the zero page memory quite nicely: fetching the next byte with a pla in only 3 cycles. It'd be neat if I could fetch data with a PLA in three cycles. Not quite sure how to manage that though Someday I'll learn to count. Look at the use of an undocumented opcode to avoid an extra instruction penalty and allows him to share the table for each 4 bits. I dodge the "extra instruction penalty" but then slam into the wall with an incorrect register usage. Oops. Still, 32 cycles per 8 bits is still doing pretty well. Of course, if the 256-byte table were acceptable, that would be much faster still (probably eight cycles per byte). Quote Link to comment Share on other sites More sharing options...
+batari Posted July 20, 2006 Share Posted July 20, 2006 Here's an old trick coded into 6502. I haven't tested it or properly stepped through the code in my head. :-) edit: fix bugs... I think I optimized your code. It's not as fast as Supercat's (36 cycles) but it's short (no tables), doesn't require fancy setup (all you need is X loaded with $33) and frees up the Y-reg for counting through the 28 bytes we want to count. sta temp asr #$AA ; clc (not needed I think) sbc temp eor #$FF sax temp lsr asr #$66 adc temp sta temp lsr lsr lsr lsr adc temp and #$0F Quote Link to comment Share on other sites More sharing options...
TROGDOR Posted December 29, 2007 Author Share Posted December 29, 2007 (edited) It's been a year and a half, but I thought about this problem again last night, and came up with an optimal solution. Here's what the 256 byte lookup table would look like in a loop: LDX #27 CLC CountShieldBits LDY Shield,X ; 4, 2 LDA count8bits,Y ; 4, 3 ADC BitTotal ; 3, 2 STA BitTotal ; 3, 2 DEX ; 2, 1 BPL CountShieldBits ; 3, 2, 19 cycles per pass, 15 + 256 bytes = 271 bytes 19 cycles is fast, but 271 bytes consumed is unacceptable. If I can just remove one bit, I can halve the size of the lookup table: LDX #27 CountShieldBits LDA Shield,X ; 4, 2 LSR ; 2, 1 Carry holds one bit. TAY ; 2, 1 LDA count7bits,Y ; 4, 3 ADC BitTotal ; 3, 2 STA BitTotal ; 3, 2 DEX ; 2, 1 BPL CountShieldBits ; 3, 2, 23 cycles per pass, 16 + 128 bytes = 144 bytes 23 cycles per pass is still very fast, and 144 bytes is more acceptable ROM use. In the end, I will likely use jbane's idea. This is used to count the bits in the shield for Stellar Fortress, so I'll just maintain a counter, adding to it when the shield is recharged, and subtracting from it when the player shoots holes in it. Otherwise I'll have to give up 736 cycles to add up all the bits. (I expanded the shield to 32 bytes from 28 since writing the original question.) The strength of the shield will determine the pitch of the shield hum, similar to the original Star Castle game. Edited December 29, 2007 by TROGDOR Quote Link to comment Share on other sites More sharing options...
Rybags Posted December 29, 2007 Share Posted December 29, 2007 (edited) Don't know if this will help, but: - a BIT instruction will set the N, V and Z flags. Z flag allows a BEQ NOBITS to skip the byte entirely N,V individually take care of bits 7 and 6 LDA, then LSR could take care of bits 0 thru 5. No idea if that will save time at all. Alternatively, you might find some compromise - use BIT initially. Then, use your lookup table for the lower nybble. Finally, you could just use AND to test bits 4 and 5. ed: actually, it might be quicker to reload the byte and ASL A twice, test N, ASL, test N. Edited December 29, 2007 by Rybags Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.