Jump to content
IGNORED

Bit Counter Challenge


TROGDOR

Recommended Posts

Greetings Atari Junkies,

 

I'm continuing work on Stellar Fortress, and have encountered an interesting programming challenge. I need to count bits. Lots of bits. I need to count all the individual 1's across 28 bytes. I've been trying to write an optimally efficient algorithm to do this. My target is to get a good mix of minimal ROM usage and minimal cycles. Here's what I've come up with so far.

 

The first method is brute force:

 

	LDY #0		 ;2,2
AddBit0
	LSR			;2,1
	BCC AddBit1	;2,2
	INY			;2,1
AddBit1
	LSR			;2,1
	BCC AddBit2	;2,2
	INY			;2,1
AddBit2
	LSR			;2,1
	BCC AddBit3	;2,2
	INY			;2,1
AddBit3
	LSR			;2,1
	BCC AddBit4	;2,2
	INY			;2,1
AddBit4
	LSR			;2,1
	BCC AddBit5	;2,2
	INY			;2,1
AddBit5
	LSR			;2,1
	BCC AddBit6	;2,2
	INY			;2,1
AddBit6
	LSR			;2,1
	BCC AddBit7	;2,2
	INY			;2,1
AddBit7
	LSR			;2,1
	BCC DoneAddBit ;2,2
	INY			;2,1
DoneAddBit

The code assumes that the target byte is already stored in A.

 

The brute force method isn't that bad. The worst case path uses 6 x 8 + 2 = 50 cycles and 4 x 8 + 2 = 34 bytes. The resulting bit count is in Y.

 

An alternative method is to use a lookup table. The fastest lookup table would be 256 bytes, which would only require 6 cycles. But I really don't want to burn 1 / 16 th of my ROM on a bit counter, so I went with a compromise. A two-pass nibble lookup:

 

	LDY #0				 ;2,2
	STY Temp1			  ;3,2
	STA Temp2			  ;3,2
	AND #%00001111		 ;2,2
	TAY					;2,1
	LDA BitCounter,Y	   ;4,2
	STA Temp3			  ;3,2

	LDA Temp2			  ;3,2
	LSR					;2,1
	LSR					;2,1
	LSR					;2,1
	LSR					;2,1
	TAY					;2,1
	LDA BitCounter,Y	   ;4,2
	CLC					;2,1
	ADC Temp3			  ;3,2
	ADC Temp1			  ;3,2
	STA Temp1			  ;3,2

BitCounter
	dc.b #0		;0000
	dc.b #1		;0001
	dc.b #1		;0010
	dc.b #2		;0011
	dc.b #1		;0100
	dc.b #2		;0101
	dc.b #2		;0110
	dc.b #3		;0111
	dc.b #1		;1000
	dc.b #2		;1001
	dc.b #2		;1010
	dc.b #3		;1011
	dc.b #2		;1100
	dc.b #3		;1101
	dc.b #3		;1110
	dc.b #4		;1111

 

47 cycles and 45 bytes. A little bit faster, but 11 bytes larger. (The extra storage in Temp1 assumes that I'll be using this across multiple bytes, and that I'll want to keep a running total. The same is true for the brute force approach, but the total is automatically maintained in Y.)

 

At this point, I'm going with the brute force approach. I'm curious if there's a way to make this any smaller or faster, so I'm interested in seeing other people's bit counting tricks.

 

While I'm asking, can anyone recommend a website of 8-bit algorithm implementations? Does such a thing exist? I've found techniques here and there with Google searches, but I've never found a good centralized database of code snipets.

Link to comment
Share on other sites

Here's an old trick coded into 6502. I haven't tested it or properly stepped through the code in my head.

 

:-)

 

edit: fix bugs...

 

 tay
 and #$55
 sta temp
 tya
 lsr
 and #$55
 clc
 adc temp
 tay
 and #$33
 sta temp
 tya
 lsr
 lsr
 and #$33
 clc
 adc temp
 tay
 and #$0F
 sta temp
 tya
 lsr
 lsr
 lsr
 lsr
 clc
 adc temp

Edited by djmips
Link to comment
Share on other sites

Like your first attempt but branchless.

 

lsr
adc #0
lsr
adc #0
lsr
adc #0
	etc.

Er, that won't work!

Try this instead:

   sta Temp
  lda #0
  lsr Temp
  adc #0
  lsr Temp
  adc #0
  lsr Temp
  adc #0
  lsr Temp
  adc #0
  lsr Temp
  adc #0
  lsr Temp
  adc #0
  lsr Temp
  adc #0
  lsr Temp
  adc #0

EDIT: 7*8 + 5 = 61 cycles and 4*8 + 4 = 36 bytes.

 

'NOTHER EDIT:

		LDY #0				;2,2
	STY Temp1			 ;3,2
	STA Temp2			 ;3,2
	AND #%00001111		;2,2
	TAY				   ;2,1
	LDA BitCounter,Y	  ;4,2
	STA Temp3			 ;3,2

	LDA Temp2			 ;3,2
	LSR				   ;2,1
	LSR				   ;2,1
	LSR				   ;2,1
	LSR				   ;2,1
	TAY				   ;2,1
	LDA BitCounter,Y	  ;4,2
	CLC				   ;2,1
	ADC Temp3			 ;3,2
	ADC Temp1			 ;3,2
	STA Temp1			 ;3,2

Temp1 isn't needed. You could remove all references to it with no effect. Trims 9 cycles and 6 bytes and your answer ends up in Temp3.

Edited by vdub_bobby
Link to comment
Share on other sites

Temp1 isn't needed. You could remove all references to it with no effect. Trims 9 cycles and 6 bytes and your answer ends up in Temp3.

 

Isn't that 11 cycles trimmed?

 

edit: maybe not, but in any case, he wants to keep a running total so....

Edited by djmips
Link to comment
Share on other sites

I presume you are at some point setting the bits yourself? In which case, have you considered using a counter that you increment/decrement every time you set/unset a bit? It would be a lot faster than setting each bit. If that doesn't work for you, then the "brute force" method is probably the fastest. You could put it into a loop to make it smaller, but unrolling the loop saves cycles.

Link to comment
Share on other sites

I'd suggest doing something like once per pair of bytes to count half the bits, then do it again with the stack pointer located one higher to count the other half.

 pla
 asr #$AA
 tax
 tya
 adc table,y
 tay
 pla
 and #$55
 tax
 tya
 adc table,y
 tay

Thirty cycles per eight bits. Code is needed to save/restore the stack pointer and set it up of course; code is also needed to special-case the first and last byte. Also, you may want some form of looping construct. Note that while this may seem to require a 128-byte table, it actually only needs 16 bytes of that table. One could "nestle in" other tables at base addresses table+2, table+8, table+10, table+32, table+34, table+42, and table+44. Strat-O-Gems uses this technique to do the conversions from packed gem data into colors.

Link to comment
Share on other sites

I'd suggest doing something like once per pair of bytes to count half the bits, then do it again with the stack pointer located one higher to count the other half.

 pla
 asr #$AA
 tax
 tya
 adc table,y
 tay
 pla
 and #$55
 tax
 tya
 adc table,y
 tay

Thirty cycles per eight bits. Code is needed to save/restore the stack pointer and set it up of course; code is also needed to special-case the first and last byte. Also, you may want some form of looping construct. Note that while this may seem to require a 128-byte table, it actually only needs 16 bytes of that table. One could "nestle in" other tables at base addresses table+2, table+8, table+10, table+32, table+34, table+42, and table+44. Strat-O-Gems uses this technique to do the conversions from packed gem data into colors.

 

cool :thumbsup:

 

John has executed a nice pass here at X-coder 2006. He's using the fact that the stack overlays the zero page memory quite nicely: fetching the next byte with a pla in only 3 cycles. Look at the use of an undocumented opcode to avoid an extra instruction penalty and allows him to share the table for each 4 bits. Having each 4 bit section only cost 15 cycles by keeping everything largely in registers. Running over the data twice gets mucho bonus points, but you've got to add in that setup cost... Interleaving other data into the 16 byte table also cost him a little but overall a great run...

Link to comment
Share on other sites

He's using the fact that the stack overlays the zero page memory quite nicely: fetching the next byte with a pla in only 3 cycles.

 

It'd be neat if I could fetch data with a PLA in three cycles. Not quite sure how to manage that though :P Someday I'll learn to count.

 

Look at the use of an undocumented opcode to avoid an extra instruction penalty and allows him to share the table for each 4 bits.

 

I dodge the "extra instruction penalty" but then slam into the wall with an incorrect register usage. Oops. Still, 32 cycles per 8 bits is still doing pretty well. Of course, if the 256-byte table were acceptable, that would be much faster still (probably eight cycles per byte).

Link to comment
Share on other sites

Here's an old trick coded into 6502. I haven't tested it or properly stepped through the code in my head.

 

:-)

 

edit: fix bugs...

I think I optimized your code. It's not as fast as Supercat's (36 cycles) but it's short (no tables), doesn't require fancy setup (all you need is X loaded with $33) and frees up the Y-reg for counting through the 28 bytes we want to count.

 sta temp
 asr #$AA
; clc (not needed I think)
 sbc temp
 eor #$FF
 sax temp
 lsr
 asr #$66
 adc temp
 sta temp
 lsr
 lsr
 lsr
 lsr
 adc temp
 and #$0F

Link to comment
Share on other sites

  • 1 year later...

It's been a year and a half, but I thought about this problem again last night, and came up with an optimal solution. Here's what the 256 byte lookup table would look like in a loop:

 

	LDX #27
CLC
CountShieldBits
LDY Shield,X		; 4, 2
LDA count8bits,Y	; 4, 3
ADC BitTotal		; 3, 2
STA BitTotal		; 3, 2
DEX			; 2, 1
BPL CountShieldBits	; 3, 2, 19 cycles per pass, 15 + 256 bytes = 271 bytes

 

19 cycles is fast, but 271 bytes consumed is unacceptable. If I can just remove one bit, I can halve the size of the lookup table:

 

	LDX #27
CountShieldBits
LDA Shield,X		; 4, 2
LSR			; 2, 1 Carry holds one bit.
TAY			; 2, 1
LDA count7bits,Y	; 4, 3
ADC BitTotal		; 3, 2
STA BitTotal		; 3, 2
DEX			; 2, 1
BPL CountShieldBits	; 3, 2, 23 cycles per pass, 16 + 128 bytes = 144 bytes

 

23 cycles per pass is still very fast, and 144 bytes is more acceptable ROM use.

 

In the end, I will likely use jbane's idea. This is used to count the bits in the shield for Stellar Fortress, so I'll just maintain a counter, adding to it when the shield is recharged, and subtracting from it when the player shoots holes in it. Otherwise I'll have to give up 736 cycles to add up all the bits. (I expanded the shield to 32 bytes from 28 since writing the original question.)

 

The strength of the shield will determine the pitch of the shield hum, similar to the original Star Castle game. ;)

Edited by TROGDOR
Link to comment
Share on other sites

Don't know if this will help, but:

 

- a BIT instruction will set the N, V and Z flags.

Z flag allows a BEQ NOBITS to skip the byte entirely

 

N,V individually take care of bits 7 and 6

 

LDA, then LSR could take care of bits 0 thru 5.

 

No idea if that will save time at all.

 

Alternatively, you might find some compromise - use BIT initially.

 

Then, use your lookup table for the lower nybble.

 

Finally, you could just use AND to test bits 4 and 5.

 

ed: actually, it might be quicker to reload the byte and ASL A twice, test N, ASL, test N.

Edited by Rybags
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...