6502 Killer hacks

Thomas Jentzsch · May 22, 2007

And here is the code for when you are doing more bits at once. 3 in this example.
  lsr
 lsr
 lsr
 adc #$F0
 eor #$F0

Cool!

djmips · May 23, 2007

Getting the 2's complement of the accumulator, rudimentary, but the extra here is two implementations.

One version uses clc, the other sec so you could save the sec or clc if you know the prior state of the carry flag.

;2's complement x = x ^ $FF + 1
;assumes carry is clear
 eor #$FF
 adc #01

; 2's complement x = x - 1 ^ $FF  
; assumes carry is set
 sbc #01
 eor #$FF

Robert M · May 31, 2007

Cool! Here's another one:

Modulus.

I need to find n mod 20. Well, to be specific, I need to find if n mod 20 is zero or otherwise.

Here's how I do it:
   lda n
  sec
Loop
  sbc #20
  bcc Remainder
  bne Loop
;--remainder is zero
Which is slow. Is there a better way?

That division by 5 is nasty for optimization. Instead I recommend dividing by 40 so that the average number of times through the loop is cut in half plus 1 add instruction. Its slower for some cases, but I estimate about 45% faster on average.

Modulo20
 ; input: A = the unsigned number to test (0 to 255)
  tax ; Need to set Z-flag for special case that A is already 0.  ( your original code did not do this so it may not be important.)
  sec
loop
  beq done
  sbc #40
  bcs loop
 ; C = 0
  adc #20
done
 ; Z = 1 if the input number was a multiple of 20.
 ; X = the original number
 ; A = garbage.

supercat · May 31, 2007

That division by 5 is nasty for optimization. Instead I recommend dividing by 40 so that the average number of times through the loop is cut in half plus 1 add instruction. Its slower for some cases, but I estimate about 45% faster on average.

How about something like:

  sec
u160: ; Test 160 from above
 sbc #160
 bcc b80
u80:
 sbc #80
 bcc b40
u40:
 sbc #40
 bcc b20
u20:
 sbc #20
 bcc bzero
done:
 rts
b80: ; Test 80 from below
 adc #80
 bcs u40
b40:
 adc #40
 bcs u20
b20:
 adc #20
 bcs done
bzero:
 adc #20
 rts

That should take 10 cycles for adc/sbc's, 8-12 for branches, and six for the RTS.

vdub_bobby · January 23, 2008

Anyone have a killer hack for 8.8 multiplication?

I need to do 8.8 * 8.8 and have the result be 8.8.

I know that the integer part of the result will not exceed 8 bits and I don't need a 16-bit remainder.

What I've got is this:

  ; a.b * c.d = e.fg
  lda #0
  sta e
  sta f
  sta g
  ldx a
  ldy b
  clc
MultiplyLoop
  lda g
  adc d
  sta g
  lda f
  adc d
  sta f
  lda e
  adc #0
  sta e
  dey
  bne MultiplyLoop
  dex
  bne MultiplyLoop
 ;result in e.fg ... ignore g to get e.f as 8.8 result

(This ignores some zero cases.)

The main problem is that this is slower than dirt.

Thomas Jentzsch · January 23, 2008

Anyone have a killer hack for 8.8 multiplication?
I need to do 8.8 * 8.8 and have the result be 8.8.

How much space do you have?

Check this link: http://www.ffd2.com/fridge/chacking/c=hacking9.txt

This is one of my all time favourites.

Cybergoth · January 23, 2008

C= Hacking was brilliant, too bad it's been a few years already since the last issue. Still joping for its return

vdub_bobby · January 23, 2008

Anyone have a killer hack for 8.8 multiplication?
I need to do 8.8 * 8.8 and have the result be 8.8.

How much space do you have?

Check this link: http://www.ffd2.com/fridge/chacking/c=hacking9.txt

This is one of my all time favourites.

Do you mean the f(x) = (x^2)/4 method?

That requires a table of squares, which is fine for 8 bits, but how do I translate that to 16 bits? The table I would need is enormous! I don't have that much space.

Edited January 23, 2008 by vdub_bobby

Ben_Larson · January 23, 2008

Anyone have a killer hack for 8.8 multiplication?
I need to do 8.8 * 8.8 and have the result be 8.8.

well if you're using fixed point notation, then 8.8 is the same as 16.0, right? I mean, it's just a matter of how you interpret it. with that said I think you could use this method (scroll down to 'Binary Multiplication'):

http://www.atariarchives.org/roots/chapter_10.php

Basically the algorithm is long multiplication like you'd do by hand, except in binary form.

supercat · January 24, 2008

(scroll down to 'Binary Multiplication'):

Three useful optimizations I would suggest:

Subtract 1 from the multiplicand stored in memory, to eliminate the in-loop CLC.
Store the other multiplicand into the product low byte location. In that way, the ROR that stores the product low bits can also sample the next bit of the multiplicand.
If code space permits, unroll the loop at least somewhat. Unrolling 2x will save about twenty cycles.

Such approaches still won't be as fast as the table-lookup approach, but they'll be pretty good.

Thomas Jentzsch · January 24, 2008

Do you mean the f(x) = (x^2)/4 method?

That requires a table of squares, which is fine for 8 bits, but how do I translate that to 16 bits? The table I would need is enormous! I don't have that much space.

No, you only require those 8 bit tables. I did this back in the 80ies with my C64. IIRC I did use 5 256 bytes tables (some were used just to save a few cycles) for 32 bit FP math (Mandelbrot).

Let's see if I can remember it...

A8:8*B8:8 = A8:0*B8:0 + 2*A8:0*B0:8 + 2*A0:8*B8:0 + A0:8*B:08

So everything is reduced to 8 (or 9) bits. You can get away with just one 256 bytes table. Or you can do extra tables for the 9 bit results to save a few extra cycles.

Edited January 24, 2008 by Thomas Jentzsch

vdub_bobby · January 24, 2008

No, you only require those 8 bit tables. I did this back in the 80ies with my C64. IIRC I did use 5 256 bytes tables (some were used just to save a few cycles) for 32 bit FP math (Mandelbrot).

Let's see if I can remember it...

A8:8*B8:8 = A8:0*B8:0 + 2*A8:0*B0:8 + 2*A0:8*B8:0 + A0:8*B:08

So everything is reduced to 8 (or 9) bits. You can get away with just one 256 bytes table. Or you can do extra tables for the 9 bit results to save a few extra cycles.

Thanks.

It turns out what I really need is A0:8 * B8:8 and I don't need the least-significant byte.

So I think I can do this...

A0:8 * B0:8 + 256*(A0:8 * B8:0)

and then just throw away the bottom byte.

My next question!

You need a table of the values (x^2)/4 for x=0 to 255. But what happens when (x^2)/4 > 255? I.e., for x>31? Don't you need two tables, one for the lower byte and one for the upper byte.

And don't your tables need to be ~512 bytes long, for when x+y>255? Since xy=f(x+y)-f(x-y)

Thomas Jentzsch · January 24, 2008

A0:8 * B0:8 + 256*(A0:8 * B8:0)
and then just throw away the bottom byte.

You have to double A0:8 * B8:0

You need a table of the values (x^2)/4 for x=0 to 255. But what happens when (x^2)/4 > 255? I.e., for x>31? Don't you need two tables, one for the lower byte and one for the upper byte.

Sure, 2 tables here. We are getting close to the 5 tables I remember.

And don't your tables need to be ~512 bytes long, for when x+y>255? Since xy=f(x+y)-f(x-y)

That would be another 2 tables, yes. Unless I miss something. :ponder:

vdub_bobby · January 24, 2008

A0:8 * B0:8 + 256*(A0:8 * B8:0)
and then just throw away the bottom byte.

You have to double A0:8 * B8:0

Nah, slightly different algorithm, I think.

Use base-10, let each letter be a digit: a * bc = a * ( ( 10 * b ) + c ) = 10 * a * b + a * c

Same thing, only where each "digit" is an 8-bit number, the base is 256.

So a * bc = a * ( ( 256 * b ) + c ) = 256 * a * b + a * c

In other words, I drop the "place" value of each digit and then add it back in at the end, kinda like when you do multiplication on paper:

	 a b
  * c d
  ------
  ad bd
ac bc
--------

Using a lot of large tables, I think I can get my A0:8 * B8:8 multiplication, throwing away the bottom byte of the result, in 77 cycles:

  ;--a * b:c
  lda a
  sta IntPtrPlus
  sta FractPtrPlus
  sta IntPtrMinus
  sta FractPtrMinus   ;four tables so I don't have to figure -c
  ldy c
  lda (IntPtrPlus),Y
  sec
  sbc (IntPtrMinus),Y
  sta ResultInt1
 ;--don't care about bottom byte of a * c since I'll drop it anyway
  ldy b
  lda (IntPtrPlus),Y
  sec
  sbc (IntPtrMinus),Y
  sta ResultInt2
  lda (FractPtrPlus),Y
  sbc (FractPtrMinus),Y
  clc
  adc ResultInt1
  sta ResultInt1
  lda ResultInt2
  adc #0
  sta ResultInt2

I *think* that will work...

Robert M · January 24, 2008

Anyone have a killer hack for 8.8 multiplication?
I need to do 8.8 * 8.8 and have the result be 8.8.

I know that the integer part of the result will not exceed 8 bits and I don't need a 16-bit remainder.

So is this really a 4.8 * 4.8 multiply, or you can just guarantee that if the one value is bigger than 12 bits, then the other will be sufficiently small to prevent an 8-bit overflow?

If it is really just 4.8 * 4.8 and you are discarding the fraction, then you can just do 4.4 * 4.4. Because the lower 4 bits of the fractions can never effect the integer portion of the result. 4.4 * 4.4 is just a bit shifted 8*8. I think that will simplify things.

Cheers!

vdub_bobby · January 24, 2008

So is this really a 4.8 * 4.8 multiply, or you can just guarantee that if the one value is bigger than 12 bits, then the other will be sufficiently small to prevent an 8-bit overflow?

If it is really just 4.8 * 4.8 and you are discarding the fraction, then you can just do 4.4 * 4.4. Because the lower 4 bits of the fractions can never effect the integer portion of the result. 4.4 * 4.4 is just a bit shifted 8*8. I think that will simplify things.

If you scroll/read down a bit, you'll see that it is really 0.8 * 8.8 and I want an 8.8 result. And I wrote a ~ 1-scanline multiply routine, just waiting for someone to tell me (a) how to make it faster or (b) how it won't work.

Edited January 24, 2008 by vdub_bobby

Thomas Jentzsch · January 24, 2008

Nah, slightly different algorithm, I think.

You are right. My mistake.

Robert M · January 24, 2008

So is this really a 4.8 * 4.8 multiply, or you can just guarantee that if the one value is bigger than 12 bits, then the other will be sufficiently small to prevent an 8-bit overflow?

If it is really just 4.8 * 4.8 and you are discarding the fraction, then you can just do 4.4 * 4.4. Because the lower 4 bits of the fractions can never effect the integer portion of the result. 4.4 * 4.4 is just a bit shifted 8*8. I think that will simplify things.

If you scroll/read down a bit, you'll see that it is really 0.8 * 8.8 and I want an 8.8 result. And I wrote a ~ 1-scanline multiply routine, just waiting for someone to tell me (a) how to make it faster or (b) how it won't work.

Since you are thowing away the bottom byte of the 8.16 result, doesn't the problem reduce to 0.4 * 8.4, and you can just use a lookup table with the 16 possible 8.8 results? Or is my thinking all screwy on this?

Cheers!

Robert M · January 24, 2008

Since you are thowing away the bottom byte of the 8.16 result, doesn't the problem reduce to 0.4 * 8.4, and you can just use a lookup table with the 16 possible 8.8 results? Or is my thinking all screwy on this?

Cheers!

My thinking is screwy. The table won't work, but the reduction to 0.4 * 8.8 is still correct I think. :ponder:

vdub_bobby · January 24, 2008

Since you are thowing away the bottom byte of the 8.16 result, doesn't the problem reduce to 0.4 * 8.4, and you can just use a lookup table with the 16 possible 8.8 results? Or is my thinking all screwy on this?

Cheers!

My thinking is screwy. The table won't work, but the reduction to 0.4 * 8.8 is still correct I think.

I'm not sure if that is correct or not...it doesn't seem like it to me, but I have to think for a long time to wrap my head around these things :lol:

But in any case, I don't think it will help my current routine, other than reducing the size of my tables, since multiplying a nibble by a byte will be the same as multiplying a byte by a byte. Or... :ponder:

Edited January 24, 2008 by vdub_bobby

Robert M · January 24, 2008

Okay, I did a spreadsheet in Excel to prove this to myself.

0.8 * 8.8 = 8.16

but we are throwing the lower 8 bits of the answer away which makes the problem this:

0.8 * 8.8 = 8.16 / 256

We want to move the division by 256 to the left side of the equation, which because it is applied after the multiplication we must take the square root of 256. That is 16, so:

(0.8 / 16) * (8.8 / 16) = 8.16 / 256 [Excel says I am right :ponder: ]

So the problem reduces to:

0.4 * 8.4 = 8.8

I don't know if that makes the code easier, but it should get the same answer as the original (0.8 * 8.8)/256 minus a small rounding error.

vdub_bobby · January 24, 2008

Hmmm. Have to think about that...

EDIT: I'm not exactly following this. Doesn't 8.16 / 256 = 0.24 ?

EDIT: Actually, more generally, I think I follow your argument - but what is that rounding error? How big is it?

After all, 2.95 * 0.73 = 2.1535

If you throw away the bottom two digits, you get 2.15 which is within 1/2% of the true answer.

But if you truncate before multiplying you get 2.03, which is over 5% away from the actual answer.

And the difference would be larger for truncating 4 bits, so I dunno...a pre-truncating routine would have to be smoking fast to be worth the error, I think. :ponder:

Edited January 24, 2008 by vdub_bobby

Thomas Jentzsch · January 24, 2008

0.4 * 8.4 = 8.8

Or e.g. 0.2 * 8.6? :ponder:

Robert M · January 25, 2008

0.4 * 8.4 = 8.8

Or e.g. 0.2 * 8.6?

Okay my shorthand was too short

8.8 means 8 bits integer and 8 bits fraction.

so the original equation was ([0].[8] * [8].[8]) >> 8 , where >> 8 is the same as /256

I am saying you can divide the arguments before you multiply and get the same answer with less multiplying and the same amount of dividing.

(([0].[8] >> 4) * ([8].[8] >> 4)) == ([0].[8] * [8].[8]) >> 8

The original equation on the right is easier to round after the divide by bitshift since you can just add the .5 in the carry back into the answer.

The new equation has two shifts and you would need to remember the carry from both and round up the answer by one only if both were set.

The equation on the left after the shift is a nybble * 12 bits which I think should be easier than a byte times 16 bits. Just the rounding gets

tougher. I will try to come up with some code shortly.

Hornpipe2 · April 11, 2008

I personally like to use BRANCH instructions instead of JMP wherever possible. It saves a byte of ROM each time, and the cycle cost is the same. There are so many branch instructions you can usually find one that will execute 100% of the time for your given application.

6502 Killer hacks

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members