IGNORED

# 6502 Killer hacks

## Recommended Posts

And here is the code for when you are doing more bits at once. 3 in this example.

  lsr
lsr
lsr
adc #$F0 eor #$F0

Cool! ##### Share on other sites

Getting the 2's complement of the accumulator, rudimentary, but the extra here is two implementations.

One version uses clc, the other sec so you could save the sec or clc if you know the prior state of the carry flag.

;2's complement x = x ^ $FF + 1 ;assumes carry is clear eor #$FF

; 2's complement x = x - 1 ^ $FF ; assumes carry is set sbc #01 eor #$FF

##### Share on other sites

• 2 weeks later...
Cool! Here's another one:

Modulus.

I need to find n mod 20. Well, to be specific, I need to find if n mod 20 is zero or otherwise.

Here's how I do it:

   lda n
sec
Loop
sbc #20
bcc Remainder
bne Loop
;--remainder is zero

Which is slow. Is there a better way?

That division by 5 is nasty for optimization. Instead I recommend dividing by 40 so that the average number of times through the loop is cut in half plus 1 add instruction. Its slower for some cases, but I estimate about 45% faster on average.

Modulo20
; input: A = the unsigned number to test (0 to 255)
tax ; Need to set Z-flag for special case that A is already 0.  ( your original code did not do this so it may not be important.)
sec
loop
beq done
sbc #40
bcs loop
; C = 0
done
; Z = 1 if the input number was a multiple of 20.
; X = the original number
; A = garbage.

##### Share on other sites

That division by 5 is nasty for optimization. Instead I recommend dividing by 40 so that the average number of times through the loop is cut in half plus 1 add instruction. Its slower for some cases, but I estimate about 45% faster on average.

  sec
u160: ; Test 160 from above
sbc #160
bcc b80
u80:
sbc #80
bcc b40
u40:
sbc #40
bcc b20
u20:
sbc #20
bcc bzero
done:
rts
b80: ; Test 80 from below
bcs u40
b40:
bcs u20
b20:
bcs done
bzero:
rts

That should take 10 cycles for adc/sbc's, 8-12 for branches, and six for the RTS.

##### Share on other sites

• 7 months later...

Anyone have a killer hack for 8.8 multiplication?

I need to do 8.8 * 8.8 and have the result be 8.8.

I know that the integer part of the result will not exceed 8 bits and I don't need a 16-bit remainder.

What I've got is this:

  ; a.b * c.d = e.fg
lda #0
sta e
sta f
sta g
ldx a
ldy b
clc
MultiplyLoop
lda g
sta g
lda f
sta f
lda e
sta e
dey
bne MultiplyLoop
dex
bne MultiplyLoop
;result in e.fg ... ignore g to get e.f as 8.8 result

(This ignores some zero cases.)

The main problem is that this is slower than dirt.

##### Share on other sites

Anyone have a killer hack for 8.8 multiplication?

I need to do 8.8 * 8.8 and have the result be 8.8.

How much space do you have?

This is one of my all time favourites. ##### Share on other sites

C= Hacking was brilliant, too bad it's been a few years already since the last issue. Still joping for its return ##### Share on other sites

Anyone have a killer hack for 8.8 multiplication?

I need to do 8.8 * 8.8 and have the result be 8.8.

How much space do you have?

This is one of my all time favourites. Do you mean the f(x) = (x^2)/4 method?

That requires a table of squares, which is fine for 8 bits, but how do I translate that to 16 bits? The table I would need is enormous! I don't have that much space. Edited by vdub_bobby
##### Share on other sites

Anyone have a killer hack for 8.8 multiplication?

I need to do 8.8 * 8.8 and have the result be 8.8.

well if you're using fixed point notation, then 8.8 is the same as 16.0, right? I mean, it's just a matter of how you interpret it. with that said I think you could use this method (scroll down to 'Binary Multiplication'):

http://www.atariarchives.org/roots/chapter_10.php

Basically the algorithm is long multiplication like you'd do by hand, except in binary form.

##### Share on other sites

(scroll down to 'Binary Multiplication'):

Three useful optimizations I would suggest:

• Subtract 1 from the multiplicand stored in memory, to eliminate the in-loop CLC.
• Store the other multiplicand into the product low byte location. In that way, the ROR that stores the product low bits can also sample the next bit of the multiplicand.
• If code space permits, unroll the loop at least somewhat. Unrolling 2x will save about twenty cycles.

Such approaches still won't be as fast as the table-lookup approach, but they'll be pretty good.

##### Share on other sites

Do you mean the f(x) = (x^2)/4 method?

That requires a table of squares, which is fine for 8 bits, but how do I translate that to 16 bits? The table I would need is enormous! I don't have that much space. No, you only require those 8 bit tables. I did this back in the 80ies with my C64. IIRC I did use 5 256 bytes tables (some were used just to save a few cycles) for 32 bit FP math (Mandelbrot).

Let's see if I can remember it...

A8:8*B8:8 = A8:0*B8:0 + 2*A8:0*B0:8 + 2*A0:8*B8:0 + A0:8*B:08

So everything is reduced to 8 (or 9) bits. You can get away with just one 256 bytes table. Or you can do extra tables for the 9 bit results to save a few extra cycles.

Edited by Thomas Jentzsch
##### Share on other sites

No, you only require those 8 bit tables. I did this back in the 80ies with my C64. IIRC I did use 5 256 bytes tables (some were used just to save a few cycles) for 32 bit FP math (Mandelbrot).

Let's see if I can remember it...

A8:8*B8:8 = A8:0*B8:0 + 2*A8:0*B0:8 + 2*A0:8*B8:0 + A0:8*B:08

So everything is reduced to 8 (or 9) bits. You can get away with just one 256 bytes table. Or you can do extra tables for the 9 bit results to save a few extra cycles.

Thanks.

It turns out what I really need is A0:8 * B8:8 and I don't need the least-significant byte.

So I think I can do this...

A0:8 * B0:8 + 256*(A0:8 * B8:0)

and then just throw away the bottom byte.

My next question!

You need a table of the values (x^2)/4 for x=0 to 255. But what happens when (x^2)/4 > 255? I.e., for x>31? Don't you need two tables, one for the lower byte and one for the upper byte.

And don't your tables need to be ~512 bytes long, for when x+y>255? Since xy=f(x+y)-f(x-y)

##### Share on other sites

A0:8 * B0:8 + 256*(A0:8 * B8:0)

and then just throw away the bottom byte.

You have to double A0:8 * B8:0

You need a table of the values (x^2)/4 for x=0 to 255. But what happens when (x^2)/4 > 255? I.e., for x>31? Don't you need two tables, one for the lower byte and one for the upper byte.

Sure, 2 tables here. We are getting close to the 5 tables I remember. And don't your tables need to be ~512 bytes long, for when x+y>255? Since xy=f(x+y)-f(x-y)

That would be another 2 tables, yes. Unless I miss something. ##### Share on other sites

A0:8 * B0:8 + 256*(A0:8 * B8:0)

and then just throw away the bottom byte.

You have to double A0:8 * B8:0

Nah, slightly different algorithm, I think.

Use base-10, let each letter be a digit: a * bc = a * ( ( 10 * b ) + c ) = 10 * a * b + a * c

Same thing, only where each "digit" is an 8-bit number, the base is 256.

So a * bc = a * ( ( 256 * b ) + c ) = 256 * a * b + a * c

In other words, I drop the "place" value of each digit and then add it back in at the end, kinda like when you do multiplication on paper:

	 a b
* c d
------
ac bc
--------

Using a lot of large tables, I think I can get my A0:8 * B8:8 multiplication, throwing away the bottom byte of the result, in 77 cycles:

  ;--a * b:c
lda a
sta IntPtrPlus
sta FractPtrPlus
sta IntPtrMinus
sta FractPtrMinus   ;four tables so I don't have to figure -c
ldy c
lda (IntPtrPlus),Y
sec
sbc (IntPtrMinus),Y
sta ResultInt1
;--don't care about bottom byte of a * c since I'll drop it anyway
ldy b
lda (IntPtrPlus),Y
sec
sbc (IntPtrMinus),Y
sta ResultInt2
lda (FractPtrPlus),Y
sbc (FractPtrMinus),Y
clc
sta ResultInt1
lda ResultInt2
sta ResultInt2

I *think* that will work...

##### Share on other sites

Anyone have a killer hack for 8.8 multiplication?

I need to do 8.8 * 8.8 and have the result be 8.8.

I know that the integer part of the result will not exceed 8 bits and I don't need a 16-bit remainder.

So is this really a 4.8 * 4.8 multiply, or you can just guarantee that if the one value is bigger than 12 bits, then the other will be sufficiently small to prevent an 8-bit overflow?

If it is really just 4.8 * 4.8 and you are discarding the fraction, then you can just do 4.4 * 4.4. Because the lower 4 bits of the fractions can never effect the integer portion of the result. 4.4 * 4.4 is just a bit shifted 8*8. I think that will simplify things. Cheers!

##### Share on other sites

So is this really a 4.8 * 4.8 multiply, or you can just guarantee that if the one value is bigger than 12 bits, then the other will be sufficiently small to prevent an 8-bit overflow?

If it is really just 4.8 * 4.8 and you are discarding the fraction, then you can just do 4.4 * 4.4. Because the lower 4 bits of the fractions can never effect the integer portion of the result. 4.4 * 4.4 is just a bit shifted 8*8. I think that will simplify things. If you scroll/read down a bit, you'll see that it is really 0.8 * 8.8 and I want an 8.8 result. And I wrote a ~ 1-scanline multiply routine, just waiting for someone to tell me (a) how to make it faster or (b) how it won't work. Edited by vdub_bobby
##### Share on other sites

Nah, slightly different algorithm, I think.

You are right. My mistake.

##### Share on other sites

So is this really a 4.8 * 4.8 multiply, or you can just guarantee that if the one value is bigger than 12 bits, then the other will be sufficiently small to prevent an 8-bit overflow?

If it is really just 4.8 * 4.8 and you are discarding the fraction, then you can just do 4.4 * 4.4. Because the lower 4 bits of the fractions can never effect the integer portion of the result. 4.4 * 4.4 is just a bit shifted 8*8. I think that will simplify things. If you scroll/read down a bit, you'll see that it is really 0.8 * 8.8 and I want an 8.8 result. And I wrote a ~ 1-scanline multiply routine, just waiting for someone to tell me (a) how to make it faster or (b) how it won't work. Since you are thowing away the bottom byte of the 8.16 result, doesn't the problem reduce to 0.4 * 8.4, and you can just use a lookup table with the 16 possible 8.8 results? Or is my thinking all screwy on this?

Cheers!

##### Share on other sites

Since you are thowing away the bottom byte of the 8.16 result, doesn't the problem reduce to 0.4 * 8.4, and you can just use a lookup table with the 16 possible 8.8 results? Or is my thinking all screwy on this?

Cheers!

My thinking is screwy. The table won't work, but the reduction to 0.4 * 8.8 is still correct I think. ##### Share on other sites

Since you are thowing away the bottom byte of the 8.16 result, doesn't the problem reduce to 0.4 * 8.4, and you can just use a lookup table with the 16 possible 8.8 results? Or is my thinking all screwy on this?

Cheers!

My thinking is screwy. The table won't work, but the reduction to 0.4 * 8.8 is still correct I think. I'm not sure if that is correct or not...it doesn't seem like it to me, but I have to think for a long time to wrap my head around these things But in any case, I don't think it will help my current routine, other than reducing the size of my tables, since multiplying a nibble by a byte will be the same as multiplying a byte by a byte. Or... Edited by vdub_bobby
##### Share on other sites

Okay, I did a spreadsheet in Excel to prove this to myself.

0.8 * 8.8 = 8.16

but we are throwing the lower 8 bits of the answer away which makes the problem this:

0.8 * 8.8 = 8.16 / 256

We want to move the division by 256 to the left side of the equation, which because it is applied after the multiplication we must take the square root of 256. That is 16, so:

(0.8 / 16) * (8.8 / 16) = 8.16 / 256 [Excel says I am right ]

So the problem reduces to:

0.4 * 8.4 = 8.8

I don't know if that makes the code easier, but it should get the same answer as the original (0.8 * 8.8)/256 minus a small rounding error.

##### Share on other sites

Hmmm. Have to think about that...

EDIT: I'm not exactly following this. Doesn't 8.16 / 256 = 0.24 ?

EDIT: Actually, more generally, I think I follow your argument - but what is that rounding error? How big is it?

After all, 2.95 * 0.73 = 2.1535

If you throw away the bottom two digits, you get 2.15 which is within 1/2% of the true answer.

But if you truncate before multiplying you get 2.03, which is over 5% away from the actual answer.

And the difference would be larger for truncating 4 bits, so I dunno...a pre-truncating routine would have to be smoking fast to be worth the error, I think. Edited by vdub_bobby
##### Share on other sites

0.4 * 8.4 = 8.8

Or e.g. 0.2 * 8.6? ##### Share on other sites

0.4 * 8.4 = 8.8

Or e.g. 0.2 * 8.6? Okay my shorthand was too short 8.8 means 8 bits integer and 8 bits fraction.

so the original equation was (. * .) >> 8 , where >> 8 is the same as /256

I am saying you can divide the arguments before you multiply and get the same answer with less multiplying and the same amount of dividing.

((. >> 4) * (. >> 4)) == (. * .) >> 8

The original equation on the right is easier to round after the divide by bitshift since you can just add the .5 in the carry back into the answer.

The new equation has two shifts and you would need to remember the carry from both and round up the answer by one only if both were set.

The equation on the left after the shift is a nybble * 12 bits which I think should be easier than a byte times 16 bits. Just the rounding gets

tougher. I will try to come up with some code shortly.

##### Share on other sites

• 2 months later...

I personally like to use BRANCH instructions instead of JMP wherever possible. It saves a byte of ROM each time, and the cycle cost is the same. There are so many branch instructions you can usually find one that will execute 100% of the time for your given application.

## Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible. ×   Pasted as rich text.   Paste as plain text instead

Only 75 emoji are allowed.