Jump to content
IGNORED

6502 Killer hacks


djmips

Recommended Posts

Getting the 2's complement of the accumulator, rudimentary, but the extra here is two implementations.

 

One version uses clc, the other sec so you could save the sec or clc if you know the prior state of the carry flag.

 

;2's complement x = x ^ $FF + 1
;assumes carry is clear
 eor #$FF
 adc #01

; 2's complement x = x - 1 ^ $FF  
; assumes carry is set
 sbc #01
 eor #$FF

Link to comment
Share on other sites

  • 2 weeks later...
Cool! Here's another one:

 

Modulus.

 

I need to find n mod 20. Well, to be specific, I need to find if n mod 20 is zero or otherwise.

Here's how I do it:

   lda n
  sec
Loop
  sbc #20
  bcc Remainder
  bne Loop
;--remainder is zero

Which is slow. Is there a better way?

 

 

That division by 5 is nasty for optimization. Instead I recommend dividing by 40 so that the average number of times through the loop is cut in half plus 1 add instruction. Its slower for some cases, but I estimate about 45% faster on average.

 

Modulo20
 ; input: A = the unsigned number to test (0 to 255)
  tax ; Need to set Z-flag for special case that A is already 0.  ( your original code did not do this so it may not be important.)
  sec
loop
  beq done
  sbc #40
  bcs loop
 ; C = 0
  adc #20
done
 ; Z = 1 if the input number was a multiple of 20.
 ; X = the original number
 ; A = garbage.

Link to comment
Share on other sites

That division by 5 is nasty for optimization. Instead I recommend dividing by 40 so that the average number of times through the loop is cut in half plus 1 add instruction. Its slower for some cases, but I estimate about 45% faster on average.

 

How about something like:

  sec
u160: ; Test 160 from above
 sbc #160
 bcc b80
u80:
 sbc #80
 bcc b40
u40:
 sbc #40
 bcc b20
u20:
 sbc #20
 bcc bzero
done:
 rts
b80: ; Test 80 from below
 adc #80
 bcs u40
b40:
 adc #40
 bcs u20
b20:
 adc #20
 bcs done
bzero:
 adc #20
 rts

That should take 10 cycles for adc/sbc's, 8-12 for branches, and six for the RTS.

Link to comment
Share on other sites

  • 7 months later...

Anyone have a killer hack for 8.8 multiplication?

I need to do 8.8 * 8.8 and have the result be 8.8.

 

I know that the integer part of the result will not exceed 8 bits and I don't need a 16-bit remainder.

 

What I've got is this:

  ; a.b * c.d = e.fg
  lda #0
  sta e
  sta f
  sta g
  ldx a
  ldy b
  clc
MultiplyLoop
  lda g
  adc d
  sta g
  lda f
  adc d
  sta f
  lda e
  adc #0
  sta e
  dey
  bne MultiplyLoop
  dex
  bne MultiplyLoop
 ;result in e.fg ... ignore g to get e.f as 8.8 result

(This ignores some zero cases.)

The main problem is that this is slower than dirt.

Link to comment
Share on other sites

Anyone have a killer hack for 8.8 multiplication?

I need to do 8.8 * 8.8 and have the result be 8.8.

How much space do you have?

 

Check this link: http://www.ffd2.com/fridge/chacking/c=hacking9.txt

 

This is one of my all time favourites. :)

Do you mean the f(x) = (x^2)/4 method?

 

That requires a table of squares, which is fine for 8 bits, but how do I translate that to 16 bits? The table I would need is enormous! I don't have that much space. :)

Edited by vdub_bobby
Link to comment
Share on other sites

Anyone have a killer hack for 8.8 multiplication?

I need to do 8.8 * 8.8 and have the result be 8.8.

 

well if you're using fixed point notation, then 8.8 is the same as 16.0, right? I mean, it's just a matter of how you interpret it. with that said I think you could use this method (scroll down to 'Binary Multiplication'):

 

http://www.atariarchives.org/roots/chapter_10.php

 

Basically the algorithm is long multiplication like you'd do by hand, except in binary form.

Link to comment
Share on other sites

(scroll down to 'Binary Multiplication'):

 

Three useful optimizations I would suggest:

 

  • Subtract 1 from the multiplicand stored in memory, to eliminate the in-loop CLC.
  • Store the other multiplicand into the product low byte location. In that way, the ROR that stores the product low bits can also sample the next bit of the multiplicand.
  • If code space permits, unroll the loop at least somewhat. Unrolling 2x will save about twenty cycles.

Such approaches still won't be as fast as the table-lookup approach, but they'll be pretty good.

Link to comment
Share on other sites

Do you mean the f(x) = (x^2)/4 method?

 

That requires a table of squares, which is fine for 8 bits, but how do I translate that to 16 bits? The table I would need is enormous! I don't have that much space. :)

No, you only require those 8 bit tables. I did this back in the 80ies with my C64. IIRC I did use 5 256 bytes tables (some were used just to save a few cycles) for 32 bit FP math (Mandelbrot).

 

Let's see if I can remember it...

 

A8:8*B8:8 = A8:0*B8:0 + 2*A8:0*B0:8 + 2*A0:8*B8:0 + A0:8*B:08

 

So everything is reduced to 8 (or 9) bits. You can get away with just one 256 bytes table. Or you can do extra tables for the 9 bit results to save a few extra cycles.

Edited by Thomas Jentzsch
Link to comment
Share on other sites

No, you only require those 8 bit tables. I did this back in the 80ies with my C64. IIRC I did use 5 256 bytes tables (some were used just to save a few cycles) for 32 bit FP math (Mandelbrot).

 

Let's see if I can remember it...

 

A8:8*B8:8 = A8:0*B8:0 + 2*A8:0*B0:8 + 2*A0:8*B8:0 + A0:8*B:08

 

So everything is reduced to 8 (or 9) bits. You can get away with just one 256 bytes table. Or you can do extra tables for the 9 bit results to save a few extra cycles.

Thanks.

 

It turns out what I really need is A0:8 * B8:8 and I don't need the least-significant byte.

 

So I think I can do this...

A0:8 * B0:8 + 256*(A0:8 * B8:0)

and then just throw away the bottom byte.

 

My next question!

 

You need a table of the values (x^2)/4 for x=0 to 255. But what happens when (x^2)/4 > 255? I.e., for x>31? Don't you need two tables, one for the lower byte and one for the upper byte.

 

And don't your tables need to be ~512 bytes long, for when x+y>255? Since xy=f(x+y)-f(x-y)

Link to comment
Share on other sites

A0:8 * B0:8 + 256*(A0:8 * B8:0)

and then just throw away the bottom byte.

You have to double A0:8 * B8:0

 

You need a table of the values (x^2)/4 for x=0 to 255. But what happens when (x^2)/4 > 255? I.e., for x>31? Don't you need two tables, one for the lower byte and one for the upper byte.

Sure, 2 tables here. We are getting close to the 5 tables I remember. :)

 

And don't your tables need to be ~512 bytes long, for when x+y>255? Since xy=f(x+y)-f(x-y)

That would be another 2 tables, yes. Unless I miss something. :ponder:

Link to comment
Share on other sites

A0:8 * B0:8 + 256*(A0:8 * B8:0)

and then just throw away the bottom byte.

You have to double A0:8 * B8:0

Nah, slightly different algorithm, I think.

 

Use base-10, let each letter be a digit: a * bc = a * ( ( 10 * b ) + c ) = 10 * a * b + a * c

 

Same thing, only where each "digit" is an 8-bit number, the base is 256.

So a * bc = a * ( ( 256 * b ) + c ) = 256 * a * b + a * c

 

In other words, I drop the "place" value of each digit and then add it back in at the end, kinda like when you do multiplication on paper:

 

	 a b
  * c d
  ------
  ad bd
ac bc
--------

 

Using a lot of large tables, I think I can get my A0:8 * B8:8 multiplication, throwing away the bottom byte of the result, in 77 cycles:

  ;--a * b:c
  lda a
  sta IntPtrPlus
  sta FractPtrPlus
  sta IntPtrMinus
  sta FractPtrMinus   ;four tables so I don't have to figure -c
  ldy c
  lda (IntPtrPlus),Y
  sec
  sbc (IntPtrMinus),Y
  sta ResultInt1
 ;--don't care about bottom byte of a * c since I'll drop it anyway
  ldy b
  lda (IntPtrPlus),Y
  sec
  sbc (IntPtrMinus),Y
  sta ResultInt2
  lda (FractPtrPlus),Y
  sbc (FractPtrMinus),Y
  clc
  adc ResultInt1
  sta ResultInt1
  lda ResultInt2
  adc #0
  sta ResultInt2

 

I *think* that will work...

Link to comment
Share on other sites

Anyone have a killer hack for 8.8 multiplication?

I need to do 8.8 * 8.8 and have the result be 8.8.

 

I know that the integer part of the result will not exceed 8 bits and I don't need a 16-bit remainder.

 

So is this really a 4.8 * 4.8 multiply, or you can just guarantee that if the one value is bigger than 12 bits, then the other will be sufficiently small to prevent an 8-bit overflow?

 

 

If it is really just 4.8 * 4.8 and you are discarding the fraction, then you can just do 4.4 * 4.4. Because the lower 4 bits of the fractions can never effect the integer portion of the result. 4.4 * 4.4 is just a bit shifted 8*8. I think that will simplify things. ;)

 

Cheers!

Link to comment
Share on other sites

So is this really a 4.8 * 4.8 multiply, or you can just guarantee that if the one value is bigger than 12 bits, then the other will be sufficiently small to prevent an 8-bit overflow?

 

 

If it is really just 4.8 * 4.8 and you are discarding the fraction, then you can just do 4.4 * 4.4. Because the lower 4 bits of the fractions can never effect the integer portion of the result. 4.4 * 4.4 is just a bit shifted 8*8. I think that will simplify things. ;)

If you scroll/read down a bit, you'll see that it is really 0.8 * 8.8 and I want an 8.8 result. And I wrote a ~ 1-scanline multiply routine, just waiting for someone to tell me (a) how to make it faster or (b) how it won't work. :)

Edited by vdub_bobby
Link to comment
Share on other sites

So is this really a 4.8 * 4.8 multiply, or you can just guarantee that if the one value is bigger than 12 bits, then the other will be sufficiently small to prevent an 8-bit overflow?

 

 

If it is really just 4.8 * 4.8 and you are discarding the fraction, then you can just do 4.4 * 4.4. Because the lower 4 bits of the fractions can never effect the integer portion of the result. 4.4 * 4.4 is just a bit shifted 8*8. I think that will simplify things. ;)

If you scroll/read down a bit, you'll see that it is really 0.8 * 8.8 and I want an 8.8 result. And I wrote a ~ 1-scanline multiply routine, just waiting for someone to tell me (a) how to make it faster or (b) how it won't work. :)

 

 

Since you are thowing away the bottom byte of the 8.16 result, doesn't the problem reduce to 0.4 * 8.4, and you can just use a lookup table with the 16 possible 8.8 results? Or is my thinking all screwy on this?

 

Cheers!

Link to comment
Share on other sites

Since you are thowing away the bottom byte of the 8.16 result, doesn't the problem reduce to 0.4 * 8.4, and you can just use a lookup table with the 16 possible 8.8 results? Or is my thinking all screwy on this?

 

Cheers!

 

My thinking is screwy. The table won't work, but the reduction to 0.4 * 8.8 is still correct I think. :ponder:

Link to comment
Share on other sites

Since you are thowing away the bottom byte of the 8.16 result, doesn't the problem reduce to 0.4 * 8.4, and you can just use a lookup table with the 16 possible 8.8 results? Or is my thinking all screwy on this?

 

Cheers!

 

My thinking is screwy. The table won't work, but the reduction to 0.4 * 8.8 is still correct I think. :ponder:

I'm not sure if that is correct or not...it doesn't seem like it to me, but I have to think for a long time to wrap my head around these things :lol:

 

But in any case, I don't think it will help my current routine, other than reducing the size of my tables, since multiplying a nibble by a byte will be the same as multiplying a byte by a byte. Or...:ponder:

Edited by vdub_bobby
Link to comment
Share on other sites

Okay, I did a spreadsheet in Excel to prove this to myself.

 

0.8 * 8.8 = 8.16

 

but we are throwing the lower 8 bits of the answer away which makes the problem this:

 

0.8 * 8.8 = 8.16 / 256

 

We want to move the division by 256 to the left side of the equation, which because it is applied after the multiplication we must take the square root of 256. That is 16, so:

 

(0.8 / 16) * (8.8 / 16) = 8.16 / 256 [Excel says I am right :ponder: ]

 

So the problem reduces to:

 

0.4 * 8.4 = 8.8

 

I don't know if that makes the code easier, but it should get the same answer as the original (0.8 * 8.8)/256 minus a small rounding error.

Link to comment
Share on other sites

Hmmm. Have to think about that...

 

EDIT: I'm not exactly following this. Doesn't 8.16 / 256 = 0.24 ?

 

EDIT: Actually, more generally, I think I follow your argument - but what is that rounding error? How big is it?

 

After all, 2.95 * 0.73 = 2.1535

If you throw away the bottom two digits, you get 2.15 which is within 1/2% of the true answer.

But if you truncate before multiplying you get 2.03, which is over 5% away from the actual answer.

 

And the difference would be larger for truncating 4 bits, so I dunno...a pre-truncating routine would have to be smoking fast to be worth the error, I think. :ponder:

Edited by vdub_bobby
Link to comment
Share on other sites

0.4 * 8.4 = 8.8

Or e.g. 0.2 * 8.6? :ponder:

 

 

 

Okay my shorthand was too short ;)

 

8.8 means 8 bits integer and 8 bits fraction.

 

so the original equation was ([0].[8] * [8].[8]) >> 8 , where >> 8 is the same as /256

 

I am saying you can divide the arguments before you multiply and get the same answer with less multiplying and the same amount of dividing.

 

(([0].[8] >> 4) * ([8].[8] >> 4)) == ([0].[8] * [8].[8]) >> 8

 

The original equation on the right is easier to round after the divide by bitshift since you can just add the .5 in the carry back into the answer.

The new equation has two shifts and you would need to remember the carry from both and round up the answer by one only if both were set.

The equation on the left after the shift is a nybble * 12 bits which I think should be easier than a byte times 16 bits. Just the rounding gets

tougher. I will try to come up with some code shortly.

Link to comment
Share on other sites

  • 2 months later...

I personally like to use BRANCH instructions instead of JMP wherever possible. It saves a byte of ROM each time, and the cycle cost is the same. There are so many branch instructions you can usually find one that will execute 100% of the time for your given application.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...