6502 Killer hacks

+SvOlli · February 26, 2021

Not a killer hack, but still something I found useful. For easy notation: "A" means accumulator, "R" is a value stored in RAM.

So, doing "A minus R" is easy

SEC
SBC R

I was faced with the problem of doing "R minus A", remembering of something I learned at school called "2s complement", I came up with this:

EOR #$FF
SEC
ADC R

So, what I did there was "minus A plus R". And it worked like a charm with a "cost" of two bytes and two cycles.

And also:

On 2/25/2021 at 12:16 PM, c3po said:

Well, for the sake of hacks, I think in any long program it's a useful commodity to have a table like this:
[...]

This allows some interesting "new instructions":

That's a very nice idea. But "spending" 1/16 of a bank seems to be a bit expensive. But coding for coding for other 6502 based platforms (C128/C64 right now), I will definitely use this, as the table could be also created with a simple loop in RAM. (That would be a fun thing in SuperCharger.) Thanks!

c3po · February 27, 2021

19 hours ago, SvOlli said:
I was faced with the problem of doing "R minus A", remembering of something I learned at school called "2s complement", I came up with this:
EOR #$FF
SEC
ADC R

I like it.

You can still do a normal "A minus R" then complement it

SEC
SBC R
EOR #$FF
CLC
ADC #01

simplified to:

CLC
SBC R
EOR #$FF

which is useful if the carry is known to be cleared already.

Thomas Jentzsch · February 27, 2021

A simple hack: Use EOR BEQ/BNE instead of CMP BEQ/BNE.

This has two advantages which might be useful:

A is 0 if equal
The C and V flags are not changed

c3po · March 19, 2021

If you need to do a modulo operation where the divisor is a power of two, you can use AND: M mod N = M AND (N-1).

; M mod 8 (8 is a power of 2...)

LDA M

AND #7 ; 8-1

mzxrules · June 18, 2021

not sure if it counts as a killer hack, and maybe this has been thought of before but I'll post it anyway.

My game is structured such that I have a world composed of 256 rooms, and then the playfield of each room is composed of up to 64 page-aligned 8x16 sprites, which comes out to 4 pages of sprite data. Instead of storing an index from 0-63 in the world data, and then multiply it later by 16 to build the 2 byte room sprite pointer, I swap the first 4 least significant bits with the 4 most significant bits and store that in my world data.

For example, let's say my sprite data is at F000, and say I want to store an index to sprite 0x2D, which is located at F2D0.

If I store the index as 0xD2 instead, then I can generate the room sprite pointer without needing 2 byte multiplication:

ldx roomId
lda roomSpriteIdx, x
pha
and #$F0 ; a is now lower half of the pointer
sta roomSprPtrL
pla
and #$0F
clc      ; These two lines can also just be 
adc #$F0 ; ora #$F0
sta roomSprPtrH

+Andrew Davie · June 18, 2021

11 hours ago, mzxrules said:

not sure if it counts as a killer hack, and maybe this has been thought of before but I'll post it anyway.

My game is structured such that I have a world composed of 256 rooms, and then the playfield of each room is composed of up to 64 page-aligned 8x16 sprites, which comes out to 4 pages of sprite data. Instead of storing an index from 0-63 in the world data, and then multiply it later by 16 to build the 2 byte room sprite pointer, I swap the first 4 least significant bits with the 4 most significant bits and store that in my world data.

For example, let's say my sprite data is at F000, and say I want to store an index to sprite 0x2D, which is located at F2D0.

If I store the index as 0xD2 instead, then I can generate the room sprite pointer without needing 2 byte multiplication:

A neat observation.

Just an aside: you can use the assembler to do the hard work for you in generating the swapped table - do not do it manually.

Something like this... makes it visually easier (you see actual sprite # in the human-readable code) and also much easier to change later

mac swapn
 .byte ({1}>>4)|(({1}&15)<<4)
endm


Table  ; list of sprite indexes
    swapn 22
    swapn 41
    swapn 12
    swapn 3

Omegamatrix · June 19, 2021

14 hours ago, mzxrules said:
not sure if it counts as a killer hack, and maybe this has been thought of before but I'll post it anyway.

My game is structured such that I have a world composed of 256 rooms, and then the playfield of each room is composed of up to 64 page-aligned 8x16 sprites, which comes out to 4 pages of sprite data. Instead of storing an index from 0-63 in the world data, and then multiply it later by 16 to build the 2 byte room sprite pointer, I swap the first 4 least significant bits with the 4 most significant bits and store that in my world data.

For example, let's say my sprite data is at F000, and say I want to store an index to sprite 0x2D, which is located at F2D0.

If I store the index as 0xD2 instead, then I can generate the room sprite pointer without needing 2 byte multiplication:
ldx roomId
lda roomSpriteIdx, x
pha
and #$F0 ; a is now lower half of the pointer
sta roomSprPtrL
pla
and #$0F
clc      ; These two lines can also just be 
adc #$F0 ; ora #$F0
sta roomSprPtrH

I really like that one, but I think you meant three lines could be ORA #$F0 as:

ldx roomId
lda roomSpriteIdx, x
pha
and #$F0 ; a is now lower half of the pointer
sta roomSprPtrL
pla
ora #$F0
sta roomSprPtrH

If you don't need X=roomId after this bit of code than you can use TAX TXA to save 3 cycles too:

ldx roomId
lda roomSpriteIdx, x
tax
and #$F0 ; a is now lower half of the pointer
sta roomSprPtrL
txa
ora #$F0
sta roomSprPtrH

Using Y register instead for TAY TYA is also an option.

mzxrules · June 19, 2021

4 hours ago, Andrew Davie said:

A neat observation.

Just an aside: you can use the assembler to do the hard work for you in generating the swapped table - do not do it manually.

I'm definitely not doing it manually, that's for sure. But for my project, I'm using Python 3 instead of DASM to generate the final data that will be included in the rom. It offers me so much more flexibility with regards to how I manage my game's data than DASM does.

1 hour ago, Omegamatrix said:
I really like that one, but I think you meant three lines could be ORA #$F0 as:
ldx roomId
lda roomSpriteIdx, x
pha
and #$F0 ; a is now lower half of the pointer
sta roomSprPtrL
pla
ora #$F0
sta roomSprPtrH
If you don't need X=roomId after this bit of code than you can use TAX TXA to save 3 cycles too:
ldx roomId
lda roomSpriteIdx, x
tax
and #$F0 ; a is now lower half of the pointer
sta roomSprPtrL
txa
ora #$F0
sta roomSprPtrH
Using Y register instead for TAY TYA is also an option.

The first one is rather clever. I can't use it in my own game though because I use the extra bits not used by the index to store other data.

For the second one, I already have it written as tax/txa in my own code, but you do bring up a good point; I hadn't realized that pha/pla take more cycles than tax/txa, presumably because pha/pla require memory access.

+Andrew Davie · June 19, 2021

1 hour ago, mzxrules said:

It offers me so much more flexibility with regards to how I manage my game's data than DASM does.

Maybe so, but skilled use of macros does offer a lot of flexibility too.

Omegamatrix · June 19, 2021

12 hours ago, mzxrules said:

The first one is rather clever. I can't use it in my own game though because I use the extra bits not used by the index to store other data.

For the second one, I already have it written as tax/txa in my own code, but you do bring up a good point; I hadn't realized that pha/pla take more cycles than tax/txa, presumably because pha/pla require memory access.

I also have thought about this since then, and while it won't work for your game I will mention that there is an illegal opcode option that is quicker, and uses less bytes:

ldx roomId
lda roomSpriteIdx, x
ldx #$F0
sax roomSprPtrL
ora #$F0
sta roomSprPtrH

azure · September 11, 2021

On 6/12/2005 at 4:08 PM, Alex H said:

Here's one I like:


; unsigned divide by 3

   sta    temp
   lsr
   lsr
   clc
   adc    temp
   ror
   lsr
   clc
   adc    temp
   ror
   lsr
   clc
   adc    temp
   ror
   lsr
   clc
   adc    temp
   ror
   lsr

Sorry for bumping an old thread, but I think the accuracy of this divide by 3 algorithm can be improved a lot by a simple fix. As it is it produces an incorrect result for all multiples of 3 greater than 0.

Just add 1 before each iteration, then the algorithm gives the correct answers up to 255/3. Only four iterations are actually necessary for 8-bit division, so there's a savings of 5 cycles. Five iterations will give accurate answers up to 1024/3 if extended into 16-bits.

    sta Temp    ; 3 (3)

    ; iteration 1: a = a * 1/4
    clc         ; 2 (5)
    adc #1      ; 2 (7)
    ror         ; 2 (9)
    lsr         ; 2 (11)

    ; iteration 2: a = a * 5/16
    sec         ; 2 (13)
    adc Temp    ; 3 (16)
    ror         ; 2 (18)
    lsr         ; 2 (20)

    ; iteration 3: a = a * 21/64
    sec         ; 2 (22)
    adc Temp    ; 3 (25)
    ror         ; 2 (27)
    lsr         ; 2 (29)

    ; iteration 4: a = a * 85/256
    sec         ; 2 (31)
    adc Temp    ; 3 (34)
    ror         ; 2 (36)
    lsr         ; 2 (38)

    ; iteration 5 is unnecessary for 8-bit division by 3

    ; iteration 5: a = a * 341/1024
    sec         ; 2 (40)
    adc Temp    ; 3 (43)
    ror         ; 2 (45)
    lsr         ; 2 (47)

The assembly can be verified here: https://skilldrick.github.io/easy6502/

I've written a simple test program in C, which implements the algorithm in 16-bits. It flags all incorrect results.

div3.c

div3.txt.zip

Edited September 11, 2021 by azure
Added .txt output

Omegamatrix · September 11, 2021

I've made a whole bunch of 8 bit unsigned integer division routines. I've posted them in my blog, and on Nesdev.

https://atariage.com/forums/blogs/entry/10805-unsigned-integer-division-routines/

For divide by 3 this is what I came up with:

;Divide by 3
;18 bytes, 30 cycles
  sta  temp
  lsr
  adc  #21
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr

It's good for all values 0-255.

azure · September 12, 2021

2 hours ago, Omegamatrix said:

For divide by 3 this is what I came up with:

It's good for all values 0-255.

Hmm. That 21 addition is fascinating. I suppose you already knew, but 21 is the only number that works for that algorithm. +20 doesn't work for 252/3 and 255/3. And +22 doesn't work for 2/3 and 5/3. From what I can make of it, it's adding two adjustments to the summation series: +21/128 and it's also folding back in the carry. I'm not sure I can figure out what adjustment the carry is having, but it seems to make it all work for the full 8-bit range.

+batari · December 18, 2021

On 9/11/2021 at 4:27 PM, Omegamatrix said:
I've made a whole bunch of 8 bit unsigned integer division routines. I've posted them in my blog, and on Nesdev.

https://atariage.com/forums/blogs/entry/10805-unsigned-integer-division-routines/

For divide by 3 this is what I came up with:
;Divide by 3
;18 bytes, 30 cycles
  sta  temp
  lsr
  adc  #21
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr
It's good for all values 0-255.

While trying to figure out a good divide by 3 in 16 bits I stumbled upon some possible explanations to why this one works!

First off, it is possible this is just doing a divide by 3 using the well known multiply and shift idea (don't divide by 3, multiply by 1/3!) that is, something like (x*(512/3))>>9 then mask out the lower 8 bits. I am not sure if this was the basis of the above code.

However, I learned something else that I don't think is directly related: For any division by an odd number in binary, there is exactly ONE number in the range (whether 8 bits, 16 bits, 32 bits or any number of bits for that matter), where if you multiply by that number, you get the same result as division. That is, each odd divisor has exactly one "magic number" you can multiply by and it functionally performs a divide.

For 8 bits, the magic number for 3 is 171 or $AB. And wouldn't you know it, the above looks a lot like a bitwise multiplication by $AB. Yeah, 512/3 is 171 so the number could have come from that.

I am not sure what the general formula for calculating the magic number. I used brute-force C code to figure out that for 3 it is 171 ($AB, or 0b10101011). For 5, the number seems to be 205 ($CD). For 7, it seems to be 183 ($B7). Note that you cannot get some of these numbers using the multiply and shift method above so this doesn't appear to be that simple. I wrote a quick bit of code to figure these out.

3's magic number: 171
5's magic number: 205
7's magic number: 183
9's magic number: 57
11's magic number: 163
13's magic number: 197
15's magic number: 239
17's magic number: 241
19's magic number: 27
21's magic number: 61
23's magic number: 167
25's magic number: 41

For example, for 12/3, do 12*171, then just look at the lower 8 bits, and you get 4. It should work for all numbers evenly divisible by 3.

Note that this algorithm, as I presented it, does not work correctly unless the numbers are evenly divisible. Perhaps the ADC #21 somehow corrects that problem. I am not exactly sure how that part works or how to tweak the other algorithms to correct that.

That all said, this idea is easily expandable to 16 bits, which was how I learned about this. I ran a few numbers for that:

3's magic number: 43691
5's magic number: 52429
7's magic number: 28087
9's magic number: 36409
11's magic number: 35747
13's magic number: 20165
15's magic number: 61167
17's magic number: 61681
19's magic number: 51739
21's magic number: 53053
23's magic number: 14247
25's magic number: 23593
27's magic number: 55827
29's magic number: 49717
31's magic number: 31711
33's magic number: 33761
35's magic number: 44939
37's magic number: 7085
39's magic number: 28567

If you needed to do a fast general 16-bit divide, you could put these numbers in a table then run a bitwise multiplication of the magic number.

As suggested this also does have the problem of not working right unless the numbers are evenly divisible. That was not a problem for what I was working on, as I didn't need to divide, but just a fast way to check if a number was evenly divisible on a CPU with a hardware multiplier but no divide.

There may be a way to tweak the idea to work with numbers that are not evenly divisible, but I am not sure what it is. I don't understand exactly how this all works, I am just stating something interesting I learned recently and I remembered this post.

Omegamatrix · December 18, 2021

Hello @batari

The math routines I wrote were all shift and add method with perbutations to make them shorter in bytes and execution if possible.

What you stumbled upon is reciprocal multiplication. Jones has a good article on it. Although a little off topic I will say I started a math file for the ARM games to make division by a unsigned integer easier for everyone to do:

// Example divide by 10:

   result = (dividend * DIV_BY_10) >> DIV_BY_10_SHIFT;

Here is the file. You'd have to look at the particular value to see if it is valid for the entire 16 bit range.

Math.h

+batari · December 18, 2021

1 hour ago, Omegamatrix said:
Hello @batari

The math routines I wrote were all shift and add method with perbutations to make them shorter in bytes and execution if possible.

What you stumbled upon is reciprocal multiplication. Jones has a good article on it. Although a little off topic I will say I started a math file for the ARM games to make division by a unsigned integer easier for everyone to do:
// Example divide by 10:

   result = (dividend * DIV_BY_10) >> DIV_BY_10_SHIFT;
Here is the file. You'd have to look at the particular value to see if it is valid for the entire 16 bit range.

Math.h 14.71 kB · 1 download

Nope, this is not reciprocal multiplication. The theory here is based on modular arithmetic.

Although many of the factors strangely turned out to be the same, this does not involve any shifting at all. This is literally only multiplication and the result is present in the lower 8 or 16 bits with no shifting involved at all.

As you found in math.h, reciprocal multiplication sometimes fails, and that is because it is an approximation. The theory I mentioned here never fails (for numbers evenly divisible) because according to theory, there is one and only one magic number for each odd divisor and this number is mathematically correct.

Since it only works for evenly divisible numbers, it's not as useful as reciprocal multiplication for general division but for some things (like a modulo operator on a CPU with no division instructions) it is very useful.

+batari · December 18, 2021

Here is the theory for those interested: https://en.wikipedia.org/wiki/Modular_multiplicative_inverse

I don't really understand how to do the math but I get what it's saying and can write code to brute-force magic numbers. I calculated the magic number for 7 in 32 bits: 3067833783.

That is, on a 32-bit machine, x/7 is the same as x*3067833783 for evenly divisible numbers. if 56/7=8, then 56*3067833783=8 (in mod 2^32). This number 3067833783 works for all evenly divisible values throughout the entire 32-bit range and it's mathematically correct and complete.

Omegamatrix · December 20, 2021

On 12/18/2021 at 2:53 AM, batari said:

Here is the theory for those interested: https://en.wikipedia.org/wiki/Modular_multiplicative_inverse

I don't really understand how to do the math but I get what it's saying and can write code to brute-force magic numbers. I calculated the magic number for 7 in 32 bits: 3067833783.

That is, on a 32-bit machine, x/7 is the same as x*3067833783 for evenly divisible numbers. if 56/7=8, then 56*3067833783=8 (in mod 2^32). This number 3067833783 works for all evenly divisible values throughout the entire 32-bit range and it's mathematically correct and complete.

Truthfully I don't understand the language either so I look around for examples. I made an excel sheet to play around with it.

Mod Test.xlsx

1) Using 8 bit as example you can make a table of (n*(2^8))+1, where I used a range of n of 0 to (2^8)-1. Looking at the results this range can probably be trimmed (like throwing out zero), but this is just for illustration.

2) Take the odd number divisor and and divide each number of the table starting from the lowest. Check each result, and stop at the first instance of an integer. It will be the magic number for that divisor within the masking of (2^8).

- An observation is that finding one magic number is actually finding 2. If 239 is the magic number for 15, then 15 is the magic number of 239.

- In the above example I am assuming it can easily be expanded to different bit sizes, 2^16, etc... though I didn't try.

- I ignored 1 as divisor, and (2^8)-1 = 255 as a divisor.

6502 Killer hacks

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members