+SvOlli Posted October 31, 2017 Share Posted October 31, 2017 Hello! For hacking some nice demo effect, I need to convert a sprite position (range 0-95) to a cpu cycle offset. This means division by 3. I also need to divide the range into three areas. For technical reasons, I chose 12 cycles = 36 "positions". So I need to divide by 36 as well. How should I do this? An idea would be to have a table of 96 bytes storing all possible results: the lower 0-4 bits contain the result of div3, bits 5-6 div36, so that I can AND out or LSR the desired result rather easy and fast. But that costs precious ROM. Creating the table in RAM also is no option, because I already need >80 bytes for other data. Right now I'm stuck with a subtraction, counting how many times I can subtract the desired value like this: lda value ldx #$ff sec loop36: inx sbc #36 bcs loop36 stx div36 lda value ldx #$ff sec loop3: inx sbc #3 bcs loop3 stx div3 This costs 24 bytes in total, but also costs quite some computing time, worst case should be 263 cpu cycles which is ~3.5 raster lines. Right now I can live with that, but my mind keeps banging if there's still an alternative worth checking out. One thing I found saved one byte: lda value ldx #$ff sec loop3: inx sbc #3 bcs loop3 stx div3 txa ldx #$ff sec loop12: inx sbc #12 bcs loop12 stx div36 Just divide by 3 and after that by 12. Gets to the same result, saves one byte by using "txa", but still needs the same amount of cycles. For now I focused on the div3 since that's where most cpu time is spent on a bad case. I looked around the internet already and found the idea of multiplying by 256/3 and throw away the low byte. That approach does not work entirely as the result sometimes gets "rounded up", which would be the death of my cycle exact code. Multiplying by 3 is easy: lda value asl clc ; only needed if value could be > 127 adc value sta mul3 So I was hoping either to come up wirh or find something similar like this for dividing by 3, but no luck yet. Any ideas? p.s.: i should be able modify my code on the 36 positions part to use there 33 positions instead, if someone got a better idea for dividing by 33 (or 11) instead of 36 (or 12). Other numbers are technically not possible. 1 Quote Link to comment Share on other sites More sharing options...
Kylearan Posted October 31, 2017 Share Posted October 31, 2017 (edited) Just thinking out loud here without testing it: The division into three areas can be done by manual comparisons (2 cmp instructions). If you do that first, then you could re-use the result for the division by 3, i.e. you only need a look-up table of size 12 (0, 3, ..., 33) and then add a base value depending in your previous cmp results (+0, +36, +72). This should be very fast and might be light on ROM. EDIT: Scratch that, this is nonsense - I should refrain from "thinking out loud" and think things through before posting instead. EDIT 2: Or maybe not. (I'm tired today...) After the manual comparisons for the div 36, you could subtract 36 or 72 from the sprite position if it's in area 2 or 3, and then use the smaller, 12 byte look-up table for the div 3 and add 12 or 24 again to the result. Edited October 31, 2017 by Kylearan Quote Link to comment Share on other sites More sharing options...
Nukey Shay Posted October 31, 2017 Share Posted October 31, 2017 Divide by 3 (Omegamatrix) ;18 bytes, 30 cycles sta temp lsr adc #$15 lsr adc temp ror lsr adc temp ror lsr adc temp ror lsr 3 Quote Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted October 31, 2017 Share Posted October 31, 2017 For hacking some nice demo effect, I need to convert a sprite position (range 0-95) to a cpu cycle offset. You could do something like X * 1/4 + X * 1/16 + X * 1/64... or (X * 16 + X * 4 + X * 1) / 64 (rounding might become a problem). But why would you have to do that division anyway? Quote Link to comment Share on other sites More sharing options...
+SvOlli Posted October 31, 2017 Author Share Posted October 31, 2017 You could do something like X * 1/4 + X * 1/16 + X * 1/64... or (X * 16 + X * 4 + X * 1) / 64 (rounding might become a problem). Yes, that was one of the approaches I tried and ran into rounding problems. Divide by 3 (Omegamatrix) ;18 bytes, 30 cycles sta temp lsr adc #$15 lsr adc temp ror lsr adc temp ror lsr adc temp ror lsr Nukey, thanks. I think that was one of the solutions I was looking for evaluation. I found another shortcut lda value sec ldx #$ff loop36: inx sbc #36 bcs loop36 stx div36 adc #36 ldx #$ff loop3: inx sbc #3 bcs loop3 txa ldx div36 adc tab12,x ; tab12: .byte 0,12,24 sta div3 I'm using only the remainder of div36 for div3, to lower the number of loops needed to 12 max. This lower the number of cycles in the worst case from 263 to 129 by increasing ROM neede from 24 to 32. Fair tradeoff. This is something I can work with, if I don't figure out anything better. But if I combine the subtract loop for div36 and omegamatrix' code for div3, I'll end up with 34 bytes and worst case of 65 cycles. Seems like the better deal. Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted November 1, 2017 Share Posted November 1, 2017 29 bytes, and 42 cycles (constant). Good for dividing values of 0 to 107. ;Divide by 3 ;18 bytes, 30 cycles sta temp lsr adc #21 lsr adc temp ror lsr adc temp ror lsr adc temp ror lsr ;Continue routine for divide by 36 ;11 bytes 12 cycles tax ; X = div3 lda #0 cpx #12 adc #0 cpx #24 adc #0 ; A = div36 More of my unsigned integer division routines can be found here. 1 Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted November 1, 2017 Share Posted November 1, 2017 I took a look, and for the the divide by 3 can be shortened for a limited range. So now it's 27 bytes and 40 cycles (constant), and valid over all for dividing values of 0 to 107. ;Divide by 3 (can be used for 0-128 range) ;16 bytes, 28 cycles sta temp lsr lsr adc temp ror lsr adc temp ror lsr adc temp ror lsr ;Continue routine for dividing by 36 (good for 0-107 only) ;11 bytes 12 cycles tax ; X = div3 lda #0 cpx #12 adc #0 cpx #24 adc #0 ; A = div36 Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted November 1, 2017 Share Posted November 1, 2017 Another optimization, a trick I discovered long ago and forgotten for this particular situation. 26 bytes and 40 cycles. ;Divide by 3 (can be used for 0-128 range) ;16 bytes, 28 cycles sta temp lsr lsr adc temp ror lsr adc temp ror lsr adc temp ror lsr ;Continue routine for dividing by 36 (good for 0-107 only) ;10 bytes 12 cycles tax ; X = div3 lda #0 cpx #12 rol cpx #24 adc #0 ; A = div36 Quote Link to comment Share on other sites More sharing options...
+SvOlli Posted November 1, 2017 Author Share Posted November 1, 2017 (edited) And by switching A and X for the results, I can even optimize that to 24 bytes and 38 cycles: ; prepare (division code suggested by omagematrix) lda xpos ;Divide by 3 (can be used for 0-107 range) ;16 bytes, 28 cycles sta temp lsr lsr adc temp ror lsr adc temp ror lsr adc temp ror lsr ; A = div3 ;Continue routine for dividing by 36 (good for 0-107 only) ;8 bytes 10 cycles ldx #0 cmp #12 inx cmp #24 inx ; X = div36 Which turns out even better this way, because I have to run a subtraction with the div3 value. Edit: never write posts just after getting out of bed... The "inx" don't work of course... and fixing that with "bcc"s wil result in code that's slower and takes more ROM. Edited November 1, 2017 by SvOlli Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted November 3, 2017 Share Posted November 3, 2017 On a hunch I tried another optimization, and it saves 3 cycles. It just makes the 0-95 criteria. 26 bytes, and 37 cycles overall. ;Divide by 3 (good for 0-95 range only) ;16 bytes, 25 cycles cmp #33 adc #0 sta temp lsr lsr adc temp ror lsr adc temp ror lsr ;Continue routine for dividing by 36 (good for 0-107 range only) ;10 bytes 12 cycles tax ; X = div3 lda #0 cpx #12 rol cpx #24 adc #0 ; A = div36 1 Quote Link to comment Share on other sites More sharing options...
+SvOlli Posted November 3, 2017 Author Share Posted November 3, 2017 Thanks again. But I'm sticking with with "26 bytes and 40 cycles" versions, because what comes in via A has been calculated and needs to be stored to RAM anyway for later use. And since the "adc temp" doesn't change anything at "temp", I don't need a temporary storage, but just can "reuse" the original one. Quote Link to comment Share on other sites More sharing options...
+SvOlli Posted November 3, 2017 Author Share Posted November 3, 2017 The "rol" instead of the first "adc #$00" is a very nice trick to save 1 byte... included. Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted November 11, 2017 Share Posted November 11, 2017 I'm going to present one more version. It saves 3 bytes but costs 16 more cycles. Use whatever routine that makes the most sense for the situation. This one is 23 bytes and 56 cycles (constant cycles) overall. ;Divide by 3 (can be used for 0-128 range) ;13 bytes, 44 cycles sta temp lsr lsr ldx #3 .loopFinishDiv3: adc temp ror lsr dex bne .loopFinishDiv3 ;Continue routine for dividing by 36 (good for 0-107 only) ;10 bytes 12 cycles tax ; X = div3 lda #0 cpx #12 rol cpx #24 adc #0 ; A = div36 Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted November 11, 2017 Share Posted November 11, 2017 Also.... I always post and realize something else. If Y is available save 1 more byte. So 22 bytes and 56 cycles (constant). ;Divide by 3 (can be used for 0-128 range) ;13 bytes, 44 cycles sta temp lsr lsr ldy #3 .loopFinishDiv3: adc temp ror lsr dey bne .loopFinishDiv3 ;Continue routine for dividing by 36 (good for 0-107 only) ;9 bytes 12 cycles tax ; X = div3 tya ; A=0 cpx #12 rol cpx #24 adc #0 ; A = div36 Quote Link to comment Share on other sites More sharing options...
Omegamatrix Posted November 12, 2017 Share Posted November 12, 2017 Here's a 21 byte version. It adds another 9 cycles but hey if you got the cycles to spare then it's another byte saved. Use whatever routine works. 21 bytes and 65 cycles (constant) overall. ;Divide by 3 (can be used for 0-128 range) ;12 bytes, 53 cycles ldy #4 sta temp .byte $0C ; NOP skip 2 bytes .loopFinishDiv3: adc temp lsr lsr dey bne .loopFinishDiv3 ;Continue routine for dividing by 36 (good for 0-107 only) ;9 bytes 12 cycles tax ; X = div3 tya ; A=0 cpx #12 rol cpx #24 adc #0 ; A = div36 Quote Link to comment Share on other sites More sharing options...
vidak Posted November 22, 2017 Share Posted November 22, 2017 This thread is amazing!! Quote Link to comment Share on other sites More sharing options...
Just Jeff Posted June 9, 2019 Share Posted June 9, 2019 Divide by 3 (Omegamatrix) ;18 bytes, 30 cycles sta temp lsr adc #$15 lsr adc temp ror lsr adc temp ror lsr adc temp ror lsr Hello.. I'm trying to divide the 160 color clocks by 6. If I wanted to divide by 6, should I just put an lsr right at the top of this (before storing to temp)? Should I do anything with the carry? Quote Link to comment Share on other sites More sharing options...
bogax Posted June 9, 2019 Share Posted June 9, 2019 Hello.. I'm trying to divide the 160 color clocks by 6. If I wanted to divide by 6, should I just put an lsr right at the top of this (before storing to temp)? Should I do anything with the carry? Can't say I've studied on it any, but I'd just add an lsr at the end, that way you keep the most bits for the divide by 3 1 Quote Link to comment Share on other sites More sharing options...
Windless Posted June 12, 2019 Share Posted June 12, 2019 (edited) Could this work ? After line 192, you VBLANK, then : -init Y at zero (2) and A to proper collision mask (2) -position a missile at pixel X (11 cycles) -position a ball at start of line (3 cycles) -set HMBL at 3 (5 cycles) -at each new scanline, HMOVE (3 cycles) and increment Y (2) until collision detected (6/5 cycles) total : 28 + result * 11, max about 380 cycles, but definitively more fun. Note you can divide many number at once using this techic, or divide a number by severals number, saving many cycles. Edited June 12, 2019 by Windless Quote Link to comment Share on other sites More sharing options...
Just Jeff Posted June 14, 2019 Share Posted June 14, 2019 Divide by 3 (Omegamatrix) ;18 bytes, 30 cycles sta temp lsr adc #$15 lsr adc temp ror lsr adc temp ror lsr adc temp ror lsr Well I've looked at this multiple times this week, manually worked through it with various values and it always works. I made a spreadsheet to try to observe what happens, why and when.. and I still have no solid idea why this works. My guess is that the adc #$15 (%00010101) is seeding the value to affect the carry for some reason and that repeatedly dividing by four, then adding the original amount back in, somehow works out to dividing by 3. Also thinking that binary 3 behaves differently than decimal 3, being that its a fraction of the base in decimal, but 1.5 times the base in binary. Would someone care to comment it so I might get a better grip on it? Also, did I get any of it right? Thanks Quote Link to comment Share on other sites More sharing options...
bogax Posted June 15, 2019 Share Posted June 15, 2019 (edited) I didn't try to work it out but.. 1/3 is 1/4 + 1/16 + 1/64 + 1/256 ... You're multiplying by reciprocal 3 $15 is the first three of those That will (roughly) have the effect of rounding up if the fractional part is > .66 ie you're multiplying by .3333... and adding .333 (approximately, I've not accounted for truncation and rounding) Edited June 15, 2019 by bogax Quote Link to comment Share on other sites More sharing options...
Just Jeff Posted June 16, 2019 Share Posted June 16, 2019 (edited) 19 hours ago, bogax said: 1/3 is 1/4 + 1/16 + 1/64 + 1/256 ... Ahh thanks, that was the missing piece. I saw Thomas mention something like that but it didn’t click because I had no idea that was a thing. Everyone here seems to know it without having to say. All those years of math in school and I don’t recall ever seeing it. Here’s my interpretation: Div6: SUBROUTINE ; divide player position by 6 ; rors are necessary for larger numbers sta temp ; Store the value for recycling. 1/3 is the convergence of 1/4 + 1/16 + 1/64 + 1/256... lsr ; Shift significant bits (of 2 bit pairs) over 1 adc #$15 ; = %00010101. Round up if the 2 bit pair was 3? lsr ; Now we are at 1/4 the original number. The results here will be divided 3 more times for the 1/256 portion adc temp ; Add that to the original number for round 2 (Original # plus 1/4) ror ; lsr ; Divide that by 4. The results here will be divided 2 more times for the 1/64 portion adc temp ; Add that to the total number for round 3 (Original # plus 1/4 plus 1/16) ror ; lsr ; Divide that by 4. The results here will be divided 1 more time for the 1/16 portion adc temp ; Add that to the total number for round 4 (Original # plus 1/4 plus 1/16 plus 1/64) ror ; lsr ; Divide that by 4 for final result. Original can be discarded now. Result is 1/4 plus 1/16 plus 1/64 plus 1/256 ;Make it divide by 6 lsr ; Edited June 16, 2019 by BNE Jeff Fractions were wrong Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.