catsfolly Posted July 12, 2012 Share Posted July 12, 2012 Here is a fractional multiply routine that I wrote for Air Hockey. It takes the number in register r3, which can be signed, and multiplies it by the number in r1 (which has to be less than 256). The result is divided by 256. (r1 * r3)/ 256 -> r0 This is useful for multiplying numbers by sines or cosines, or for scaling numbers by arbitrary values. Feel free to use this routine (or optimize it) if you like... Catsfolly ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;; umult / 256 ;;; ;;; unsigned multiply ((r1 * r3)/ 256 (r3 can be signed) ;;; r1 * r3, result in r0 ;;; if r1 > 255, returns r3 value in r0 ;;; r2, r3, r4 trashed ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; umultd256 proc cmpi #255,r1 bnc @@r1ok ; branch if 255 or less movr r3,r0 ; r1 is 256 or greater or negative jr r5 ; done @@r1ok clrr r0 ; clear result clrr r4 ; result in r0, r4 clrr r2 ; shifted value in r3 r2 sarc r3,1 ; divide multiplicand by 2 rrc r2,1 ; extend to lsb word swap r1 ; move multiplier bits to top of word beq @@done tstr r1 ; test true sign bit bmi @@loop_entry @@loop: sarc r3,1 ; divide multiplicand by 2 rrc r2,1 ; extend to lsb word sll r1,1 ; get next bit of multiplier tstr r1 ; test sign and zero beq @@done bpl @@loop @@loop_entry ; add value in (r3, r2) to total in (r0, r4) addr r2,r4 ; add lsbits adcr r0 ; add carry if any addr r3,r0 ; add msbits b @@loop @@done jr r5 endp Quote Link to comment Share on other sites More sharing options...
intvnut Posted July 12, 2012 Share Posted July 12, 2012 Just a quick bit of feedback... You don't need the "tstr r1" in your @@loop, as the SLL will set the flags you need. :-) Also, if you shift the multiplicand right rather than left, you can avoid the full 32-bit add as well, needing only the ADCR. This does mean you'll have to left-shift by 8 after the loop, but I think it may be a win overall. If you don't mind a shade more code size in the name of speed, you can have two versions of the inner loop: umultd256 proc cmpi #255,r1 bnc @@r1ok ; branch if 255 or less movr r3,r0 ; r1 is 256 or greater or negative jr r5 ; done @@r1ok clrr r0 ; clear result clrr r2 ; result in r0, r2 b @@first @@loop_add: sllc r2 ; rlc r0 ; shift 32-bit result left addr r3,r2 ; add in multiplicand adcr r0 ; ...and carry sarc r1 bc @@loop_add beq @@done @@loop_noadd sllc r2 ; rlc r0 ; shift 32-bit result left @@first: sarc r1 bc @@loop_add bneq @@loop_noadd ; when we get here, product r0:r2 needs to be left shifted by 8 more swap r0 swap r2 andi #$FF00,r0 andi #$00FF,r2 xorr r2,r0 jr r5 endp I admit, I haven't tested this yet... Quote Link to comment Share on other sites More sharing options...
catsfolly Posted July 12, 2012 Author Share Posted July 12, 2012 (edited) Just a quick bit of feedback... You don't need the "tstr r1" in your @@loop, as the SLL will set the flags you need. :-) You're right! I guess when I tried it the first time (without the "tstr r1") I had some other error in the code, because I thought it didn't work. But I just tried it again with the tstr commented out and it worked fine. Your approach is interesting - thanks for tackling the problem. I was trying to avoid the fix up at the end. It's a tradeoff - having 1 extra instruction in the "add loop" versus 5 instructions at the end. If the multiplier has exactly 5 bits set in it (out of a possible , then the "add loop" is executed five times, and the routines break even (more or less)... If less bits are set, my routine is faster. If more bits are set, your routine is faster... (IMHO) Thanks for the feedback. Catsfolly Edited July 12, 2012 by catsfolly Quote Link to comment Share on other sites More sharing options...
intvnut Posted July 12, 2012 Share Posted July 12, 2012 Just a quick bit of feedback... You don't need the "tstr r1" in your @@loop, as the SLL will set the flags you need. :-) You're right! I guess when I tried it the first time (without the "tstr r1") I had some other error in the code, because I thought it didn't work. But I just tried it again with the tstr commented out and it worked fine. Your approach is interesting - thanks for tackling the problem. I was trying to avoid the fix up at the end. It's a tradeoff - having 1 extra instruction in the "add loop" versus 5 instructions at the end. If the multiplier has exactly 5 bits set in it (out of a possible , then the "add loop" is executed five times, and the routines break even (more or less)... If less bits are set, my routine is faster. If more bits are set, your routine is faster... (IMHO) Thanks for the feedback. Catsfolly I admit, I threw that version together very quickly. :-) At any rate, is the cross-over point really that far out? If you remove the TSTR in your loop, I count 34 cycles for a 0, 59 cycles for a 1. I count mine at 34 cycles also for a 0, but only 39 cycles for a 1. The merge at the end costs 34 cycles. Counterbalancing that is 1 less CLRR and 2 less shifts, for an outside-the-loop penalty of only 16 cycles. So, I should be faster in all cases, shouldn't I? Or am I counting wrong? Staying closer to your original, you could rotate the loop like this: umultd256 proc cmpi #255,r1 bnc @@r1ok ; branch if 255 or less movr r3,r0 ; r1 is 256 or greater or negative jr r5 ; done @@r1ok clrr r0 ; clear result clrr r4 ; result in r0, r4 clrr r2 ; shifted value in r3 r2 sarc r3,1 ; divide multiplicand by 2 rrc r2,1 ; extend to lsb word swap r1 ; move multiplier bits to top of word beq @@done sllc r1,1 ; test first bit of multiplier bnc @@loop_no_add @@loop_add: ; add value in (r3, r2) to total in (r0, r4) addr r2,r4 ; add lsbits adcr r0 ; add carry if any addr r3,r0 ; add msbits @@loop_no_add: sarc r3,1 ; divide multiplicand by 2 rrc r2,1 ; extend to lsb word sllc r1,1 ; get next bit of multiplier bc @@loop_add bneq @@loop_no_add @@done jr r5 endp This would cost 45 cycles per 1 bit, and 34 cycles per 0, which if nothing else reduces the penalties of 1s by swapping the branch order. Quote Link to comment Share on other sites More sharing options...
catsfolly Posted July 13, 2012 Author Share Posted July 13, 2012 I admit, I threw that version together very quickly. :-) At any rate, is the cross-over point really that far out? If you remove the TSTR in your loop, I count 34 cycles for a 0, 59 cycles for a 1. I count mine at 34 cycles also for a 0, but only 39 cycles for a 1. The merge at the end costs 34 cycles. Counterbalancing that is 1 less CLRR and 2 less shifts, for an outside-the-loop penalty of only 16 cycles. So, I should be faster in all cases, shouldn't I? Or am I counting wrong? Oops. I guess I hadn't counted all the cycles. I was just thinking that theoretically, my approach would have just one extra add in the "add loop". (Which was true in your final "rearranged" version). But, as you pointed out, my approach requires more setup instructions. Still it seems more "elegant" to me to avoid the fix up at the end. But faster is faster... Also, with your approach we could leave off the fix up and have a general unsigned multiply routine. Maybe have different entry points to do the fix up or not. I'll give your version a try. (The speed is not so critical for this routine, but why waste cycles?) Thanks, Catsfolly Quote Link to comment Share on other sites More sharing options...
catsfolly Posted July 13, 2012 Author Share Posted July 13, 2012 I think we can save a swap! ; when we get here, product r0:r2 needs to be left shifted by 8 more swap r0 swap r2 andi #$FF00,r0 andi #$00FF,r2 xorr r2,r0 could be: ; when we get here, product r0:r2 needs to be left shifted by 8 more andi #$00FF,r0 andi #$FF00,r2 xorr r2,r0 swap r0 Quote Link to comment Share on other sites More sharing options...
intvnut Posted July 13, 2012 Share Posted July 13, 2012 I think we can save a swap! ; when we get here, product r0:r2 needs to be left shifted by 8 more swap r0 swap r2 andi #$FF00,r0 andi #$00FF,r2 xorr r2,r0 could be: ; when we get here, product r0:r2 needs to be left shifted by 8 more andi #$00FF,r0 andi #$FF00,r2 xorr r2,r0 swap r0 Indeed! BTW, I got thinking about it, and I realized there were some great routines put together by everyone on INTVPROG back when Arnauld and I were experimenting with "IntvWolf." You can probably search on that in the INTVPROG archives for more discussion details. Here's a multiply routine that computes a 48-bit product between 32-bit and 16-bit fixed point values. I don't know who wrote it -- it may have been Carl. It costs only 49 cycles for a 1 and 37 for a 0 and has almost no setup overhead. It's rather impressive, and nearly as cost effective as the 8x16 => 24 code above. ; Compute 48-bit product (A * B), where A is 32 bits, and B is 16 bits. ; Return upper 32 bits in R0:R2. ; ; Note, because we right-shift 'A' through the multiply process, the final ; result is an approximation. ; ; R3:R0 == A ; R1 == B ; ; R0:R2 == (A * B)>>16 m321616 PROC CLRR R2 CLRR R4 B @@test @@add: CLRC RRC R3 RRC R0 ADDR R0, R2 ADCR R4 ADDR R3, R4 @@test: SLLC R1, 1 BC @@add BEQ @@done RRC R3 RRC R0 B @@test @@done: MOVR R4, R0 JR R5 ENDP And here's a plain 16x16 => 16 multiply I wrote that goes 2 bits at a time. I was pretty sure I had written such a routine, and it came to me while I was at work where I'd find it: ;; MPY16 Multiply signed 16-bit by signed 16-bit into 16-bit result. ;; MPYSU16 Multiply signed 16-bit by unsigned 16-bit into 16-bit result. ;; MPYU16 Multiply unsigned 16-bit by unsigned 16-bit into 16-bit result. ;; ;; R0 = 16-bit (signed for MPY16, MPYSU16, unsigned for MPYU16) ;; R1 = 16-bit (signed for MPY16, unsigned for MPYSU16, MPYU16) ;; ;; R2 = result ;; ;; Author: J. Zbiciak mpy16: PROC TSTR R1 ; is R1 negative? BPL @@core ; NEGR R1 ;\__ toggle sign on R0 then and force NEGR R0 ;/ R1 to be +ve. mpysu16: mpyu16: @@core: CLRR R2 SARC R2, 2 RRC R1, 2 BOV @@add_32 BC @@add_1 B @@test @@add_32: BNC @@add_2 @@add_3: ADDR R0, R2 @@add_2: ADDR R0, R2 @@add_1: ADDR R0, R2 @@test: SLL R0, 2 @@cont: SARC R1, 2 BOV @@add_32 BC @@add_1 BNEQ @@test JR R5 ENDP And for 16x16=>32-bit and 32x16=>32, here's one I wrote that drops back to 1 bit at a time: ;; MPY Multiply signed 16-bit by signed 16-bit into signed 32-bit result. ;; MPYIL Multiply 32-bit by signed 16-bit into signed 32-bit result. ;; ;; R3:R0 = 32-bit (signed 16-bit in R0 for MPY) ;; R1 = signed 16-bit ;; ;; R4:R2 = result ;; ;; R3 trashed ;; ;; Author: J. Zbiciak ;; ;; (Fun fact: Routine named after instruction on C6200 DSP.) mpy PROC CLRR R3 ;\ TSTR R0 ; |__ sign extend R0 into R3. BPL mpyil ; | DECR R3 ;/ mpyil: CLRR R4 CLRR R2 TSTR R1 BPL @@first NEGR R1 ; \ COMR R3 ; |__ toggle R3:R0's sign to force R1 +ve NEGR R0 ; | ADCR R3 ; / B @@first @@add: ADDR R0, R2 ADCR R4 ADDR R3, R4 @@test: SLLC R0, 1 RLC R3, 1 @@first: SARC R1, 1 BC @@add BNEQ @@test JR R5 ENDP Since there's other potentially interesting code in there, I've attached the ZIP archive. I'm not 100% sure it's the most recent version, but it's still interesting nonetheless. IntvWolf.zip Quote Link to comment Share on other sites More sharing options...
Rev Posted July 13, 2012 Share Posted July 13, 2012 This thread is gettin' me all hot and bothered... Quote Link to comment Share on other sites More sharing options...
catsfolly Posted July 13, 2012 Author Share Posted July 13, 2012 (edited) BTW, I got thinking about it, and I realized there were some great routines put together by everyone on INTVPROG back when Arnauld and I were experimenting with "IntvWolf." You can probably search on that in the INTVPROG archives for more discussion details. Thanks. I didn't know about these routines. In the intv library code there are some divide routines, and macros for multiplying by fixed constants, but I didn't see any general purpose multiply routines, so I wrote my own. I'll check out the archive... Catsfolly Edited July 13, 2012 by catsfolly Quote Link to comment Share on other sites More sharing options...
intvnut Posted July 13, 2012 Share Posted July 13, 2012 Thanks. I didn't know about these routines. In the intv library code there are some divide routines, and macros for multiplying by fixed constants, but I didn't see any general purpose multiply routines, so I wrote my own. You're right, there isn't. I had actually started writing a library of multiplies (on paper), and in fact, that's exactly what I was doing when I first heard the Twin Towers got hit, sitting outside a Starbucks, waiting to meet a friend before work. I guess I didn't get back to that notebook... Quote Link to comment Share on other sites More sharing options...
Carl Mueller Jr Posted July 13, 2012 Share Posted July 13, 2012 Yeah, I did notice the date on most of your code – September 2001. I was starting to wonder what had happened during that month to inspire you to write so much code! There must be quite a story. Quote Link to comment Share on other sites More sharing options...
Rev Posted July 14, 2012 Share Posted July 14, 2012 Thanks. I didn't know about these routines. In the intv library code there are some divide routines, and macros for multiplying by fixed constants, but I didn't see any general purpose multiply routines, so I wrote my own. You're right, there isn't. I had actually started writing a library of multiplies (on paper), and in fact, that's exactly what I was doing when I first heard the Twin Towers got hit, sitting outside a Starbucks, waiting to meet a friend before work. I guess I didn't get back to that notebook... . The twin towers is the 'where we're you' of our generation. I personally was in downtown Atlanta on a construction site in the middle of tall bldgs , after about an hours or two the site was empty! Quote Link to comment Share on other sites More sharing options...
intvnut Posted July 14, 2012 Share Posted July 14, 2012 Yeah, I did notice the date on most of your code – September 2001. I was starting to wonder what had happened during that month to inspire you to write so much code! There must be quite a story. Actually, that was also around the time I started working on what become Space Patrol, shortly after the successful release of 4-Tris. We had also just announced samples of the first processor I contributed instructions to, the C6416 DSP, like the day before. (Tom Engibous rang the closing bell on Wall Street that day.) I was in the mood to write a library to go with SDK-1600, and develop games and such. Of course, life being what it was, my priorities shifted for more reasons than 9/11. That whole "life away from the computer" thing. But, yeah, I spent the rest of that month kinda wrapping up what I had typed in, and not much on finishing and copying in what was in that notebook. Quote Link to comment Share on other sites More sharing options...
catsfolly Posted July 14, 2012 Author Share Posted July 14, 2012 I tried Joe's version of the multiply routine, but it didn't work (he said it was untested, but the code looked so nice and sensible that I thought it had a good chance of running...) Turns out you can't just shift the result when you find a zero in the multipliier. You have to shift the multiplicand (at least that is what I did to get it to work...) Example walkthrough (old way). if multiplier is %00000100 result = 0 lsb of multiplier is zero so result = result *2 = 0 next bit of multiplier is zero, so result = result *2 = original result * 4 = 0 next bit of multiplier is 1, so result = result *2 + multiplicand = 1 *multiplicand; Example walkthrough (new way). if multiplier is %00000100 result = 0 lsb of multiplier is zero so multiplicant = multiplicand *2 next bit of multiplier is zero, so multiplicant = multiplicand *2 = original multiplicand * 4 next bit of multiplier is 1, so result = result + multiplicand = original multiplicand * 4 Here is the version that worked: umultd256 proc cmpi #255,r1 bnc @@r1ok ; branch if 255 or less movr r3,r0 ; r1 is 256 or greater or negative jr r5 ; done @@r1ok clrr r0 ; clear result clrr r4 ; result in r0, r4 clrr r2 ;\ TSTR R3 ; |__ sign extend R3 into R2. BPL @@first ; | DECR R2 ;/ b @@first @@loop_add: addr r3,r0 ; add in multiplicand adcr r4 ; ...and carry addr r2,r4 ; @@loop_noadd sllc r3 ; rlc r2 ; shift 32-bit multiplicand left @@first: sarc r1 bc @@loop_add bneq @@loop_noadd ; when we get here, product r0:r2 needs to be left shifted by 8 more @@done movr r4,r2 andi #$FF00,r0 andi #$00FF,r2 xorr r2,r0 swap r0 jr r5 Quote Link to comment Share on other sites More sharing options...
intvnut Posted July 14, 2012 Share Posted July 14, 2012 I tried Joe's version of the multiply routine, but it didn't work (he said it was untested, but the code looked so nice and sensible that I thought it had a good chance of running...) Turns out you can't just shift the result when you find a zero in the multipliier. You have to shift the multiplicand (at least that is what I did to get it to work...) Example walkthrough (old way). if multiplier is %00000100 result = 0 lsb of multiplier is zero so result = result *2 = 0 next bit of multiplier is zero, so result = result *2 = original result * 4 = 0 next bit of multiplier is 1, so result = result *2 + multiplicand = 1 *multiplicand; Example walkthrough (new way). if multiplier is %00000100 result = 0 lsb of multiplier is zero so multiplicant = multiplicand *2 next bit of multiplier is zero, so multiplicant = multiplicand *2 = original multiplicand * 4 next bit of multiplier is 1, so result = result + multiplicand = original multiplicand * 4 Here is the version that worked: umultd256 proc cmpi #255,r1 bnc @@r1ok ; branch if 255 or less movr r3,r0 ; r1 is 256 or greater or negative jr r5 ; done @@r1ok clrr r0 ; clear result clrr r4 ; result in r0, r4 clrr r2 ;\ TSTR R3 ; |__ sign extend R3 into R2. BPL @@first ; | DECR R2 ;/ b @@first @@loop_add: addr r3,r0 ; add in multiplicand adcr r4 ; ...and carry addr r2,r4 ; @@loop_noadd sllc r3 ; rlc r2 ; shift 32-bit multiplicand left @@first: sarc r1 bc @@loop_add bneq @@loop_noadd ; when we get here, product r0:r2 needs to be left shifted by 8 more @@done movr r4,r2 andi #$FF00,r0 andi #$00FF,r2 xorr r2,r0 swap r0 jr r5 I see my problem, and yeah, you're right. If you only want to iterate once per multiplicand bit (in the 8-bit multiplicand) then with the 8-bit multiplicand shifting right, the 16-bit multiplicand has to shift left. Alternately, you could shift the product being accumulated right, but then you'd have a variable fixup at the end. See, I told you I wrote that too quickly. :-) Quote Link to comment Share on other sites More sharing options...
intvnut Posted July 14, 2012 Share Posted July 14, 2012 I've gone mad with speed! Specifically, I decided to try writing a "radix-16" version of the fractional multiply routine. Now, this one only does unsigned 16 x unsigned 8 => unsigned 16, but it's trivial to wrap it to handle signed x unsigned. I tested it over the full range of inputs ($0000 x $00 through $FFFF x $FF), and it passed every test. (The test program ran for 11,313,832,018 cycles... good thing I used --macho=0, since that's 3.5 hours in real life.) Anyway, the resulting code averaged about 385.25 cycles/call across all those inputs. It seems like it can still go faster. Anyone want to give it a go? About 264 cycles of that 385 is in the code outside the shift/add, so I think there's a few opportunities in there... mpy_u8_u16_q8.zip Quote Link to comment Share on other sites More sharing options...
intvnut Posted July 15, 2012 Share Posted July 15, 2012 Ok, so this is a tad depressing. I compared the updated version if catsfolly's code above (with a minor fix, changing CMPI #255 to CMPI #256) to mine and benchmarked it over the full range of multiply inputs also. It comes in at 399.7 cycles average. My fancy radix-16 multiply hardly seems worth it for only 15 cycles gain. Hmmm.... Quote Link to comment Share on other sites More sharing options...
catsfolly Posted July 15, 2012 Author Share Posted July 15, 2012 Well, you certainly are testing the code this time! I'm surprised that the radix-16 version didn't do better. I wonder what the fastest and slowest times were? I would guess with the "simple" version (I can't call it my version because you basically wrote it (based on my version) and then I fixed it) would be fastest when just one bit was set (in the 8 bit multiplier), and take longer when many bits were set. On the other hand, the radix version would tend to take the about same amount of time for all cases, because a lot of the time is in the setup. The "simple" version works for me. Thanks. Catsfolly Quote Link to comment Share on other sites More sharing options...
intvnut Posted July 15, 2012 Share Posted July 15, 2012 Well, you certainly are testing the code this time! I'm surprised that the radix-16 version didn't do better. As was I! Actually, when we looked at multiplies before almost a decade ago, I think we came to a similar conclusion -- as you increase radix, you don't really get faster on average, you just flatten the cycle count between the best/worst cases. I'm usually good at remembering things, but some neurons fire better than others. In any case, I haven't forgotten Michael Abrash's admonition to measure, measure, measure. :-) (I read the Zen of Assembly Language Programming, Zen of Code Optimization and Zen of Computer Graphics cover to cover back when they came out. Great books.) I wonder what the fastest and slowest times were? Good question. I don't actually have a good way of measuring that right now. My way of measuring the average was pretty crude. jzIntv, with history logging enabled, will dump out a profile of cycles spent in each instruction. I just summed them all up and divided by 224. Maybe I should consider adding some sort of proper profiler to jzIntv someday. If you think about it, there are twice as many multiplicands of bit length N as there are of length N-1. So, in the averages, longer multiplicands will get weighted more heavily. If your inputs are evenly distributed, this is OK. But, if shorter multiplicands are more common, then the "simple" version really has a chance to pull ahead, since the radix-16 execution time is somewhat flat, but the "simple" multiply's proportional to the magnitude of the multiplicand. I'm going to guess that with a physics simulation, the distribution looks closer to uniform, or maybe even biased toward longer lengths if the 8 bit value comes from a sin/cos table. Another interesting observation: Every multiply (except multiplying by 0) will have at least one '1' in it: the leading 1. If you assume a uniform distribution of multiplicands, though, the bits trailing the leading 1 will be 50/50 split between 0s and 1s. So, the looping cost of a multiplicand of length N averages to (N-1)*(cost_of_one + cost_of_zero)/2 + cost_of_one. I first noticed the "leading one bias" when analyzing divide routines with Carl. (Which reminds me: Carl, I still need to look at IDIV9.) It suggests that you might get a marginal gain if you can make 1s cheaper than 0s in the multiply loop. For example, initialize your product to 2N, and subtract for every 0. This is only likely to pay off if you have a floating point representation, where the leading 1 is in a fixed position at the left of the mantissa. As I recall, when I tried to take advantage of this in the divide routines I wrote, Carl was always able to beat me with the more straightforward approach. So much for number theory. Benchmarks can be quite humbling. :-) I would guess with the "simple" version (I can't call it my version because you basically wrote it (based on my version) and then I fixed it) would be fastest when just one bit was set (in the 8 bit multiplier), and take longer when many bits were set. Yep. On the other hand, the radix version would tend to take the about same amount of time for all cases, because a lot of the time is in the setup. The "simple" version works for me. Yeah, there really isn't a compelling reason to use the radix-16 version at the moment, unless folks can see a way to get a lot of cycles out. Unrolling it would eliminate a fair bit of overhead, perhaps giving you back 40-50 cycles, but at a large codesize cost. If you're hurting for cycles, that's a tradeoff worth making. Most games won't need it though. In Space Patrol, had I needed such a multiply, I would have unrolled, etc. for the speed. I needed every cycle I could get. I joked at the time that I must be borrowing cycles from all the other microprocessors in the room, because I was astonished I managed to make it all fit. (As it stands, doesn't actually all fit all the time. There are some creative hacks in there that hide the very occasional times when it doesn't fit, preventing glitches that would call attention to this fact. The cycle loading is very "spiky", and there's enough cycles if you average certain things over a few frames.) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.