Jump to content
IGNORED

Fast multiply, division and remainder routines


Recommended Posts

Hi everyone :D

 

I've been playing with Intellivision assembler code for the near release of IntyBASIC v1.1 and I've came with these pretty optimum routines for multiplication, and unsigned division and remainder.

 

I've tried to optimize speed of the routines for size of operands instead of a fixed cycle operation that always use same time.

 

Feel free to use in your own projects. ;)

; Fast multiplication
; Assembly code by Oscar Toledo G. Jul/10/2015
; Released to public domain.

    ; Entry: R0 = Multiplicand, R1 = Multiplicator
    ; Output: R0 = Result
    ; Best case: 44 cycles, worst case 702 cycles.
uf_mul16:    PROC
    CLRR R4
@@0:
    CLRC
    RRC R0,1
    BNC @@1
    ADDR R1,R4
@@1:
    ADDR R1,R1
    TSTR R0
    BNE @@0
    MOVR R4,R0
    JR R5
    ENDP

And here is the other routine:

; Fast unsigned division/remainder
; Assembly code by Oscar Toledo G. Jul/10/2015
; Released to public domain.

    ; Ultrafast unsigned division/remainder operation
    ; Entry: R0 = Dividend
    ;        R1 = Divisor
    ; Output: R0 = Quotient
    ;         R2 = Remainder
    ; Worst case: 6 + 6 + 9 + 496 = 517 cycles
    ; Best case: 6 + (6 + 7) * 16 = 214 cycles

uf_udiv16:    PROC
    CLRR R2        ; 6
    ADDR R0,R0    ; 6
    BC @@1        ; 7/9
    ADDR R0,R0    ; 6
    BC @@2        ; 7/9
    ADDR R0,R0    ; 6
    BC @@3        ; 7/9
    ADDR R0,R0    ; 6
    BC @@4        ; 7/9
    ADDR R0,R0    ; 6
    BC @@5        ; 7/9
    ADDR R0,R0    ; 6
    BC @@6        ; 7/9
    ADDR R0,R0    ; 6
    BC @@7        ; 7/9
    ADDR R0,R0    ; 6
    BC @@8        ; 7/9
    ADDR R0,R0    ; 6
    BC @@9        ; 7/9
    ADDR R0,R0    ; 6
    BC @@10        ; 7/9
    ADDR R0,R0    ; 6
    BC @@11        ; 7/9
    ADDR R0,R0    ; 6
    BC @@12        ; 7/9
    ADDR R0,R0    ; 6
    BC @@13        ; 7/9
    ADDR R0,R0    ; 6
    BC @@14        ; 7/9
    ADDR R0,R0    ; 6
    BC @@15        ; 7/9
    ADDR R0,R0    ; 6
    BC @@16        ; 7/9
    JR R5

@@1:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@2:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@3:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@4:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@5:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@6:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@7:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@8:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@9:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@10:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@11:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@12:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@13:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@14:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@15:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
@@16:    RLC R2,1    ; 6
    CMPR R1,R2    ; 6
    BNC $+3        ; 7/9
    SUBR R1,R2    ; 6
    RLC R0,1    ; 6
    JR R5

    ENDP
  • Like 3
Link to comment
Share on other sites

I can confirm that these present no major problems (I tend to commit crimes when I use math, so I get hit first) :D

 

Oscar, are these good for "any" size operands? And does this get rid of the "optimized for 2,4,8 etc" operations? ie: can we just go crazy with everything now? :)

 

In fact: no.

 

IntyBASIC implements different levels of optimizations trying to get most speed of everything instead of a single solution.

 

* Multiplication by constant: optimized by powers of 2 and also for constants <= 127.

* Other multiplication uses now the new shifting routine (except if you activated JLP support that is faster)

* If you use 'var' by 'var' multiplication it will use a routine authored by intvnut that is amazingly faster for arbitrary numbers (not used if you activated JLP)

 

For division and remainder.

 

* Optimized by powers of 2.

* Still repeated substraction for small constants and other expressions. (pretty good if your maximum number for dividend goes up to 20 times your divisor) (also it will remove this and use JLP if activated)

* If you use 'var' / 'var' or 'var' % 'var' it will use the new routine. (not used if JLP is activated)

 

This way, let us suppose you have a = b / 20, it will use repeated substraction and it will be very fast, more than my new routine. But if B was 16 bits, then it would be better to write:

#c = 20
#a = #b / #c    ' If #b can approach 65000 this is ten of times faster than the other option.

Why this approach? because the programmer has extra information that the compiler doesn't have and can choose between ways of usage that will improve considerably the speed of the program.

Link to comment
Share on other sites

OK :) Someone smarter than me will have to come up with a quick explanation with examples, to really flesh this out. I rarely find computation speed to be my problem, my brain is usually the limiting factor :P

 

If powers of 2 are still optimized, then I'll just code for that and adjust accordingly (ie: speed or difficulty setting can be 2,4,8... instead of 1,2,3). It's good to know about that in advance like you say.

  • Like 1
Link to comment
Share on other sites

Ok, let me get the silliest of nitpicks out of the way:

.

    ; Entry: R0 = Quotient
    ;        R1 = Divisor
    ; Output: R0 = Result
    ;         R2 = Remainder

.

On entry, R0 is the dividend, and only on output is R0 the quotient. (Quotient is the result after dividing.) I sometimes use "numerator" and "denominator" for the names of inputs because it's easier for me to keep straight than dividend/divisor. (Although keeping dividend/divisor straight is easier than keeping minuend and subtrahend or abscissa and ordinate straight. I haven't used those terms since high school, but by gosh they drilled them into us as if they were actually important!)

 

Other than that, these look really good. :)

 

On the opening part of the subtract, I would probably use "SLLC R0, R0" instead of "ADDR R0, R0". It will be slightly faster in practice, because SLLC is interruptible, and therefore it'll reduce the amount of STIC cycle stealing by a minor fraction. (Barely enough to be measurable, though.)

 

Writing divides on this architecture always makes me a little sad, having pairs of CMPR/SUBR, since I know under the hood, CMPR is a SUBR. It just feels like such a waste! But, all the fast ways of using the carry bit (ADCR, RLC) also destroy it. Only branches leave it intact.

 

This particular divide approach doesn't translate into a non-restoring implementation easily either.

 

On the multiply, I'm wondering if it's possible to optimize the quarter-square approach for 16x16 multiply with some early-exits. The main advantage your code has over my other 16x16 qsqr multiply is that it's faster for small multiplicands that just happen to be in 16-bit variables. (ie. 8 bit values in 16-bit variables.) If the qsqr code "early exited" when it detects one side is small (ie. fits in 8 bits), I think it could be faster than yours for nearly all multiplicands.

Edited by intvnut
Link to comment
Share on other sites

Ok, let me get the silliest of nitpicks out of the way:

.

    ; Entry: R0 = Quotient
    ;        R1 = Divisor
    ; Output: R0 = Result
    ;         R2 = Remainder
.

On entry, R0 is the dividend, and only on output is R0 the quotient. (Quotient is the result after dividing.) I sometimes use "numerator" and "denominator" for the names of inputs because it's easier for me to keep straight than dividend/divisor. (Although keeping dividend/divisor straight is easier than keeping minuend and subtrahend or abscissa and ordinate straight. I haven't used those terms since high school, but by gosh they drilled them into us as if they were actually important!)

 

Woops! my fault, I was going to check it but I forgot it altogether. Corrected! :) :thumbsup:

 

Other than that, these look really good. :)

Thanks!

 

On the opening part of the subtract, I would probably use "SLLC R0, R0" instead of "ADDR R0, R0". It will be slightly faster in practice, because SLLC is interruptible, and therefore it'll reduce the amount of STIC cycle stealing by a minor fraction. (Barely enough to be measurable, though.)

Yep, but I would need to add a CLRC at start. I wanted to keep R0=0 and R2=0 in case of dividend R0=0

 

Writing divides on this architecture always makes me a little sad, having pairs of CMPR/SUBR, since I know under the hood, CMPR is a SUBR. It just feels like such a waste! But, all the fast ways of using the carry bit (ADCR, RLC) also destroy it. Only branches leave it intact.

I've been cracking my head trying to find a way to do everything in a single instruction, even thinking if there is an ARM like instruction for conditional execution. But it appears like is the only way.

 

Also a way to preserve Z flag in multiplication would save me the TSTR at end. In fact this multiplication is so reasonable that it is generated inline in next version of IntyBASIC.

 

This particular divide approach doesn't translate into a non-restoring implementation easily either.

I'm not sure to understand "non-restoring"

 

On the multiply, I'm wondering if it's possible to optimize the quarter-square approach for 16x16 multiply with some early-exits. The main advantage your code has over my other 16x16 qsqr multiply is that it's faster for small multiplicands that just happen to be in 16-bit variables. (ie. 8 bit values in 16-bit variables.) If the qsqr code "early exited" when it detects one side is small (ie. fits in 8 bits), I think it could be faster than yours for nearly all multiplicands.

Oh yes! that would be pretty cool! :)

Link to comment
Share on other sites

Yep, but I would need to add a CLRC at start. I wanted to keep R0=0 and R2=0 in case of dividend R0=0

No, you wouldn't. The two are literally interchangeable. SLLC shifts in a 0. Are you confusing it with RLC? Also, in my original post, I meant to say SLLC is non-interruptible. And also I meant to say "opening part of the divide." I was up way past my bedtime, apparently.

 

So, to be clear what I was suggesting:

.

uf_udiv16:    PROC
    CLRR R2        ; 6
    SLLC R0       ; 6
    BC @@1        ; 7/9
    SLLC R0       ; 6
    BC @@2        ; 7/9
    SLLC R0       ; 6
    BC @@3        ; 7/9

.

That doesn't require a CLRC at the start.

 

I'm not sure to understand "non-restoring"

 

In a restoring divide, you do the subtract conditionally: You do a trial subtract, and if the remaining dividend went negative, you restore the previous value. (In this case, CMPR lets us try the subtract without committing to keeping the result.)

 

In a non-restoring divide, you do the subtract unconditionally. If the remaining dividend goes negative, you don't restore the previous value. You keep the negative value. In order to make this work, you switch between doing subtracts while the remaining dividend is negative, and adds when the remaining dividend is positive. Wikipedia has a concise example.

 

In context here, if you could convert to a non-restoring implementation somehow, you could replace CMPR with SUBR, and if the result went negative, the branch would take you to alternate code that knows to ADDR next. But, given the way you shift in quotient bits combined with how you shift along dividend bits, I don't think it could be made to work profitably here. That, and the code size would roughly double.

Link to comment
Share on other sites

I have never in my entire life heard the terms "minuend and subtrahend". Not even in school. And I didn't take "new math" or anything. So weird, because I'm familiar with all of the others.

 

I wonder why this is.

 

No idea. It seems like we used these terms for a short while as vocab test words, and then never used them again after the test. I don't recall using the terms "ordinate" and "abscissa" outside that unit, except as trivia items.

Link to comment
Share on other sites

In a non-restoring divide, you do the subtract unconditionally. If the remaining dividend goes negative, you don't restore the previous value. You keep the negative value. In order to make this work, you switch between doing subtracts while the remaining dividend is negative, and adds when the remaining dividend is positive. Wikipedia has a concise example.

 

 

Ah, thats like the divide-by-ten algorithm in GroovyBee's score display routine. :) It took me a while to understand it back then, but it's a neat trick.

 

-dZ.

Edited by DZ-Jay
Link to comment
Share on other sites

Ok, I don't see how the best case on this guy is 44 cycles:

.

uf_mul16:   PROC
            CLRR R4             ;   6
@@0:
            CLRC                ;   6
            RRC     R0, 1       ;   6
            BNC     @@1         ;   7 (9)
            ADDR    R1, R4      ;   6
@@1:
            ADDR    R1, R1      ;   6
            TSTR    R0          ;   6
            BNE     @@0         ;   9 (7)
            MOVR    R4, R0      ;   6
            JR      R5          ;   7 
            ENDP

.

In the best-case path through this, the first branch is taken and the second branch is not. That adds up to 59 for me.

 

BTW, if you rotate the loop, you can also take advantage of the fact you can use SARC after the first iteration, and avoid both the TSTR in the loop and the CLRC. At least I think so. I haven't tested it but this code looks plausible:

.

    ; Entry:  R0 = Multiplicand, R1 = Multiplier
    ; Output: R0 = Result
    ; Best case: 49 cycles, worst case 561 cycles.
uf_mul16:   PROC

            CLRR    R4          ;   6
            CLRC                ;   6
            RRC     R0, 1       ;   6
            BEQ     @@2         ;   7 (9)
                                ; ---
                                ;  25 (27)

@@0:        BNC     @@1         ;   7 (9)
            ADDR    R1, R4      ;   6

@@1:        SLL     R1, 1       ;   6
            SARC    R0, 1       ;   6
            BNEQ    @@0         ;   9
                                ; ---
                                ;  34 (30) per bit; -2 on last iter

@@2:        BNC     @@3         ;   7 (9)
            ADDR    R1, R4      ;   6

@@3:        MOVR    R4, R0      ;   6
            JR      R5          ;   7
                                ; ---
                                ;  26 (22)
            ENDP

.

I need to double check the best/worst case numbers. Consider them rough stabs...

Edited by intvnut
Link to comment
Share on other sites

I see now SLLC. Why it's advantageous over ADDR?

 

This is the most micro of micro-optimizations. When the display is active, the STIC pauses the CPU with a bus request 13 or 14 times per frame. It steals up to 1400 cycles per frame this way. It doesn't need all those cycles though.

 

Non-interruptible instructions delay responding to a bus request until an interruptible boundary. So, when you pick a non-interruptible instruction in lieu of its interruptible equivalent, you prevent the STIC from stealing some of those cycles occasionally. This doesn't hurt anything, and it gives you more usable cycles.

 

This is related to the "non-interruptible instruction display glitch" issue. If you have too many non-interruptible instructions, then the CPU won't relinquish the bus to the STIC in time. The STIC does need some cycles to do its work, just not as many as it asks for.

 

So, if your game is doing a lot of multiplies with this loop (assuming you're referring to the SLL in the multiply code), you might get back an extra 9 cycles per frame per multiply, by delaying a bus request until after the branch. Not much, which is why I say it's the most micro of microoptimizations. Similar comments apply to the SLLC substitution I suggested in the divide routine.

Edited by intvnut
Link to comment
Share on other sites

This is the most micro of micro-optimizations. When the display is active, the STIC pauses the CPU with a bus request 13 or 14 times per frame. It steals up to 1400 cycles per frame this way. It doesn't need all those cycles though.

 

Non-interruptible instructions delay responding to a bus request until an interruptible boundary. So, when you pick a non-interruptible instruction in lieu of its interruptible equivalent, you prevent the STIC from stealing some of those cycles occasionally. This doesn't hurt anything, and it gives you more usable cycles.

 

This is related to the "non-interruptible instruction display glitch" issue. If you have too many non-interruptible instructions, then the CPU won't relinquish the bus to the STIC in time. The STIC does need some cycles to do its work, just not as many as it asks for.

 

So, if your game is doing a lot of multiplies with this loop (assuming you're referring to the SLL in the multiply code), you might get back an extra 9 cycles per frame per multiply, by delaying a bus request until after the branch. Not much, which is why I say it's the most micro of microoptimizations. Similar comments apply to the SLLC substitution I suggested in the divide routine.

Ok, just implemented :) thanks! :thumbsup:

Link to comment
Share on other sites

off topic really, but way back when as a college freshman I had a TA (Asad Kallada, I believe though my spelling may be wrong). I will never forget him standing at the chalk board pointing to a simple division problem and saying over and over "this is the numerator, (moves pointer) this is the denominator" and getting upset as the class sat with blank stares.

 

the thing was he had a very strong accent and all we were hearing was "Dis is de numerator (moves pointer) Dis is denumerator". People who had not heard the terms were very confused. When I explained it to him after class he found it very funny.

 

(this isn't intended to make fun of accents or any of the wonderful people who learn english as a second or third language. Just a recollection from my youth that, looking back, seems humorous to me.)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...