Quarter Square Multiplication: Blindingly fast multiplies!

freewheel · April 2, 2015

If all that abstraction, power, creative empowerment, and productivity costs a few thousand DECLEs, I say, bring it on.

DZ: don't get me wrong. I'm not arguing against functionality and abstraction in the slightest.

However when I look at 40, 50, 100 possible instructions in something like IntyBASIC... and if each time we casually just say "hey, it's only a thousand words"... when is the point when it's too much? Does this point even exist, or do we just plow ahead and not worry about it?

I guess I just found it strange that multiplication would be the end-all and be-all of feature additions of that size. And that it's bleedingly obvious to everyone but me that in this case it makes total sense, but that it's unlikely to creep much beyond that.

intvnut · April 3, 2015

DZ: don't get me wrong. I'm not arguing against functionality and abstraction in the slightest.

However when I look at 40, 50, 100 possible instructions in something like IntyBASIC... and if each time we casually just say "hey, it's only a thousand words"... when is the point when it's too much? Does this point even exist, or do we just plow ahead and not worry about it?

I guess I just found it strange that multiplication would be the end-all and be-all of feature additions of that size. And that it's bleedingly obvious to everyone but me that in this case it makes total sense, but that it's unlikely to creep much beyond that.

I admit wanting to eke out as much performance as possible in the math primitives, and then back up from that if they get too huge. I'm a math library nerd I guess, and all you've really heard is my voice. I don't see multiplies as being a 'killer feature' per se, although they can make some algorithms elegant.

For example, rand8( val ) has a pretty elegant and cheap implementation if you have a cheap multiply. And I can see certain kinds of games using that type of 'dice roll' rand all over the place. For example, whatever First Spear is working on in that other thread has a nice table-driven approach that's low cost if you have a cheap bounded random number generator.

Aside: If there are a lot of rand8() calls—more than one per frame—then perhaps each call should advance the random number state too. A SARC / BC / XORI / MVO would take care of that nicely.

Acceleration curves, heading computation, scoring, certain kinds of AI... these are all places that could potentially benefit. Many of these have clever multiply-free algorithms, mind you, but anyway.

Will most games be affected if multiplies take 0.75ms instead of 0.25ms? Unlikely. (That's the approximate performance of the shift/add vs. table-based multiply.)

My personal gut feel is to have a couple options handy, and pick one until we figure out why it sucks more than the others. The super-fast table based multiply seems more in line with the current IntyBASIC code generation strategies, such as open coding PRINTs as an unrolled series of MVIIs instead of having a PRINT function somewhere that formats characters in a loop at run time.

Maybe larger games would benefit from a -Osize flag someday, while more action-oriented games would prefer -Ospeed (which would match the current default assumptions). There are many things IntyBASIC open-codes that could be factored out to library calls at a small cycle expense but a big codesize benefit.

freewheel · April 3, 2015

Maybe larger games would benefit from a -Osize flag someday, while more action-oriented games would prefer -Ospeed (which would match the current default assumptions). There are many things IntyBASIC open-codes that could be factored out to library calls at a small cycle expense but a big codesize benefit.

Isn't it remarkable that we're already having a discussion about compiler optimization flags??? Amazing progress if you ask me. When I first approached IntyBASIC I expected something to simplify the (to me) tedium of MV, ADD, CMP ad naseum. And maybe, if I was lucky, to cover some of the drudge work that DZ mentions (writing to the screen etc). I was ridiculously impressed when I first downloaded 0.8 or so, and It's astounding how far this has come since then.

Consider me a convert on the mult speedup if it allows a convenient rand(range) function. Also if I can get a=b^c /deadhorse

intvnut · April 3, 2015

Consider me a convert on the mult speedup if it allows a convenient rand(range) function. Also if I can get a=b^c /deadhorse

Tell me more about what you'd want out of an exponentiation operator. Just integer exponentiation? By a variable integer amount?

freewheel · April 3, 2015

Tell me more about what you'd want out of an exponentiation operator. Just integer exponentiation? By a variable integer amount?

Well if you're taking requests...

I don't know, really. It's something that comes up occasionally that would be really handy when it's needed. I often find myself seeing clever opportunities for 2^n (n being determined at runtime, so yeah, fully variable), but without spending any actual time looking at your table mult implementation part of my brain wonders if it lends itself to the generalized case.

In a perfect world, a=b^c where b and c could be "anything"? I realize how stupidly impractical that gets due to massive overflow once c becomes anything sizable. It's certainly not something worth spending a lot of time/resources on, but if there was a quick win? It'd be swell. It's part of the basic math operands in my head - I can't remember what language I used to use it a lot in but it was just part of the syntax.

intvnut · April 3, 2015

Well if you're taking requests...

I don't know, really. It's something that comes up occasionally that would be really handy when it's needed. I often find myself seeing clever opportunities for 2^n (n being determined at runtime, so yeah, fully variable), but without spending any actual time looking at your table mult implementation part of my brain wonders if it lends itself to the generalized case.

In a perfect world, a=b^c where b and c could be "anything"? I realize how stupidly impractical that gets due to massive overflow once c becomes anything sizable. It's certainly not something worth spending a lot of time/resources on, but if there was a quick win? It'd be swell. It's part of the basic math operands in my head - I can't remember what language I used to use it a lot in but it was just part of the syntax.

Well, 2ⁿ is really, really easy, and probably best served with a 16 entry lookup table. Something like this:

.

Pow2    DECLE 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768

.

Then Pow2(n) would give you 2ⁿ, as long as 0 ≤ n ≤ 15. You can do that today with the existing IntyBASIC. With an official operator, I suppose you could get fancier, adding an implicit test to handle n < 0 or n > 15 by clamping to 1 or 32768, for example.

The other exponentiation that comes up regularly in my experience is n², often in distance computation. With the table lookup, we can actually do that cheaper than n * n, since the table the multiply uses is a table of squares for 0² through 255². So, if IntyBASIC adopts the Quarter Square multiply, it can also provide a Sqr(n) function that returns n² super-cheap. I also have a fast integer square root laying around to go with it that IntyBASIC is welcome to adopt if it wants a Sqrt() function.

If you're actually interested in Euclidean distance, though, it might be interesting to adopt the dist_fast routine I wrote as an IntyBASIC built-in. It's accurate within 4% and way faster than squares and square roots. (I didn't come up with the algorithm; I implemented the algorithm published by Cognex* in Graphics Gems.) And since the dist_fast routine is really two 8-bit multiplies and a subtract, it might get very slightly faster with the lookup table from Quarter Squares. Not sure.

As for which languages had a general exponentiation operator? I'm pretty sure Commodore BASIC had it. Otherwise, it's actually pretty rare in most languages I've encountered. FORTRAN had it though.

_______________________________________________________________

* Unrelated, but potentially interesting: Cognex is a really cool company doing neat stuff, and one of our customers at work. I really had a blast talking low level optimization tricks with their engineers. I wrote the dist_fast routine, though, long before I had the opportunity to meet the Cognex engineers.

Edited April 3, 2015 by intvnut

freewheel · April 3, 2015

Yeah, a lookup table is basically what I construct (I don't even bother with going past 8 values). 2^n as a function is more for just aesthetics and convenience if the syntax gets implemented. n^2 - that's kinda why I started thinking about your tables. If they're speeding up multiplies to begin with, then it should extend to at least that. Not sure how much further - n^3, 4, etc...

Random question, while we're sort of on this subject: address management, and where all this library code ends up. The prologue starts right at $5000, then my code gets inserted. The epilogue (at least in the example I'm looking at) gets tacked onto the end of the block that starts at $C100 (because I'm pushing the end of my code there). So the compiler already has some degree of smarts to figure out how to dynamically shift code blocks around within the memory map, even if I have to manually manage my own stuff if it overflows any "allowed" address space.

If IntyBASIC ends up with a lot more of this stuff included, then these address ranges will become a bit tricky to manage (ie: if the prologue grows to more than 8k, where does it end up?). Which means perhaps a bit of "smarts" to manage it. Which means the possibility of automatic memory management within the compiler itself?

Something I didn't articulate earlier because it was a half-formed thought - if the prologue gets terribly large, pretty much every IntyBASIC programmer is going to have to deal with this (the static ASM ORG $address statements). Either that, or IntyBASIC learns to handle it all for the programmer. Which would be an awesome addition. One of my fears with "hey, let's add this routine in, it's only 700 words" is that with every IntyBASIC release, re-compiling could break a lot of programs. I'd have to move all the ORG statements around every time, and while it's a fairly simple thing to figure out, it's just more maintenance. To every single program (I rarely write anything under 8k).

I know it's been discussed before but I don't know where it's at or even the feasibility. It's certainly something I'd peg as a requirement if the language inclusions grow by much. Right now it'd just be a "nice to have" because things are mostly static; I only have to move ORGs around when *I* add things. I noticed that the current mult routines are in the prologue which is what prompted this minor worry in the first place.

intvnut · April 3, 2015

Yeah, a lookup table is basically what I construct (I don't even bother with going past 8 values). 2^n as a function is more for just aesthetics and convenience if the syntax gets implemented. n^2 - that's kinda why I started thinking about your tables. If they're speeding up multiplies to begin with, then it should extend to at least that. Not sure how much further - n^3, 4, etc...

Random question, while we're sort of on this subject: address management, and where all this library code ends up. The prologue starts right at $5000, then my code gets inserted. The epilogue (at least in the example I'm looking at) gets tacked onto the end of the block that starts at $C100 (because I'm pushing the end of my code there). So the compiler already has some degree of smarts to figure out how to dynamically shift code blocks around within the memory map, even if I have to manually manage my own stuff if it overflows any "allowed" address space.

If IntyBASIC ends up with a lot more of this stuff included, then these address ranges will become a bit tricky to manage (ie: if the prologue grows to more than 8k, where does it end up?). Which means perhaps a bit of "smarts" to manage it. Which means the possibility of automatic memory management within the compiler itself?

Something I didn't articulate earlier because it was a half-formed thought - if the prologue gets terribly large, pretty much every IntyBASIC programmer is going to have to deal with this (the static ASM ORG $address statements). Either that, or IntyBASIC learns to handle it all for the programmer. Which would be an awesome addition. One of my fears with "hey, let's add this routine in, it's only 700 words" is that with every IntyBASIC release, re-compiling could break a lot of programs. I'd have to move all the ORG statements around every time, and while it's a fairly simple thing to figure out, it's just more maintenance. To every single program (I rarely write anything under 8k).

I know it's been discussed before but I don't know where it's at or even the feasibility. It's certainly something I'd peg as a requirement if the language inclusions grow by much. Right now it'd just be a "nice to have" because things are mostly static; I only have to move ORGs around when *I* add things. I noticed that the current mult routines are in the prologue which is what prompted this minor worry in the first place.

You're hitting on the topic that I think needs to be dealt with next, honestly: ASM ORG is a bloody, fragile hack.

If I'm not mistaken, IntyBASIC outputs the following more or less:

The prologue file, more or less verbatim
The generated code
The epilogue file, more or less verbatim

The prologue code is just the minimum to get a cartridge header on the ROM image and boot the program. All the real library meat is in the epilogue file. The compiler has almost no smarts at all about ROM address management at the moment that I'm aware of. (It does know something about the available RAM.) So, the epilogue will just show up after the last bit of your generated code. If you sign off your program with an ASM ORG statement, the epilogue will end up there.

I've been toying with some ideas of how to handle shoving an IntyBASIC-compiled program into an inconvenient memory map as a post-processing step for some time. (This is without getting into page-flipping, which is the next challenge.) I do have some interesting ideas.

As much as I think having to know too much about how multiply is implemented is against the ethos of BASIC, I think having to finagle the address map is an order of magnitude worse.

Edited April 3, 2015 by intvnut

JohnPCAE · April 7, 2015

Have you taken a look at the multiplication routine in my raycasting demo?

intvnut · April 7, 2015

Have you taken a look at the multiplication routine in my raycasting demo?

I have, briefly. I think the quarter square table-based 8x8 multiply beats the value-dependent dynamic dispatch, actually. You have a fixed 23 cycle overhead here to dispatch:

.

        MOVR    R1,     R2
        ADDI    #@@FracFPMultTable, R2
        MVI@    R2,     R7

.

and because of the structure of each of the leaves, you pretty much need to JSR / JR R5 to leverage the kernel in your fixed point multiply. JSR + JR adds 20 more cycles. You're at 43 cycles just to multiply by 0.

If you extract the 8x8 kernel from Quarter Square, it's small enough you could inline it in your fixed-point multiply routine. It's a flat 42 cycles if you do that.

.

            MOVR    R0,             R2      ;   6
            ADDI    #QSQR8_TBL.mid, R1      ;   8
            ADDR    R1,             R2      ;   6   a + b
            SUBR    R0,             R1      ;   6   a - b
            MVI@    R2,             R2      ;   8
            SUB@    R1,             R2      ;   8
                                            ;----
                                            ;  42

.

When I saw Quarter Square, I was like "Where have you been all my classic-gaming programming life?" It really is that cool.

Edited April 7, 2015 by intvnut

intvnut · April 7, 2015

Have you taken a look at the multiplication routine in my raycasting demo?

BTW, I don't mean to downplay what you achieved in the raycasting demo. It's actually really cool. And I couldn't think of a faster approach to multiplication than the one you took. Maybe microoptimizations around the edges, but that's it.

I was shocked when I totaled up the cycles for quarter square. It's in its own league. I can't take credit for it. I found it by accident on Wikipedia.

+DZ-Jay · April 7, 2015

Now for the kicker... John how do you think your raycasting demo would perform with PURE ULTIMATE MULTIPLICATION POWAH?

intvnut · April 7, 2015

Have you taken a look at the multiplication routine in my raycasting demo?

Ok, I couldn't resist it. Here's a draft fixed-point multiply to replace your 8.8 * 8.8 fixed-point multiply. I have not tested this. I have not even attempted to assemble it yet. But, it looks pretty good. It comes in around 196 cycles worst case. This should be a solid improvement over your current multiply code. I eyeballed the cycle count for FracFixedMultiply and it looks like it ranges from around 250 - 400 cycles/call, which is really darn good. (If you have a better analysis of the cycle count, by all means post it.) My code below does stomp on a few registers without saving/restoring them, so if you do save/restore R2, R3 and R4, you're looking at ~250 cycles for mine.

.

;;  R1 = R0 * R1 >> 8, where R1 in range -1..1.  R0 and R1 are signed.
;;  Note: R0 = b, R1 = a
qs_fx_mpy16x8   PROC

            PSHR    R5                  ;   9

            MOVR    R1,         R2      ;   6  \_ Extract lo(a)
            ANDI    #$FF,       R2      ;   8  /
            XORR    R2,         R1      ;   6  -- Extract hi(a)
            BEQ     @@pos               ;   7 (9) is hi(b) 0xFE,0xFF,0x00,0x01
                                        ;----
                                        ;  36 (38)

            ADDI    #$100,      R1      ;   8  \_ +1: just multiply by 0x1.00
            BGT     @@mpy_by_1          ;   7  /
            BEQ     @@neg               ;   9  R1 is -ve, but > -1.00
                                        ;----
                                        ;  60

            NEGR    R0                  ; R1 was -1.00, so just negate R0
@@mpy_by_1: MOVR    R0,         R1      ; or was +1.00, so just move R0 to R1
            JR      R5                  ;


@@pos:      MOVR    R0,         R3      ;   6  \_ Extract lo(b) => R0
            ANDI    #$FF,       R0      ;   8  /
            XORR    R0,         R3      ;   6  \_ Extract hi(b) => R3
            SWAP    R3                  ;   6  /
                                        ;----
                                        ;  26
                                        ;  36 (carried forward)
                                        ;----
                                        ;  62

            ; R2 = lo(a), R0 = lo(b)
            ; lo(a) * lo(b).
            MOVR    R0,         R4      ;   6  R4 = R0 = lo(b)
            ADDI    #QSQR8_TBL.mid, R2  ;   8  offset lo(a) into table.
            MOVR    R2,         R5      ;   6  ... save for lo*hi

            ADDR    R2,         R4      ;   6  lo(a) + lo(b)
            SUBR    R0,         R2      ;   6  lo(a) - lo(b)
            MVI@    R4,         R1      ;   8  \_ low 16 bits of product
            SUB@    R2,         R1      ;   8  /

            SWAP    R1                  ;   6  \_ shift right 8 bits
            ANDI    #$FF,       R1      ;   8  /
                                        ;----
                                        ;  62
                                        ;  62 (carried forward)
                                        ;----
                                        ; 124


            ;
            ; hi(a) * lo(b)
            MOVR    R3,         R4      ;   6  R4 = R3 = hi(b)
            ADDR    R5,         R4      ;   6  lo(a) + hi(b)
            SUBR    R3,         R5      ;   6  lo(a) - hi(b)
            ADD@    R4,         R1      ;   6
            SUB@    R5,         R1      ;   6
                                        ;----
                                        ;  30
                                        ; 124
                                        ;----
                                        ; 154

            PULR    PC                  ;  12
                                        ;----
                                        ; 176

@@neg:      MOVR    R0,         R3      ;   6  \_ Extract lo(b) => R0
            ANDI    #$FF,       R0      ;   8  /
            XORR    R0,         R3      ;   6  \_ Extract hi(b) => R3
            SWAP    R3                  ;   6  /
                                        ;----
                                        ;  26
                                        ;  60 (carried forward)
                                        ;----
                                        ;  86

            ; R2 = lo(a), R0 = lo(b)
            ; lo(a) * lo(b).
            MOVR    R0,         R4      ;   6  R4 = R0 = lo(b)
            ADDI    #QSQR8_TBL.mid, R2  ;   8  offset lo(a) into table.
            MOVR    R2,         R5      ;   6  ... save for lo*hi

            ADDR    R2,         R4      ;   6  lo(a) + lo(b)
            SUBR    R0,         R2      ;   6  lo(a) - lo(b)
            MVI@    R4,         R1      ;   8  \_ low 16 bits of product
            SUB@    R2,         R1      ;   8  /

            SWAP    R1                  ;   6  \_ shift right 8 bits
            ANDI    #$FF,       R1      ;   8  /
                                        ;----
                                        ;  62
                                        ;  86 (carried forward)
                                        ;----
                                        ; 148


            ;
            ; hi(a) * lo(b)
            MOVR    R3,         R4      ;   6  R4 = R3 = hi(b)
            ADDR    R5,         R4      ;   6  lo(a) + hi(b)
            SUBR    R3,         R5      ;   6  lo(a) - hi(b)
            ADD@    R4,         R1      ;   6
            SUB@    R5,         R1      ;   6

            NEGR    R1                  ;   6
                                        ;----
                                        ;  36
                                        ; 148
                                        ;----
                                        ; 184

            PULR    PC                  ;  12
                                        ;----
                                        ; 196
            ENDP

.

Note that I don't have any code to handle R0 < 0. I don't actually think that's needed, as you're returning the middle 16 bits of the 32-bit product.

intvnut · April 8, 2015

Ok, I couldn't resist it. Here's a draft fixed-point multiply to replace your 8.8 * 8.8 fixed-point multiply. I have not tested this. I have not even attempted to assemble it yet. But, it looks pretty good. It comes in around 196 cycles worst case. This should be a solid improvement over your current multiply code. I eyeballed the cycle count for FracFixedMultiply and it looks like it ranges from around 250 - 400 cycles/call, which is really darn good. (If you have a better analysis of the cycle count, by all means post it.) My code below does stomp on a few registers without saving/restoring them, so if you do save/restore R2, R3 and R4, you're looking at ~250 cycles for mine.
...

Note that I don't have any code to handle R0 < 0. I don't actually think that's needed, as you're returning the middle 16 bits of the 32-bit product.

Ok, I was wrong about not needing to handle R0 < 0. Below is the corrected code, with a ham-fisted test for R0 < 0 followed by post-negate. It seems to mostly behave the same way as the original fixed point multiply, although I think the rounding may be different somewhere, since it isn't pixel-perfect to the original.

Here's the latest code, tested inside the raycaster. You'll note my hamfisted test at the top that lacks cycle counting, as I believe this test and the required compensation can be integrated into the rest of the code more cheaply than what I did here to test correctness. This version saves/restores R3, R4. Its run time (ignoring the hamfisted bit at the top) is 208 to 229 cycles. I imagine a proper bit of code to compensate for R0 < 0 would add 20-30 cycles or so to that, as opposed to the expensive temporary code I have there now.

.

;;  R1 = R0 * R1 >> 8, where R1 in range -1..1.  R0 and R1 are signed.
;;  Note: R0 = b, R1 = a
;;  Trashes R0, R2.  Preserves R3, R4.
qs_fx_mpy16x8   PROC

            ; quick hack
            TSTR    R0
            BPL     @@ok
            NEGR    R0
            PSHR    R5
            CALL    @@ok
            NEGR    R1
            PULR    PC


@@ok:       MOVR    R1,         R2      ;   6  \_ Extract lo(a)
            ANDI    #$FF,       R2      ;   8  / 
            XORR    R2,         R1      ;   6  -- Extract hi(a)
            BEQ     @@pos               ;   7 (9) is hi(b) 0xFE,0xFF,0x00,0x01?
                                        ;----
                                        ;  27 (29)

            ADDI    #$100,      R1      ;   8  
            BEQ     @@neg               ;   7 (9) 0xFF: R1 is -ve, but > -1.00
                                        ;----
                                        ;  15 (17)
                                        ;  27
                                        ;----
                                        ;  42 (44)

            BGT     @@mpy_by_1          ;   7 (9) 0x01: R1 is 1.00
                                        ;----
                                        ;  49 (51)

            NEGR    R0                  ; R1 was -1.00, so just negate R0
@@mpy_by_1: MOVR    R0,         R1      ; or was +1.00, so just move R0 to R1
            JR      R5                  ; 
            

@@pos:      PSHR    R5                  ;   9
            PSHR    R4                  ;   9
            PSHR    R3                  ;   9

            MOVR    R0,         R3      ;   6  \_ Extract lo(b) => R0
            ANDI    #$FF,       R0      ;   8  / 
            XORR    R0,         R3      ;   6  \_ Extract hi(b) => R3
            SWAP    R3                  ;   6  /
                                        ;----
                                        ;  53
                                        ;  29 (carried forward)
                                        ;----
                                        ;  82

            ; R2 = lo(a), R0 = lo(b)
            ; lo(a) * lo(b).
            MOVR    R0,         R4      ;   6  R4 = R0 = lo(b)
            ADDI    #QSQR8_TBL.mid, R2  ;   8  offset lo(a) into table.
            MOVR    R2,         R5      ;   6  ... save for lo*hi

            ADDR    R2,         R4      ;   6  lo(a) + lo(b)
            SUBR    R0,         R2      ;   6  lo(a) - lo(b)
            MVI@    R4,         R1      ;   8  \_ low 16 bits of product
            SUB@    R2,         R1      ;   8  /

            SWAP    R1                  ;   6  \_ shift right 8 bits
            ANDI    #$FF,       R1      ;   8  /
                                        ;----
                                        ;  62
                                        ;  82 (carried forward)
                                        ;----
                                        ; 144

                                            
            ; 
            ; hi(a) * lo(b)
            MOVR    R3,         R4      ;   6  R4 = R3 = hi(b)
            ADDR    R5,         R4      ;   6  lo(a) + hi(b)
            SUBR    R3,         R5      ;   6  lo(a) - hi(b)
            ADD@    R4,         R1      ;   6
            SUB@    R5,         R1      ;   6
                                        ;----
                                        ;  30
                                        ; 144
                                        ;----
                                        ; 174

            PULR    R3                  ;  11
            PULR    R4                  ;  11
            PULR    PC                  ;  12
                                        ;----
                                        ; 208

@@neg:      PSHR    R5                  ;   9
            PSHR    R4                  ;   9
            PSHR    R3                  ;   9

            MOVR    R0,         R1      ;   6  Save 'b' in R0 for -ve hi*hi
            MOVR    R1,         R3      ;   6  \_ Extract lo(b) => R0
            ANDI    #$FF,       R1      ;   8  / 
            XORR    R1,         R3      ;   6  \_ Extract hi(b) => R3
            SWAP    R3                  ;   6  /
                                        ;----
                                        ;  59
                                        ;  44 (carried forward)
                                        ;----
                                        ; 103

            ; R2 = lo(a), R1 = lo(b)
            ; lo(a) * lo(b).
            MOVR    R1,         R4      ;   6  R4 = R0 = lo(b)
            ADDI    #QSQR8_TBL.mid, R2  ;   8  offset lo(a) into table.
            MOVR    R2,         R5      ;   6  ... save for lo*hi

            ADDR    R2,         R4      ;   6  lo(a) + lo(b)
            SUBR    R1,         R2      ;   6  lo(a) - lo(b)
            MVI@    R4,         R1      ;   8  \_ low 16 bits of product
            SUB@    R2,         R1      ;   8  /

            SWAP    R1                  ;   6  \_ shift right 8 bits
            ANDI    #$FF,       R1      ;   8  /
            SUBR    R0,         R1      ;   6  a * hi(b) -1 multiply
                                        ;----
                                        ;  62
                                        ; 103 (carried forward)
                                        ;----
                                        ; 165

                                            
            ; 
            ; hi(a) * lo(b)
            MOVR    R3,         R4      ;   6  R4 = R3 = hi(b)
            ADDR    R5,         R4      ;   6  lo(a) + hi(b)
            SUBR    R3,         R5      ;   6  lo(a) - hi(b)
            ADD@    R4,         R1      ;   6
            SUB@    R5,         R1      ;   6

                                        ;----
                                        ;  30
                                        ; 165
                                        ;----
                                        ; 195

            PULR    R3                  ;  11
            PULR    R4                  ;  11
            PULR    PC                  ;  12
                                        ;----
                                        ; 229
            ENDP

.

I've attached the latest qsqr3.asm that includes the lookup table and other multiply routines.

qsqr3.asm

intvnut · April 9, 2015

Have you taken a look at the multiplication routine in my raycasting demo?

Ok, so I've optimized the quarter-square multiply technique to do the signed 8.8 * signed 8.8 fixed-point multiply, where the right-hand side is limited to the range -1.00 to 1.00. Its cycle count now ranges from 228 to 271 cycles (not counting the early-exit trivial cases for multiply by -1.00 or +1.00). Aside from the cost incurred by compensating for the signs of the multiplicands, the multiply performance is otherwise flat due to the table lookup.

And, I've hooked it into the raycasting demo as a test, and so far it appears to be more or less pixel-perfect.

I've attached the final version. It's likely possible to eliminate another dozen or so cycles out of this with micro-optimizations. Still, I think this is pretty good. It feels slightly snappier at some angles in the raycasting demo. It's not night-and-day different, but a reasonable improvement. As I said before, your multiply implementation was really darn good. The table lookup approach from quarter-square just flattens the cost for an 8x8 unsigned multiply down to nearly the same as multiplying by 0x01.

This version uses a straightforward "negate R0 and negate the product" approach for dealing with a negative R0 value. I tried to use a more clever approach based on some observations in Hacker's Delight, but I botched something along the way and it didn't work. Fixing the botch looked to be more costly than this straightforward approach, cycle-wise. Hacker's Delight is an awesome book, but it basically assumes a 1-cycle-per-instruction RISC processor with a barrel shifter for many of its nifty hacks. That hardly describes our beloved Intellivision.

EDIT: It isn't pixel perfect. I think there may be a rounding/truncation difference but I'm not sure exactly where. Here's a couple of screen shots to demonstrate.

qsqr3.asm

Edited April 9, 2015 by intvnut

+nanochess · April 23, 2015

I guess there isn't an easy way to say a table lookup only returns an 8-bit value, is there? For First Spear's use case (the one that inspired all this, actually), if he uses the approach I suggested, he would have a lookup table that will return small values (fit well within 8 bits), but are served well by a lookup table.

Alternately, does it make sense to have a rand8(val) primitive as part of the language that returns a random 8 bit value that's in the range [ 0, val ), and that could use the 8 x 8 multiply? ie.

.
    rand_result = rand8( val )
.

that expands to something like

.
        ; assume 'val' in R0; result in R0.  Trashes R1, R2.
        ANDI    #$FF,           R0    ;   8
        MOVR    R0,             R2    ;   6
        MVII    #QSQR8_TBL.mid, R1    ;   8
        ADD     _rand,          R1    ;  10
        ADDR    R1,             R2    ;   6
        SUBR    R0,             R1    ;   6
        MVI@    R2,             R0    ;   8
        SUB@    R1,             R0    ;   8
        SWAP    R0                    ;   6
        ANDI    #$FF,           R0    ;   8
                                      ;----
                                      ;  74
.

A flat 74 cycles gives you an 8 bit random number in the desired range, with minimal bias.

This usage of the 8-bit multiplication via RAND(limit) and the var1 * var2 expression compiled to a call to your subroutine are now included in IntyBASIC v1.0.4

Thanks for your effort! I think these routines are very valuable

intvnut · April 24, 2015

This usage of the 8-bit multiplication via RAND(limit) and the var1 * var2 expression compiled to a call to your subroutine are now included in IntyBASIC v1.0.4

Thanks for your effort! I think these routines are very valuable

Awesome!

I particularly think the RAND() function is just generally useful. RAND (or rather RND) was always a cornerstone in all my old BASIC games, which made it conspicuous by its absence in IntyBASIC. Until now, that is.

(Ok, there's the 0..255 random variable, but the scaled random is what I always used. INT( limit * RND(1) ) was an idiom I employed quite often back in the day.)

Edited April 24, 2015 by intvnut

Quarter Square Multiplication: Blindingly fast multiplies!

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members