Quarter Square Multiplication: Blindingly fast multiplies!

intvnut · April 1, 2015

While looking for something else the other night, I stumbled on this Wikipedia article that describes Quarter Square Multiplication. I had never seen this algorithm before. For a machine like the Intellivision, it appears to be a perfect fit!

I won't repeat the Wikipedia article here, but the short version: With a few additions, subtractions and table lookups, you can perform arbitrary multiplication wicked fast. For an 8-bit x 8-bit multiplication (16-bit result), we're talking ~70 cycles. For 16x16 multiplication (also 16-bit result), we're talking ~302 cycles.

It didn't take me very long to code it up at all. The algorithm is that simple. It just needs a 511 word lookup table.

Here's the implementation of both 8x8 => 16 and 16x16 => 16 multiplies.

.

; Quarter Square Multiplication
; Assembly code by Joe Zbiciak, 2015
; Released to public domain.


QSQR8_TBL   PROC
            DECLE   $0000, $0000, $0001, $0002, $0004, $0006, $0009, $000C
            DECLE   $0010, $0014, $0019, $001E, $0024, $002A, $0031, $0038
            DECLE   $0040, $0048, $0051, $005A, $0064, $006E, $0079, $0084
            DECLE   $0090, $009C, $00A9, $00B6, $00C4, $00D2, $00E1, $00F0
            DECLE   $0100, $0110, $0121, $0132, $0144, $0156, $0169, $017C
            DECLE   $0190, $01A4, $01B9, $01CE, $01E4, $01FA, $0211, $0228
            DECLE   $0240, $0258, $0271, $028A, $02A4, $02BE, $02D9, $02F4
            DECLE   $0310, $032C, $0349, $0366, $0384, $03A2, $03C1, $03E0
            DECLE   $0400, $0420, $0441, $0462, $0484, $04A6, $04C9, $04EC
            DECLE   $0510, $0534, $0559, $057E, $05A4, $05CA, $05F1, $0618
            DECLE   $0640, $0668, $0691, $06BA, $06E4, $070E, $0739, $0764
            DECLE   $0790, $07BC, $07E9, $0816, $0844, $0872, $08A1, $08D0
            DECLE   $0900, $0930, $0961, $0992, $09C4, $09F6, $0A29, $0A5C
            DECLE   $0A90, $0AC4, $0AF9, $0B2E, $0B64, $0B9A, $0BD1, $0C08
            DECLE   $0C40, $0C78, $0CB1, $0CEA, $0D24, $0D5E, $0D99, $0DD4
            DECLE   $0E10, $0E4C, $0E89, $0EC6, $0F04, $0F42, $0F81, $0FC0
            DECLE   $1000, $1040, $1081, $10C2, $1104, $1146, $1189, $11CC
            DECLE   $1210, $1254, $1299, $12DE, $1324, $136A, $13B1, $13F8
            DECLE   $1440, $1488, $14D1, $151A, $1564, $15AE, $15F9, $1644
            DECLE   $1690, $16DC, $1729, $1776, $17C4, $1812, $1861, $18B0
            DECLE   $1900, $1950, $19A1, $19F2, $1A44, $1A96, $1AE9, $1B3C
            DECLE   $1B90, $1BE4, $1C39, $1C8E, $1CE4, $1D3A, $1D91, $1DE8
            DECLE   $1E40, $1E98, $1EF1, $1F4A, $1FA4, $1FFE, $2059, $20B4
            DECLE   $2110, $216C, $21C9, $2226, $2284, $22E2, $2341, $23A0
            DECLE   $2400, $2460, $24C1, $2522, $2584, $25E6, $2649, $26AC
            DECLE   $2710, $2774, $27D9, $283E, $28A4, $290A, $2971, $29D8
            DECLE   $2A40, $2AA8, $2B11, $2B7A, $2BE4, $2C4E, $2CB9, $2D24
            DECLE   $2D90, $2DFC, $2E69, $2ED6, $2F44, $2FB2, $3021, $3090
            DECLE   $3100, $3170, $31E1, $3252, $32C4, $3336, $33A9, $341C
            DECLE   $3490, $3504, $3579, $35EE, $3664, $36DA, $3751, $37C8
            DECLE   $3840, $38B8, $3931, $39AA, $3A24, $3A9E, $3B19, $3B94
            DECLE   $3C10, $3C8C, $3D09, $3D86, $3E04, $3E82, $3F01, $3F80
            DECLE   $4000, $4080, $4101, $4182, $4204, $4286, $4309, $438C
            DECLE   $4410, $4494, $4519, $459E, $4624, $46AA, $4731, $47B8
            DECLE   $4840, $48C8, $4951, $49DA, $4A64, $4AEE, $4B79, $4C04
            DECLE   $4C90, $4D1C, $4DA9, $4E36, $4EC4, $4F52, $4FE1, $5070
            DECLE   $5100, $5190, $5221, $52B2, $5344, $53D6, $5469, $54FC
            DECLE   $5590, $5624, $56B9, $574E, $57E4, $587A, $5911, $59A8
            DECLE   $5A40, $5AD8, $5B71, $5C0A, $5CA4, $5D3E, $5DD9, $5E74
            DECLE   $5F10, $5FAC, $6049, $60E6, $6184, $6222, $62C1, $6360
            DECLE   $6400, $64A0, $6541, $65E2, $6684, $6726, $67C9, $686C
            DECLE   $6910, $69B4, $6A59, $6AFE, $6BA4, $6C4A, $6CF1, $6D98
            DECLE   $6E40, $6EE8, $6F91, $703A, $70E4, $718E, $7239, $72E4
            DECLE   $7390, $743C, $74E9, $7596, $7644, $76F2, $77A1, $7850
            DECLE   $7900, $79B0, $7A61, $7B12, $7BC4, $7C76, $7D29, $7DDC
            DECLE   $7E90, $7F44, $7FF9, $80AE, $8164, $821A, $82D1, $8388
            DECLE   $8440, $84F8, $85B1, $866A, $8724, $87DE, $8899, $8954
            DECLE   $8A10, $8ACC, $8B89, $8C46, $8D04, $8DC2, $8E81, $8F40
            DECLE   $9000, $90C0, $9181, $9242, $9304, $93C6, $9489, $954C
            DECLE   $9610, $96D4, $9799, $985E, $9924, $99EA, $9AB1, $9B78
            DECLE   $9C40, $9D08, $9DD1, $9E9A, $9F64, $A02E, $A0F9, $A1C4
            DECLE   $A290, $A35C, $A429, $A4F6, $A5C4, $A692, $A761, $A830
            DECLE   $A900, $A9D0, $AAA1, $AB72, $AC44, $AD16, $ADE9, $AEBC
            DECLE   $AF90, $B064, $B139, $B20E, $B2E4, $B3BA, $B491, $B568
            DECLE   $B640, $B718, $B7F1, $B8CA, $B9A4, $BA7E, $BB59, $BC34
            DECLE   $BD10, $BDEC, $BEC9, $BFA6, $C084, $C162, $C241, $C320
            DECLE   $C400, $C4E0, $C5C1, $C6A2, $C784, $C866, $C949, $CA2C
            DECLE   $CB10, $CBF4, $CCD9, $CDBE, $CEA4, $CF8A, $D071, $D158
            DECLE   $D240, $D328, $D411, $D4FA, $D5E4, $D6CE, $D7B9, $D8A4
            DECLE   $D990, $DA7C, $DB69, $DC56, $DD44, $DE32, $DF21, $E010
            DECLE   $E100, $E1F0, $E2E1, $E3D2, $E4C4, $E5B6, $E6A9, $E79C
            DECLE   $E890, $E984, $EA79, $EB6E, $EC64, $ED5A, $EE51, $EF48
            DECLE   $F040, $F138, $F231, $F32A, $F424, $F51E, $F619, $F714
            DECLE   $F810, $F90C, $FA09, $FB06, $FC04, $FD02, $FE01  
            ENDP

; R2 = R0 * R1, where R0 and R1 are unsigned 8-bit values
; Destroys R1
qs_mpy8     PROC
            MOVR    R0,     R2      ;   6
            ADDR    R1,     R2      ;   6   a + b
            SUBR    R0,     R1      ;   6   a - b
            BPL     @@ok            ;   7
            NEGR    R1              ;   6
@@ok:       ADDI    #QSQR8_TBL, R2  ;   8
            ADDI    #QSQR8_TBL, R1  ;   8
            MVI@    R2,         R2  ;   8
            SUB@    R1,         R2  ;   8
            JR      R5              ;   7
                                    ;----
                                    ;  70
            ENDP
            

; R1 = R0 * R1, where R0 and R1 are 16-bit values
; destroys R0, R2, R3, R4, R5
qs_mpy16    PROC
            PSHR    R5              ;   9

            MVII    #QSQR8_TBL, R5  ;   8

            MOVR    R0,         R2  ;   6   R2 is orig 16-bit a
            MOVR    R1,         R3  ;   6   R3 is orig 16-bit b
                                
            ; lo * lo           
            ANDI    #$FF,       R0  ;   8   R0 is lo(a)
            MOVR    R0,         R4  ;   6   R4 is lo(a)
            ANDI    #$FF,       R1  ;   8   R1 is lo(b)
            PSHR    R1              ;   9   save lo(b)

            ADDR    R1,         R4  ;   6   R4 = lo(a) + lo(b)
            SUBR    R0,         R1  ;   6   R1 = lo(a) - lo(b)
            BPL     @@pos_ll        ;   7
            NEGR    R1              ;   6
                                
@@pos_ll:   ADDR    R5,         R4  ;   6   R4 = &qstbl[lo(a)+lo(b)]
            ADDR    R5,         R1  ;   6   R1 = &qstbl[lo(a)-lo(b)]
            MVI@    R4,         R4  ;   8   R4 = qstbl[lo(a)+lo(b)]
            SUB@    R1,         R4  ;   8   R4 = lo(a)*lo(b)
                                    ;----
                                    ; 113

            ; lo * hi
            SWAP    R3              ;   6   \_ R3 = hi(b)
            ANDI    #$FF,       R3  ;   8   /
            MOVR    R0,         R1  ;   6   R0 = R1 = lo(a)
            ADDR    R3,         R1  ;   6   R1 = hi(b) + lo(a)
            SUBR    R0,         R3  ;   6   R3 = hi(b) - lo(a)
            BPL     @@pos_lh        ;   7
            NEGR    R3              ;   6

@@pos_lh:   ADDR    R5,         R1  ;   6   R1 = &qstbl[hi(b)+lo(a)]
            ADDR    R5,         R3  ;   6   R3 = &qstbl[hi(b)-lo(a)]
            MVI@    R1,         R1  ;   8   R1 = qstbl[hi(b)-lo(a)]
            SUB@    R3,         R1  ;   8   R1 = lo(a)*hi(b)
                                    ;----
                                    ;  73
                                    ; 113 (carried forward)
                                    ;----
                                    ; 186

            ; hi * lo
            SWAP    R2              ;   6
            ANDI    #$FF,       R2  ;   8   R2 = hi(a)
            PULR    R3              ;  11   R3 = lo(b)
            MOVR    R3,         R0  ;   6   R0 = lo(b)
            ADDR    R2,         R3  ;   6   R3 = hi(a) + lo(b)
            SUBR    R0,         R2  ;   6   R2 = hi(a) - lo(b)
            BPL     @@pos_hl        ;   7
            NEGR    R2              ;   6
             
@@pos_hl:   ADDR    R5,         R3  ;   6   R3 = &qstbl[hi(a)+lo(b)]
            ADDR    R5,         R2  ;   6   R2 = &qstbl[hi(a)-lo(b)]
            ADD@    R3,         R1  ;   8   \_ R1 = lo(a)*hi(b) + hi(a)*lo(b)
            SUB@    R2,         R1  ;   8   /
                                    ;----
                                    ;  84
                                    ; 186 (carried forward)
                                    ;----
                                    ; 270

            SWAP    R1              ;   6   \_ shift upper product left 8
            ANDI    #$FF00,     R1  ;   8   /
            ADDR    R4,         R1  ;   6   final product
            PULR    PC              ;  12
                                    ;----
                                    ;  32
                                    ; 270 (carried forward)
                                    ;----
                                    ; 302
            ENDP

.

I spent maybe 20 minutes on the 8x8 implementation, and another 40 - 60 minutes on the 16x16 implementation. (I first did a naive implementation that called qs_mpy8, and then a smarter implementation that inlined everything.) Tonight, I ran both through literally millions of random tests, comparing against JLP's multiply accelerator.

(Side note: If you run jzIntv with rate control off (-r0), audio off (-a0), and minimize the window, it runs very, very fast. About 280x on my machine. 270x if you forget and leave audio on.)

As the comment says, I release this to the public domain. If you use this in one of your games / programs / whatever, I don't mind a tip of the hat my way if you happen to think of it. And if not, eh, we still all get more cool games.

If any of you find ways to improve this code (I'm sure it's possible), post it here and share with everyone!

Homework assignment for the motivated: Add an 8x16 multiply to the library. With all three, then programs like IntyBASIC have a nice arsenal for implementing multiplication efficiently, with no surprises for oddball multiplicands. (That's presuming IntyBASIC can figure out if the arguments to the multiply come from 8 bit variables or 16-bit variables/intermediate results.) I think adding an 8x16 multiply mainly requires removing the 'hl' portion of the 16x16 multiply code and a PSHR, saving ~93 cycles.

quarter_square_multiplication.asm

Edited April 1, 2015 by intvnut

intvnut · April 1, 2015

One thought I had while stuck in a 7AM meeting this morning: If you're willing to spend for an add'l 255 table entries, you can shave ~13 cycles off the 8x8 multiply and ~39 cycles off the worst-case 16x16 multiply by eliminating the BPL/NEGR statements.

For the 8x8 multiply, that brings it down to 57 cycles. It's a small enough hunk of code that you could inline it, making it only 50 cycles inline. That's not bad if your game needs an arbitrary multiply.

Note that if you have a fixed multiplier (ie. you're multiplying by 20 or by 50 or by some other constant), you're better off using a precomputed set of shifts and adds. I believe dZ-Jay has the most up to date version of the set of macros I published for constant multiplies up to 127, and it's likely elsewhere in this forum too if you dig for it.

The quarter-square technique is simple enough that if you really wanted to optimize the routine and table-size for a specific game, it's fairly trivial to do so. I wish I had known about the technique sooner!

Final note: Here's the C model I had written to experiment with it as part of writing the assembly code. It really is this simple.

.

#include <stdio.h>

int qsqr8_tbl[511];

// 8x8 unsigned multiply
int qsqr8( int a, int b )
{
    int s = a + b;
    int d = a - b;

    if ( d < 0 ) d = -d;

    return qsqr8_tbl[s] - qsqr8_tbl[d];
}

int qsqr16( int a, int b )
{
    int plo = qsqr8( a & 0xFF, b & 0xFF );
    int phi = qsqr8( a & 0xFF, b >> 8 ) + qsqr8( a >> 8, b & 0xFF );

    return (plo + (phi << ) & 0xFFFF;
}

int main()
{
    int a, b, i;

    for ( i = 0; i < 511; i++ )
        qsqr8_tbl[i] = i*i >> 2;

    for ( a = 0; a < 256; a++ )
        for ( b = 0; b < 256; b++ )
        {
            int p = a * b;
            int q = qsqr8( a, b );

            if ( p != q )
                printf(" %.2X %.2X %.4X %.4X\n", a, b, p, q );
        }

    for ( a = 0; a < 65536; a++ )
        for ( b = 0; b < 65536; b++ )
        {
            int p = (a * b) & 0xFFFF;
            int q = qsqr16( a, b );

            if ( p != q )
                printf(" %.4X %.4X %.4X %.4X\n", a, b, p, q );
        }

    return 0;
}

.

intvnut · April 2, 2015

Ok, if you're willing to blow an extra 255 words for the lookup table, you can make this really fast. 49 cycles for 8x8=>16, and 225 cycles for 16x16=>16.

.

; Quarter Square Multiplication
; Assembly code by Joe Zbiciak, 2015
; Released to public domain.

QSQR8_TBL   PROC
            DECLE   $3F80, $3F01, $3E82, $3E04, $3D86, $3D09, $3C8C, $3C10
            DECLE   $3B94, $3B19, $3A9E, $3A24, $39AA, $3931, $38B8, $3840
            DECLE   $37C8, $3751, $36DA, $3664, $35EE, $3579, $3504, $3490
            DECLE   $341C, $33A9, $3336, $32C4, $3252, $31E1, $3170, $3100
            DECLE   $3090, $3021, $2FB2, $2F44, $2ED6, $2E69, $2DFC, $2D90
            DECLE   $2D24, $2CB9, $2C4E, $2BE4, $2B7A, $2B11, $2AA8, $2A40
            DECLE   $29D8, $2971, $290A, $28A4, $283E, $27D9, $2774, $2710
            DECLE   $26AC, $2649, $25E6, $2584, $2522, $24C1, $2460, $2400
            DECLE   $23A0, $2341, $22E2, $2284, $2226, $21C9, $216C, $2110
            DECLE   $20B4, $2059, $1FFE, $1FA4, $1F4A, $1EF1, $1E98, $1E40
            DECLE   $1DE8, $1D91, $1D3A, $1CE4, $1C8E, $1C39, $1BE4, $1B90
            DECLE   $1B3C, $1AE9, $1A96, $1A44, $19F2, $19A1, $1950, $1900
            DECLE   $18B0, $1861, $1812, $17C4, $1776, $1729, $16DC, $1690
            DECLE   $1644, $15F9, $15AE, $1564, $151A, $14D1, $1488, $1440
            DECLE   $13F8, $13B1, $136A, $1324, $12DE, $1299, $1254, $1210
            DECLE   $11CC, $1189, $1146, $1104, $10C2, $1081, $1040, $1000
            DECLE   $0FC0, $0F81, $0F42, $0F04, $0EC6, $0E89, $0E4C, $0E10
            DECLE   $0DD4, $0D99, $0D5E, $0D24, $0CEA, $0CB1, $0C78, $0C40
            DECLE   $0C08, $0BD1, $0B9A, $0B64, $0B2E, $0AF9, $0AC4, $0A90
            DECLE   $0A5C, $0A29, $09F6, $09C4, $0992, $0961, $0930, $0900
            DECLE   $08D0, $08A1, $0872, $0844, $0816, $07E9, $07BC, $0790
            DECLE   $0764, $0739, $070E, $06E4, $06BA, $0691, $0668, $0640
            DECLE   $0618, $05F1, $05CA, $05A4, $057E, $0559, $0534, $0510
            DECLE   $04EC, $04C9, $04A6, $0484, $0462, $0441, $0420, $0400
            DECLE   $03E0, $03C1, $03A2, $0384, $0366, $0349, $032C, $0310
            DECLE   $02F4, $02D9, $02BE, $02A4, $028A, $0271, $0258, $0240
            DECLE   $0228, $0211, $01FA, $01E4, $01CE, $01B9, $01A4, $0190
            DECLE   $017C, $0169, $0156, $0144, $0132, $0121, $0110, $0100
            DECLE   $00F0, $00E1, $00D2, $00C4, $00B6, $00A9, $009C, $0090
            DECLE   $0084, $0079, $006E, $0064, $005A, $0051, $0048, $0040
            DECLE   $0038, $0031, $002A, $0024, $001E, $0019, $0014, $0010
            DECLE   $000C, $0009, $0006, $0004, $0002, $0001, $0000
@@mid:
            DECLE   $0000, $0000, $0001, $0002, $0004, $0006, $0009, $000C
            DECLE   $0010, $0014, $0019, $001E, $0024, $002A, $0031, $0038
            DECLE   $0040, $0048, $0051, $005A, $0064, $006E, $0079, $0084
            DECLE   $0090, $009C, $00A9, $00B6, $00C4, $00D2, $00E1, $00F0
            DECLE   $0100, $0110, $0121, $0132, $0144, $0156, $0169, $017C
            DECLE   $0190, $01A4, $01B9, $01CE, $01E4, $01FA, $0211, $0228
            DECLE   $0240, $0258, $0271, $028A, $02A4, $02BE, $02D9, $02F4
            DECLE   $0310, $032C, $0349, $0366, $0384, $03A2, $03C1, $03E0
            DECLE   $0400, $0420, $0441, $0462, $0484, $04A6, $04C9, $04EC
            DECLE   $0510, $0534, $0559, $057E, $05A4, $05CA, $05F1, $0618
            DECLE   $0640, $0668, $0691, $06BA, $06E4, $070E, $0739, $0764
            DECLE   $0790, $07BC, $07E9, $0816, $0844, $0872, $08A1, $08D0
            DECLE   $0900, $0930, $0961, $0992, $09C4, $09F6, $0A29, $0A5C
            DECLE   $0A90, $0AC4, $0AF9, $0B2E, $0B64, $0B9A, $0BD1, $0C08
            DECLE   $0C40, $0C78, $0CB1, $0CEA, $0D24, $0D5E, $0D99, $0DD4
            DECLE   $0E10, $0E4C, $0E89, $0EC6, $0F04, $0F42, $0F81, $0FC0
            DECLE   $1000, $1040, $1081, $10C2, $1104, $1146, $1189, $11CC
            DECLE   $1210, $1254, $1299, $12DE, $1324, $136A, $13B1, $13F8
            DECLE   $1440, $1488, $14D1, $151A, $1564, $15AE, $15F9, $1644
            DECLE   $1690, $16DC, $1729, $1776, $17C4, $1812, $1861, $18B0
            DECLE   $1900, $1950, $19A1, $19F2, $1A44, $1A96, $1AE9, $1B3C
            DECLE   $1B90, $1BE4, $1C39, $1C8E, $1CE4, $1D3A, $1D91, $1DE8
            DECLE   $1E40, $1E98, $1EF1, $1F4A, $1FA4, $1FFE, $2059, $20B4
            DECLE   $2110, $216C, $21C9, $2226, $2284, $22E2, $2341, $23A0
            DECLE   $2400, $2460, $24C1, $2522, $2584, $25E6, $2649, $26AC
            DECLE   $2710, $2774, $27D9, $283E, $28A4, $290A, $2971, $29D8
            DECLE   $2A40, $2AA8, $2B11, $2B7A, $2BE4, $2C4E, $2CB9, $2D24
            DECLE   $2D90, $2DFC, $2E69, $2ED6, $2F44, $2FB2, $3021, $3090
            DECLE   $3100, $3170, $31E1, $3252, $32C4, $3336, $33A9, $341C
            DECLE   $3490, $3504, $3579, $35EE, $3664, $36DA, $3751, $37C8
            DECLE   $3840, $38B8, $3931, $39AA, $3A24, $3A9E, $3B19, $3B94
            DECLE   $3C10, $3C8C, $3D09, $3D86, $3E04, $3E82, $3F01, $3F80
            DECLE   $4000, $4080, $4101, $4182, $4204, $4286, $4309, $438C
            DECLE   $4410, $4494, $4519, $459E, $4624, $46AA, $4731, $47B8
            DECLE   $4840, $48C8, $4951, $49DA, $4A64, $4AEE, $4B79, $4C04
            DECLE   $4C90, $4D1C, $4DA9, $4E36, $4EC4, $4F52, $4FE1, $5070
            DECLE   $5100, $5190, $5221, $52B2, $5344, $53D6, $5469, $54FC
            DECLE   $5590, $5624, $56B9, $574E, $57E4, $587A, $5911, $59A8
            DECLE   $5A40, $5AD8, $5B71, $5C0A, $5CA4, $5D3E, $5DD9, $5E74
            DECLE   $5F10, $5FAC, $6049, $60E6, $6184, $6222, $62C1, $6360
            DECLE   $6400, $64A0, $6541, $65E2, $6684, $6726, $67C9, $686C
            DECLE   $6910, $69B4, $6A59, $6AFE, $6BA4, $6C4A, $6CF1, $6D98
            DECLE   $6E40, $6EE8, $6F91, $703A, $70E4, $718E, $7239, $72E4
            DECLE   $7390, $743C, $74E9, $7596, $7644, $76F2, $77A1, $7850
            DECLE   $7900, $79B0, $7A61, $7B12, $7BC4, $7C76, $7D29, $7DDC
            DECLE   $7E90, $7F44, $7FF9, $80AE, $8164, $821A, $82D1, $8388
            DECLE   $8440, $84F8, $85B1, $866A, $8724, $87DE, $8899, $8954
            DECLE   $8A10, $8ACC, $8B89, $8C46, $8D04, $8DC2, $8E81, $8F40
            DECLE   $9000, $90C0, $9181, $9242, $9304, $93C6, $9489, $954C
            DECLE   $9610, $96D4, $9799, $985E, $9924, $99EA, $9AB1, $9B78
            DECLE   $9C40, $9D08, $9DD1, $9E9A, $9F64, $A02E, $A0F9, $A1C4
            DECLE   $A290, $A35C, $A429, $A4F6, $A5C4, $A692, $A761, $A830
            DECLE   $A900, $A9D0, $AAA1, $AB72, $AC44, $AD16, $ADE9, $AEBC
            DECLE   $AF90, $B064, $B139, $B20E, $B2E4, $B3BA, $B491, $B568
            DECLE   $B640, $B718, $B7F1, $B8CA, $B9A4, $BA7E, $BB59, $BC34
            DECLE   $BD10, $BDEC, $BEC9, $BFA6, $C084, $C162, $C241, $C320
            DECLE   $C400, $C4E0, $C5C1, $C6A2, $C784, $C866, $C949, $CA2C
            DECLE   $CB10, $CBF4, $CCD9, $CDBE, $CEA4, $CF8A, $D071, $D158
            DECLE   $D240, $D328, $D411, $D4FA, $D5E4, $D6CE, $D7B9, $D8A4
            DECLE   $D990, $DA7C, $DB69, $DC56, $DD44, $DE32, $DF21, $E010
            DECLE   $E100, $E1F0, $E2E1, $E3D2, $E4C4, $E5B6, $E6A9, $E79C
            DECLE   $E890, $E984, $EA79, $EB6E, $EC64, $ED5A, $EE51, $EF48
            DECLE   $F040, $F138, $F231, $F32A, $F424, $F51E, $F619, $F714
            DECLE   $F810, $F90C, $FA09, $FB06, $FC04, $FD02, $FE01
            ENDP

; R2 = R0 * R1, where R0 and R1 are unsigned 8-bit values
; Destroys R1
qs_mpy8     PROC
            MOVR    R0,             R2      ;   6
            ADDI    #QSQR8_TBL.mid, R1      ;   8
            ADDR    R1,             R2      ;   6   a + b
            SUBR    R0,             R1      ;   6   a - b
@@ok:       MVI@    R2,             R2      ;   8
            SUB@    R1,             R2      ;   8
            JR      R5                      ;   7
                                            ;----
                                            ;  49
            ENDP
            

; R1 = R0 * R1, where R0 and R1 are 16-bit values
; destroys R0, R2, R3, R4, R5
qs_mpy16    PROC
            PSHR    R5                  ;   9
                                   
            ; Unpack lo/hi
            MOVR    R0,         R2      ;   6   
            ANDI    #$FF,       R0      ;   8   R0 is lo(a)
            XORR    R0,         R2      ;   6   
            SWAP    R2                  ;   6   R2 is hi(a)

            MOVR    R1,         R3      ;   6   R3 is orig 16-bit b
            ANDI    #$FF,       R1      ;   8   R1 is lo(b)
            MOVR    R1,         R5      ;   6   R5 is lo(b)
            XORR    R1,         R3      ;   6   
            SWAP    R3                  ;   6   R3 is hi(b)
                                        ;----
                                        ;  67
                                        
            ; lo * lo                   
            MOVR    R0,         R4      ;   6   R4 is lo(a)
            ADDI    #QSQR8_TBL.mid, R1  ;   8
            ADDR    R1,         R4      ;   6   R4 = lo(a) + lo(b)
            SUBR    R0,         R1      ;   6   R1 = lo(a) - lo(b)
                                        
@@pos_ll:   MVI@    R4,         R4      ;   8   R4 = qstbl[lo(a)+lo(b)]
            SUB@    R1,         R4      ;   8   R4 = lo(a)*lo(b)
                                        ;----
                                        ;  42
                                        ;  67 (carried forward)
                                        ;----
                                        ; 109
                                       
            ; lo * hi                  
            MOVR    R0,         R1      ;   6   R0 = R1 = lo(a)
            ADDI    #QSQR8_TBL.mid, R3  ;   8
            ADDR    R3,         R1      ;   6   R1 = hi(b) + lo(a)
            SUBR    R0,         R3      ;   6   R3 = hi(b) - lo(a)
                                       
@@pos_lh:   MVI@    R1,         R1      ;   8   R1 = qstbl[hi(b)-lo(a)]
            SUB@    R3,         R1      ;   8   R1 = lo(a)*hi(b)
                                        ;----
                                        ;  42
                                        ; 109 (carried forward)
                                        ;----
                                        ; 151
                                       
            ; hi * lo                  
            MOVR    R5,         R0      ;   6   R5 = R0 = lo(b)
            ADDI    #QSQR8_TBL.mid, R2  ;   8
            ADDR    R2,         R5      ;   6   R3 = hi(a) + lo(b)
            SUBR    R0,         R2      ;   6   R2 = hi(a) - lo(b)
                                       
@@pos_hl:   ADD@    R5,         R1      ;   8   \_ R1 = lo(a)*hi(b)+hi(a)*lo(b)
            SUB@    R2,         R1      ;   8   /
                                        ;----
                                        ;  42
                                        ; 151 (carried forward)
                                        ;----
                                        ; 193
                                       
            SWAP    R1                  ;   6   \_ shift upper product left 8
            ANDI    #$FF00,     R1      ;   8   /
            ADDR    R4,         R1      ;   6   final product
            PULR    PC                  ;  12
                                        ;----
                                        ;  32
                                        ; 193 (carried forward)
                                        ;----
                                        ; 225
            ENDP

.

And yes, it's tested.

EDIT: ...and updated to the version with the corrected cycle counting. 225 cycles!

qsqr3.asm

Edited April 2, 2015 by intvnut

intvnut · April 2, 2015

BTW... if you're following this thread, I did make a couple edits and got the final count to 216 cycles. Calling it a night on this version, as dinner's ready.

EDIT: GAH! 225 cycles. I missed the 9 cycles for the PSHR at the top when counting cycles in the most recent version. What's 9 cycles among friends?

Edited April 2, 2015 by intvnut

+nanochess · April 2, 2015

Ok, if you're willing to blow an extra 255 words for the lookup table, you can make this really fast. 49 cycles for 8x8=>16, and 229 cycles for 16x16=>16.

.

; Quarter Square Multiplication
; Assembly code by Joe Zbiciak, 2015
; Released to public domain.

QSQR8_TBL   PROC
            DECLE   $3F80, $3F01, $3E82, $3E04, $3D86, $3D09, $3C8C, $3C10
            DECLE   $3B94, $3B19, $3A9E, $3A24, $39AA, $3931, $38B8, $3840
            DECLE   $37C8, $3751, $36DA, $3664, $35EE, $3579, $3504, $3490
            DECLE   $341C, $33A9, $3336, $32C4, $3252, $31E1, $3170, $3100
            DECLE   $3090, $3021, $2FB2, $2F44, $2ED6, $2E69, $2DFC, $2D90
            DECLE   $2D24, $2CB9, $2C4E, $2BE4, $2B7A, $2B11, $2AA8, $2A40
            DECLE   $29D8, $2971, $290A, $28A4, $283E, $27D9, $2774, $2710
            DECLE   $26AC, $2649, $25E6, $2584, $2522, $24C1, $2460, $2400
            DECLE   $23A0, $2341, $22E2, $2284, $2226, $21C9, $216C, $2110
            DECLE   $20B4, $2059, $1FFE, $1FA4, $1F4A, $1EF1, $1E98, $1E40
            DECLE   $1DE8, $1D91, $1D3A, $1CE4, $1C8E, $1C39, $1BE4, $1B90
            DECLE   $1B3C, $1AE9, $1A96, $1A44, $19F2, $19A1, $1950, $1900
            DECLE   $18B0, $1861, $1812, $17C4, $1776, $1729, $16DC, $1690
            DECLE   $1644, $15F9, $15AE, $1564, $151A, $14D1, $1488, $1440
            DECLE   $13F8, $13B1, $136A, $1324, $12DE, $1299, $1254, $1210
            DECLE   $11CC, $1189, $1146, $1104, $10C2, $1081, $1040, $1000
            DECLE   $0FC0, $0F81, $0F42, $0F04, $0EC6, $0E89, $0E4C, $0E10
            DECLE   $0DD4, $0D99, $0D5E, $0D24, $0CEA, $0CB1, $0C78, $0C40
            DECLE   $0C08, $0BD1, $0B9A, $0B64, $0B2E, $0AF9, $0AC4, $0A90
            DECLE   $0A5C, $0A29, $09F6, $09C4, $0992, $0961, $0930, $0900
            DECLE   $08D0, $08A1, $0872, $0844, $0816, $07E9, $07BC, $0790
            DECLE   $0764, $0739, $070E, $06E4, $06BA, $0691, $0668, $0640
            DECLE   $0618, $05F1, $05CA, $05A4, $057E, $0559, $0534, $0510
            DECLE   $04EC, $04C9, $04A6, $0484, $0462, $0441, $0420, $0400
            DECLE   $03E0, $03C1, $03A2, $0384, $0366, $0349, $032C, $0310
            DECLE   $02F4, $02D9, $02BE, $02A4, $028A, $0271, $0258, $0240
            DECLE   $0228, $0211, $01FA, $01E4, $01CE, $01B9, $01A4, $0190
            DECLE   $017C, $0169, $0156, $0144, $0132, $0121, $0110, $0100
            DECLE   $00F0, $00E1, $00D2, $00C4, $00B6, $00A9, $009C, $0090
            DECLE   $0084, $0079, $006E, $0064, $005A, $0051, $0048, $0040
            DECLE   $0038, $0031, $002A, $0024, $001E, $0019, $0014, $0010
            DECLE   $000C, $0009, $0006, $0004, $0002, $0001, $0000
@@mid:
            DECLE   $0000, $0000, $0001, $0002, $0004, $0006, $0009, $000C
            DECLE   $0010, $0014, $0019, $001E, $0024, $002A, $0031, $0038
            DECLE   $0040, $0048, $0051, $005A, $0064, $006E, $0079, $0084
            DECLE   $0090, $009C, $00A9, $00B6, $00C4, $00D2, $00E1, $00F0
            DECLE   $0100, $0110, $0121, $0132, $0144, $0156, $0169, $017C
            DECLE   $0190, $01A4, $01B9, $01CE, $01E4, $01FA, $0211, $0228
            DECLE   $0240, $0258, $0271, $028A, $02A4, $02BE, $02D9, $02F4
            DECLE   $0310, $032C, $0349, $0366, $0384, $03A2, $03C1, $03E0
            DECLE   $0400, $0420, $0441, $0462, $0484, $04A6, $04C9, $04EC
            DECLE   $0510, $0534, $0559, $057E, $05A4, $05CA, $05F1, $0618
            DECLE   $0640, $0668, $0691, $06BA, $06E4, $070E, $0739, $0764
            DECLE   $0790, $07BC, $07E9, $0816, $0844, $0872, $08A1, $08D0
            DECLE   $0900, $0930, $0961, $0992, $09C4, $09F6, $0A29, $0A5C
            DECLE   $0A90, $0AC4, $0AF9, $0B2E, $0B64, $0B9A, $0BD1, $0C08
            DECLE   $0C40, $0C78, $0CB1, $0CEA, $0D24, $0D5E, $0D99, $0DD4
            DECLE   $0E10, $0E4C, $0E89, $0EC6, $0F04, $0F42, $0F81, $0FC0
            DECLE   $1000, $1040, $1081, $10C2, $1104, $1146, $1189, $11CC
            DECLE   $1210, $1254, $1299, $12DE, $1324, $136A, $13B1, $13F8
            DECLE   $1440, $1488, $14D1, $151A, $1564, $15AE, $15F9, $1644
            DECLE   $1690, $16DC, $1729, $1776, $17C4, $1812, $1861, $18B0
            DECLE   $1900, $1950, $19A1, $19F2, $1A44, $1A96, $1AE9, $1B3C
            DECLE   $1B90, $1BE4, $1C39, $1C8E, $1CE4, $1D3A, $1D91, $1DE8
            DECLE   $1E40, $1E98, $1EF1, $1F4A, $1FA4, $1FFE, $2059, $20B4
            DECLE   $2110, $216C, $21C9, $2226, $2284, $22E2, $2341, $23A0
            DECLE   $2400, $2460, $24C1, $2522, $2584, $25E6, $2649, $26AC
            DECLE   $2710, $2774, $27D9, $283E, $28A4, $290A, $2971, $29D8
            DECLE   $2A40, $2AA8, $2B11, $2B7A, $2BE4, $2C4E, $2CB9, $2D24
            DECLE   $2D90, $2DFC, $2E69, $2ED6, $2F44, $2FB2, $3021, $3090
            DECLE   $3100, $3170, $31E1, $3252, $32C4, $3336, $33A9, $341C
            DECLE   $3490, $3504, $3579, $35EE, $3664, $36DA, $3751, $37C8
            DECLE   $3840, $38B8, $3931, $39AA, $3A24, $3A9E, $3B19, $3B94
            DECLE   $3C10, $3C8C, $3D09, $3D86, $3E04, $3E82, $3F01, $3F80
            DECLE   $4000, $4080, $4101, $4182, $4204, $4286, $4309, $438C
            DECLE   $4410, $4494, $4519, $459E, $4624, $46AA, $4731, $47B8
            DECLE   $4840, $48C8, $4951, $49DA, $4A64, $4AEE, $4B79, $4C04
            DECLE   $4C90, $4D1C, $4DA9, $4E36, $4EC4, $4F52, $4FE1, $5070
            DECLE   $5100, $5190, $5221, $52B2, $5344, $53D6, $5469, $54FC
            DECLE   $5590, $5624, $56B9, $574E, $57E4, $587A, $5911, $59A8
            DECLE   $5A40, $5AD8, $5B71, $5C0A, $5CA4, $5D3E, $5DD9, $5E74
            DECLE   $5F10, $5FAC, $6049, $60E6, $6184, $6222, $62C1, $6360
            DECLE   $6400, $64A0, $6541, $65E2, $6684, $6726, $67C9, $686C
            DECLE   $6910, $69B4, $6A59, $6AFE, $6BA4, $6C4A, $6CF1, $6D98
            DECLE   $6E40, $6EE8, $6F91, $703A, $70E4, $718E, $7239, $72E4
            DECLE   $7390, $743C, $74E9, $7596, $7644, $76F2, $77A1, $7850
            DECLE   $7900, $79B0, $7A61, $7B12, $7BC4, $7C76, $7D29, $7DDC
            DECLE   $7E90, $7F44, $7FF9, $80AE, $8164, $821A, $82D1, $8388
            DECLE   $8440, $84F8, $85B1, $866A, $8724, $87DE, $8899, $8954
            DECLE   $8A10, $8ACC, $8B89, $8C46, $8D04, $8DC2, $8E81, $8F40
            DECLE   $9000, $90C0, $9181, $9242, $9304, $93C6, $9489, $954C
            DECLE   $9610, $96D4, $9799, $985E, $9924, $99EA, $9AB1, $9B78
            DECLE   $9C40, $9D08, $9DD1, $9E9A, $9F64, $A02E, $A0F9, $A1C4
            DECLE   $A290, $A35C, $A429, $A4F6, $A5C4, $A692, $A761, $A830
            DECLE   $A900, $A9D0, $AAA1, $AB72, $AC44, $AD16, $ADE9, $AEBC
            DECLE   $AF90, $B064, $B139, $B20E, $B2E4, $B3BA, $B491, $B568
            DECLE   $B640, $B718, $B7F1, $B8CA, $B9A4, $BA7E, $BB59, $BC34
            DECLE   $BD10, $BDEC, $BEC9, $BFA6, $C084, $C162, $C241, $C320
            DECLE   $C400, $C4E0, $C5C1, $C6A2, $C784, $C866, $C949, $CA2C
            DECLE   $CB10, $CBF4, $CCD9, $CDBE, $CEA4, $CF8A, $D071, $D158
            DECLE   $D240, $D328, $D411, $D4FA, $D5E4, $D6CE, $D7B9, $D8A4
            DECLE   $D990, $DA7C, $DB69, $DC56, $DD44, $DE32, $DF21, $E010
            DECLE   $E100, $E1F0, $E2E1, $E3D2, $E4C4, $E5B6, $E6A9, $E79C
            DECLE   $E890, $E984, $EA79, $EB6E, $EC64, $ED5A, $EE51, $EF48
            DECLE   $F040, $F138, $F231, $F32A, $F424, $F51E, $F619, $F714
            DECLE   $F810, $F90C, $FA09, $FB06, $FC04, $FD02, $FE01
            ENDP

; R2 = R0 * R1, where R0 and R1 are unsigned 8-bit values
; Destroys R1
qs_mpy8     PROC
            MOVR    R0,             R2      ;   6
            ADDI    #QSQR8_TBL.mid, R1      ;   8
            ADDR    R1,             R2      ;   6   a + b
            SUBR    R0,             R1      ;   6   a - b
@@ok:       MVI@    R2,             R2      ;   8
            SUB@    R1,             R2      ;   8
            JR      R5                      ;   7
                                            ;----
                                            ;  49
            ENDP
            

; R1 = R0 * R1, where R0 and R1 are 16-bit values
; destroys R0, R2, R3, R4, R5
qs_mpy16    PROC
            PSHR    R5                  ;   9
                                   
            MOVR    R0,         R2      ;   6   R2 is orig 16-bit a
            MOVR    R1,         R3      ;   6   R3 is orig 16-bit b
                                        
            ; lo * lo                   
            ANDI    #$FF,       R0      ;   8   R0 is lo(a)
            MOVR    R0,         R4      ;   6   R4 is lo(a)
            ANDI    #$FF,       R1      ;   8   R1 is lo(b)
            MOVR    R1,         R5      ;   6   save lo(b)
                                       
            ADDI    #QSQR8_TBL.mid, R1  ;   8
            ADDR    R1,         R4      ;   6   R4 = lo(a) + lo(b)
            SUBR    R0,         R1      ;   6   R1 = lo(a) - lo(b)
                                        
@@pos_ll:   MVI@    R4,         R4      ;   8   R4 = qstbl[lo(a)+lo(b)]
            SUB@    R1,         R4      ;   8   R4 = lo(a)*lo(b)
                                        ;----
                                        ;  85
                                       
            ; lo * hi                  
            SWAP    R3                  ;   6   \_ R3 = hi(b)
            ANDI    #$FF,       R3      ;   8   /
            MOVR    R0,         R1      ;   6   R0 = R1 = lo(a)
            ADDI    #QSQR8_TBL.mid, R3  ;   8
            ADDR    R3,         R1      ;   6   R1 = hi(b) + lo(a)
            SUBR    R0,         R3      ;   6   R3 = hi(b) - lo(a)
                                       
@@pos_lh:   MVI@    R1,         R1      ;   8   R1 = qstbl[hi(b)-lo(a)]
            SUB@    R3,         R1      ;   8   R1 = lo(a)*hi(b)
                                        ;----
                                        ;  56
                                        ;  85 (carried forward)
                                        ;----
                                        ; 141
                                       
            ; hi * lo                  
            SWAP    R2                  ;   6
            ANDI    #$FF,       R2      ;   8   R2 = hi(a)
            MOVR    R5,         R0      ;   6   R5 = R0 = lo(b)
            ADDI    #QSQR8_TBL.mid, R2  ;   8
            ADDR    R2,         R5      ;   6   R3 = hi(a) + lo(b)
            SUBR    R0,         R2      ;   6   R2 = hi(a) - lo(b)
                                       
@@pos_hl:   ADD@    R5,         R1      ;   8   \_ R1 = lo(a)*hi(b)+hi(a)*lo(b)
            SUB@    R2,         R1      ;   8   /
                                        ;----
                                        ;  56
                                        ; 141 (carried forward)
                                        ;----
                                        ; 197
                                       
            SWAP    R1                  ;   6   \_ shift upper product left 8
            ANDI    #$FF00,     R1      ;   8   /
            ADDR    R4,         R1      ;   6   final product
            PULR    PC                  ;  12
                                        ;----
                                        ;  32
                                        ; 197 (carried forward)
                                        ;----
                                        ; 229
            ENDP

.

And yes, it's tested.

I've been reading your posts and I've to say this routine is pretty impressive . Very nice work!

I'll try to include it in the next version of IntyBASIC, maybe with some case detection to include it only if it's used a var1*var2 expression

intvnut · April 2, 2015

I've been reading your posts and I've to say this routine is pretty impressive . Very nice work!

I'll try to include it in the next version of IntyBASIC, maybe with some case detection to include it only if it's used a var1*var2 expression

I guess there isn't an easy way to say a table lookup only returns an 8-bit value, is there? For First Spear's use case (the one that inspired all this, actually), if he uses the approach I suggested, he would have a lookup table that will return small values (fit well within 8 bits), but are served well by a lookup table.

Alternately, does it make sense to have a rand8(val) primitive as part of the language that returns a random 8 bit value that's in the range [ 0, val ), and that could use the 8 x 8 multiply? ie.

.

    rand_result = rand8( val )

.

that expands to something like

.

        ; assume 'val' in R0; result in R0.  Trashes R1, R2.
        ANDI    #$FF,           R0    ;   8
        MOVR    R0,             R2    ;   6
        MVII    #QSQR8_TBL.mid, R1    ;   8
        ADD     _rand,          R1    ;  10
        ADDR    R1,             R2    ;   6
        SUBR    R0,             R1    ;   6
        MVI@    R2,             R0    ;   8
        SUB@    R1,             R0    ;   8
        SWAP    R0                    ;   6
        ANDI    #$FF,           R0    ;   8
                                      ;----
                                      ;  74

.

A flat 74 cycles gives you an 8 bit random number in the desired range, with minimal bias.

Edited April 2, 2015 by intvnut

+nanochess · April 2, 2015

I guess there isn't an easy way to say a table lookup only returns an 8-bit value, is there? For First Spear's use case (the one that inspired all this, actually), if he uses the approach I suggested, he would have a lookup table that will return small values (fit well within 8 bits), but are served well by a lookup table.

Alternately, does it make sense to have a rand8(val) primitive as part of the language that returns a random 8 bit value that's in the range [ 0, val ), and that could use the 8 x 8 multiply? ie.

.
    rand_result = rand8( val )
.

that expands to something like

.
        ; assume 'val' in R0; result in R0.  Trashes R1, R2.
        ANDI    #$FF,           R0    ;   8
        MOVR    R0,             R2    ;   6
        MVII    #QSQR8_TBL.mid, R1    ;   8
        ADD     _rand,          R1    ;  10
        ADDR    R1,             R2    ;   6
        SUBR    R0,             R1    ;   6
        MVI@    R2,             R0    ;   8
        SUB@    R1,             R0    ;   8
        SWAP    R0                    ;   6
        ANDI    #$FF,           R0    ;   8
                                      ;----
                                      ;  74
.

A flat 74 cycles gives you an 8 bit random number in the desired range, with minimal bias.

Pretty reasonable indeed!

Let me think about it.

freewheel · April 2, 2015

We've definitely talked about rand(range) before. An 8-bit value would be fine. Right now I'm doing wasteful (modulo) tricks to make it happen. I'd love to have something slightly more nice, syntax-wise, and way faster

255 words to speed up multiplication? That's a substantial amount of ROM for one calculation - hopefully there's a way to easily toggle inclusion on or off depending on a person's needs. Yeah, most games can spare it, but still. If performance isn't a big worry, that's an entire screen of graphics

intvnut · April 2, 2015

We've definitely talked about rand(range) before. An 8-bit value would be fine. Right now I'm doing wasteful (modulo) tricks to make it happen. I'd love to have something slightly more nice, syntax-wise, and way faster

255 words to speed up multiplication? That's a substantial amount of ROM for one calculation - hopefully there's a way to easily toggle inclusion on or off depending on a person's needs. Yeah, most games can spare it, but still. If performance isn't a big worry, that's an entire screen of graphics

The total table is 766 words, actually.

But really, these days, are there any games that can't spare it? The allophone library is about twice that size, for comparison. I'd say CPU cycles are in shorter supply.

If you're going all-out for ROM space, you can get up to around 50K before you have to page-flip. 766 words is about 1.5% of that. But from a performance perspective, a 225 cycle multiply regardless of the operands is waaay cheaper than the current cost (up to ~90 seconds if multiplying by -1!!). 225 cycles is, coincidentally, about 1.5% of a frame time.

Ok, ok, I posted iterative shift-and-add methods that are about 3x the cycles but have no lookup table. Still, ROM seems so cheaaaaap. I'd rather bloat the multiply and work on solving page-flipped ROMs in IntyBASIC than worrying about shrinking the multiply codesize while retaining the lovely cycle count. Solving page-flipped ROMs opens more doors.

Edited April 2, 2015 by intvnut

intvnut · April 2, 2015

We've definitely talked about rand(range) before. An 8-bit value would be fine. Right now I'm doing wasteful (modulo) tricks to make it happen. I'd love to have something slightly more nice, syntax-wise, and way faster

BTW, multiplication may be better than modulo, depending on the multiplier. If your random value falls in an 8 bit range, you might consider (rand * range) / 256. This is especially true if 'range' is a compile-time constant.

freewheel · April 2, 2015

The total table is 766 words, actually.

But really, these days, are there any games that can't spare it? The allophone library is about twice that size, for comparison. I'd say CPU cycles are in shorter supply.

766? Yikes. That's almost a quarter of the size of entire GAMES from back in the day

I'd say that it depends on the game, and the intended target. Are we assuming that every place a person will ever use a ROM in the future will have gobs and gobs of storage? That may be the case, but who knows. I've yet to come even close to heavy cycle counting - maybe I just don't write computationally intensive games.

Don't get me wrong, I love the idea. As you've been able to eke out, I'm using a LOT of table-driven ROM waste to simplify things (OK, sometimes it saves ROM space too...). I'm a big fan of being smart with resources. But adding 1.5% to every single homebrew from here on in, for a single function that may not even be necessary? I guess I'm more worried about the precedent it sets. And about a future where a one line "Hello World" INTV game uses 20KB just to fit in all the library functions that we're sure "everyone" needs. How quickly does one become three become ten? I didn't realize the allophones were being added regardless of Voice support. That seems a bit... excessive. Maybe it's time to figure out how to *easily* include/exclude these extra features?

Of course things might get complicated once we start discussing the ability to turn on "fast multiplication", or similar things. I'm not sure how easy it would be to have the compiler use "standard" multiplication unless that library/function is included/selected.

Also I've lost track of why there's such a performance hit here. I don't follow where a multiplication can take 90 seconds, at any value. I ran this and it happens well within a single frame (that's as granular as I could bother to look):

a=a+1
b=b-1
test=a*b

Now that I think about it, I've definitely gotten lost somewhere because there's no way that this is what you're talking about.

intvnut · April 2, 2015

766? Yikes. That's almost a quarter of the size of entire GAMES from back in the day

I'd say that it depends on the game, and the intended target. Are we assuming that every place a person will ever use a ROM in the future will have gobs and gobs of storage? That may be the case, but who knows. I've yet to come even close to heavy cycle counting - maybe I just don't write computationally intensive games.

Don't get me wrong, I love the idea. As you've been able to eke out, I'm using a LOT of table-driven ROM waste to simplify things (OK, sometimes it saves ROM space too...). I'm a big fan of being smart with resources. But adding 1.5% to every single homebrew from here on in, for a single function that may not even be necessary? I guess I'm more worried about the precedent it sets. And about a future where a one line "Hello World" INTV game uses 20KB just to fit in all the library functions that we're sure "everyone" needs. How quickly does one become three become ten? I didn't realize the allophones were being added regardless of Voice support. That seems a bit... excessive. Maybe it's time to figure out how to *easily* include/exclude these extra features?

Of course things might get complicated once we start discussing the ability to turn on "fast multiplication", or similar things. I'm not sure how easy it would be to have the compiler use "standard" multiplication unless that library/function is included/selected.

Also I've lost track of why there's such a performance hit here. I don't follow where a multiplication can take 90 seconds, at any value. I ran this and it happens well within a single frame (that's as granular as I could bother to look):
a=a+1
b=b-1
test=a*b
Now that I think about it, I've definitely gotten lost somewhere because there's no way that this is what you're talking about.

URGH. First let me wipe some egg off my face. Multiplying by -1 costs over 92 clock ticks not 92 seconds. There's 60 ticks in a second and 60 seconds in a minute, and somewhere along the way I conflated the two. Also, your multiply is measuring 1 * 255, not 1 * -1. 8 bit variables!

Still, 90+ ticks is fairly unacceptable for responsiveness in a game.

Also, when you factor in the cost of the interrupt handler, it's actually longer. I benchmarked it just now to be sure. Here's my source:

.

        CLS
        #START = FRAME
        A = A * -1
        #TICKS = FRAME - #START
        PRINT "Done:  ", <5> #TICKS
done:   GOTO done

.

This benchmark reports that multiplying by -1 cost 114 ticks. Compile it and see if you agree with the result. I admit my egregious units error. But hopefully you agree that taking almost 2 seconds for a multiply might surprise a few people, even if it isn't 90 seconds.

(One moment... still wiping some egg off... anyway.)

On the voice stuff: It's guarded by a conditional compile directive. If you don't use voice, the allophones don't come in. I was just trying to give a relative measure.

On "Hello World" costing 20KB.... That seems like a specious argument. You may as well include in that measure the 4K words (8K bytes?) of the EXEC, because that's there too. Throw in the 2K bytes of GROM while you're at it. Just getting to the title screen on an unmodified Intellivision requires 10K bytes of code (without so much as a byte of IntyBASIC) by those rules.

I think it's perfectly acceptable to have a healthy support library under you when writing a game. Yes, it can make trivial "Hello World" examples look appallingly large vs. a true bare-metal example—grab your soldering iron to make it happen!—but that's a shade ingenuine. Any reasonable game will leverage that library code to great effect.

GWBASIC.EXE (remember that?) was nearly 80K. Larger when you consider the CGA font and BIOS in ROM that it leveraged. We're still operating on a bargain budget in comparison, with better results! I'd like to see you write GoatNom in GWBASIC on a PC XT and have it be as fluid. (And yes, I consider IntyBASIC to be more in the GWBASIC territory—or at least some of the later 16K and 32K BASICs—with its graphics and music capability, as compared to the simpler 8K and smaller text-only BASICs.)

Am I suggesting we should be abjectly wasteful? NO. But for core functions that most folks consider "table stakes," such as basic math operations or small random numbers (ie. rolling dice), there should be efficient primitives. And yes, some of those might be fairly large. As I said before, cycles (both CPU and programmer cycles) seem to be in far shorter supply than bytes of ROM. Fast primitives lower the onramp, reduce surprising performance anomalies, and really don't cost much of anything in the long run. Even on the Intellivision.

Edited April 2, 2015 by intvnut

+nanochess · April 2, 2015

Am I suggesting we should be abjectly wasteful? NO. But for core functions that most folks consider "table stakes," such as basic math operations or small random numbers (ie. rolling dice), there should be efficient primitives. And yes, some of those might be fairly large. As I said before, cycles (both CPU and programmer cycles) seem to be in far shorter supply than bytes of ROM. Fast primitives lower the onramp, reduce surprising performance anomalies, and really don't cost much of anything in the long run. Even on the Intellivision.

That's my main objective with IntyBASIC. That the core functions are efficient, specially the most used functions.

This way the programmer doesn't lose his/her time in optimization purposes or thinking in reworking in assembler code, because he/she would get only a tiny fraction of speed-up. IntyBASIC tries to put a reasonable balance between assembler and compiled code.

Also the support library must be fast and of reasonable size. It wouldn't be very useful (for example) to have a music tracker that is so slow, big or complex that it's effectively unusable.

freewheel · April 2, 2015

URGH. First let me wipe some egg off my face. Multiplying by -1 costs over 92 clock ticks not 92 seconds. There's 60 ticks in a second and 60 seconds in a minute, and somewhere along the way I conflated the two. Also, your multiply is measuring 1 * 255, not 1 * -1. 8 bit variables!

Still, 90+ ticks is fairly unacceptable for responsiveness in a game.

Also, when you factor in the cost of the interrupt handler, it's actually longer. I benchmarked it just now to be sure. Here's my source:

.
        CLS
        #START = FRAME
        A = A * -1
        #TICKS = FRAME - #START
        PRINT "Done:  ", <5> #TICKS
done:   GOTO done
.

This benchmark reports that multiplying by -1 cost 114 ticks. Compile it and see if you agree with the result. I admit my egregious units error. But hopefully you agree that taking almost 2 seconds for a multiply might surprise a few people, even if it isn't 90 seconds.

Ah. There's my problem. I'd never in a million years be doing multiplication with 16 bit variables (or negative constant operands) in an environment such as this. I had to go to this to replicate what you're seeing (and yeah it's pretty painful):

#a=#a+1
#b=-1
#test=#a*#b    'or just -1 in place of #b

I'm not sure I'll use the word "contrived" for that, but it doesn't seem like something I'd be using terribly often. I use 255 as -1 all the time and it works swimmingly for most calculations, including multiplications. It helps that in my world, the land of negative numbers really only exists in the -1,0,1 band - and that I pretty much never do complex math with the very few 16-bit variables available

On the voice stuff: It's guarded by a conditional compile directive. If you don't use voice, the allophones don't come in. I was just trying to give a relative measure.

Is this conditional assembling? I thought it was conditional compilation, too - but when I looked last night, the code is all there in the final .asm. Is that what IF DEFINED does during final assembly? Because many sections of the Intellivoice driver don't seem to be enclosed with that directive (although the allophones most definitely are).

Edited April 2, 2015 by freeweed

intvnut · April 2, 2015

P.S. My apologies if I sounded a little cranky last night. Been fighting with some awful code at work. Bleh.

intvnut · April 2, 2015

Ah. There's my problem. I'd never in a million years be doing multiplication with 16 bit variables (or negative constant operands) in an environment such as this. I had to go to this to replicate what you're seeing (and yeah it's pretty painful):
#a=#a+1
#b=-1
#test=#a*#b    'or just -1 in place of #b
I'm not sure I'll use the word "contrived" for that, but it doesn't seem like something I'd be using terribly often. I use 255 as -1 all the time and it works swimmingly for most calculations, including multiplications. It helps that in my world, the land of negative numbers really only exists in the -1,0,1 band - and that I pretty much never do complex math with the very few 16-bit variables available

You don't need 16 bit variables on the LHS to make it happen. A = A * -1 is enough to see it.

Honestly, a negative value is more likely to pop out of an intermediate result, as those are 16 bits. I just wrote the example as I did to make it clear. In theory, IntyBASIC could flip the negative constant in my example to positive before multiplying (and thus be able to match the constant against the optimized constant routines), followed by a NEGR on the product. That'd fix my example without fixing the underlying issue.

If someone happens to write a * (b - c), and (b - c) is occasionally negative, they could be left wondering why their program suddenly went pear shaped. That's what I mean by surprising. If they're able to narrow it down to that expression (and be honest, until these performance threads, would you have been able to identify that expression as troublesome?), would they think to try (b - c) * a as a fix? Why should it even be a fix? But, in IntyBASIC 1.0.3 and earlier, it is.

To me, it just seems to go against the BASIC ethos—Beginner's All-purpose Symbolic Instruction Code—that there might be such big booby traps in innocuous looking statements.

Is this conditional assembling? I thought it was conditional compilation, too - but when I looked last night, the code is all there in the final .asm. Is that what IF DEFINED does during final assembly? Because many sections of the Intellivoice driver don't seem to be enclosed with that directive (although the allophones most definitely are).

IF DEFINED is very similar to #ifdef in C/C++. If the corresponding symbol is defined, then that hunk of code gets brought in. Since AS1600 doesn't include a linker, the IntyBASIC epilogue file just includes all the library components, and then disables pieces with conditional assembly directives.

In principle it should be possible to leave the voice driver out in addition to the allophones. If it's not doing that, then that's a potential avenue for improvement.

Edited April 2, 2015 by intvnut

freewheel · April 2, 2015

Hmm. I suspect that my subconscious has hundreds of little optimization secrets (ie: "things to avoid doing on hardware or in languages that are possibly inefficient") that I don't even realize are there, and that's why I habitually avoid these issues without even realizing it. I'd never, ever use a=a*-1. In fact knowing that we're dealing with unsigned integers, I just automatically sub in 255 whenever this happens (and it's usually controlled by an 8-bit variable anyway, controlling something like direction or speed). So I've just never run across it. Other programmers might not have these sorts of things burned into their memory. So yeah, I can see where this could trip someone up.

And "pear shaped" is putting it mildly. It absolutely brings the program to a screeching halt. *I* would be able to narrow it down because I have a knack for spotting the problem line very quickly when things go haywire (it helps that I compile and run after damned near any remotely complex addition, comp sci programming practices 101 forever!), but yeah. Some people wouldn't in a million years think that a simple math operation could go so crazy, so I could see someone spending hours trying to figure things out. I'm surprised it hasn't been discovered before - either IntyBASIC still has a fairly limited audience of people who instinctively avoid these sorts of traps, or we've just been lucky so far.

intvnut · April 2, 2015

And "pear shaped" is putting it mildly. It absolutely brings the program to a screeching halt. *I* would be able to narrow it down because I have a knack for spotting the problem line very quickly when things go haywire (it helps that I compile and run after damned near any remotely complex addition, comp sci programming practices 101 forever!), but yeah. Some people wouldn't in a million years think that a simple math operation could go so crazy, so I could see someone spending hours trying to figure things out. I'm surprised it hasn't been discovered before - either IntyBASIC still has a fairly limited audience of people who instinctively avoid these sorts of traps, or we've just been lucky so far.

I'm going to go with a little bit of "lucky" and a little bit of "I don't know why this didn't work, so I just rewrote it randomly several times until it did." (Nearly 20 years in engineering has exposed me to many, many people in that second category. And even I have been in that second category occasionally...)

Edited April 2, 2015 by intvnut

freewheel · April 2, 2015

I'm going to go with a little bit of "lucky" and a little bit of "I don't know why this didn't work, so I just rewrote it randomly several times until it did." (Nearly 20 years in engineering has exposed me to many, many people in that second category. And even I have been in that second category occasionally...)

Put me solidly in the 2nd category, although with a fair bit of "good hunches usually pay off" mixed in. I'm used to designing around the limitations of any system I deal with, even if I don't much understand them, so I fly by the seat of my pants while doing this. And usually succeed.

I look at this as similar to one of my favourite optimizations that Oscar put in: SCREEN label[,origin_offset,target_offset,cols,rows].

He could have sat back and said "let's keep this simple and easy for newbies, and who cares, consuming 240 words for every screen isn't THAT wasteful in the grand scheme of things, ROM is cheap". Instead, he's built in a granularity that allows me as programmer to decide if and when I want to use the ROM space, while still using the simplicity of the SCREEN command. Obviously I could write ranged POKE loops if I wanted (or whatever would work best) but this was just so damned elegant. And in the case of my mega-project, it's saved me close to 3.5k words (I'm creating a lot of half screens). Just that one thought of "hey, why be wasteful" has given me a tremendous amount of space back - space that I can use with more graphics, text, whatever, just to make things that much more grand in scope. Even if that wasn't his original intention when writing that syntax, it's how it played out

Like I said earlier, I have no idea how you'd toggle something like math optimizations. Should it be included by default? Almost certainly. But I just don't think "you can always edit it out of the epilogue" is a reasonable answer for those that want to be a bit more efficient when they can. Surely there's a middle ground here, but as someone with exceedingly poor grasp of language syntax and compiler design, I'm the last guy to figure out how. So instead I complain And I'll admit that I'm on a bit of a storage reclaim mission right now, so my priorities may be a bit out of whack.

intvnut · April 2, 2015

I look at this as similar to one of my favourite optimizations that Oscar put in: SCREEN label[,origin_offset,target_offset,cols,rows].

He could have sat back and said "let's keep this simple and easy for newbies, and who cares, consuming 240 words for every screen isn't THAT wasteful in the grand scheme of things, ROM is cheap". Instead, he's built in a granularity that allows me as programmer to decide if and when I want to use the ROM space, while still using the simplicity of the SCREEN command. Obviously I could write ranged POKE loops if I wanted (or whatever would work best) but this was just so damned elegant. And in the case of my mega-project, it's saved me close to 3.5k words (I'm creating a lot of half screens). Just that one thought of "hey, why be wasteful" has given me a tremendous amount of space back - space that I can use with more graphics, text, whatever, just to make things that much more grand in scope. Even if that wasn't his original intention when writing that syntax, it's how it played out

I actually see that a bit differently. SCREENs scale in the size of the game. With each SCREEN statement, you increase the ROM size accordingly. So, you have something that scales with the size of the project. (There's the additional detail that this syntax allows you to update just a portion of the screen, which by itself is useful irrespective of ROM size / cycle count arguments.)

In contrast, library code is fixed in size, no matter what the size of your project. "Hello World" and "War And Peace: The RPG" both bring the same library with them. (Granted, conditional assembly tricks may drop out unused library bits, but that effect is limited.)

It makes a ton of sense to optimize your data structures and layout, because those will grow as you add more content to your game. The efficiency of your data structures determines rate at which they grow as you add more content to the game. The multiply library, however, will stay the same size, no matter how many multiplies you use. It's a fixed overhead. (I suppose if you use zero multiplies, it could be possible to make it disappear.)

One easy way to allow configurability, perhaps, is to leverage the conditional assembly directives and some creativity in the intybasic_prologue.asm and intybasic_epilogue.asm. Consider this idea:

.

;; Put this in intybasic_prologue.asm

_mathlib  PROC
@@FAST    EQU   0
@@COMPACT EQU   1
          ENDP

_mathlib.type SET _mathlib.FAST        ; default
          
MACRO MATHLIB %strategy%
_mathlib.type SET _mathlib.%strategy%
ENDM

.

;; Put this in intybasic_epilogue.asm

    IF _mathlib.type = _mathlib.FAST
        ; put code for super fast multiply, etc. here
    ELSE
        ; put more compact shift/add implementations here
    ENDI

.

Then, in your BASIC program, if you want to switch strategies, you can add a single ASM statement at the top:

.

ASM MATHLIB COMPACT

.

Of course, if the rand8() I proposed requires that lookup table, then rand8() will always have to make a CALL to do the multiply so the IF/ELSE/ENDI can work its magic.

freewheel · April 2, 2015

The multiply library, however, will stay the same size, no matter how many multiplies you use. It's a fixed overhead. (I suppose if you use zero multiplies, it could be possible to make it disappear.)

You're technically correct, which is the best kind of correct. But just because it's a fixed overhead doesn't automatically mean it always makes sense.

I could design a fixed overhead library function that consumes 20k words and saves 1% of your cycles. And it would be idiotic to do that, except in that once instance where you're trying to wring out every last cycle. Yes, it's an extreme example, but I've seen that kind of thinking before (*avoids early JAVA rant*).

Don't get me wrong - there's an obvious and massive difference in scope between a fairly small fixed cost and something that scales. My point is that it's easy to say "eh, it's just a few bytes" (albeit added to every single program until the end of time), while solving a problem that may or may not ever come up. I multiply all over the place, often in very stupid/lazy ways, and have never run into this. Fundamentally I agree with you though. Basic math is just too damned ... basic. And you'll really get me onboard if this library could also mean support for exponentiation

Which opens a broader question - is this some sort of quirk with the way IntyBASIC was designed that we have this gaping performance hole? Or is this situation semi-common on older architectures/languages?

In terms of implementing it as an option, I figured it would be something a bit kludgy like that. Not sure how maintainable that gets, especially as more of these library functions are added. So I'd probably prefer just losing the ROM space in every program.

intvnut · April 2, 2015

You're technically correct, which is the best kind of correct. But just because it's a fixed overhead doesn't automatically mean it always makes sense.

I could design a fixed overhead library function that consumes 20k words and saves 1% of your cycles. And it would be idiotic to do that, except in that once instance where you're trying to wring out every last cycle. Yes, it's an extreme example, but I've seen that kind of thinking before (*avoids early JAVA rant*).

True, and it's a fair point. Fixed costs are acceptable if you can amortize them effectively. It's rather hard to amortize 20K words over a total 50K word budget. It's maybe a little easier to amortize 800 words. We can all build our strawmen as tall as we like. ;-)

Which opens a broader question - is this some sort of quirk with the way IntyBASIC was designed that we have this gaping performance hole? Or is this situation semi-common on older architectures/languages?

I think it's the growing pains of a young and evolving language. Recall all the performance issues folks had trying to print decimal numbers, and then we brought in the PRNUM routines. To me this process looks like steady improvement on the things that can benefit, ideally attacked in Pareto order of importance. And that order gets determined by what people actually use.

In this particular case, nanochess put in a very simple multiply routine that was guaranteed to work, and most importantly, he shipped it. That means everyone gets to use it, and that by itself is useful. A program that works (even if it has some 'gotchas') is far more useful than a program that doesn't exist yet because you're trying to make it perfect. "Ship early, ship often" as the Linux folks say. It can always be refined in a future version, and as we've seen, that's exactly what's happened.

That's a lesson I sometimes fail to heed. (Granted, the rules are a bit different when shipping in an actual ROM / EPROM. There, you really do need to get it as close to 'perfect' as possible before shipping because you can't upgrade in the field.)

As for whether this particular performance quirk was common elsewhere? I can't say I've ever run into a multiply like that in other languages. But, I haven't intersected those languages this early in their lifetimes either.

freewheel · April 2, 2015

I guess I meant my question more in a generic sense, not specifically about IntyBASIC. I assume this comes about because the architecture has no native multiply instruction. How is this handled in other situations? Does everyone come up with their own multiply routines?

I think part of my consternation with the whole thing is that I'm kinda naively assuming that basic math functions are a "solved problem". Yeah, there's always some clever algorithm being discovered to shave a bit here and there, but this one is a whopper (in specific cases). The print numbers routine was something a lot less fundamental to mathematics (and kinda architecture-dependent). It's no surprise that that came in a later version. PRINT, in general, will probably always have room for optimization (I've done it myself via some quick and dirty lookup tables).

All of which makes me realize that I've never actually spent the time to see just what "comes with" IntyBASIC - ie: what does a very simple program compile to. I naively assume that things like math just translate into a tiny handful of instructions, and the things that all languages share are implemented very similarly - so I spent my time eyeballing the more "language-specific" features. And I figured I'd ask you because if anyone knows how things like low-level arithmetic routines get implemented on low-end hardware, you seem like the right guy to ask

Edited April 2, 2015 by freeweed

+DZ-Jay · April 2, 2015

URGH. First let me wipe some egg off my face. Multiplying by -1 costs over 92 clock ticks not 92 seconds. There's 60 ticks in a second and 60 seconds in a minute, and somewhere along the way I conflated the two. Also, your multiply is measuring 1 * 255, not 1 * -1. 8 bit variables!

Still, 90+ ticks is fairly unacceptable for responsiveness in a game.

Also, when you factor in the cost of the interrupt handler, it's actually longer. I benchmarked it just now to be sure. Here's my source:

.
        CLS
        #START = FRAME
        A = A * -1
        #TICKS = FRAME - #START
        PRINT "Done:  ", <5> #TICKS
done:   GOTO done
.
This benchmark reports that multiplying by -1 cost 114 ticks. Compile it and see if you agree with the result. I admit my egregious units error. But hopefully you agree that taking almost 2 seconds for a multiply might surprise a few people, even if it isn't 90 seconds.

(One moment... still wiping some egg off... anyway.)

On the voice stuff: It's guarded by a conditional compile directive. If you don't use voice, the allophones don't come in. I was just trying to give a relative measure.

On "Hello World" costing 20KB.... That seems like a specious argument. You may as well include in that measure the 4K words (8K bytes?) of the EXEC, because that's there too. Throw in the 2K bytes of GROM while you're at it. Just getting to the title screen on an unmodified Intellivision requires 10K bytes of code (without so much as a byte of IntyBASIC) by those rules.

I think it's perfectly acceptable to have a healthy support library under you when writing a game. Yes, it can make trivial "Hello World" examples look appallingly large vs. a true bare-metal examplegrab your soldering iron to make it happen!but that's a shade ingenuine. Any reasonable game will leverage that library code to great effect.

GWBASIC.EXE (remember that?) was nearly 80K. Larger when you consider the CGA font and BIOS in ROM that it leveraged. We're still operating on a bargain budget in comparison, with better results! I'd like to see you write GoatNom in GWBASIC on a PC XT and have it be as fluid. (And yes, I consider IntyBASIC to be more in the GWBASIC territoryor at least some of the later 16K and 32K BASICswith its graphics and music capability, as compared to the simpler 8K and smaller text-only BASICs.)

Am I suggesting we should be abjectly wasteful? NO. But for core functions that most folks consider "table stakes," such as basic math operations or small random numbers (ie. rolling dice), there should be efficient primitives. And yes, some of those might be fairly large. As I said before, cycles (both CPU and programmer cycles) seem to be in far shorter supply than bytes of ROM. Fast primitives lower the onramp, reduce surprising performance anomalies, and really don't cost much of anything in the long run. Even on the Intellivision.

Very well put. In my framework, every time that I get the urge to cringe about the amount of ROM or RAM it consumes in just low-level, under-the-bonnet libraries; I remind myself that it is a rather large chunk of code that the programmer will not have to write for his game.

Let's take for example the pseudo-random number generator in P-Machinery. The programmer would still need to consume storage and cycles writing a specialized PRNG for his game. All things being equal, his routine may be smaller, sure.

However, all things are not equal. The savings in development effort and time, not to mention the absence of another cognitive burden orthogonal to the game itself, should more than offset whatever DECLEs the framework consumes.

In your example about the 20K "Hello World" program, it is wasteful if the majority of programs are as trivial. However, the cost savings and efficiencies are significant for any non-trivial application where the programmer can print a string of text on the screen with a single statement.

As someone who wrote a game in Assembly Language (and a game development framework pretty much from scratch), I cannot overstate how insanely arcane and soul crushing it is to have to *think* about BACKTAB bitmaps, address mappings, and even ASCII character tables, and how to copy them--one by one--when trying to put a title on the screen. And don't get me started with STIC handling without abstractions.

If all that abstraction, power, creative empowerment, and productivity costs a few thousand DECLEs, I say, bring it on.

After all, none of us grew up dreaming of writing lame and boring boilerplate code to move bits around. We're here for the games!

dZ.

+DZ-Jay · April 2, 2015

How is this handled in other situations? Does everyone come up with their own multiply routines?

Actually, yes. However, due to being hand-crafted for purpose, they tend to be highly specialized.

For example, in Christmas Carol I got away with avoiding multiplication by thinking in terms of look-up tables, power-of-two boundaries, and optimized data structures; rather than general indices or arithmetic.

Except in one single spot, where I had the need to multiply by 29. So I built a single-purpose code block that multiplied an 8-bit value in a register by 29, highly optimized and compact.

I will tell you one thing, if I had a fast general-purpose multiplication routine then, I would had tried it first; and only rolled out my own if it proved insufficient or too slow for my program.

dZ.

Quarter Square Multiplication: Blindingly fast multiplies!

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members