intvnut Posted April 1, 2015 Share Posted April 1, 2015 (edited) While looking for something else the other night, I stumbled on this Wikipedia article that describes Quarter Square Multiplication. I had never seen this algorithm before. For a machine like the Intellivision, it appears to be a perfect fit! I won't repeat the Wikipedia article here, but the short version: With a few additions, subtractions and table lookups, you can perform arbitrary multiplication wicked fast. For an 8-bit x 8-bit multiplication (16-bit result), we're talking ~70 cycles. For 16x16 multiplication (also 16-bit result), we're talking ~302 cycles. It didn't take me very long to code it up at all. The algorithm is that simple. It just needs a 511 word lookup table. Here's the implementation of both 8x8 => 16 and 16x16 => 16 multiplies. . ; Quarter Square Multiplication ; Assembly code by Joe Zbiciak, 2015 ; Released to public domain. QSQR8_TBL PROC DECLE $0000, $0000, $0001, $0002, $0004, $0006, $0009, $000C DECLE $0010, $0014, $0019, $001E, $0024, $002A, $0031, $0038 DECLE $0040, $0048, $0051, $005A, $0064, $006E, $0079, $0084 DECLE $0090, $009C, $00A9, $00B6, $00C4, $00D2, $00E1, $00F0 DECLE $0100, $0110, $0121, $0132, $0144, $0156, $0169, $017C DECLE $0190, $01A4, $01B9, $01CE, $01E4, $01FA, $0211, $0228 DECLE $0240, $0258, $0271, $028A, $02A4, $02BE, $02D9, $02F4 DECLE $0310, $032C, $0349, $0366, $0384, $03A2, $03C1, $03E0 DECLE $0400, $0420, $0441, $0462, $0484, $04A6, $04C9, $04EC DECLE $0510, $0534, $0559, $057E, $05A4, $05CA, $05F1, $0618 DECLE $0640, $0668, $0691, $06BA, $06E4, $070E, $0739, $0764 DECLE $0790, $07BC, $07E9, $0816, $0844, $0872, $08A1, $08D0 DECLE $0900, $0930, $0961, $0992, $09C4, $09F6, $0A29, $0A5C DECLE $0A90, $0AC4, $0AF9, $0B2E, $0B64, $0B9A, $0BD1, $0C08 DECLE $0C40, $0C78, $0CB1, $0CEA, $0D24, $0D5E, $0D99, $0DD4 DECLE $0E10, $0E4C, $0E89, $0EC6, $0F04, $0F42, $0F81, $0FC0 DECLE $1000, $1040, $1081, $10C2, $1104, $1146, $1189, $11CC DECLE $1210, $1254, $1299, $12DE, $1324, $136A, $13B1, $13F8 DECLE $1440, $1488, $14D1, $151A, $1564, $15AE, $15F9, $1644 DECLE $1690, $16DC, $1729, $1776, $17C4, $1812, $1861, $18B0 DECLE $1900, $1950, $19A1, $19F2, $1A44, $1A96, $1AE9, $1B3C DECLE $1B90, $1BE4, $1C39, $1C8E, $1CE4, $1D3A, $1D91, $1DE8 DECLE $1E40, $1E98, $1EF1, $1F4A, $1FA4, $1FFE, $2059, $20B4 DECLE $2110, $216C, $21C9, $2226, $2284, $22E2, $2341, $23A0 DECLE $2400, $2460, $24C1, $2522, $2584, $25E6, $2649, $26AC DECLE $2710, $2774, $27D9, $283E, $28A4, $290A, $2971, $29D8 DECLE $2A40, $2AA8, $2B11, $2B7A, $2BE4, $2C4E, $2CB9, $2D24 DECLE $2D90, $2DFC, $2E69, $2ED6, $2F44, $2FB2, $3021, $3090 DECLE $3100, $3170, $31E1, $3252, $32C4, $3336, $33A9, $341C DECLE $3490, $3504, $3579, $35EE, $3664, $36DA, $3751, $37C8 DECLE $3840, $38B8, $3931, $39AA, $3A24, $3A9E, $3B19, $3B94 DECLE $3C10, $3C8C, $3D09, $3D86, $3E04, $3E82, $3F01, $3F80 DECLE $4000, $4080, $4101, $4182, $4204, $4286, $4309, $438C DECLE $4410, $4494, $4519, $459E, $4624, $46AA, $4731, $47B8 DECLE $4840, $48C8, $4951, $49DA, $4A64, $4AEE, $4B79, $4C04 DECLE $4C90, $4D1C, $4DA9, $4E36, $4EC4, $4F52, $4FE1, $5070 DECLE $5100, $5190, $5221, $52B2, $5344, $53D6, $5469, $54FC DECLE $5590, $5624, $56B9, $574E, $57E4, $587A, $5911, $59A8 DECLE $5A40, $5AD8, $5B71, $5C0A, $5CA4, $5D3E, $5DD9, $5E74 DECLE $5F10, $5FAC, $6049, $60E6, $6184, $6222, $62C1, $6360 DECLE $6400, $64A0, $6541, $65E2, $6684, $6726, $67C9, $686C DECLE $6910, $69B4, $6A59, $6AFE, $6BA4, $6C4A, $6CF1, $6D98 DECLE $6E40, $6EE8, $6F91, $703A, $70E4, $718E, $7239, $72E4 DECLE $7390, $743C, $74E9, $7596, $7644, $76F2, $77A1, $7850 DECLE $7900, $79B0, $7A61, $7B12, $7BC4, $7C76, $7D29, $7DDC DECLE $7E90, $7F44, $7FF9, $80AE, $8164, $821A, $82D1, $8388 DECLE $8440, $84F8, $85B1, $866A, $8724, $87DE, $8899, $8954 DECLE $8A10, $8ACC, $8B89, $8C46, $8D04, $8DC2, $8E81, $8F40 DECLE $9000, $90C0, $9181, $9242, $9304, $93C6, $9489, $954C DECLE $9610, $96D4, $9799, $985E, $9924, $99EA, $9AB1, $9B78 DECLE $9C40, $9D08, $9DD1, $9E9A, $9F64, $A02E, $A0F9, $A1C4 DECLE $A290, $A35C, $A429, $A4F6, $A5C4, $A692, $A761, $A830 DECLE $A900, $A9D0, $AAA1, $AB72, $AC44, $AD16, $ADE9, $AEBC DECLE $AF90, $B064, $B139, $B20E, $B2E4, $B3BA, $B491, $B568 DECLE $B640, $B718, $B7F1, $B8CA, $B9A4, $BA7E, $BB59, $BC34 DECLE $BD10, $BDEC, $BEC9, $BFA6, $C084, $C162, $C241, $C320 DECLE $C400, $C4E0, $C5C1, $C6A2, $C784, $C866, $C949, $CA2C DECLE $CB10, $CBF4, $CCD9, $CDBE, $CEA4, $CF8A, $D071, $D158 DECLE $D240, $D328, $D411, $D4FA, $D5E4, $D6CE, $D7B9, $D8A4 DECLE $D990, $DA7C, $DB69, $DC56, $DD44, $DE32, $DF21, $E010 DECLE $E100, $E1F0, $E2E1, $E3D2, $E4C4, $E5B6, $E6A9, $E79C DECLE $E890, $E984, $EA79, $EB6E, $EC64, $ED5A, $EE51, $EF48 DECLE $F040, $F138, $F231, $F32A, $F424, $F51E, $F619, $F714 DECLE $F810, $F90C, $FA09, $FB06, $FC04, $FD02, $FE01 ENDP ; R2 = R0 * R1, where R0 and R1 are unsigned 8-bit values ; Destroys R1 qs_mpy8 PROC MOVR R0, R2 ; 6 ADDR R1, R2 ; 6 a + b SUBR R0, R1 ; 6 a - b BPL @@ok ; 7 NEGR R1 ; 6 @@ok: ADDI #QSQR8_TBL, R2 ; 8 ADDI #QSQR8_TBL, R1 ; 8 MVI@ R2, R2 ; 8 SUB@ R1, R2 ; 8 JR R5 ; 7 ;---- ; 70 ENDP ; R1 = R0 * R1, where R0 and R1 are 16-bit values ; destroys R0, R2, R3, R4, R5 qs_mpy16 PROC PSHR R5 ; 9 MVII #QSQR8_TBL, R5 ; 8 MOVR R0, R2 ; 6 R2 is orig 16-bit a MOVR R1, R3 ; 6 R3 is orig 16-bit b ; lo * lo ANDI #$FF, R0 ; 8 R0 is lo(a) MOVR R0, R4 ; 6 R4 is lo(a) ANDI #$FF, R1 ; 8 R1 is lo(b) PSHR R1 ; 9 save lo(b) ADDR R1, R4 ; 6 R4 = lo(a) + lo(b) SUBR R0, R1 ; 6 R1 = lo(a) - lo(b) BPL @@pos_ll ; 7 NEGR R1 ; 6 @@pos_ll: ADDR R5, R4 ; 6 R4 = &qstbl[lo(a)+lo(b)] ADDR R5, R1 ; 6 R1 = &qstbl[lo(a)-lo(b)] MVI@ R4, R4 ; 8 R4 = qstbl[lo(a)+lo(b)] SUB@ R1, R4 ; 8 R4 = lo(a)*lo(b) ;---- ; 113 ; lo * hi SWAP R3 ; 6 \_ R3 = hi(b) ANDI #$FF, R3 ; 8 / MOVR R0, R1 ; 6 R0 = R1 = lo(a) ADDR R3, R1 ; 6 R1 = hi(b) + lo(a) SUBR R0, R3 ; 6 R3 = hi(b) - lo(a) BPL @@pos_lh ; 7 NEGR R3 ; 6 @@pos_lh: ADDR R5, R1 ; 6 R1 = &qstbl[hi(b)+lo(a)] ADDR R5, R3 ; 6 R3 = &qstbl[hi(b)-lo(a)] MVI@ R1, R1 ; 8 R1 = qstbl[hi(b)-lo(a)] SUB@ R3, R1 ; 8 R1 = lo(a)*hi(b) ;---- ; 73 ; 113 (carried forward) ;---- ; 186 ; hi * lo SWAP R2 ; 6 ANDI #$FF, R2 ; 8 R2 = hi(a) PULR R3 ; 11 R3 = lo(b) MOVR R3, R0 ; 6 R0 = lo(b) ADDR R2, R3 ; 6 R3 = hi(a) + lo(b) SUBR R0, R2 ; 6 R2 = hi(a) - lo(b) BPL @@pos_hl ; 7 NEGR R2 ; 6 @@pos_hl: ADDR R5, R3 ; 6 R3 = &qstbl[hi(a)+lo(b)] ADDR R5, R2 ; 6 R2 = &qstbl[hi(a)-lo(b)] ADD@ R3, R1 ; 8 \_ R1 = lo(a)*hi(b) + hi(a)*lo(b) SUB@ R2, R1 ; 8 / ;---- ; 84 ; 186 (carried forward) ;---- ; 270 SWAP R1 ; 6 \_ shift upper product left 8 ANDI #$FF00, R1 ; 8 / ADDR R4, R1 ; 6 final product PULR PC ; 12 ;---- ; 32 ; 270 (carried forward) ;---- ; 302 ENDP . I spent maybe 20 minutes on the 8x8 implementation, and another 40 - 60 minutes on the 16x16 implementation. (I first did a naive implementation that called qs_mpy8, and then a smarter implementation that inlined everything.) Tonight, I ran both through literally millions of random tests, comparing against JLP's multiply accelerator. (Side note: If you run jzIntv with rate control off (-r0), audio off (-a0), and minimize the window, it runs very, very fast. About 280x on my machine. 270x if you forget and leave audio on.) As the comment says, I release this to the public domain. If you use this in one of your games / programs / whatever, I don't mind a tip of the hat my way if you happen to think of it. And if not, eh, we still all get more cool games. If any of you find ways to improve this code (I'm sure it's possible), post it here and share with everyone! Homework assignment for the motivated: Add an 8x16 multiply to the library. With all three, then programs like IntyBASIC have a nice arsenal for implementing multiplication efficiently, with no surprises for oddball multiplicands. (That's presuming IntyBASIC can figure out if the arguments to the multiply come from 8 bit variables or 16-bit variables/intermediate results.) I think adding an 8x16 multiply mainly requires removing the 'hl' portion of the 16x16 multiply code and a PSHR, saving ~93 cycles. quarter_square_multiplication.asm Edited April 1, 2015 by intvnut 3 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/ Share on other sites More sharing options...
intvnut Posted April 1, 2015 Author Share Posted April 1, 2015 One thought I had while stuck in a 7AM meeting this morning: If you're willing to spend for an add'l 255 table entries, you can shave ~13 cycles off the 8x8 multiply and ~39 cycles off the worst-case 16x16 multiply by eliminating the BPL/NEGR statements. For the 8x8 multiply, that brings it down to 57 cycles. It's a small enough hunk of code that you could inline it, making it only 50 cycles inline. That's not bad if your game needs an arbitrary multiply. Note that if you have a fixed multiplier (ie. you're multiplying by 20 or by 50 or by some other constant), you're better off using a precomputed set of shifts and adds. I believe dZ-Jay has the most up to date version of the set of macros I published for constant multiplies up to 127, and it's likely elsewhere in this forum too if you dig for it. The quarter-square technique is simple enough that if you really wanted to optimize the routine and table-size for a specific game, it's fairly trivial to do so. I wish I had known about the technique sooner! Final note: Here's the C model I had written to experiment with it as part of writing the assembly code. It really is this simple. . #include <stdio.h> int qsqr8_tbl[511]; // 8x8 unsigned multiply int qsqr8( int a, int b ) { int s = a + b; int d = a - b; if ( d < 0 ) d = -d; return qsqr8_tbl[s] - qsqr8_tbl[d]; } int qsqr16( int a, int b ) { int plo = qsqr8( a & 0xFF, b & 0xFF ); int phi = qsqr8( a & 0xFF, b >> 8 ) + qsqr8( a >> 8, b & 0xFF ); return (plo + (phi << ) & 0xFFFF; } int main() { int a, b, i; for ( i = 0; i < 511; i++ ) qsqr8_tbl[i] = i*i >> 2; for ( a = 0; a < 256; a++ ) for ( b = 0; b < 256; b++ ) { int p = a * b; int q = qsqr8( a, b ); if ( p != q ) printf(" %.2X %.2X %.4X %.4X\n", a, b, p, q ); } for ( a = 0; a < 65536; a++ ) for ( b = 0; b < 65536; b++ ) { int p = (a * b) & 0xFFFF; int q = qsqr16( a, b ); if ( p != q ) printf(" %.4X %.4X %.4X %.4X\n", a, b, p, q ); } return 0; } . 2 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210235 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 (edited) Ok, if you're willing to blow an extra 255 words for the lookup table, you can make this really fast. 49 cycles for 8x8=>16, and 225 cycles for 16x16=>16. . ; Quarter Square Multiplication ; Assembly code by Joe Zbiciak, 2015 ; Released to public domain. QSQR8_TBL PROC DECLE $3F80, $3F01, $3E82, $3E04, $3D86, $3D09, $3C8C, $3C10 DECLE $3B94, $3B19, $3A9E, $3A24, $39AA, $3931, $38B8, $3840 DECLE $37C8, $3751, $36DA, $3664, $35EE, $3579, $3504, $3490 DECLE $341C, $33A9, $3336, $32C4, $3252, $31E1, $3170, $3100 DECLE $3090, $3021, $2FB2, $2F44, $2ED6, $2E69, $2DFC, $2D90 DECLE $2D24, $2CB9, $2C4E, $2BE4, $2B7A, $2B11, $2AA8, $2A40 DECLE $29D8, $2971, $290A, $28A4, $283E, $27D9, $2774, $2710 DECLE $26AC, $2649, $25E6, $2584, $2522, $24C1, $2460, $2400 DECLE $23A0, $2341, $22E2, $2284, $2226, $21C9, $216C, $2110 DECLE $20B4, $2059, $1FFE, $1FA4, $1F4A, $1EF1, $1E98, $1E40 DECLE $1DE8, $1D91, $1D3A, $1CE4, $1C8E, $1C39, $1BE4, $1B90 DECLE $1B3C, $1AE9, $1A96, $1A44, $19F2, $19A1, $1950, $1900 DECLE $18B0, $1861, $1812, $17C4, $1776, $1729, $16DC, $1690 DECLE $1644, $15F9, $15AE, $1564, $151A, $14D1, $1488, $1440 DECLE $13F8, $13B1, $136A, $1324, $12DE, $1299, $1254, $1210 DECLE $11CC, $1189, $1146, $1104, $10C2, $1081, $1040, $1000 DECLE $0FC0, $0F81, $0F42, $0F04, $0EC6, $0E89, $0E4C, $0E10 DECLE $0DD4, $0D99, $0D5E, $0D24, $0CEA, $0CB1, $0C78, $0C40 DECLE $0C08, $0BD1, $0B9A, $0B64, $0B2E, $0AF9, $0AC4, $0A90 DECLE $0A5C, $0A29, $09F6, $09C4, $0992, $0961, $0930, $0900 DECLE $08D0, $08A1, $0872, $0844, $0816, $07E9, $07BC, $0790 DECLE $0764, $0739, $070E, $06E4, $06BA, $0691, $0668, $0640 DECLE $0618, $05F1, $05CA, $05A4, $057E, $0559, $0534, $0510 DECLE $04EC, $04C9, $04A6, $0484, $0462, $0441, $0420, $0400 DECLE $03E0, $03C1, $03A2, $0384, $0366, $0349, $032C, $0310 DECLE $02F4, $02D9, $02BE, $02A4, $028A, $0271, $0258, $0240 DECLE $0228, $0211, $01FA, $01E4, $01CE, $01B9, $01A4, $0190 DECLE $017C, $0169, $0156, $0144, $0132, $0121, $0110, $0100 DECLE $00F0, $00E1, $00D2, $00C4, $00B6, $00A9, $009C, $0090 DECLE $0084, $0079, $006E, $0064, $005A, $0051, $0048, $0040 DECLE $0038, $0031, $002A, $0024, $001E, $0019, $0014, $0010 DECLE $000C, $0009, $0006, $0004, $0002, $0001, $0000 @@mid: DECLE $0000, $0000, $0001, $0002, $0004, $0006, $0009, $000C DECLE $0010, $0014, $0019, $001E, $0024, $002A, $0031, $0038 DECLE $0040, $0048, $0051, $005A, $0064, $006E, $0079, $0084 DECLE $0090, $009C, $00A9, $00B6, $00C4, $00D2, $00E1, $00F0 DECLE $0100, $0110, $0121, $0132, $0144, $0156, $0169, $017C DECLE $0190, $01A4, $01B9, $01CE, $01E4, $01FA, $0211, $0228 DECLE $0240, $0258, $0271, $028A, $02A4, $02BE, $02D9, $02F4 DECLE $0310, $032C, $0349, $0366, $0384, $03A2, $03C1, $03E0 DECLE $0400, $0420, $0441, $0462, $0484, $04A6, $04C9, $04EC DECLE $0510, $0534, $0559, $057E, $05A4, $05CA, $05F1, $0618 DECLE $0640, $0668, $0691, $06BA, $06E4, $070E, $0739, $0764 DECLE $0790, $07BC, $07E9, $0816, $0844, $0872, $08A1, $08D0 DECLE $0900, $0930, $0961, $0992, $09C4, $09F6, $0A29, $0A5C DECLE $0A90, $0AC4, $0AF9, $0B2E, $0B64, $0B9A, $0BD1, $0C08 DECLE $0C40, $0C78, $0CB1, $0CEA, $0D24, $0D5E, $0D99, $0DD4 DECLE $0E10, $0E4C, $0E89, $0EC6, $0F04, $0F42, $0F81, $0FC0 DECLE $1000, $1040, $1081, $10C2, $1104, $1146, $1189, $11CC DECLE $1210, $1254, $1299, $12DE, $1324, $136A, $13B1, $13F8 DECLE $1440, $1488, $14D1, $151A, $1564, $15AE, $15F9, $1644 DECLE $1690, $16DC, $1729, $1776, $17C4, $1812, $1861, $18B0 DECLE $1900, $1950, $19A1, $19F2, $1A44, $1A96, $1AE9, $1B3C DECLE $1B90, $1BE4, $1C39, $1C8E, $1CE4, $1D3A, $1D91, $1DE8 DECLE $1E40, $1E98, $1EF1, $1F4A, $1FA4, $1FFE, $2059, $20B4 DECLE $2110, $216C, $21C9, $2226, $2284, $22E2, $2341, $23A0 DECLE $2400, $2460, $24C1, $2522, $2584, $25E6, $2649, $26AC DECLE $2710, $2774, $27D9, $283E, $28A4, $290A, $2971, $29D8 DECLE $2A40, $2AA8, $2B11, $2B7A, $2BE4, $2C4E, $2CB9, $2D24 DECLE $2D90, $2DFC, $2E69, $2ED6, $2F44, $2FB2, $3021, $3090 DECLE $3100, $3170, $31E1, $3252, $32C4, $3336, $33A9, $341C DECLE $3490, $3504, $3579, $35EE, $3664, $36DA, $3751, $37C8 DECLE $3840, $38B8, $3931, $39AA, $3A24, $3A9E, $3B19, $3B94 DECLE $3C10, $3C8C, $3D09, $3D86, $3E04, $3E82, $3F01, $3F80 DECLE $4000, $4080, $4101, $4182, $4204, $4286, $4309, $438C DECLE $4410, $4494, $4519, $459E, $4624, $46AA, $4731, $47B8 DECLE $4840, $48C8, $4951, $49DA, $4A64, $4AEE, $4B79, $4C04 DECLE $4C90, $4D1C, $4DA9, $4E36, $4EC4, $4F52, $4FE1, $5070 DECLE $5100, $5190, $5221, $52B2, $5344, $53D6, $5469, $54FC DECLE $5590, $5624, $56B9, $574E, $57E4, $587A, $5911, $59A8 DECLE $5A40, $5AD8, $5B71, $5C0A, $5CA4, $5D3E, $5DD9, $5E74 DECLE $5F10, $5FAC, $6049, $60E6, $6184, $6222, $62C1, $6360 DECLE $6400, $64A0, $6541, $65E2, $6684, $6726, $67C9, $686C DECLE $6910, $69B4, $6A59, $6AFE, $6BA4, $6C4A, $6CF1, $6D98 DECLE $6E40, $6EE8, $6F91, $703A, $70E4, $718E, $7239, $72E4 DECLE $7390, $743C, $74E9, $7596, $7644, $76F2, $77A1, $7850 DECLE $7900, $79B0, $7A61, $7B12, $7BC4, $7C76, $7D29, $7DDC DECLE $7E90, $7F44, $7FF9, $80AE, $8164, $821A, $82D1, $8388 DECLE $8440, $84F8, $85B1, $866A, $8724, $87DE, $8899, $8954 DECLE $8A10, $8ACC, $8B89, $8C46, $8D04, $8DC2, $8E81, $8F40 DECLE $9000, $90C0, $9181, $9242, $9304, $93C6, $9489, $954C DECLE $9610, $96D4, $9799, $985E, $9924, $99EA, $9AB1, $9B78 DECLE $9C40, $9D08, $9DD1, $9E9A, $9F64, $A02E, $A0F9, $A1C4 DECLE $A290, $A35C, $A429, $A4F6, $A5C4, $A692, $A761, $A830 DECLE $A900, $A9D0, $AAA1, $AB72, $AC44, $AD16, $ADE9, $AEBC DECLE $AF90, $B064, $B139, $B20E, $B2E4, $B3BA, $B491, $B568 DECLE $B640, $B718, $B7F1, $B8CA, $B9A4, $BA7E, $BB59, $BC34 DECLE $BD10, $BDEC, $BEC9, $BFA6, $C084, $C162, $C241, $C320 DECLE $C400, $C4E0, $C5C1, $C6A2, $C784, $C866, $C949, $CA2C DECLE $CB10, $CBF4, $CCD9, $CDBE, $CEA4, $CF8A, $D071, $D158 DECLE $D240, $D328, $D411, $D4FA, $D5E4, $D6CE, $D7B9, $D8A4 DECLE $D990, $DA7C, $DB69, $DC56, $DD44, $DE32, $DF21, $E010 DECLE $E100, $E1F0, $E2E1, $E3D2, $E4C4, $E5B6, $E6A9, $E79C DECLE $E890, $E984, $EA79, $EB6E, $EC64, $ED5A, $EE51, $EF48 DECLE $F040, $F138, $F231, $F32A, $F424, $F51E, $F619, $F714 DECLE $F810, $F90C, $FA09, $FB06, $FC04, $FD02, $FE01 ENDP ; R2 = R0 * R1, where R0 and R1 are unsigned 8-bit values ; Destroys R1 qs_mpy8 PROC MOVR R0, R2 ; 6 ADDI #QSQR8_TBL.mid, R1 ; 8 ADDR R1, R2 ; 6 a + b SUBR R0, R1 ; 6 a - b @@ok: MVI@ R2, R2 ; 8 SUB@ R1, R2 ; 8 JR R5 ; 7 ;---- ; 49 ENDP ; R1 = R0 * R1, where R0 and R1 are 16-bit values ; destroys R0, R2, R3, R4, R5 qs_mpy16 PROC PSHR R5 ; 9 ; Unpack lo/hi MOVR R0, R2 ; 6 ANDI #$FF, R0 ; 8 R0 is lo(a) XORR R0, R2 ; 6 SWAP R2 ; 6 R2 is hi(a) MOVR R1, R3 ; 6 R3 is orig 16-bit b ANDI #$FF, R1 ; 8 R1 is lo(b) MOVR R1, R5 ; 6 R5 is lo(b) XORR R1, R3 ; 6 SWAP R3 ; 6 R3 is hi(b) ;---- ; 67 ; lo * lo MOVR R0, R4 ; 6 R4 is lo(a) ADDI #QSQR8_TBL.mid, R1 ; 8 ADDR R1, R4 ; 6 R4 = lo(a) + lo(b) SUBR R0, R1 ; 6 R1 = lo(a) - lo(b) @@pos_ll: MVI@ R4, R4 ; 8 R4 = qstbl[lo(a)+lo(b)] SUB@ R1, R4 ; 8 R4 = lo(a)*lo(b) ;---- ; 42 ; 67 (carried forward) ;---- ; 109 ; lo * hi MOVR R0, R1 ; 6 R0 = R1 = lo(a) ADDI #QSQR8_TBL.mid, R3 ; 8 ADDR R3, R1 ; 6 R1 = hi(b) + lo(a) SUBR R0, R3 ; 6 R3 = hi(b) - lo(a) @@pos_lh: MVI@ R1, R1 ; 8 R1 = qstbl[hi(b)-lo(a)] SUB@ R3, R1 ; 8 R1 = lo(a)*hi(b) ;---- ; 42 ; 109 (carried forward) ;---- ; 151 ; hi * lo MOVR R5, R0 ; 6 R5 = R0 = lo(b) ADDI #QSQR8_TBL.mid, R2 ; 8 ADDR R2, R5 ; 6 R3 = hi(a) + lo(b) SUBR R0, R2 ; 6 R2 = hi(a) - lo(b) @@pos_hl: ADD@ R5, R1 ; 8 \_ R1 = lo(a)*hi(b)+hi(a)*lo(b) SUB@ R2, R1 ; 8 / ;---- ; 42 ; 151 (carried forward) ;---- ; 193 SWAP R1 ; 6 \_ shift upper product left 8 ANDI #$FF00, R1 ; 8 / ADDR R4, R1 ; 6 final product PULR PC ; 12 ;---- ; 32 ; 193 (carried forward) ;---- ; 225 ENDP . And yes, it's tested.EDIT: ...and updated to the version with the corrected cycle counting. 225 cycles! qsqr3.asm Edited April 2, 2015 by intvnut 1 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210672 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 (edited) BTW... if you're following this thread, I did make a couple edits and got the final count to 216 cycles. Calling it a night on this version, as dinner's ready. EDIT: GAH! 225 cycles. I missed the 9 cycles for the PSHR at the top when counting cycles in the most recent version. What's 9 cycles among friends? Edited April 2, 2015 by intvnut Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210678 Share on other sites More sharing options...
+nanochess Posted April 2, 2015 Share Posted April 2, 2015 Ok, if you're willing to blow an extra 255 words for the lookup table, you can make this really fast. 49 cycles for 8x8=>16, and 229 cycles for 16x16=>16. . ; Quarter Square Multiplication ; Assembly code by Joe Zbiciak, 2015 ; Released to public domain. QSQR8_TBL PROC DECLE $3F80, $3F01, $3E82, $3E04, $3D86, $3D09, $3C8C, $3C10 DECLE $3B94, $3B19, $3A9E, $3A24, $39AA, $3931, $38B8, $3840 DECLE $37C8, $3751, $36DA, $3664, $35EE, $3579, $3504, $3490 DECLE $341C, $33A9, $3336, $32C4, $3252, $31E1, $3170, $3100 DECLE $3090, $3021, $2FB2, $2F44, $2ED6, $2E69, $2DFC, $2D90 DECLE $2D24, $2CB9, $2C4E, $2BE4, $2B7A, $2B11, $2AA8, $2A40 DECLE $29D8, $2971, $290A, $28A4, $283E, $27D9, $2774, $2710 DECLE $26AC, $2649, $25E6, $2584, $2522, $24C1, $2460, $2400 DECLE $23A0, $2341, $22E2, $2284, $2226, $21C9, $216C, $2110 DECLE $20B4, $2059, $1FFE, $1FA4, $1F4A, $1EF1, $1E98, $1E40 DECLE $1DE8, $1D91, $1D3A, $1CE4, $1C8E, $1C39, $1BE4, $1B90 DECLE $1B3C, $1AE9, $1A96, $1A44, $19F2, $19A1, $1950, $1900 DECLE $18B0, $1861, $1812, $17C4, $1776, $1729, $16DC, $1690 DECLE $1644, $15F9, $15AE, $1564, $151A, $14D1, $1488, $1440 DECLE $13F8, $13B1, $136A, $1324, $12DE, $1299, $1254, $1210 DECLE $11CC, $1189, $1146, $1104, $10C2, $1081, $1040, $1000 DECLE $0FC0, $0F81, $0F42, $0F04, $0EC6, $0E89, $0E4C, $0E10 DECLE $0DD4, $0D99, $0D5E, $0D24, $0CEA, $0CB1, $0C78, $0C40 DECLE $0C08, $0BD1, $0B9A, $0B64, $0B2E, $0AF9, $0AC4, $0A90 DECLE $0A5C, $0A29, $09F6, $09C4, $0992, $0961, $0930, $0900 DECLE $08D0, $08A1, $0872, $0844, $0816, $07E9, $07BC, $0790 DECLE $0764, $0739, $070E, $06E4, $06BA, $0691, $0668, $0640 DECLE $0618, $05F1, $05CA, $05A4, $057E, $0559, $0534, $0510 DECLE $04EC, $04C9, $04A6, $0484, $0462, $0441, $0420, $0400 DECLE $03E0, $03C1, $03A2, $0384, $0366, $0349, $032C, $0310 DECLE $02F4, $02D9, $02BE, $02A4, $028A, $0271, $0258, $0240 DECLE $0228, $0211, $01FA, $01E4, $01CE, $01B9, $01A4, $0190 DECLE $017C, $0169, $0156, $0144, $0132, $0121, $0110, $0100 DECLE $00F0, $00E1, $00D2, $00C4, $00B6, $00A9, $009C, $0090 DECLE $0084, $0079, $006E, $0064, $005A, $0051, $0048, $0040 DECLE $0038, $0031, $002A, $0024, $001E, $0019, $0014, $0010 DECLE $000C, $0009, $0006, $0004, $0002, $0001, $0000 @@mid: DECLE $0000, $0000, $0001, $0002, $0004, $0006, $0009, $000C DECLE $0010, $0014, $0019, $001E, $0024, $002A, $0031, $0038 DECLE $0040, $0048, $0051, $005A, $0064, $006E, $0079, $0084 DECLE $0090, $009C, $00A9, $00B6, $00C4, $00D2, $00E1, $00F0 DECLE $0100, $0110, $0121, $0132, $0144, $0156, $0169, $017C DECLE $0190, $01A4, $01B9, $01CE, $01E4, $01FA, $0211, $0228 DECLE $0240, $0258, $0271, $028A, $02A4, $02BE, $02D9, $02F4 DECLE $0310, $032C, $0349, $0366, $0384, $03A2, $03C1, $03E0 DECLE $0400, $0420, $0441, $0462, $0484, $04A6, $04C9, $04EC DECLE $0510, $0534, $0559, $057E, $05A4, $05CA, $05F1, $0618 DECLE $0640, $0668, $0691, $06BA, $06E4, $070E, $0739, $0764 DECLE $0790, $07BC, $07E9, $0816, $0844, $0872, $08A1, $08D0 DECLE $0900, $0930, $0961, $0992, $09C4, $09F6, $0A29, $0A5C DECLE $0A90, $0AC4, $0AF9, $0B2E, $0B64, $0B9A, $0BD1, $0C08 DECLE $0C40, $0C78, $0CB1, $0CEA, $0D24, $0D5E, $0D99, $0DD4 DECLE $0E10, $0E4C, $0E89, $0EC6, $0F04, $0F42, $0F81, $0FC0 DECLE $1000, $1040, $1081, $10C2, $1104, $1146, $1189, $11CC DECLE $1210, $1254, $1299, $12DE, $1324, $136A, $13B1, $13F8 DECLE $1440, $1488, $14D1, $151A, $1564, $15AE, $15F9, $1644 DECLE $1690, $16DC, $1729, $1776, $17C4, $1812, $1861, $18B0 DECLE $1900, $1950, $19A1, $19F2, $1A44, $1A96, $1AE9, $1B3C DECLE $1B90, $1BE4, $1C39, $1C8E, $1CE4, $1D3A, $1D91, $1DE8 DECLE $1E40, $1E98, $1EF1, $1F4A, $1FA4, $1FFE, $2059, $20B4 DECLE $2110, $216C, $21C9, $2226, $2284, $22E2, $2341, $23A0 DECLE $2400, $2460, $24C1, $2522, $2584, $25E6, $2649, $26AC DECLE $2710, $2774, $27D9, $283E, $28A4, $290A, $2971, $29D8 DECLE $2A40, $2AA8, $2B11, $2B7A, $2BE4, $2C4E, $2CB9, $2D24 DECLE $2D90, $2DFC, $2E69, $2ED6, $2F44, $2FB2, $3021, $3090 DECLE $3100, $3170, $31E1, $3252, $32C4, $3336, $33A9, $341C DECLE $3490, $3504, $3579, $35EE, $3664, $36DA, $3751, $37C8 DECLE $3840, $38B8, $3931, $39AA, $3A24, $3A9E, $3B19, $3B94 DECLE $3C10, $3C8C, $3D09, $3D86, $3E04, $3E82, $3F01, $3F80 DECLE $4000, $4080, $4101, $4182, $4204, $4286, $4309, $438C DECLE $4410, $4494, $4519, $459E, $4624, $46AA, $4731, $47B8 DECLE $4840, $48C8, $4951, $49DA, $4A64, $4AEE, $4B79, $4C04 DECLE $4C90, $4D1C, $4DA9, $4E36, $4EC4, $4F52, $4FE1, $5070 DECLE $5100, $5190, $5221, $52B2, $5344, $53D6, $5469, $54FC DECLE $5590, $5624, $56B9, $574E, $57E4, $587A, $5911, $59A8 DECLE $5A40, $5AD8, $5B71, $5C0A, $5CA4, $5D3E, $5DD9, $5E74 DECLE $5F10, $5FAC, $6049, $60E6, $6184, $6222, $62C1, $6360 DECLE $6400, $64A0, $6541, $65E2, $6684, $6726, $67C9, $686C DECLE $6910, $69B4, $6A59, $6AFE, $6BA4, $6C4A, $6CF1, $6D98 DECLE $6E40, $6EE8, $6F91, $703A, $70E4, $718E, $7239, $72E4 DECLE $7390, $743C, $74E9, $7596, $7644, $76F2, $77A1, $7850 DECLE $7900, $79B0, $7A61, $7B12, $7BC4, $7C76, $7D29, $7DDC DECLE $7E90, $7F44, $7FF9, $80AE, $8164, $821A, $82D1, $8388 DECLE $8440, $84F8, $85B1, $866A, $8724, $87DE, $8899, $8954 DECLE $8A10, $8ACC, $8B89, $8C46, $8D04, $8DC2, $8E81, $8F40 DECLE $9000, $90C0, $9181, $9242, $9304, $93C6, $9489, $954C DECLE $9610, $96D4, $9799, $985E, $9924, $99EA, $9AB1, $9B78 DECLE $9C40, $9D08, $9DD1, $9E9A, $9F64, $A02E, $A0F9, $A1C4 DECLE $A290, $A35C, $A429, $A4F6, $A5C4, $A692, $A761, $A830 DECLE $A900, $A9D0, $AAA1, $AB72, $AC44, $AD16, $ADE9, $AEBC DECLE $AF90, $B064, $B139, $B20E, $B2E4, $B3BA, $B491, $B568 DECLE $B640, $B718, $B7F1, $B8CA, $B9A4, $BA7E, $BB59, $BC34 DECLE $BD10, $BDEC, $BEC9, $BFA6, $C084, $C162, $C241, $C320 DECLE $C400, $C4E0, $C5C1, $C6A2, $C784, $C866, $C949, $CA2C DECLE $CB10, $CBF4, $CCD9, $CDBE, $CEA4, $CF8A, $D071, $D158 DECLE $D240, $D328, $D411, $D4FA, $D5E4, $D6CE, $D7B9, $D8A4 DECLE $D990, $DA7C, $DB69, $DC56, $DD44, $DE32, $DF21, $E010 DECLE $E100, $E1F0, $E2E1, $E3D2, $E4C4, $E5B6, $E6A9, $E79C DECLE $E890, $E984, $EA79, $EB6E, $EC64, $ED5A, $EE51, $EF48 DECLE $F040, $F138, $F231, $F32A, $F424, $F51E, $F619, $F714 DECLE $F810, $F90C, $FA09, $FB06, $FC04, $FD02, $FE01 ENDP ; R2 = R0 * R1, where R0 and R1 are unsigned 8-bit values ; Destroys R1 qs_mpy8 PROC MOVR R0, R2 ; 6 ADDI #QSQR8_TBL.mid, R1 ; 8 ADDR R1, R2 ; 6 a + b SUBR R0, R1 ; 6 a - b @@ok: MVI@ R2, R2 ; 8 SUB@ R1, R2 ; 8 JR R5 ; 7 ;---- ; 49 ENDP ; R1 = R0 * R1, where R0 and R1 are 16-bit values ; destroys R0, R2, R3, R4, R5 qs_mpy16 PROC PSHR R5 ; 9 MOVR R0, R2 ; 6 R2 is orig 16-bit a MOVR R1, R3 ; 6 R3 is orig 16-bit b ; lo * lo ANDI #$FF, R0 ; 8 R0 is lo(a) MOVR R0, R4 ; 6 R4 is lo(a) ANDI #$FF, R1 ; 8 R1 is lo(b) MOVR R1, R5 ; 6 save lo(b) ADDI #QSQR8_TBL.mid, R1 ; 8 ADDR R1, R4 ; 6 R4 = lo(a) + lo(b) SUBR R0, R1 ; 6 R1 = lo(a) - lo(b) @@pos_ll: MVI@ R4, R4 ; 8 R4 = qstbl[lo(a)+lo(b)] SUB@ R1, R4 ; 8 R4 = lo(a)*lo(b) ;---- ; 85 ; lo * hi SWAP R3 ; 6 \_ R3 = hi(b) ANDI #$FF, R3 ; 8 / MOVR R0, R1 ; 6 R0 = R1 = lo(a) ADDI #QSQR8_TBL.mid, R3 ; 8 ADDR R3, R1 ; 6 R1 = hi(b) + lo(a) SUBR R0, R3 ; 6 R3 = hi(b) - lo(a) @@pos_lh: MVI@ R1, R1 ; 8 R1 = qstbl[hi(b)-lo(a)] SUB@ R3, R1 ; 8 R1 = lo(a)*hi(b) ;---- ; 56 ; 85 (carried forward) ;---- ; 141 ; hi * lo SWAP R2 ; 6 ANDI #$FF, R2 ; 8 R2 = hi(a) MOVR R5, R0 ; 6 R5 = R0 = lo(b) ADDI #QSQR8_TBL.mid, R2 ; 8 ADDR R2, R5 ; 6 R3 = hi(a) + lo(b) SUBR R0, R2 ; 6 R2 = hi(a) - lo(b) @@pos_hl: ADD@ R5, R1 ; 8 \_ R1 = lo(a)*hi(b)+hi(a)*lo(b) SUB@ R2, R1 ; 8 / ;---- ; 56 ; 141 (carried forward) ;---- ; 197 SWAP R1 ; 6 \_ shift upper product left 8 ANDI #$FF00, R1 ; 8 / ADDR R4, R1 ; 6 final product PULR PC ; 12 ;---- ; 32 ; 197 (carried forward) ;---- ; 229 ENDP . And yes, it's tested. I've been reading your posts and I've to say this routine is pretty impressive . Very nice work! I'll try to include it in the next version of IntyBASIC, maybe with some case detection to include it only if it's used a var1*var2 expression 2 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210679 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 (edited) I've been reading your posts and I've to say this routine is pretty impressive . Very nice work! I'll try to include it in the next version of IntyBASIC, maybe with some case detection to include it only if it's used a var1*var2 expression I guess there isn't an easy way to say a table lookup only returns an 8-bit value, is there? For First Spear's use case (the one that inspired all this, actually), if he uses the approach I suggested, he would have a lookup table that will return small values (fit well within 8 bits), but are served well by a lookup table. Alternately, does it make sense to have a rand8(val) primitive as part of the language that returns a random 8 bit value that's in the range [ 0, val ), and that could use the 8 x 8 multiply? ie. . rand_result = rand8( val ) . that expands to something like . ; assume 'val' in R0; result in R0. Trashes R1, R2. ANDI #$FF, R0 ; 8 MOVR R0, R2 ; 6 MVII #QSQR8_TBL.mid, R1 ; 8 ADD _rand, R1 ; 10 ADDR R1, R2 ; 6 SUBR R0, R1 ; 6 MVI@ R2, R0 ; 8 SUB@ R1, R0 ; 8 SWAP R0 ; 6 ANDI #$FF, R0 ; 8 ;---- ; 74 . A flat 74 cycles gives you an 8 bit random number in the desired range, with minimal bias. Edited April 2, 2015 by intvnut 1 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210743 Share on other sites More sharing options...
+nanochess Posted April 2, 2015 Share Posted April 2, 2015 I guess there isn't an easy way to say a table lookup only returns an 8-bit value, is there? For First Spear's use case (the one that inspired all this, actually), if he uses the approach I suggested, he would have a lookup table that will return small values (fit well within 8 bits), but are served well by a lookup table. Alternately, does it make sense to have a rand8(val) primitive as part of the language that returns a random 8 bit value that's in the range [ 0, val ), and that could use the 8 x 8 multiply? ie. . rand_result = rand8( val ) . that expands to something like . ; assume 'val' in R0; result in R0. Trashes R1, R2. ANDI #$FF, R0 ; 8 MOVR R0, R2 ; 6 MVII #QSQR8_TBL.mid, R1 ; 8 ADD _rand, R1 ; 10 ADDR R1, R2 ; 6 SUBR R0, R1 ; 6 MVI@ R2, R0 ; 8 SUB@ R1, R0 ; 8 SWAP R0 ; 6 ANDI #$FF, R0 ; 8 ;---- ; 74 . A flat 74 cycles gives you an 8 bit random number in the desired range, with minimal bias. Pretty reasonable indeed! Let me think about it. Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210765 Share on other sites More sharing options...
freewheel Posted April 2, 2015 Share Posted April 2, 2015 We've definitely talked about rand(range) before. An 8-bit value would be fine. Right now I'm doing wasteful (modulo) tricks to make it happen. I'd love to have something slightly more nice, syntax-wise, and way faster 255 words to speed up multiplication? That's a substantial amount of ROM for one calculation - hopefully there's a way to easily toggle inclusion on or off depending on a person's needs. Yeah, most games can spare it, but still. If performance isn't a big worry, that's an entire screen of graphics 1 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210797 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 (edited) We've definitely talked about rand(range) before. An 8-bit value would be fine. Right now I'm doing wasteful (modulo) tricks to make it happen. I'd love to have something slightly more nice, syntax-wise, and way faster 255 words to speed up multiplication? That's a substantial amount of ROM for one calculation - hopefully there's a way to easily toggle inclusion on or off depending on a person's needs. Yeah, most games can spare it, but still. If performance isn't a big worry, that's an entire screen of graphics The total table is 766 words, actually. But really, these days, are there any games that can't spare it? The allophone library is about twice that size, for comparison. I'd say CPU cycles are in shorter supply. If you're going all-out for ROM space, you can get up to around 50K before you have to page-flip. 766 words is about 1.5% of that. But from a performance perspective, a 225 cycle multiply regardless of the operands is waaay cheaper than the current cost (up to ~90 seconds if multiplying by -1!!). 225 cycles is, coincidentally, about 1.5% of a frame time. Ok, ok, I posted iterative shift-and-add methods that are about 3x the cycles but have no lookup table. Still, ROM seems so cheaaaaap. I'd rather bloat the multiply and work on solving page-flipped ROMs in IntyBASIC than worrying about shrinking the multiply codesize while retaining the lovely cycle count. Solving page-flipped ROMs opens more doors. Edited April 2, 2015 by intvnut 1 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210807 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 We've definitely talked about rand(range) before. An 8-bit value would be fine. Right now I'm doing wasteful (modulo) tricks to make it happen. I'd love to have something slightly more nice, syntax-wise, and way faster BTW, multiplication may be better than modulo, depending on the multiplier. If your random value falls in an 8 bit range, you might consider (rand * range) / 256. This is especially true if 'range' is a compile-time constant. Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210819 Share on other sites More sharing options...
freewheel Posted April 2, 2015 Share Posted April 2, 2015 The total table is 766 words, actually. But really, these days, are there any games that can't spare it? The allophone library is about twice that size, for comparison. I'd say CPU cycles are in shorter supply. 766? Yikes. That's almost a quarter of the size of entire GAMES from back in the day I'd say that it depends on the game, and the intended target. Are we assuming that every place a person will ever use a ROM in the future will have gobs and gobs of storage? That may be the case, but who knows. I've yet to come even close to heavy cycle counting - maybe I just don't write computationally intensive games. Don't get me wrong, I love the idea. As you've been able to eke out, I'm using a LOT of table-driven ROM waste to simplify things (OK, sometimes it saves ROM space too...). I'm a big fan of being smart with resources. But adding 1.5% to every single homebrew from here on in, for a single function that may not even be necessary? I guess I'm more worried about the precedent it sets. And about a future where a one line "Hello World" INTV game uses 20KB just to fit in all the library functions that we're sure "everyone" needs. How quickly does one become three become ten? I didn't realize the allophones were being added regardless of Voice support. That seems a bit... excessive. Maybe it's time to figure out how to *easily* include/exclude these extra features? Of course things might get complicated once we start discussing the ability to turn on "fast multiplication", or similar things. I'm not sure how easy it would be to have the compiler use "standard" multiplication unless that library/function is included/selected. Also I've lost track of why there's such a performance hit here. I don't follow where a multiplication can take 90 seconds, at any value. I ran this and it happens well within a single frame (that's as granular as I could bother to look): a=a+1 b=b-1 test=a*b Now that I think about it, I've definitely gotten lost somewhere because there's no way that this is what you're talking about. Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210822 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 (edited) 766? Yikes. That's almost a quarter of the size of entire GAMES from back in the day I'd say that it depends on the game, and the intended target. Are we assuming that every place a person will ever use a ROM in the future will have gobs and gobs of storage? That may be the case, but who knows. I've yet to come even close to heavy cycle counting - maybe I just don't write computationally intensive games. Don't get me wrong, I love the idea. As you've been able to eke out, I'm using a LOT of table-driven ROM waste to simplify things (OK, sometimes it saves ROM space too...). I'm a big fan of being smart with resources. But adding 1.5% to every single homebrew from here on in, for a single function that may not even be necessary? I guess I'm more worried about the precedent it sets. And about a future where a one line "Hello World" INTV game uses 20KB just to fit in all the library functions that we're sure "everyone" needs. How quickly does one become three become ten? I didn't realize the allophones were being added regardless of Voice support. That seems a bit... excessive. Maybe it's time to figure out how to *easily* include/exclude these extra features? Of course things might get complicated once we start discussing the ability to turn on "fast multiplication", or similar things. I'm not sure how easy it would be to have the compiler use "standard" multiplication unless that library/function is included/selected. Also I've lost track of why there's such a performance hit here. I don't follow where a multiplication can take 90 seconds, at any value. I ran this and it happens well within a single frame (that's as granular as I could bother to look): a=a+1 b=b-1 test=a*b Now that I think about it, I've definitely gotten lost somewhere because there's no way that this is what you're talking about. URGH. First let me wipe some egg off my face. Multiplying by -1 costs over 92 clock ticks not 92 seconds. There's 60 ticks in a second and 60 seconds in a minute, and somewhere along the way I conflated the two. Also, your multiply is measuring 1 * 255, not 1 * -1. 8 bit variables! Still, 90+ ticks is fairly unacceptable for responsiveness in a game. Also, when you factor in the cost of the interrupt handler, it's actually longer. I benchmarked it just now to be sure. Here's my source: . CLS #START = FRAME A = A * -1 #TICKS = FRAME - #START PRINT "Done: ", <5> #TICKS done: GOTO done . This benchmark reports that multiplying by -1 cost 114 ticks. Compile it and see if you agree with the result. I admit my egregious units error. But hopefully you agree that taking almost 2 seconds for a multiply might surprise a few people, even if it isn't 90 seconds. (One moment... still wiping some egg off... anyway.) On the voice stuff: It's guarded by a conditional compile directive. If you don't use voice, the allophones don't come in. I was just trying to give a relative measure. On "Hello World" costing 20KB.... That seems like a specious argument. You may as well include in that measure the 4K words (8K bytes?) of the EXEC, because that's there too. Throw in the 2K bytes of GROM while you're at it. Just getting to the title screen on an unmodified Intellivision requires 10K bytes of code (without so much as a byte of IntyBASIC) by those rules. I think it's perfectly acceptable to have a healthy support library under you when writing a game. Yes, it can make trivial "Hello World" examples look appallingly large vs. a true bare-metal example—grab your soldering iron to make it happen!—but that's a shade ingenuine. Any reasonable game will leverage that library code to great effect. GWBASIC.EXE (remember that?) was nearly 80K. Larger when you consider the CGA font and BIOS in ROM that it leveraged. We're still operating on a bargain budget in comparison, with better results! I'd like to see you write GoatNom in GWBASIC on a PC XT and have it be as fluid. (And yes, I consider IntyBASIC to be more in the GWBASIC territory—or at least some of the later 16K and 32K BASICs—with its graphics and music capability, as compared to the simpler 8K and smaller text-only BASICs.) Am I suggesting we should be abjectly wasteful? NO. But for core functions that most folks consider "table stakes," such as basic math operations or small random numbers (ie. rolling dice), there should be efficient primitives. And yes, some of those might be fairly large. As I said before, cycles (both CPU and programmer cycles) seem to be in far shorter supply than bytes of ROM. Fast primitives lower the onramp, reduce surprising performance anomalies, and really don't cost much of anything in the long run. Even on the Intellivision. Edited April 2, 2015 by intvnut 1 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210838 Share on other sites More sharing options...
+nanochess Posted April 2, 2015 Share Posted April 2, 2015 Am I suggesting we should be abjectly wasteful? NO. But for core functions that most folks consider "table stakes," such as basic math operations or small random numbers (ie. rolling dice), there should be efficient primitives. And yes, some of those might be fairly large. As I said before, cycles (both CPU and programmer cycles) seem to be in far shorter supply than bytes of ROM. Fast primitives lower the onramp, reduce surprising performance anomalies, and really don't cost much of anything in the long run. Even on the Intellivision. That's my main objective with IntyBASIC. That the core functions are efficient, specially the most used functions. This way the programmer doesn't lose his/her time in optimization purposes or thinking in reworking in assembler code, because he/she would get only a tiny fraction of speed-up. IntyBASIC tries to put a reasonable balance between assembler and compiled code. Also the support library must be fast and of reasonable size. It wouldn't be very useful (for example) to have a music tracker that is so slow, big or complex that it's effectively unusable. Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210967 Share on other sites More sharing options...
freewheel Posted April 2, 2015 Share Posted April 2, 2015 (edited) URGH. First let me wipe some egg off my face. Multiplying by -1 costs over 92 clock ticks not 92 seconds. There's 60 ticks in a second and 60 seconds in a minute, and somewhere along the way I conflated the two. Also, your multiply is measuring 1 * 255, not 1 * -1. 8 bit variables! Still, 90+ ticks is fairly unacceptable for responsiveness in a game. Also, when you factor in the cost of the interrupt handler, it's actually longer. I benchmarked it just now to be sure. Here's my source: . CLS #START = FRAME A = A * -1 #TICKS = FRAME - #START PRINT "Done: ", <5> #TICKS done: GOTO done . This benchmark reports that multiplying by -1 cost 114 ticks. Compile it and see if you agree with the result. I admit my egregious units error. But hopefully you agree that taking almost 2 seconds for a multiply might surprise a few people, even if it isn't 90 seconds. Ah. There's my problem. I'd never in a million years be doing multiplication with 16 bit variables (or negative constant operands) in an environment such as this. I had to go to this to replicate what you're seeing (and yeah it's pretty painful): #a=#a+1 #b=-1 #test=#a*#b 'or just -1 in place of #b I'm not sure I'll use the word "contrived" for that, but it doesn't seem like something I'd be using terribly often. I use 255 as -1 all the time and it works swimmingly for most calculations, including multiplications. It helps that in my world, the land of negative numbers really only exists in the -1,0,1 band - and that I pretty much never do complex math with the very few 16-bit variables available On the voice stuff: It's guarded by a conditional compile directive. If you don't use voice, the allophones don't come in. I was just trying to give a relative measure. Is this conditional assembling? I thought it was conditional compilation, too - but when I looked last night, the code is all there in the final .asm. Is that what IF DEFINED does during final assembly? Because many sections of the Intellivoice driver don't seem to be enclosed with that directive (although the allophones most definitely are). Edited April 2, 2015 by freeweed Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210985 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 P.S. My apologies if I sounded a little cranky last night. Been fighting with some awful code at work. Bleh. Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3210986 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 (edited) Ah. There's my problem. I'd never in a million years be doing multiplication with 16 bit variables (or negative constant operands) in an environment such as this. I had to go to this to replicate what you're seeing (and yeah it's pretty painful): #a=#a+1 #b=-1 #test=#a*#b 'or just -1 in place of #b I'm not sure I'll use the word "contrived" for that, but it doesn't seem like something I'd be using terribly often. I use 255 as -1 all the time and it works swimmingly for most calculations, including multiplications. It helps that in my world, the land of negative numbers really only exists in the -1,0,1 band - and that I pretty much never do complex math with the very few 16-bit variables available You don't need 16 bit variables on the LHS to make it happen. A = A * -1 is enough to see it. Honestly, a negative value is more likely to pop out of an intermediate result, as those are 16 bits. I just wrote the example as I did to make it clear. In theory, IntyBASIC could flip the negative constant in my example to positive before multiplying (and thus be able to match the constant against the optimized constant routines), followed by a NEGR on the product. That'd fix my example without fixing the underlying issue. If someone happens to write a * (b - c), and (b - c) is occasionally negative, they could be left wondering why their program suddenly went pear shaped. That's what I mean by surprising. If they're able to narrow it down to that expression (and be honest, until these performance threads, would you have been able to identify that expression as troublesome?), would they think to try (b - c) * a as a fix? Why should it even be a fix? But, in IntyBASIC 1.0.3 and earlier, it is. To me, it just seems to go against the BASIC ethos—Beginner's All-purpose Symbolic Instruction Code—that there might be such big booby traps in innocuous looking statements. Is this conditional assembling? I thought it was conditional compilation, too - but when I looked last night, the code is all there in the final .asm. Is that what IF DEFINED does during final assembly? Because many sections of the Intellivoice driver don't seem to be enclosed with that directive (although the allophones most definitely are). IF DEFINED is very similar to #ifdef in C/C++. If the corresponding symbol is defined, then that hunk of code gets brought in. Since AS1600 doesn't include a linker, the IntyBASIC epilogue file just includes all the library components, and then disables pieces with conditional assembly directives. In principle it should be possible to leave the voice driver out in addition to the allophones. If it's not doing that, then that's a potential avenue for improvement. Edited April 2, 2015 by intvnut Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3211000 Share on other sites More sharing options...
freewheel Posted April 2, 2015 Share Posted April 2, 2015 Hmm. I suspect that my subconscious has hundreds of little optimization secrets (ie: "things to avoid doing on hardware or in languages that are possibly inefficient") that I don't even realize are there, and that's why I habitually avoid these issues without even realizing it. I'd never, ever use a=a*-1. In fact knowing that we're dealing with unsigned integers, I just automatically sub in 255 whenever this happens (and it's usually controlled by an 8-bit variable anyway, controlling something like direction or speed). So I've just never run across it. Other programmers might not have these sorts of things burned into their memory. So yeah, I can see where this could trip someone up. And "pear shaped" is putting it mildly. It absolutely brings the program to a screeching halt. *I* would be able to narrow it down because I have a knack for spotting the problem line very quickly when things go haywire (it helps that I compile and run after damned near any remotely complex addition, comp sci programming practices 101 forever!), but yeah. Some people wouldn't in a million years think that a simple math operation could go so crazy, so I could see someone spending hours trying to figure things out. I'm surprised it hasn't been discovered before - either IntyBASIC still has a fairly limited audience of people who instinctively avoid these sorts of traps, or we've just been lucky so far. 2 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3211011 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 (edited) And "pear shaped" is putting it mildly. It absolutely brings the program to a screeching halt. *I* would be able to narrow it down because I have a knack for spotting the problem line very quickly when things go haywire (it helps that I compile and run after damned near any remotely complex addition, comp sci programming practices 101 forever!), but yeah. Some people wouldn't in a million years think that a simple math operation could go so crazy, so I could see someone spending hours trying to figure things out. I'm surprised it hasn't been discovered before - either IntyBASIC still has a fairly limited audience of people who instinctively avoid these sorts of traps, or we've just been lucky so far. I'm going to go with a little bit of "lucky" and a little bit of "I don't know why this didn't work, so I just rewrote it randomly several times until it did." (Nearly 20 years in engineering has exposed me to many, many people in that second category. And even I have been in that second category occasionally...) Edited April 2, 2015 by intvnut Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3211024 Share on other sites More sharing options...
freewheel Posted April 2, 2015 Share Posted April 2, 2015 I'm going to go with a little bit of "lucky" and a little bit of "I don't know why this didn't work, so I just rewrote it randomly several times until it did." (Nearly 20 years in engineering has exposed me to many, many people in that second category. And even I have been in that second category occasionally...) Put me solidly in the 2nd category, although with a fair bit of "good hunches usually pay off" mixed in. I'm used to designing around the limitations of any system I deal with, even if I don't much understand them, so I fly by the seat of my pants while doing this. And usually succeed. I look at this as similar to one of my favourite optimizations that Oscar put in: SCREEN label[,origin_offset,target_offset,cols,rows]. He could have sat back and said "let's keep this simple and easy for newbies, and who cares, consuming 240 words for every screen isn't THAT wasteful in the grand scheme of things, ROM is cheap". Instead, he's built in a granularity that allows me as programmer to decide if and when I want to use the ROM space, while still using the simplicity of the SCREEN command. Obviously I could write ranged POKE loops if I wanted (or whatever would work best) but this was just so damned elegant. And in the case of my mega-project, it's saved me close to 3.5k words (I'm creating a lot of half screens). Just that one thought of "hey, why be wasteful" has given me a tremendous amount of space back - space that I can use with more graphics, text, whatever, just to make things that much more grand in scope. Even if that wasn't his original intention when writing that syntax, it's how it played out Like I said earlier, I have no idea how you'd toggle something like math optimizations. Should it be included by default? Almost certainly. But I just don't think "you can always edit it out of the epilogue" is a reasonable answer for those that want to be a bit more efficient when they can. Surely there's a middle ground here, but as someone with exceedingly poor grasp of language syntax and compiler design, I'm the last guy to figure out how. So instead I complain And I'll admit that I'm on a bit of a storage reclaim mission right now, so my priorities may be a bit out of whack. 1 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3211057 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 I look at this as similar to one of my favourite optimizations that Oscar put in: SCREEN label[,origin_offset,target_offset,cols,rows]. He could have sat back and said "let's keep this simple and easy for newbies, and who cares, consuming 240 words for every screen isn't THAT wasteful in the grand scheme of things, ROM is cheap". Instead, he's built in a granularity that allows me as programmer to decide if and when I want to use the ROM space, while still using the simplicity of the SCREEN command. Obviously I could write ranged POKE loops if I wanted (or whatever would work best) but this was just so damned elegant. And in the case of my mega-project, it's saved me close to 3.5k words (I'm creating a lot of half screens). Just that one thought of "hey, why be wasteful" has given me a tremendous amount of space back - space that I can use with more graphics, text, whatever, just to make things that much more grand in scope. Even if that wasn't his original intention when writing that syntax, it's how it played out I actually see that a bit differently. SCREENs scale in the size of the game. With each SCREEN statement, you increase the ROM size accordingly. So, you have something that scales with the size of the project. (There's the additional detail that this syntax allows you to update just a portion of the screen, which by itself is useful irrespective of ROM size / cycle count arguments.) In contrast, library code is fixed in size, no matter what the size of your project. "Hello World" and "War And Peace: The RPG" both bring the same library with them. (Granted, conditional assembly tricks may drop out unused library bits, but that effect is limited.) It makes a ton of sense to optimize your data structures and layout, because those will grow as you add more content to your game. The efficiency of your data structures determines rate at which they grow as you add more content to the game. The multiply library, however, will stay the same size, no matter how many multiplies you use. It's a fixed overhead. (I suppose if you use zero multiplies, it could be possible to make it disappear.) One easy way to allow configurability, perhaps, is to leverage the conditional assembly directives and some creativity in the intybasic_prologue.asm and intybasic_epilogue.asm. Consider this idea: . ;; Put this in intybasic_prologue.asm _mathlib PROC @@FAST EQU 0 @@COMPACT EQU 1 ENDP _mathlib.type SET _mathlib.FAST ; default MACRO MATHLIB %strategy% _mathlib.type SET _mathlib.%strategy% ENDM . ;; Put this in intybasic_epilogue.asm IF _mathlib.type = _mathlib.FAST ; put code for super fast multiply, etc. here ELSE ; put more compact shift/add implementations here ENDI . Then, in your BASIC program, if you want to switch strategies, you can add a single ASM statement at the top: . ASM MATHLIB COMPACT . Of course, if the rand8() I proposed requires that lookup table, then rand8() will always have to make a CALL to do the multiply so the IF/ELSE/ENDI can work its magic. Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3211072 Share on other sites More sharing options...
freewheel Posted April 2, 2015 Share Posted April 2, 2015 The multiply library, however, will stay the same size, no matter how many multiplies you use. It's a fixed overhead. (I suppose if you use zero multiplies, it could be possible to make it disappear.) You're technically correct, which is the best kind of correct. But just because it's a fixed overhead doesn't automatically mean it always makes sense. I could design a fixed overhead library function that consumes 20k words and saves 1% of your cycles. And it would be idiotic to do that, except in that once instance where you're trying to wring out every last cycle. Yes, it's an extreme example, but I've seen that kind of thinking before (*avoids early JAVA rant*). Don't get me wrong - there's an obvious and massive difference in scope between a fairly small fixed cost and something that scales. My point is that it's easy to say "eh, it's just a few bytes" (albeit added to every single program until the end of time), while solving a problem that may or may not ever come up. I multiply all over the place, often in very stupid/lazy ways, and have never run into this. Fundamentally I agree with you though. Basic math is just too damned ... basic. And you'll really get me onboard if this library could also mean support for exponentiation Which opens a broader question - is this some sort of quirk with the way IntyBASIC was designed that we have this gaping performance hole? Or is this situation semi-common on older architectures/languages? In terms of implementing it as an option, I figured it would be something a bit kludgy like that. Not sure how maintainable that gets, especially as more of these library functions are added. So I'd probably prefer just losing the ROM space in every program. Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3211082 Share on other sites More sharing options...
intvnut Posted April 2, 2015 Author Share Posted April 2, 2015 You're technically correct, which is the best kind of correct. But just because it's a fixed overhead doesn't automatically mean it always makes sense. I could design a fixed overhead library function that consumes 20k words and saves 1% of your cycles. And it would be idiotic to do that, except in that once instance where you're trying to wring out every last cycle. Yes, it's an extreme example, but I've seen that kind of thinking before (*avoids early JAVA rant*). True, and it's a fair point. Fixed costs are acceptable if you can amortize them effectively. It's rather hard to amortize 20K words over a total 50K word budget. It's maybe a little easier to amortize 800 words. We can all build our strawmen as tall as we like. Which opens a broader question - is this some sort of quirk with the way IntyBASIC was designed that we have this gaping performance hole? Or is this situation semi-common on older architectures/languages? I think it's the growing pains of a young and evolving language. Recall all the performance issues folks had trying to print decimal numbers, and then we brought in the PRNUM routines. To me this process looks like steady improvement on the things that can benefit, ideally attacked in Pareto order of importance. And that order gets determined by what people actually use. In this particular case, nanochess put in a very simple multiply routine that was guaranteed to work, and most importantly, he shipped it. That means everyone gets to use it, and that by itself is useful. A program that works (even if it has some 'gotchas') is far more useful than a program that doesn't exist yet because you're trying to make it perfect. "Ship early, ship often" as the Linux folks say. It can always be refined in a future version, and as we've seen, that's exactly what's happened. That's a lesson I sometimes fail to heed. (Granted, the rules are a bit different when shipping in an actual ROM / EPROM. There, you really do need to get it as close to 'perfect' as possible before shipping because you can't upgrade in the field.) As for whether this particular performance quirk was common elsewhere? I can't say I've ever run into a multiply like that in other languages. But, I haven't intersected those languages this early in their lifetimes either. 1 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3211099 Share on other sites More sharing options...
freewheel Posted April 2, 2015 Share Posted April 2, 2015 (edited) I guess I meant my question more in a generic sense, not specifically about IntyBASIC. I assume this comes about because the architecture has no native multiply instruction. How is this handled in other situations? Does everyone come up with their own multiply routines? I think part of my consternation with the whole thing is that I'm kinda naively assuming that basic math functions are a "solved problem". Yeah, there's always some clever algorithm being discovered to shave a bit here and there, but this one is a whopper (in specific cases). The print numbers routine was something a lot less fundamental to mathematics (and kinda architecture-dependent). It's no surprise that that came in a later version. PRINT, in general, will probably always have room for optimization (I've done it myself via some quick and dirty lookup tables). All of which makes me realize that I've never actually spent the time to see just what "comes with" IntyBASIC - ie: what does a very simple program compile to. I naively assume that things like math just translate into a tiny handful of instructions, and the things that all languages share are implemented very similarly - so I spent my time eyeballing the more "language-specific" features. And I figured I'd ask you because if anyone knows how things like low-level arithmetic routines get implemented on low-end hardware, you seem like the right guy to ask Edited April 2, 2015 by freeweed Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3211115 Share on other sites More sharing options...
+DZ-Jay Posted April 2, 2015 Share Posted April 2, 2015 URGH. First let me wipe some egg off my face. Multiplying by -1 costs over 92 clock ticks not 92 seconds. There's 60 ticks in a second and 60 seconds in a minute, and somewhere along the way I conflated the two. Also, your multiply is measuring 1 * 255, not 1 * -1. 8 bit variables! Still, 90+ ticks is fairly unacceptable for responsiveness in a game. Also, when you factor in the cost of the interrupt handler, it's actually longer. I benchmarked it just now to be sure. Here's my source: . CLS #START = FRAME A = A * -1 #TICKS = FRAME - #START PRINT "Done: ", <5> #TICKS done: GOTO done .This benchmark reports that multiplying by -1 cost 114 ticks. Compile it and see if you agree with the result. I admit my egregious units error. But hopefully you agree that taking almost 2 seconds for a multiply might surprise a few people, even if it isn't 90 seconds. (One moment... still wiping some egg off... anyway.) On the voice stuff: It's guarded by a conditional compile directive. If you don't use voice, the allophones don't come in. I was just trying to give a relative measure. On "Hello World" costing 20KB.... That seems like a specious argument. You may as well include in that measure the 4K words (8K bytes?) of the EXEC, because that's there too. Throw in the 2K bytes of GROM while you're at it. Just getting to the title screen on an unmodified Intellivision requires 10K bytes of code (without so much as a byte of IntyBASIC) by those rules. I think it's perfectly acceptable to have a healthy support library under you when writing a game. Yes, it can make trivial "Hello World" examples look appallingly large vs. a true bare-metal examplegrab your soldering iron to make it happen!but that's a shade ingenuine. Any reasonable game will leverage that library code to great effect. GWBASIC.EXE (remember that?) was nearly 80K. Larger when you consider the CGA font and BIOS in ROM that it leveraged. We're still operating on a bargain budget in comparison, with better results! I'd like to see you write GoatNom in GWBASIC on a PC XT and have it be as fluid. (And yes, I consider IntyBASIC to be more in the GWBASIC territoryor at least some of the later 16K and 32K BASICswith its graphics and music capability, as compared to the simpler 8K and smaller text-only BASICs.) Am I suggesting we should be abjectly wasteful? NO. But for core functions that most folks consider "table stakes," such as basic math operations or small random numbers (ie. rolling dice), there should be efficient primitives. And yes, some of those might be fairly large. As I said before, cycles (both CPU and programmer cycles) seem to be in far shorter supply than bytes of ROM. Fast primitives lower the onramp, reduce surprising performance anomalies, and really don't cost much of anything in the long run. Even on the Intellivision. Very well put. In my framework, every time that I get the urge to cringe about the amount of ROM or RAM it consumes in just low-level, under-the-bonnet libraries; I remind myself that it is a rather large chunk of code that the programmer will not have to write for his game. Let's take for example the pseudo-random number generator in P-Machinery. The programmer would still need to consume storage and cycles writing a specialized PRNG for his game. All things being equal, his routine may be smaller, sure. However, all things are not equal. The savings in development effort and time, not to mention the absence of another cognitive burden orthogonal to the game itself, should more than offset whatever DECLEs the framework consumes. In your example about the 20K "Hello World" program, it is wasteful if the majority of programs are as trivial. However, the cost savings and efficiencies are significant for any non-trivial application where the programmer can print a string of text on the screen with a single statement. As someone who wrote a game in Assembly Language (and a game development framework pretty much from scratch), I cannot overstate how insanely arcane and soul crushing it is to have to *think* about BACKTAB bitmaps, address mappings, and even ASCII character tables, and how to copy them--one by one--when trying to put a title on the screen. And don't get me started with STIC handling without abstractions. If all that abstraction, power, creative empowerment, and productivity costs a few thousand DECLEs, I say, bring it on. After all, none of us grew up dreaming of writing lame and boring boilerplate code to move bits around. We're here for the games! dZ. 1 Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3211149 Share on other sites More sharing options...
+DZ-Jay Posted April 2, 2015 Share Posted April 2, 2015 How is this handled in other situations? Does everyone come up with their own multiply routines? Actually, yes. However, due to being hand-crafted for purpose, they tend to be highly specialized. For example, in Christmas Carol I got away with avoiding multiplication by thinking in terms of look-up tables, power-of-two boundaries, and optimized data structures; rather than general indices or arithmetic. Except in one single spot, where I had the need to multiply by 29. So I built a single-purpose code block that multiplied an 8-bit value in a register by 29, highly optimized and compact. I will tell you one thing, if I had a fast general-purpose multiplication routine then, I would had tried it first; and only rolled out my own if it proved insufficient or too slow for my program. dZ. Quote Link to comment https://forums.atariage.com/topic/236848-quarter-square-multiplication-blindingly-fast-multiplies/#findComment-3211157 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.