Tursi Posted March 12, 2015 Share Posted March 12, 2015 Surprising what those 6502 guru's can squeeze out of that little 8-bitter, huh? Chuck Peddle would be proud! Yeah, the A800XL was the one I had the most exposure to, and I always thought it was a capable little beast. dmsc over there has his code down to 21 seconds now, although when I said 30 kicked my butt, I had forgotten that the TI code ran in 26 seconds. (I was thinking 40). That said, he's beat 26 now, too, and he's also counting bytes (750 or so, IIRC). So, I undertook to port his version, which does away with all the squaring in terms of smarter step counts, and runs at 3.13 bits fixed point, giving very nice curves. The code ported and seems to run, but I have some debugging to do. /IF/ I can make it work, we'll get our 16-bitter back on top, it seems to be running the loops in about 16 seconds and in 670 bytes of code, no scratchpad code. (And I can optimize the sine table better than the 8-bit, cause we can have bigger lookup tables ). If I can fix the bugs, first. It's a bit harder piece of code to follow. I haven't tested that the new sine or new square root function even work yet, those are the right places to start. There's a clear uninitialized variable problem, because alternate runs produce different outputs. But to share: * Fixed point of 3.13 * Port from DMSC's Atari800XL 6502 code * Port to TI-99/4A 9900 by Tursi * Assemble with something that likes labels > 6 chars DEF START * Constants (INTEGER) - TI SX = 256 cx equ 128 * cx = 160 ; sx / 2 cy equ 90 * cy = 90 ; sy * 15/32 startXs equ 211 * startXs = 224 ; sx/5 + cx startYs equ 154 * startYs = 154 ; sx/5 + cy numZi equ 128 * numZi = 128 ; (-64+64) * Constants (Fixed 3.13) initAzt equ >ff02 * initAZt = $FF02 ; FP (1/64)^2 - 2/64 initZt2 equ 8192 * initZt2 = 8192 ; FP 1 stepZt2 equ 4 * stepZt2 = 4 ; FP 2 * (1/64)^2 initAXf equ 158 * initAXf = 102 ; FP 256 * (20/(9*sx))^2 = 256 * (1/144)^2 stepXf2 equ 316 * stepXf2 = 204 ; FP 256 * 2*(20/(9*sx))^2 = 256 * 2*(1/144)^2 * storing in registers? Going to try. zt2 EQU 15 azt EQU 14 xs EQU 13 * byte value TODO: check for wrap on byte regs! ys EQU 12 * byte value RET EQU 11 * for BL zi EQU 10 * byte value x1 EQU 9 x2 EQU 8 y1 EQU 7 * byte value axf EQU 5 di EQU 4 tmp1 EQU 3 tmp2 EQU 2 R1 EQU 1 * scratch registers R0 EQU 0 * this one is easier to store in memory * but we'll stick in scratchpad after the regs * note we reserve 4 to preserve alignment, but * it is big endian, and left aligned (4th byte is unused) xf2 EQU >8320 * 3 bytes long START LWPI >8300 * set up graphics and sine table BL @BITMAP BL @initsine * erase the pattern table CLR R0 CLR R1 LI R2,>1800 BL @VDPFILL * set the color table to white on black LI R0,>2000 LI R1,>F100 LI R2,>1800 BL @VDPFILL * init defaults li zt2,initZt2 li azt,initAZt li xs,startXs li ys,startYs li zi,numZi * ; Outer for loop: loopZi * ; x1 = x2 = xs mov xs,x1 mov xs,x2 * ; xf2 = 0 clr @xf2 clr @xf2+2 * ; axf = initAXf li axf,initAXf * ; Inner loop: loopX * ; AX = xf^2 = xf2 >> 8 * have to be careful with xf2 - in the original code it's * 3 bytes, little endian, but here, it's 3 bytes, big endian. * so this code takes the two MOST significant bytes * This is going to end up in tmp1, so we work there mov @xf2,tmp1 * ; tmp1 = xf*xf + zt*zt a zt2,tmp1 * ; di = sqrt(tmp1) * sqrt will take in and return in R0 (must preserve tmp1) mov tmp1,r0 bl @sqrt mov r0,di * ; tmp1 = di * 3; di = di * 2 sla di,1 mov di,tmp1 a di,tmp1 * ; tmp2 = sin(tmp1) * sine will take in and return in R0 mov tmp1,r0 bl @sine mov r0,tmp2 * ; AX = tmp1 * 3 mov tmp1,r0 sla r0,1 a tmp1,r0 * ; AX = sin(AX) = sin(9*di) bl @sine * ; AX = (1/2 - 1/8 + 1/32) * AX = 0.40625 * sin(9*di) * This is one case where the MPY instruction is a clear winner - just reading * all the instructions on the 9900 would take longer than to execute the * multiply. The tricky part is not stomping on any important registers! mpy @point4,r0 * result in r0,r1 - part we care about (LSW) is R1 * ; tmp2 += AX = sin(3*di) + 0.375 * sin(9*di) a r1,tmp2 * ; tmp2 * 320 * 7 / 40 = tmp2 * 56 = tmp2 * 64 - tmp2 * 8 * another multiply candidate - for us it's tmp2*256*7/40 = tmp2*44.8 (truncate to 44) * MPY's lack of sign handling is okay if we never care about the MSW mov tmp2,r0 mpy @num44,r0 * LSW in r1, so 'A' is LSB mov r1,tmp2 * ; y1 = -A + ys andi r1,>00FF neg r1 a ys,r1 mov r1,y1 * ; plot x1, y1 mov x1,r1 mov y1,r0 bl @plot * ; plot x2, y1 mov x2,r1 mov y1,r0 bl @plot * ; Condition: if(2*di>=2.0) break; ci di,>4000 jhe endLoopX * ; End of inner loop: x1++, x2--, xf2+=axf, axf+=stepXf2 inc x1 dec x2 mov axf,r0 * get MSB of axf swpb r0 andi r0,>00ff mov axf,r1 * get LSB of axf swpb r1 ab r1,@xf2+2 * add to LSB of xf2 jnc nocarry inc r0 * if there was carry, increment the MSB add nocarry a r0,@xf2 * add MSB (and carry if there was one) ai axf,stepXf2 jmp loopX endLoopX * ; End of outer loop: zt2+=azt, azt+=stepZt2, xs--, ys-- dec xs dec ys a azt,zt2 ai azt,stepZt2 * ; Condition: zi--; if( zi<0 ) break; dec zi joc loopZi * ; End of program end LIMI 2 JMP end ************************************************************************************* * utility code ************************************************************************************* * VDP access * Write single byte to R0 from MSB R1 * Destroys R0 (actually just oRs it) VSBW ORI R0,>4000 SWPB R0 MOVB R0,@>8C02 SWPB R0 MOVB R0,@>8C02 MOVB R1,@>8C00 B *R11 * Write R2 bytes from R1 to VDP R0 * Destroys R0,R1,R2 VDPFILL ORI R0,>4000 SWPB R0 MOVB R0,@>8C02 SWPB R0 MOVB R0,@>8C02 VMBWLP MOVB R1,@>8C00 DEC R2 JNE VMBWLP B *R11 * Write address or register VDPWA SWPB R0 MOVB R0,@>8C02 SWPB R0 MOVB R0,@>8C02 B *R11 * load regs list to VDP address, end on >0000 and write >D0 (for sprites) * address of table in R1 (destroyed) LOADRG LOADLP MOV *R1+,R0 JEQ LDRDN SWPB R0 MOVB R0,@>8C02 SWPB R0 MOVB R0,@>8C02 JMP LOADLP LDRDN LI R1,>D000 MOVB R1,@>8C00 B *R11 * Setup for normal bitmap mode BITMAP MOV R11,@SAVE * set display and disable sprites LI R1,BMREGS BL @LOADRG * set up SIT - We load the standard 0-255, 3 times LI R0,>5800 BL @VDPWA CLR R2 NQ# CLR R1 LP# MOVB R1,@>8C00 AI R1,>0100 CI R1,>0000 JNE LP# INC R2 CI R2,3 JNE NQ# MOV @SAVE,R11 B *R11 * IN AND OUT IN T1 * T1 in = 3.13 signed fixed point * T1 out = 3.13 signed fixed point (3.12 accuracy) * Uses T2,X1,Y1,T32 * http://samples.sainsburysebooks.co.uk/9781483296692_sample_809121.pdf * because we need an even number of bits, we calculate to 3.12 and then shift SQWP EQU >8324 * we need some workspace, this preserves calling regs SQRT MOV R0,@SQWP LWPI SQWP still have r0! CLR r1 root CLR r2 remHi (r0 is remLo) LI r3,14 count = ((WORD/2-1)+(FRACBITS>>1)) -> 7+6, +1 for loop SQRT0 SLA r2,2 remHi = (remHi << 2) | (remLo >> 14); MOV r0,r4 SRL r4,14 SOC r4,r2 SLA r0,2 remLo <<= 2; SLA r1,1 root <<= 1; MOV r1,r4 testDiv = (root << 1) + 1; SLA r4,1 INC r4 C r2,r4 if (remHi >= testDiv) { JL SQRT2 S r4,r2 remHi -= testDiv; INC r1 root += 1; SQRT2 DEC r3 while (--count != 0); JNE SQRT0 MOV r1,@>8300 return( root); LWPI >8300 SLA r0,1 Get it up to x.13 fixed point B *R11 * INPUT R1,R0 - kills TMP1,TMP2 as well PLOT * use the E/A routine for address MOV R0,tmp1 R0 is the Y value. SLA tmp1,5 SOC R0,tmp1 ANDI tmp1,>FF07 MOV R1,tmp2 R1 is the X value. ANDI tmp2,7 A R1,tmp1 tmp1 is the byte offset. S tmp2,tmp1 tmp2 is the bit offset. * inline VDP! SWPB tmp1 set up read address MOVB tmp1,@>8C02 SWPB tmp1 MOVB tmp1,@>8C02 ORI tmp1,>4000 we need this later, and provides a VDP delay MOVB @>8800,R1 read the byte from VDP SWPB tmp1 set up write address MOVB tmp1,@>8C02 SWPB tmp1 MOVB tmp1,@>8C02 SOCB @BITS(tmp2),R1 or the bit and provide VDP delay MOVB R1,@>8C00 write the byte back B *R11 * sine code ported from dmsc's 6502 code as well * my input and output in R0 as 3.13 fixed point sine * stx mul2+1 * asl a * rol mul2+1 * asl a * rol mul2+1 * ror mul2+2 * asl a * rol mul2+1 * ; mul2+1 has the fractional part, mul2+2 has the sign (1 bits), C is the reflection * I think there was a bug in the original code in that mul2+2 ends up with the msb integer, * rather than the sign bit (ror should be after the first rol?). But as long as the * integer was always less than 2, it would still have always worked. Ultimately, * this code is taking this value: S B1 B2 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 * it wants the sign bit S at mul2+2 (assuming the comment is correct) * it wants the 'reflection' bit from B2 into the Carry * and it takes the fractional bits F1-F8 for the index into mul2+1 * We need index * 2, so we'll just take an extra bit * this is a lot more expensive than a bigger lookup table... it gives a 512 entry * table in 512 bytes of RAM (instead of 1024).. but using 1k of memory this * would just be a single lookup, rather than all the tests and inversions. mov r0,r1 srl r1,4 andi r1,>1fe * mask out integer and mask out odd bit * bcc @noreflect coc @reflect,r0 jne noreflect * lda mul2+1 * eor #255 * tay * lda sineLo+1,y * ldx sineHi+1,y xor @flipidx,r1 mov @sinetab+2(r1),r1 * bcs @invert * we know the carry is set from the above test, nothing changed it, * so this is an always branch. jmp invert noreflect * ldy mul2+1 * lda sineLo,y * ldx sineHi,y mov @sinetab(r1),r1 invert * ldy mul2+2 * bpl @ok coc @negativ,r0 jne ok * clc * eor #255 * adc #1 * pha * txa * eor #255 * adc #0 * tax * pla inv r1 ok * rts mov r1,r0 b *r11 * init the sine table (if we're going to build a sine table, * why not build the whole thing instead of just half of it?) * iniSum = 54 iniSum equ 54 initSine mov r11,@SAVE * lda #iniSum * sta tmp1 li tmp1,iniSum * lda #0 * sta mtmp * sta mtmp+1 * tay * tax * no reason not to use the registers here... clr r1 clr x1 nextbyte * ; Read next byte * pha * lda genTable,x * sta tmp2 * pla clr r0 movb @genTable(x1),tmp2 * inx * stx mtmp+1 * ldx mtmp inc r1 * jsr genFour * jsr genFour * we don't have a stack, easier to do it inline bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne * stx mtmp * ldx mtmp+1 * cpx #32 * bne nextbyte * what we DO have is lots of registers ci r1,32 jne nextbyte * lda #0 * sta sineLo+256 * lda #$20 * sta sineHi+256 li r0,>2000 mov r0,@sinetab+512 * rts mov @SAVE,r11 B *R11 * ; Generates one value *genFour: * jsr genTwo *genTwo: * jsr genOne * cute trick genOne * ; AX = AX + sum * sta sineLo, y * pha * txa * sta sineHi, y * pla * iny mov x1,@sinetab(y1) inct y1 * clc * adc tmp1 * bcc @noinc * inx *@noinc: a tmp1,x1 * ; Read bit, test if sum must be decreased * asl tmp2 * bcc @nodec * dec tmp1 *@nodec: sla tmp2,1 jnc nodec dec tmp1 nodec * rts B *R11 * just a label so I can check sizes DATASPC * (my own - 0.4 in 3.13 fixed point) * data reference so we can MPY directly point4 DATA 3276 * And this is just a 44 to mpy against num44 DATA 44 * bits for pixel BITS DATA >8040,>2010,>0804,>0201 * registers for bitmap (and 5A00 is the address of the sprite table) * background is transparent (the only color never redefined) * PDT - >0000 * SIT - >1800 * SDT - >1800 * CT - >2000 * SAL - >1B00 BMREGS DATA >81E0,>8002,>8206,>83ff,>8403,>8536,>8603,>8700,>5B00,>0000 * misc bits for testing purposes reflect data >2000 negativ data >8000 flipidx data >01FE * data for sine generation genTable data >f000,>0200,>0100,>1004,>0404,>0820,>8210,>4221 data >0888,>4444,>4488,>8912,>2448,>9224,>8922,>4912 * spot to save return addresses SAVE bss 2 * spot to store the sine table sinetab bss 514 END 4 Quote Link to comment Share on other sites More sharing options...
sometimes99er Posted March 12, 2015 Share Posted March 12, 2015 Actually a nice computer generated image. 1 Quote Link to comment Share on other sites More sharing options...
+OLD CS1 Posted March 12, 2015 Share Posted March 12, 2015 Surprising what those 6502 guru's can squeeze out of that little 8-bitter, huh? Chuck Peddle would be proud! I will happily raise a glass to our 8-bit pioneer forefathers. The 6500-,6800-, and Z80 family processors are just incredible. Without diminishing what came before, I hazard to say the work on those CPUs represented a much larger leap in technology than anything being made today. 3 Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 13, 2015 Share Posted March 13, 2015 Close... but not quite. If I follow the error patterns right, this is a sign error or a step error in the sine table... I'm too sleepy to trace much more (and getting kind of tired of this hat ). If anyone knows offhand which variable I screwed up to get this pattern, let me know and we can fix it! Otherwise, it's almost there. This is a rewrite port of dmcs's version of the hat, using 3.13 fixed point. While he eliminated all multiplies for the 6502, I put a few back in because the 9900 can do them more efficiently than a lot of shifting (although I still need some shifts anyway, so I might be wrong on that). I also extended his sine table to 1024 entries to allow a straight lookup rather than mirroring and reflecting (because we can have indexes larger than 255), and created a second sine table that was pre-scaled to save an inline multiply. Assuming it's running the right number of cycles, it runs in just shy of 20 seconds, and takes 768 bytes (code+data, not including the test block at the end). Once we get it working correctly, I can optimize it a bit more and shove most of the inner loop into scratchpad, I think. * Fixed point of 3.13 in most cases * Port from DMSC's Atari800XL code * Port to TI-99/4A 9900 by Tursi * Assemble with something that likes labels > 6 chars DEF START,TEST * Constants (INTEGER) - TI SX = 256, SY=192 cx equ 128 sx / 2 cy equ 90 sy * 15 / 32 fy equ 56 sy * 7 / 24 * For the TI version.. Ys is an integer (step is 1), and Xs is 8.8 (step 0. * the stepYs is thus optimized out and just inc/dec used. stepXs equ >00cc TOFP(sx / 320.0) * these constants replace the squaring with simple counting loops initZt2 equ >2000 TOFP(1.0) deltaZt equ >0002 TOFP(1.0/(64.0*64.0)) initAZt equ >ff02 deltaZt - TOFP(2.0 / 64.0) stepZt2 equ >0004 2 * deltaZt * The values of delta² are too small for 3.13 bits of fixed-point, so we * use 8 more bits, for an 3.21 bits of precision. This is not slow because * we only use additions. deltaXf equ >009e TOFP( 20.0*20.0 / (9.0*9.0*sx*sx) * 256 ) - the *256 adds 8 bits of zeros initAXf equ >009e deltaXf stepXf2 equ >013c 2 * deltaXf * variables in registers vdpadr equ 15 zt2 EQU 14 azt EQU 13 xs EQU 12 RET EQU 11 * for BL ys EQU 10 zi EQU 9 x1 equ 8 x2 equ 7 axf equ 6 di equ 5 tmp equ 4 z1 equ 3 y1 equ 2 r1 equ 1 r0 equ 0 * for plot tmp1 equ 4 tmp2 equ 3 * this one is easier to store in memory * but we'll stick in scratchpad after the regs * note we reserve 4 to preserve alignment, but * it is big endian, and left aligned (4th byte is unused) xf2 EQU >8320 * 3 bytes long START LWPI >8300 * frequently used VDP address li vdpadr,>8c02 * set up graphics and sine table BL @BITMAP BL @initsine * clear out the oldY table (entries of 192) li r0,oldY li r1,>C000 li r2,256 rlp movb r1,*r0+ dec r2 jne rlp * erase the pattern table CLR R0 CLR R1 LI R2,>1800 BL @VDPFILL * set the color table to white on black LI R0,>2000 LI R1,>F100 LI R2,>1800 BL @VDPFILL * init defaults li zt2,initZt2 li azt,initAZt li xs,>3333 li ys,>0040 li zi,128 * ; Outer for loop: loopZi * x1 = (xs>> + cx; // SIGNED mov xs,x1 sra x1,8 ai x1,cx * x2 = x1; mov x1,x2 * ** only xf2 needs 24 bits ** * xf2 = 0; clr @xf2 clr @xf2+2 * axf = initAXf; li axf,initAXf * inner for loop loopXi * di = sqrti( (xf2>> + zt2 ); mov @xf2,r0 * two MS bytes here, LSB in next byte, so no shift needed a zt2,r0 bl @sqrt * return in r0 mov r0,di * we save it cause we need to check it later anyway * tmp = (0x3244 * di) >> 13; // * (PI/2) - 3.13 * 3.13 = 6.26, shift and truncate li tmp,>3244 mpy tmp,r0 * output in r0,r1 srl r1,13 sla r0,3 soc r0,r1 * merge the two words back to 3.13 (not saved to tmp yet) * tmp = (tmp*489)>>13; // multiplier to go from fixed point max range (1.57) to 3/4 circle (768) mov r1,r0 li tmp,489 mpy tmp,r0 * output in r0,r1 srl r1,12 * not shifting all the way to get a multiply by 2 for the index sla r0,4 soc r0,r1 * merge the two words back to 3.13 (not saved to tmp yet) * z1 = sinetab[tmp]; mov @sinetab(r1),z1 * tmp = tmp + tmp + tmp; mov r1,tmp a r1,tmp a r1,tmp * z1 += sinefour[tmp&0x3ff]; andi tmp,>03ff a @sinefour(tmp),z1 * tmp = ((z1 * fy) >> 13); // SIGNED li r0,fy mov z1,z1 jgt inotneg neg z1 mpy z1,r0 * 3.13 x 16.0 = 19.13, LSW is already correct srl r1,13 sla r0,3 soc r0,r1 * merge the two words back to 3.13 (not saved to tmp yet) neg r1 jmp idone inotneg mpy z1,r0 srl r1,13 sla r0,3 soc r0,r1 * merge the two words back to 3.13 (not saved to tmp yet) idone * y1 = ys + cy - tmp; mov ys,y1 ai y1,cy s r1,y1 * if( oldY[x1] > y1 ) c @oldY(x1),y1 jl noplot1 * oldY[x1] = y1; mov y1,@oldY(x1) * dc->SetPixel(x1, y1, RGB(0,0,0)); mov x1,r1 mov y1,r0 bl @plot noplot1 * if( oldY[x2] > y1 ) c @oldY(x2),y1 jl noplot2 * oldY[x2] = y1; mov y1,@oldY(x2) * dc->SetPixel(x2, y1, RGB(0,0,0)); mov x2,r1 mov y1,r0 bl @plot noplot2 * end of inner loop processing (normally after di, but it doesn't change) * x1++, x2--, xf2 += axf, axf += stepXf2 inc x1 dec x2 * xf2 needs to be done bytewise, cause we have to split axf for it mov axf,r0 swpb r0 andi r0,>ff00 * LSB in MSB position for LSB of xf2 mov axf,r1 swpb r1 andi r1,>00FF * MSB in LSB position for MSW of xf2 ab r0,@xf2+2 * add the LSB jnc nocarry inc @xf2 * add in the carry to the MSW nocarry a r1,@xf2 * and add in the MSB ai axf,stepXf2 * if( di >= 0x2000 ) break; (inner loop) ci di,>2000 jlt loopXi * outer loop end * zt2 += azt, azt += stepZt2, xs -= stepXs, --ys, zi-- (condition) a azt,zt2 ai azt,stepZt2 ai xs,-stepXs dec ys dec zi jne loopZi * ; End of program end LIMI 2 JMP end ************************************************************************************* * utility code ************************************************************************************* * VDP access * Write single byte to R0 from MSB R1 * Destroys R0 (actually just oRs it) VSBW ORI R0,>4000 SWPB R0 MOVB R0,*vdpadr SWPB R0 MOVB R0,*vdpadr MOVB R1,@>8C00 B *R11 * Write R2 bytes from R1 to VDP R0 * Destroys R0,R1,R2 VDPFILL ORI R0,>4000 SWPB R0 MOVB R0,*vdpadr SWPB R0 MOVB R0,*vdpadr VMBWLP MOVB R1,@>8C00 DEC R2 JNE VMBWLP B *R11 * Write address or register VDPWA SWPB R0 MOVB R0,*vdpadr SWPB R0 MOVB R0,*vdpadr B *R11 * load regs list to VDP address, end on >0000 and write >D0 (for sprites) * address of table in R1 (destroyed) LOADRG LOADLP MOV *R1+,R0 JEQ LDRDN SWPB R0 MOVB R0,*vdpadr SWPB R0 MOVB R0,*vdpadr JMP LOADLP LDRDN LI R1,>D000 MOVB R1,@>8C00 B *R11 * Setup for normal bitmap mode BITMAP MOV R11,@SAVE * set display and disable sprites LI R1,BMREGS BL @LOADRG * set up SIT - We load the standard 0-255, 3 times LI R0,>5800 BL @VDPWA CLR R2 NQ# CLR R1 LP# MOVB R1,@>8C00 AI R1,>0100 CI R1,>0000 JNE LP# INC R2 CI R2,3 JNE NQ# MOV @SAVE,R11 B *R11 * IN AND OUT IN R0 * fractions only - returns 1.0 for values >0.99999 * adapted from dmsc's code * R0 in = 3.13 signed fixed point * Uses separate workspace - looks similar to following * http://samples.sainsburysebooks.co.uk/9781483296692_sample_809121.pdf SQWP EQU >8324 * we need some workspace, this preserves calling regs SQRT ci r0,>2000 1.0 in 3.13 jl sqrt1 li r0,>2000 too big - early out b *r11 sqrt1 MOV R0,@SQWP LWPI SQWP still have r0! (x) CLR r1 root (r) CLR r2 remHi (h) (r0 is remLo) * clr r4 (q) SLA R0,3 lose the integer part LI r3,13 count = (7+FPSCALE/2) -> 7+6 SQRT0 sla r1,1 r = r<<1; mov r1,r4 q = h + (0xFFFF ^ r); inv r4 a r2,r4 jlt sqrt2 if( q >= 0 ) { r += 2; h = q; } inct r1 mov r4,r2 sqrt2 sla r2,2 h = (h << 2) | (x>>14); mov r0,r5 srl r5,14 soc r5,r2 sla r0,2 x <<= 2; DEC r3 while (--count != 0); JNE SQRT0 MOV r1,@>8300 return r; LWPI >8300 B *R11 * INPUT R1,R0 - kills TMP1,TMP2 as well PLOT * use the E/A routine for address MOV R0,tmp1 R0 is the Y value. SLA tmp1,5 SOC R0,tmp1 ANDI tmp1,>FF07 MOV R1,tmp2 R1 is the X value. ANDI tmp2,7 A R1,tmp1 tmp1 is the byte offset. S tmp2,tmp1 tmp2 is the bit offset. * inline VDP! SWPB tmp1 set up read address MOVB tmp1,*vdpadr SWPB tmp1 MOVB tmp1,*vdpadr ORI tmp1,>4000 we need this later, and provides a VDP delay MOVB @>8800,R1 read the byte from VDP SWPB tmp1 set up write address MOVB tmp1,*vdpadr SWPB tmp1 MOVB tmp1,*vdpadr SOCB @BITS(tmp2),R1 or the bit and provide VDP delay MOVB R1,@>8C00 write the byte back B *R11 * init the sine tables * r1 - temp for reflected offset (0-510) * r2 - add value * r3 - current output value * r4 - current change table entry * r5 - table output offset (0-510) * r6 - temp for negative output value * r7,r8 - temp for x0.4 output * r9 - loop counter initSine mov r11,@SAVE li r2,54 * starting value * no reason not to use the registers here... clr r9 clr r3 nextbyte clr r4 movb @genTable(r9),r4 * we don't have a stack, easier to do it inline bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne * what we DO have is lots of registers inc r9 ci r9,32 jne nextbyte mov @SAVE,r11 B *R11 * set all four points on the curve, and load both tables genOne li r1,510 s r5,r1 * reflection offset mov r3,r6 neg r6 * negative version mov r3,@sinetab(r5) mov r3,@sinetab+512(r1) mov r6,@sinetab+1024(r5) mov r6,@sinetab+1536(r1) * make the *0.4 version li r6,>0ccd * 0.4 in 3.13 mov r3,r7 jgt gnotneg abs r7 mpy r6,r7 srl r8,13 * shift fraction sla r7,3 * shift int soc r7,r8 * make 3.13 neg r8 jmp gdone gnotneg mpy r6,r7 srl r8,13 * shift fraction sla r7,3 * shift int soc r7,r8 * make 3.13 gdone mov r8,r6 neg r6 mov r8,@sinefour(r5) mov r8,@sinefour+512(r1) mov r6,@sinefour+1024(r5) mov r6,@sinefour+1536(r1) inct r5 a r2,r3 * ; Read bit, test if sum must be decreased sla r4,1 jnc nodec dec r2 nodec B *R11 * just a label so I can check sizes DATASPC * (my own - 0.40625 in 0.16 fixed point) * data reference so we can MPY directly point4 DATA >6800 * And this is just a 44 to mpy against num44 DATA 44 * bits for pixel BITS DATA >8040,>2010,>0804,>0201 * registers for bitmap (and 5A00 is the address of the sprite table) * background is transparent (the only color never redefined) * PDT - >0000 * SIT - >1800 * SDT - >1800 * CT - >2000 * SAL - >1B00 BMREGS DATA >81E0,>8002,>8206,>83ff,>8403,>8536,>8603,>8700,>5B00,>0000 * misc bits for testing purposes reflect data >2000 negativ data >4000 flipidx data >01FE * data for sine generation genTable data >f000,>0200,>0100,>1004,>0404,>0820,>8210,>4221 data >0888,>4444,>4488,>8912,>2448,>9224,>8922,>4912 * BSS section * spot to save return addresses SAVE bss 2 * spot to store the sine table (full 1024 entries) sinetab bss 2048 * sine divided by 4, to remove a multiply inline sinefour bss 2048 * row table for hidden surface (one entry per column) oldY bss 512 ****************************************************************************** TEST LWPI >8300 * frequently used VDP address li vdpadr,>8c02 * set up graphics and sine table BL @BITMAP BL @initsine * erase the pattern table CLR R0 CLR R1 LI R2,>1800 BL @VDPFILL * set the color table to white on black LI R0,>2000 LI R1,>F100 LI R2,>1800 BL @VDPFILL * draw a sine wave clr x1 sinlp mov x1,r1 sla r1,3 mov @sinetab(r1),r0 sra r0,7 ai r0,96 mov x1,r1 bl @plot inc x1 ci x1,256 jne sinlp * draw a sine wave clr x1 sinlp2 mov x1,r1 sla r1,3 mov @sinefour(r1),r0 sra r0,7 ai r0,96 mov x1,r1 bl @plot inc x1 ci x1,256 jne sinlp2 b @end END 2 Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 13, 2015 Share Posted March 13, 2015 I stand (well, sit actually) in awe of this....! Quote Link to comment Share on other sites More sharing options...
Asmusr Posted March 13, 2015 Share Posted March 13, 2015 If anyone knows offhand which variable I screwed up to get this pattern, let me know and we can fix it! Looks like the 2nd sine function call. This controls the number of 'humps' on the hat. Quote Link to comment Share on other sites More sharing options...
dmsc Posted March 13, 2015 Share Posted March 13, 2015 Hi! I browsed your program, without really understanding the 9900 assembly, but here are some comments: srl r1,12 * not shifting all the way to get a multiply by 2 for the index sla r0,4 soc r0,r1 * merge the two words back to 3.13 (not saved to tmp yet) * z1 = sinetab[tmp]; mov @sinetab(r1),z1 Don't you need something like "r1 = r1 & 0X7fe", or the indexes automatically address words only? ... * tmp = tmp + tmp + tmp; mov r1,tmp a r1,tmp a r1,tmp * z1 += sinefour[tmp&0x3ff]; andi tmp,>03ff a @sinefour(tmp),z1 Also here, I thought that you should and with 0x7fe. ... * set all four points on the curve, and load both tables genOne li r1,510 s r5,r1 * reflection offset mov r3,r6 neg r6 * negative version mov r3,@sinetab(r5) mov r3,@sinetab+512(r1) mov r6,@sinetab+1024(r5) mov r6,@sinetab+1536(r1) Note that the reflection is not about 255, but about 256 (i.e., the point at "0" is the same that the point at "512", so I suspect that you should use "li r1,512", so that with "r1=0" you store the reflection at 512+512 = 1024. * make the *0.4 version li r6,>0ccd * 0.4 in 3.13 mov r3,r7 jgt gnotneg This is not needed, as r3 is always positive (you are in the first quadrant). Hope it helps! 1 Quote Link to comment Share on other sites More sharing options...
Asmusr Posted March 13, 2015 Share Posted March 13, 2015 >> Don't you need something like "r1 = r1 & 0X7fe", or the indexes automatically address words only? A word operation like MOV (as opposed to a byte operation like MOVB) will automatically discard the least significant bit. I.e. if you try to read a word at 1001 you will read the word at 1000 instead. Quote Link to comment Share on other sites More sharing options...
+mizapf Posted March 13, 2015 Share Posted March 13, 2015 >> Don't you need something like "r1 = r1 & 0X7fe", or the indexes automatically address words only? A word operation like MOV (as opposed to a byte operation like MOVB) will automatically discard the least significant bit. I.e. if you try to read a word at 1001 you will read the word at 1000 instead. When reading more about other architectures (like MIPS) I learned that this not a necessity for all architectures. Others react with a "bus error" trap. The masking of the LSB is obviously a consequence that the TMS9900 processor does not even have an A15 line; it is created by the chipset on the board. Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 14, 2015 Share Posted March 14, 2015 I love coming back to such problems after SLEEP. Things are so much more clear. Looks like the 2nd sine function call. This controls the number of 'humps' on the hat. Narrows it down! Don't you need something like "r1 = r1 & 0X7fe", or the indexes automatically address words only? Yeah, it does. The 9900 actually only has 15 address bits, so the least significant bit is never presented to the address bus. I cheated there. Also here, I thought that you should and with 0x7fe. And that's it, yes. As soon as I read that, combined with Rasmus' comment, I knew this was it. I was masking out half the sine table. (The 9900 may understand it's a word access, but it still makes us provide a byte-sized offset ) Corrected, and it's fixed. It draws a hat now! Note that the reflection is not about 255, but about 256 (i.e., the point at "0" is the same that the point at "512", so I suspect that you should use "li r1,512", so that with "r1=0" you store the reflection at 512+512 = 1024. I tried that at first, and it leaves gaps in the table if I don't store the 1.0 value like your table had. I was a little unclear on that, so yeah, I cheated. I've retested now with the benefit of sleep, and yes, you are correct. Will fix. This is not needed, as r3 is always positive (you are in the first quadrant). Ooh, good info, thank you! Will post an update shortly. (Instead of, you know, doing something useful with my day. ) Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 14, 2015 Share Posted March 14, 2015 (edited) So, porting dmsc's algorithm to the TI seems to work well! Here are the working versions. Here's the standard code - all in 8-bit RAM, takes about 20 seconds to run (and 730 bytes of code+data). The Classic99 heatmap is fun to watch while it's running - you can see the accesses to the sine tables as a sort of sparkling effect. * Fixed point of 3.13 in most cases * Port from DMSC's Atari800XL code * Port to TI-99/4A 9900 by Tursi * Assemble with something that likes labels > 6 chars DEF START * Constants (INTEGER) - TI SX = 256, SY=192 cx equ 128 sx / 2 cy equ 90 sy * 15 / 32 fy equ 56 sy * 7 / 24 * For the TI version.. Ys is an integer (step is 1), and Xs is 8.8 (step 0. * the stepYs is thus optimized out and just inc/dec used. stepXs equ >00cc TOFP(sx / 320.0) * these constants replace the squaring with simple counting loops initZt2 equ >2000 TOFP(1.0) deltaZt equ >0002 TOFP(1.0/(64.0*64.0)) initAZt equ >ff02 deltaZt - TOFP(2.0 / 64.0) stepZt2 equ >0004 2 * deltaZt * The values of delta² are too small for 3.13 bits of fixed-point, so we * use 8 more bits, for an 3.21 bits of precision. This is not slow because * we only use additions. deltaXf equ >009e TOFP( 20.0*20.0 / (9.0*9.0*sx*sx) * 256 ) - the *256 adds 8 bits of zeros initAXf equ >009e deltaXf stepXf2 equ >013c 2 * deltaXf * variables in registers vdpadr equ 15 zt2 EQU 14 azt EQU 13 xs EQU 12 RET EQU 11 * for BL ys EQU 10 zi EQU 9 x1 equ 8 x2 equ 7 axf equ 6 *di equ 5 * optimized away tmp equ 4 z1 equ 3 y1 equ 2 r1 equ 1 r0 equ 0 * for plot tmp1 equ 4 tmp2 equ 3 * this one is easier to store in memory * but we'll stick in scratchpad after the regs * note we reserve 4 to preserve alignment, but * it is big endian, and left aligned (4th byte is unused) xf2 EQU >8320 * 3 bytes long START LWPI >8300 * frequently used VDP address li vdpadr,>8c02 * set up graphics and sine table BL @BITMAP BL @initsine * clear out the oldY table (entries of 192) li r0,oldY li r1,>C0C0 li r2,128 rlp mov r1,*r0+ dec r2 jne rlp * erase the pattern table LI R0,>4000 write address >0000 CLR R1 LI R2,>1800 BL @VDPFILL * set the color table to white on black LI R0,>6000 write address >2000 LI R1,>F100 LI R2,>1800 BL @VDPFILL * init defaults li zt2,initZt2 li azt,initAZt li xs,>3333 li ys,>0040 li zi,128 * ; Outer for loop: loopZi * x1 = (xs>> + cx; // SIGNED mov xs,x1 sra x1,8 ai x1,cx * x2 = x1; mov x1,x2 * ** only xf2 needs 24 bits ** * xf2 = 0; clr @xf2 clr @xf2+2 * axf = initAXf; li axf,initAXf * inner for loop loopXi * di = sqrti( (xf2>> + zt2 ); mov @xf2,r0 * two MS bytes here, LSB in next byte, so no shift needed a zt2,r0 bl @sqrt * return in r0 * if( di >= 0x2000 ) break; (inner loop - moved from the end, no visible difference but frees di) ci r0,>2000 jhe lXiEnd * tmp = (0x3244 * di) >> 13; // * (PI/2) - 3.13 * 3.13 = 6.26, shift and truncate li tmp,>3244 mpy tmp,r0 * output in r0,r1 srl r1,13 sla r0,3 soc r0,r1 * merge the two words back to 3.13 (not saved to tmp yet) * tmp = (tmp*489)>>13; // multiplier to go from fixed point max range (1.57) to 3/4 circle (768) mov r1,r0 li tmp,489 mpy tmp,r0 * output in r0,r1 srl r1,12 * not shifting all the way to get a multiply by 2 for the index sla r0,4 soc r0,r1 * merge the two words back to 3.13 (not saved to tmp yet) * z1 = sinetab[tmp]; mov @sinetab(r1),z1 * tmp = tmp + tmp + tmp; mov r1,tmp a r1,tmp a r1,tmp * z1 += sinefour[tmp&0x3ff]; andi tmp,>07fe * a @sinefour(tmp),z1 * delayed till below * tmp = ((z1 * fy) >> 13); // SIGNED li r0,fy * mov z1,z1 a @sinefour(tmp),z1 * moved from above so we can add and test in one step jlt iisneg mpy z1,r0 srl r1,13 sla r0,3 soc r0,r1 * merge the two words back to 3.13 (not saved to tmp yet) jmp idone iisneg neg z1 mpy z1,r0 * 3.13 x 16.0 = 19.13, LSW is already correct srl r1,13 sla r0,3 soc r0,r1 * merge the two words back to 3.13 (not saved to tmp yet) neg r1 idone * y1 = ys + cy - tmp; mov ys,y1 ai y1,cy s r1,y1 * if( oldY[x1] > y1 ) c @oldY(x1),y1 jl noplot1 * oldY[x1] = y1; mov y1,@oldY(x1) * dc->SetPixel(x1, y1, RGB(0,0,0)); mov x1,r1 mov y1,r0 bl @plot noplot1 * if( oldY[x2] > y1 ) c @oldY(x2),y1 jl noplot2 * oldY[x2] = y1; mov y1,@oldY(x2) * dc->SetPixel(x2, y1, RGB(0,0,0)); mov x2,r1 mov y1,r0 bl @plot noplot2 * end of inner loop processing (normally after di, but it doesn't change) * x1++, x2--, xf2 += axf, axf += stepXf2 inc x1 dec x2 * xf2 needs to be done bytewise, cause we have to split axf for it mov axf,r0 swpb r0 andi r0,>ff00 * LSB in MSB position for LSB of xf2 mov axf,r1 swpb r1 andi r1,>00FF * MSB in LSB position for MSW of xf2 ab r0,@xf2+2 * add the LSB jnc nocarry inc @xf2 * add in the carry to the MSW nocarry a r1,@xf2 * and add in the MSB ai axf,stepXf2 jmp loopXi lXiEnd * outer loop end * zt2 += azt, azt += stepZt2, xs -= stepXs, --ys, zi-- (condition) a azt,zt2 ai azt,stepZt2 ai xs,-stepXs dec ys dec zi jne loopZi * ; End of program end LWPI >83E0 * GPLWS BL @>000E * SCAN (so you can cancel screen blank) LIMI 2 LIMI 0 JMP end ************************************************************************************* * utility code ************************************************************************************* * VDP access * Write R2 bytes from R1 to VDP R0 * Destroys R0,R1,R2 VDPFILL SWPB R0 MOVB R0,*vdpadr SWPB R0 MOVB R0,*vdpadr VMBWLP MOVB R1,@>8C00 DEC R2 JNE VMBWLP B *R11 * load regs list to VDP address, end on >0000 and write >D0 (for sprites) * address of table in R1 (destroyed) LOADRG LOADLP MOV *R1+,R0 JEQ LDRDN SWPB R0 MOVB R0,*vdpadr SWPB R0 MOVB R0,*vdpadr JMP LOADLP LDRDN LI R1,>D000 MOVB R1,@>8C00 B *R11 * Setup for normal bitmap mode BITMAP MOV R11,@SAVE * set display and disable sprites LI R1,BMREGS BL @LOADRG * set up SIT - We load the standard 0-255, 3 times LI R0,>5800 SWPB R0 MOVB R0,*vdpadr SWPB R0 MOVB R0,*vdpadr LI R2,3 CLR R1 LP# MOVB R1,@>8C00 AI R1,>0100 JNE LP# DEC R2 JNE LP# MOV @SAVE,R11 B *R11 * IN AND OUT IN R0 * fractions only - returns 1.0 for values >0.99999 * adapted from dmsc's code * R0 in = 3.13 signed fixed point * Uses separate workspace - looks similar to following * http://samples.sainsburysebooks.co.uk/9781483296692_sample_809121.pdf * uses regs r0-r5 in new workspace SQWP EQU >8324 we need some workspace, this preserves calling regs SQRT ci r0,>2000 1.0 in 3.13 jl sqrt1 li r0,>2000 too big - early out b *r11 sqrt1 MOV R0,@SQWP LWPI SQWP still have r0! (x) CLR r1 root (r) CLR r2 remHi (h) (r0 is remLo) * clr r4 (q) (doesn't need init, this line just for reference) SLA R0,3 lose the integer part LI r3,13 count = (7+FPSCALE/2) -> 7+6 SQRT0 sla r1,1 r = r<<1; mov r1,r4 q = h + (0xFFFF ^ r); inv r4 a r2,r4 jlt sqrt2 if( q >= 0 ) { r += 2; h = q; } inct r1 mov r4,r2 sqrt2 sla r2,2 h = (h << 2) | (x>>14); mov r0,r5 srl r5,14 soc r5,r2 sla r0,2 x <<= 2; DEC r3 while (--count != 0); JNE SQRT0 MOV r1,@>8300 return r; LWPI >8300 B *R11 * INPUT R1,R0 - kills TMP1,TMP2 as well PLOT * use the E/A routine for address MOV R0,tmp1 R0 is the Y value. SLA tmp1,5 SOC R0,tmp1 ANDI tmp1,>FF07 MOV R1,tmp2 R1 is the X value. ANDI tmp2,7 A R1,tmp1 tmp1 is the byte offset. S tmp2,tmp1 tmp2 is the bit offset. * inline VDP! SWPB tmp1 set up read address MOVB tmp1,*vdpadr SWPB tmp1 MOVB tmp1,*vdpadr ORI tmp1,>4000 we need this later, and provides a VDP delay MOVB @>8800,R1 read the byte from VDP SWPB tmp1 set up write address MOVB tmp1,*vdpadr SWPB tmp1 MOVB tmp1,*vdpadr SOCB @BITS(tmp2),R1 or the bit and provide VDP delay MOVB R1,@>8C00 write the byte back B *R11 * init the sine tables * r1 - temp for reflected offset (0-510) * r2 - add value * r3 - current output value * r4 - current change table entry * r5 - table output offset (0-510) * r6 - temp for negative output value * r7,r8 - temp for x0.4 output * r9 - loop counter initSine mov r11,@SAVE * need this to get home! li r2,54 * starting value clr r9 clr r3 nextbyte clr r4 movb @genTable(r9),r4 * we don't have a stack, easier to do it inline bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne bl @genOne * what we DO have is lots of registers inc r9 ci r9,32 jne nextbyte li r3,>2000 bl @genone mov @SAVE,r11 B *R11 * set all four points on the curve, and load both tables genOne li r1,512 s r5,r1 * reflection offset mov r3,r6 neg r6 * negative version mov r3,@sinetab(r5) mov r3,@sinetab+512(r1) mov r6,@sinetab+1024(r5) mov r6,@sinetab+1536(r1) * make the *0.4 version (r3 is always positive here) li r6,>0ccd * 0.4 in 3.13 mov r3,r7 mpy r6,r7 srl r8,13 * shift fraction sla r7,3 * shift int soc r7,r8 * make 3.13 gdone mov r8,r6 neg r6 mov r8,@sinefour(r5) mov r8,@sinefour+512(r1) mov r6,@sinefour+1024(r5) mov r6,@sinefour+1536(r1) inct r5 a r2,r3 * ; Read bit, test if sum must be decreased sla r4,1 jnc nodec dec r2 nodec B *R11 * bits for pixel BITS DATA >8040,>2010,>0804,>0201 * registers for bitmap (and 5A00 is the address of the sprite table) * background is transparent (the only color never redefined) * PDT - >0000 * SIT - >1800 * SDT - >1800 * CT - >2000 * SAL - >1B00 BMREGS DATA >81E0,>8002,>8206,>83ff,>8403,>8536,>8603,>8700,>5B00,>0000 * data for sine generation genTable data >f000,>0200,>0100,>1004,>0404,>0820,>8210,>4221 data >0888,>4444,>4488,>8912,>2448,>9224,>8922,>4912 * BSS section * spot to save return addresses SAVE bss 2 * spot to store the sine table (full 1024 entries) sinetab bss 2048 * sine divided by 4, to remove a multiply inline sinefour bss 2048 * row table for hidden surface (one word per column) oldY bss 512 END The scratchpad version moves sqrt, plot, and the end of the inner loop into scratchpad RAM. This one completes in 17 seconds. (It kind of makes me wonder what a playground-like approach of swapping code in and out would be like - faster or slower?) The overhead of copy loops makes it larger, 790 bytes. Now, interestingly, I tried the suggestion of drawing to a CPU buffer first -- in both the scratchpad and non-scratchpad version, it saved only 1 second of runtime. (Probably because we control the number of pixels drawn so well). But, it freed up a bit more scratchpad space to not have the VDP access there, so I snuck a bit more of the inner loop into it. The net result is a bit under 16 seconds. Files (and all three source) are on the attached disk: DMSCASM_O, DMSCSCRT_O and DMSCHIDE_O, program name START (disk tested). HATASM.zip Note with DMSCHIDE that the screen will display garbage for 16 seconds, then the graphic will appear all at once. But in my opinion, it feels like less of a wait when you get to watch it draw, so the /17/ seconds of DMSCSCRT is infinitely more interesting. And DMSCASM should be an easy port to the F18A GPU. I've included the WinAsm99 source separately in the zip because it won't build with Editor/Assembler anyway. Edited March 14, 2015 by Tursi 1 Quote Link to comment Share on other sites More sharing options...
JamesD Posted March 14, 2015 Share Posted March 14, 2015 (edited) Before you celebrate your win over Atari... isn't the Atari plotting for a width of 320 where you are plotting for a width of 256?The Atari is having to draw more pixels. *edit*NIce work though! Edited March 14, 2015 by JamesD Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 14, 2015 Share Posted March 14, 2015 Before you celebrate your win over Atari... isn't the Atari plotting for a width of 320 where you are plotting for a width of 256? The Atari is having to draw more pixels. *edit* NIce work though! yes, but they are running the same number of loops. (ie: there's no place where a loop is 64 steps smaller). The difference is made in the /step/ along the X axis, which is '1' on the Atari and 0.8 on the TI. In truth the TI is doing MORE work because of the scaling. 1 Quote Link to comment Share on other sites More sharing options...
JamesD Posted March 14, 2015 Share Posted March 14, 2015 (edited) yes, but they are running the same number of loops. (ie: there's no place where a loop is 64 steps smaller). The difference is made in the /step/ along the X axis, which is '1' on the Atari and 0.8 on the TI. In truth the TI is doing MORE work because of the scaling. I haven't looked at the assembly but the BASIC code I posted in 186 doesn't step by 0.8. *edit* BTW, compare the image output by the assembly to that of the BASIC version. Then look at the last pic of the assembly output on the Atari. Edited March 14, 2015 by JamesD Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 14, 2015 Share Posted March 14, 2015 I haven't looked at the assembly but the BASIC code I posted in 186 doesn't step by 0.8. *edit* BTW, compare the image output by the assembly to that of the BASIC version. Then look at the last pic of the assembly output on the Atari. No, I think I'm pretty much done with it. I've invested literally days in drawing this silly image. Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted March 14, 2015 Share Posted March 14, 2015 I'm not yet done with it because I am learning a lot about fixed-point arithmetic and trying to wrap my head around the CORDIC algorithm. I want to implement a fixed-point, CORDIC algorithm for sine. ...lee 1 Quote Link to comment Share on other sites More sharing options...
JamesD Posted March 14, 2015 Share Posted March 14, 2015 No, I think I'm pretty much done with it. I've invested literally days in drawing this silly image. I can understand that. Just porting the BASIC took more time than I should have spent on it. 1 Quote Link to comment Share on other sites More sharing options...
bfollett Posted March 14, 2015 Author Share Posted March 14, 2015 The Atari graphics chip steals cycles from the cpu and the speed penalty varies depending on the graphics mode being used. The 320 mode is the worst and I think you use about 30% of the CPU time. There is an option to have a narrow playfield on the Atari that would be 256 pixels instead of 320 and should speed up the program a bit. That would be a more fair comparison of speed if someone rewrote the Atari demo to match the TI resolution using the narrow playfield. The Atari also has the option to shut the display off completely if you didn't want to see the hat being drawn. You'd gain the full 30% speed increase that way. You just turn the screen on when it's done drawing. Bob 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 15, 2015 Share Posted March 15, 2015 (edited) You guys diving in and saying "oh it's not fair", what the heck?? (Was trying to put a real-world example of the difference here, but I seem to have broken my test app It was just for example. Drawing the entire image costs the TI 1 second. Drawing 64 fewer columns, in those cases where it doesn't plot the pixel (it still calculates them), means a difference at absolute most of 0.25 seconds. Since I'm measuring by hand with a stopwatch, I can't even measure that.) Saying that it's not fair because DMA steals cycles -- that's like saying it's not fair because the 6502 is clocked slower than the 9900. Well, I say it's not fair because the Atari has direct access to video RAM, and because the 9900 is crippled by 4 waits states on every memory access, and because our instructions are 16-bits wide instead of 8. It's a machine comparison - architecture is part of it. But it's all a wash, isn't it? The code sizes are nearly identical. Furthermore, in the best case on both platforms (DMA off on the Atari, render to offscreen buffer on the TI), the runtime IS THE SAME. 16 seconds. It IS a fair comparison, but even if it was not, it was just for fun. The winning code in both cases is by the same guy, after all! The 8-bit wars were thirty years ago. Edited March 15, 2015 by Tursi 1 Quote Link to comment Share on other sites More sharing options...
dmsc Posted March 15, 2015 Share Posted March 15, 2015 Hi!, Good to know that it worked, and relatively fast... You guys diving in and saying "oh it's not fair", what the heck?? ... snip ... It IS a fair comparison, but even if it was not, it was just for fun. The winning code in both cases is by the same guy, after all! The 8-bit wars were thirty years ago. I hereby declare the winner as my current notebook, that draws the complete 320x192 figure with the integer algorithm in 0.00067 seconds, so it's 25000 times faster than any of the other contenders 1 Quote Link to comment Share on other sites More sharing options...
JamesD Posted March 15, 2015 Share Posted March 15, 2015 Perhaps fair is the wrong terminology. Since the image isn't the same it's a bit of an Apples and Oranges comparison though. Quote Link to comment Share on other sites More sharing options...
bfollett Posted March 15, 2015 Author Share Posted March 15, 2015 Well if trying to draw to different resolutions screens doesn't make comparisons less valid/unfair, why not use the 64X48 mode on the TI. I bet that would shave some time off the results. Bob Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 15, 2015 Share Posted March 15, 2015 Holy crap, you guys. You weren't complaining when I was drawing only 40% of the image in TI BASIC. 1 Quote Link to comment Share on other sites More sharing options...
+Ksarul Posted March 15, 2015 Share Posted March 15, 2015 The key is the interval between X-axis iterations--.8 vs 1.0. Though the TI version used a smaller screen size to display the data, it performed a similar number of calculations overall to compute the data. Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 15, 2015 Share Posted March 15, 2015 The multipy by .8 in is an extra step that is required in the TI code that is not required in the Atari code. In addition the TI had the additional overhead of writing its video via a port, which is on the upper 8 bits of the data bus only. Additional overhead that the Atari version doesn't have. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.