Bitmap mode.

Tursi · March 12, 2015

Surprising what those 6502 guru's can squeeze out of that little 8-bitter, huh? Chuck Peddle would be proud!

Yeah, the A800XL was the one I had the most exposure to, and I always thought it was a capable little beast.

dmsc over there has his code down to 21 seconds now, although when I said 30 kicked my butt, I had forgotten that the TI code ran in 26 seconds. (I was thinking 40). That said, he's beat 26 now, too, and he's also counting bytes (750 or so, IIRC). So, I undertook to port his version, which does away with all the squaring in terms of smarter step counts, and runs at 3.13 bits fixed point, giving very nice curves.

The code ported and seems to run, but I have some debugging to do. /IF/ I can make it work, we'll get our 16-bitter back on top, it seems to be running the loops in about 16 seconds and in 670 bytes of code, no scratchpad code. (And I can optimize the sine table better than the 8-bit, cause we can have bigger lookup tables ).

If I can fix the bugs, first. It's a bit harder piece of code to follow.

I haven't tested that the new sine or new square root function even work yet, those are the right places to start. There's a clear uninitialized variable problem, because alternate runs produce different outputs. But to share:

* Fixed point of 3.13

* Port from DMSC's Atari800XL 6502 code

* Port to TI-99/4A 9900 by Tursi

* Assemble with something that likes labels > 6 chars

DEF START

* Constants (INTEGER) - TI SX = 256

cx equ 128 * cx = 160 ; sx / 2

cy equ 90 * cy = 90 ; sy * 15/32

startXs equ 211 * startXs = 224 ; sx/5 + cx

startYs equ 154 * startYs = 154 ; sx/5 + cy

numZi equ 128 * numZi = 128 ; (-64+64)

* Constants (Fixed 3.13)

initAzt equ >ff02 * initAZt = $FF02 ; FP (1/64)^2 - 2/64

initZt2 equ 8192 * initZt2 = 8192 ; FP 1

stepZt2 equ 4 * stepZt2 = 4 ; FP 2 * (1/64)^2

initAXf equ 158 * initAXf = 102 ; FP 256 * (20/(9*sx))^2 = 256 * (1/144)^2

stepXf2 equ 316 * stepXf2 = 204 ; FP 256 * 2*(20/(9*sx))^2 = 256 * 2*(1/144)^2

* storing in registers? Going to try.

zt2 EQU 15

azt EQU 14

xs EQU 13 * byte value TODO: check for wrap on byte regs!

ys EQU 12 * byte value

RET EQU 11 * for BL

zi EQU 10 * byte value

x1 EQU 9

x2 EQU 8

y1 EQU 7 * byte value

axf EQU 5

di EQU 4

tmp1 EQU 3

tmp2 EQU 2

R1 EQU 1 * scratch registers

R0 EQU 0

* this one is easier to store in memory

* but we'll stick in scratchpad after the regs

* note we reserve 4 to preserve alignment, but

* it is big endian, and left aligned (4th byte is unused)

xf2 EQU >8320 * 3 bytes long

START

LWPI >8300

* set up graphics and sine table

BL @BITMAP

BL @initsine

* erase the pattern table

CLR R0

CLR R1

LI R2,>1800

BL @VDPFILL

* set the color table to white on black

LI R0,>2000

LI R1,>F100

LI R2,>1800

BL @VDPFILL

* init defaults

li zt2,initZt2

li azt,initAZt

li xs,startXs

li ys,startYs

li zi,numZi

* ; Outer for loop:

loopZi

* ; x1 = x2 = xs

mov xs,x1

mov xs,x2

* ; xf2 = 0

clr @xf2

clr @xf2+2

* ; axf = initAXf

li axf,initAXf

* ; Inner loop:

loopX

* ; AX = xf^2 = xf2 >> 8

* have to be careful with xf2 - in the original code it's

* 3 bytes, little endian, but here, it's 3 bytes, big endian.

* so this code takes the two MOST significant bytes

* This is going to end up in tmp1, so we work there

mov @xf2,tmp1

* ; tmp1 = xf*xf + zt*zt

a zt2,tmp1

* ; di = sqrt(tmp1)

* sqrt will take in and return in R0 (must preserve tmp1)

mov tmp1,r0

bl @sqrt

mov r0,di

* ; tmp1 = di * 3; di = di * 2

sla di,1

mov di,tmp1

a di,tmp1

* ; tmp2 = sin(tmp1)

* sine will take in and return in R0

mov tmp1,r0

bl @sine

mov r0,tmp2

* ; AX = tmp1 * 3

mov tmp1,r0

sla r0,1

a tmp1,r0

* ; AX = sin(AX) = sin(9*di)

bl @sine

* ; AX = (1/2 - 1/8 + 1/32) * AX = 0.40625 * sin(9*di)

* This is one case where the MPY instruction is a clear winner - just reading

* all the instructions on the 9900 would take longer than to execute the

* multiply. The tricky part is not stomping on any important registers!

mpy @point4,r0 * result in r0,r1 - part we care about (LSW) is R1

* ; tmp2 += AX = sin(3*di) + 0.375 * sin(9*di)

a r1,tmp2

* ; tmp2 * 320 * 7 / 40 = tmp2 * 56 = tmp2 * 64 - tmp2 * 8

* another multiply candidate - for us it's tmp2*256*7/40 = tmp2*44.8 (truncate to 44)

* MPY's lack of sign handling is okay if we never care about the MSW

mov tmp2,r0

mpy @num44,r0 * LSW in r1, so 'A' is LSB

mov r1,tmp2

* ; y1 = -A + ys

andi r1,>00FF

neg r1

a ys,r1

mov r1,y1

* ; plot x1, y1

mov x1,r1

mov y1,r0

bl @plot

* ; plot x2, y1

mov x2,r1

mov y1,r0

bl @plot

* ; Condition: if(2*di>=2.0) break;

ci di,>4000

jhe endLoopX

* ; End of inner loop: x1++, x2--, xf2+=axf, axf+=stepXf2

inc x1

dec x2

mov axf,r0 * get MSB of axf

swpb r0

andi r0,>00ff

mov axf,r1 * get LSB of axf

swpb r1

ab r1,@xf2+2 * add to LSB of xf2

jnc nocarry

inc r0 * if there was carry, increment the MSB add

nocarry

a r0,@xf2 * add MSB (and carry if there was one)

ai axf,stepXf2

jmp loopX

endLoopX

* ; End of outer loop: zt2+=azt, azt+=stepZt2, xs--, ys--

dec xs

dec ys

a azt,zt2

ai azt,stepZt2

* ; Condition: zi--; if( zi<0 ) break;

dec zi

joc loopZi

* ; End of program

end

LIMI 2

JMP end

*************************************************************************************

* utility code

*************************************************************************************

* VDP access

* Write single byte to R0 from MSB R1

* Destroys R0 (actually just oRs it)

VSBW

ORI R0,>4000

SWPB R0

MOVB R0,@>8C02

SWPB R0

MOVB R0,@>8C02

MOVB R1,@>8C00

B *R11

* Write R2 bytes from R1 to VDP R0

* Destroys R0,R1,R2

VDPFILL

ORI R0,>4000

SWPB R0

MOVB R0,@>8C02

SWPB R0

MOVB R0,@>8C02

VMBWLP

MOVB R1,@>8C00

DEC R2

JNE VMBWLP

B *R11

* Write address or register

VDPWA

SWPB R0

MOVB R0,@>8C02

SWPB R0

MOVB R0,@>8C02

B *R11

* load regs list to VDP address, end on >0000 and write >D0 (for sprites)

* address of table in R1 (destroyed)

LOADRG

LOADLP

MOV *R1+,R0

JEQ LDRDN

SWPB R0

MOVB R0,@>8C02

SWPB R0

MOVB R0,@>8C02

JMP LOADLP

LDRDN

LI R1,>D000

MOVB R1,@>8C00

B *R11

* Setup for normal bitmap mode

BITMAP

MOV R11,@SAVE

* set display and disable sprites

LI R1,BMREGS

BL @LOADRG

* set up SIT - We load the standard 0-255, 3 times

LI R0,>5800

BL @VDPWA

CLR R2

NQ#

CLR R1

LP#

MOVB R1,@>8C00

AI R1,>0100

CI R1,>0000

JNE LP#

INC R2

CI R2,3

JNE NQ#

MOV @SAVE,R11

B *R11

* IN AND OUT IN T1

* T1 in = 3.13 signed fixed point

* T1 out = 3.13 signed fixed point (3.12 accuracy)

* Uses T2,X1,Y1,T32

* http://samples.sainsburysebooks.co.uk/9781483296692_sample_809121.pdf

* because we need an even number of bits, we calculate to 3.12 and then shift

SQWP EQU >8324 * we need some workspace, this preserves calling regs

SQRT

MOV R0,@SQWP

LWPI SQWP still have r0!

CLR r1 root

CLR r2 remHi (r0 is remLo)

LI r3,14 count = ((WORD/2-1)+(FRACBITS>>1)) -> 7+6, +1 for loop

SQRT0

SLA r2,2 remHi = (remHi << 2) | (remLo >> 14);

MOV r0,r4

SRL r4,14

SOC r4,r2

SLA r0,2 remLo <<= 2;

SLA r1,1 root <<= 1;

MOV r1,r4 testDiv = (root << 1) + 1;

SLA r4,1

INC r4

C r2,r4 if (remHi >= testDiv) {

JL SQRT2

S r4,r2 remHi -= testDiv;

INC r1 root += 1;

SQRT2

DEC r3 while (--count != 0);

JNE SQRT0

MOV r1,@>8300 return( root);

LWPI >8300

SLA r0,1 Get it up to x.13 fixed point

B *R11

* INPUT R1,R0 - kills TMP1,TMP2 as well

PLOT

* use the E/A routine for address

MOV R0,tmp1 R0 is the Y value.

SLA tmp1,5

SOC R0,tmp1

ANDI tmp1,>FF07

MOV R1,tmp2 R1 is the X value.

ANDI tmp2,7

A R1,tmp1 tmp1 is the byte offset.

S tmp2,tmp1 tmp2 is the bit offset.

* inline VDP!

SWPB tmp1 set up read address

MOVB tmp1,@>8C02

SWPB tmp1

MOVB tmp1,@>8C02

ORI tmp1,>4000 we need this later, and provides a VDP delay

MOVB @>8800,R1 read the byte from VDP

SWPB tmp1 set up write address

MOVB tmp1,@>8C02

SWPB tmp1

MOVB tmp1,@>8C02

SOCB @BITS(tmp2),R1 or the bit and provide VDP delay

MOVB R1,@>8C00 write the byte back

B *R11

* sine code ported from dmsc's 6502 code as well

* my input and output in R0 as 3.13 fixed point

sine

* stx mul2+1

* asl a

* rol mul2+1

* asl a

* rol mul2+1

* ror mul2+2

* asl a

* rol mul2+1

* ; mul2+1 has the fractional part, mul2+2 has the sign (1 bits), C is the reflection

* I think there was a bug in the original code in that mul2+2 ends up with the msb integer,

* rather than the sign bit (ror should be after the first rol?). But as long as the

* integer was always less than 2, it would still have always worked. Ultimately,

* this code is taking this value: S B1 B2 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13

* it wants the sign bit S at mul2+2 (assuming the comment is correct)

* it wants the 'reflection' bit from B2 into the Carry

* and it takes the fractional bits F1-F8 for the index into mul2+1

* We need index * 2, so we'll just take an extra bit

* this is a lot more expensive than a bigger lookup table... it gives a 512 entry

* table in 512 bytes of RAM (instead of 1024).. but using 1k of memory this

* would just be a single lookup, rather than all the tests and inversions.

mov r0,r1

srl r1,4

andi r1,>1fe * mask out integer and mask out odd bit

* bcc @noreflect

coc @reflect,r0

jne noreflect

* lda mul2+1

* eor #255

* tay

* lda sineLo+1,y

* ldx sineHi+1,y

xor @flipidx,r1

mov @sinetab+2(r1),r1

* bcs @invert

* we know the carry is set from the above test, nothing changed it,

* so this is an always branch.

jmp invert

noreflect

* ldy mul2+1

* lda sineLo,y

* ldx sineHi,y

mov @sinetab(r1),r1

invert

* ldy mul2+2

* bpl @ok

coc @negativ,r0

jne ok

* clc

* eor #255

* adc #1

* pha

* txa

* eor #255

* adc #0

* tax

* pla

inv r1

ok

* rts

mov r1,r0

b *r11

* init the sine table (if we're going to build a sine table,

* why not build the whole thing instead of just half of it?)

* iniSum = 54

iniSum equ 54

initSine

mov r11,@SAVE

* lda #iniSum

* sta tmp1

li tmp1,iniSum

* lda #0

* sta mtmp

* sta mtmp+1

* tay

* tax

* no reason not to use the registers here...

clr r1

clr x1

nextbyte

* ; Read next byte

* pha

* lda genTable,x

* sta tmp2

* pla

clr r0

movb @genTable(x1),tmp2

* inx

* stx mtmp+1

* ldx mtmp

inc r1

* jsr genFour

* we don't have a stack, easier to do it inline

bl @genOne

* stx mtmp

* ldx mtmp+1

* cpx #32

* bne nextbyte

* what we DO have is lots of registers

ci r1,32

jne nextbyte

* lda #0

* sta sineLo+256

* lda #$20

* sta sineHi+256

li r0,>2000

mov r0,@sinetab+512

* rts

mov @SAVE,r11

B *R11

* ; Generates one value

*genFour:

* jsr genTwo

*genTwo:

* jsr genOne

* cute trick

genOne

* ; AX = AX + sum

* sta sineLo, y

* pha

* txa

* sta sineHi, y

* pla

* iny

mov x1,@sinetab(y1)

inct y1

* clc

* adc tmp1

* bcc @noinc

* inx

*@noinc:

a tmp1,x1

* ; Read bit, test if sum must be decreased

* asl tmp2

* bcc @nodec

* dec tmp1

*@nodec:

sla tmp2,1

jnc nodec

dec tmp1

nodec

* rts

B *R11

* just a label so I can check sizes

DATASPC

* (my own - 0.4 in 3.13 fixed point)

* data reference so we can MPY directly

point4 DATA 3276

* And this is just a 44 to mpy against

num44 DATA 44

* bits for pixel

BITS

DATA >8040,>2010,>0804,>0201

* registers for bitmap (and 5A00 is the address of the sprite table)

* background is transparent (the only color never redefined)

* PDT - >0000

* SIT - >1800

* SDT - >1800

* CT - >2000

* SAL - >1B00

BMREGS DATA >81E0,>8002,>8206,>83ff,>8403,>8536,>8603,>8700,>5B00,>0000

* misc bits for testing purposes

reflect data >2000

negativ data >8000

flipidx data >01FE

* data for sine generation

genTable data >f000,>0200,>0100,>1004,>0404,>0820,>8210,>4221

data >0888,>4444,>4488,>8912,>2448,>9224,>8922,>4912

* spot to save return addresses

SAVE bss 2

* spot to store the sine table

sinetab bss 514

END

sometimes99er · March 12, 2015

Actually a nice computer generated image.

+OLD CS1 · March 12, 2015

Surprising what those 6502 guru's can squeeze out of that little 8-bitter, huh? Chuck Peddle would be proud!

I will happily raise a glass to our 8-bit pioneer forefathers. The 6500-,6800-, and Z80 family processors are just incredible. Without diminishing what came before, I hazard to say the work on those CPUs represented a much larger leap in technology than anything being made today.

Tursi · March 13, 2015

Close... but not quite. If I follow the error patterns right, this is a sign error or a step error in the sine table... I'm too sleepy to trace much more (and getting kind of tired of this hat ).

If anyone knows offhand which variable I screwed up to get this pattern, let me know and we can fix it! Otherwise, it's almost there. This is a rewrite port of dmcs's version of the hat, using 3.13 fixed point. While he eliminated all multiplies for the 6502, I put a few back in because the 9900 can do them more efficiently than a lot of shifting (although I still need some shifts anyway, so I might be wrong on that). I also extended his sine table to 1024 entries to allow a straight lookup rather than mirroring and reflecting (because we can have indexes larger than 255), and created a second sine table that was pre-scaled to save an inline multiply.

Assuming it's running the right number of cycles, it runs in just shy of 20 seconds, and takes 768 bytes (code+data, not including the test block at the end). Once we get it working correctly, I can optimize it a bit more and shove most of the inner loop into scratchpad, I think.

* Fixed point of 3.13 in most cases
* Port from DMSC's Atari800XL code
* Port to TI-99/4A 9900 by Tursi
* Assemble with something that likes labels > 6 chars
    DEF START,TEST
* Constants (INTEGER) - TI SX = 256, SY=192
cx   equ  128  sx / 2
cy   equ  90  sy * 15 / 32
fy   equ  56  sy * 7 / 24
* For the TI version.. Ys is an integer (step is 1), and Xs is 8.8 (step 0.
* the stepYs is thus optimized out and just inc/dec used.
stepXs equ  >00cc TOFP(sx / 320.0)
* these constants replace the squaring with simple counting loops
initZt2 equ  >2000 TOFP(1.0)
deltaZt equ  >0002 TOFP(1.0/(64.0*64.0))
initAZt equ  >ff02 deltaZt - TOFP(2.0 / 64.0)
stepZt2 equ  >0004 2 * deltaZt
* The values of delta² are too small for 3.13 bits of fixed-point, so we
* use 8 more bits, for an 3.21 bits of precision. This is not slow because
* we only use additions.
deltaXf equ  >009e TOFP( 20.0*20.0 / (9.0*9.0*sx*sx) * 256 ) - the *256 adds 8 bits of zeros
initAXf equ  >009e deltaXf
stepXf2 equ  >013c 2 * deltaXf
* variables in registers
vdpadr equ 15
zt2  EQU 14
azt  EQU 13
xs  EQU 12
RET  EQU 11  * for BL
ys  EQU 10
zi  EQU 9
x1  equ 8
x2  equ 7
axf  equ 6
di  equ 5
tmp  equ 4
z1  equ 3
y1  equ 2
r1  equ 1
r0  equ 0
* for plot
tmp1 equ 4
tmp2 equ 3
* this one is easier to store in memory
* but we'll stick in scratchpad after the regs
* note we reserve 4 to preserve alignment, but
* it is big endian, and left aligned (4th byte is unused)
xf2  EQU >8320 * 3 bytes long
START
  LWPI >8300
* frequently used VDP address 
  li vdpadr,>8c02
* set up graphics and sine table
  BL @BITMAP
  BL @initsine
  
* clear out the oldY table (entries of 192)
  li r0,oldY
  li r1,>C000
  li r2,256
rlp
  movb r1,*r0+
  dec r2
  jne rlp
* erase the pattern table
  CLR R0
  CLR R1
  LI R2,>1800
  BL @VDPFILL
* set the color table to white on black
  LI R0,>2000
  LI R1,>F100
  LI R2,>1800
  BL @VDPFILL
* init defaults
  li zt2,initZt2
  li azt,initAZt
  li xs,>3333
  li ys,>0040
  li zi,128
*        ; Outer for loop:
loopZi
* x1 = (xs>> + cx;  // SIGNED
  mov xs,x1
  sra x1,8
  ai x1,cx
* x2 = x1;
  mov x1,x2
*   ** only xf2 needs 24 bits **
* xf2 = 0;
  clr @xf2
  clr @xf2+2
 
*   axf = initAXf;
  li axf,initAXf
  
* inner for loop
loopXi
* di = sqrti( (xf2>> + zt2 );
  mov @xf2,r0   * two MS bytes here, LSB in next byte, so no shift needed
  a zt2,r0
  bl @sqrt   * return in r0
  mov r0,di   * we save it cause we need to check it later anyway
*  tmp = (0x3244 * di) >> 13; // * (PI/2) - 3.13 * 3.13 = 6.26, shift and truncate
  li tmp,>3244
  mpy tmp,r0   * output in r0,r1
  srl r1,13
  sla r0,3
  soc r0,r1   * merge the two words back to 3.13 (not saved to tmp yet)
* tmp = (tmp*489)>>13;  // multiplier to go from fixed point max range (1.57) to 3/4 circle (768)
  mov r1,r0
  li tmp,489
  mpy tmp,r0   * output in r0,r1
  srl r1,12   * not shifting all the way to get a multiply by 2 for the index
  sla r0,4
  soc r0,r1   * merge the two words back to 3.13 (not saved to tmp yet)
* z1 = sinetab[tmp];
  mov @sinetab(r1),z1
  
*  tmp = tmp + tmp + tmp;
  mov r1,tmp
  a r1,tmp
  a r1,tmp
  
* z1 += sinefour[tmp&0x3ff];
  andi tmp,>03ff
  a @sinefour(tmp),z1
* tmp =  ((z1 * fy) >> 13); // SIGNED
  li r0,fy
  mov z1,z1
  jgt inotneg
  
  neg z1
  mpy z1,r0   * 3.13 x 16.0 = 19.13, LSW is already correct
  srl r1,13
  sla r0,3
  soc r0,r1   * merge the two words back to 3.13 (not saved to tmp yet)
  neg r1
  jmp idone
  
inotneg
  mpy z1,r0
  srl r1,13
  sla r0,3
  soc r0,r1   * merge the two words back to 3.13 (not saved to tmp yet)
idone
* y1 = ys + cy - tmp;
  mov ys,y1
  ai y1,cy
  s r1,y1
* if( oldY[x1] > y1 )
  c @oldY(x1),y1
  jl noplot1
  
* oldY[x1] = y1;
  mov y1,@oldY(x1)
* dc->SetPixel(x1, y1, RGB(0,0,0));
  mov x1,r1
  mov y1,r0
  bl @plot
  
noplot1
         
*   if( oldY[x2] > y1 )
  c @oldY(x2),y1
  jl noplot2
  
* oldY[x2] = y1;
  mov y1,@oldY(x2)
 
* dc->SetPixel(x2, y1, RGB(0,0,0));
  mov x2,r1
  mov y1,r0
  bl @plot
noplot2
* end of inner loop processing (normally after di, but it doesn't change)
* x1++, x2--, xf2 += axf, axf += stepXf2
  inc x1
  dec x2
 
* xf2 needs to be done bytewise, cause we have to split axf for it
  mov axf,r0
  swpb r0
  andi r0,>ff00  * LSB in MSB position for LSB of xf2
  mov axf,r1
  swpb r1
  andi r1,>00FF  * MSB in LSB position for MSW of xf2
  ab r0,@xf2+2  * add the LSB
  jnc nocarry
  inc @xf2   * add in the carry to the MSW
nocarry 
  a r1,@xf2   * and add in the MSB
 
  ai axf,stepXf2
   
* if( di >= 0x2000 ) break; (inner loop)
  ci di,>2000
  jlt loopXi
 
* outer loop end
* zt2 += azt, azt += stepZt2, xs -= stepXs, --ys, zi-- (condition)
  a azt,zt2
  ai azt,stepZt2
  ai xs,-stepXs
  dec ys
  
  dec zi
  jne loopZi
*        ; End of program
end
  LIMI 2
  JMP end
  
*************************************************************************************
* utility code
*************************************************************************************
* VDP access
* Write single byte to R0 from MSB R1
* Destroys R0 (actually just oRs it)
VSBW
    ORI R0,>4000
    SWPB R0
    MOVB R0,*vdpadr
    SWPB R0
    MOVB R0,*vdpadr
    MOVB R1,@>8C00
    B *R11
* Write R2 bytes from R1 to VDP R0
* Destroys R0,R1,R2
VDPFILL
    ORI R0,>4000
    SWPB R0
    MOVB R0,*vdpadr
    SWPB R0
    MOVB R0,*vdpadr
VMBWLP
    MOVB R1,@>8C00
    DEC R2
    JNE VMBWLP
    B *R11
   
* Write address or register
VDPWA
    SWPB R0
    MOVB R0,*vdpadr
    SWPB R0
    MOVB R0,*vdpadr
    B *R11 
   
* load regs list to VDP address, end on >0000 and write >D0 (for sprites)
* address of table in R1 (destroyed)
LOADRG
LOADLP
    MOV *R1+,R0
    JEQ LDRDN
    SWPB R0
    MOVB R0,*vdpadr
    SWPB R0
    MOVB R0,*vdpadr
    JMP LOADLP
LDRDN
    LI R1,>D000
    MOVB R1,@>8C00
    B *R11
* Setup for normal bitmap mode
BITMAP
    MOV R11,@SAVE
* set display and disable sprites
    LI R1,BMREGS
    BL @LOADRG
   
* set up SIT - We load the standard 0-255, 3 times
    LI R0,>5800
    BL @VDPWA
    CLR R2
NQ#
    CLR R1
LP#
    MOVB R1,@>8C00
    AI R1,>0100
    CI R1,>0000
    JNE LP#
    INC R2
    CI R2,3
    JNE NQ#
   
    MOV @SAVE,R11
    B *R11
* IN AND OUT IN R0
* fractions only - returns 1.0 for values >0.99999
* adapted from dmsc's code
* R0 in = 3.13 signed fixed point
* Uses separate workspace - looks similar to following
* http://samples.sainsburysebooks.co.uk/9781483296692_sample_809121.pdf
SQWP EQU >8324   * we need some workspace, this preserves calling regs
SQRT
 ci r0,>2000   1.0 in 3.13
 jl sqrt1
 li r0,>2000   too big - early out
 b *r11
sqrt1 
 MOV R0,@SQWP
 LWPI SQWP   still have r0! (x)
 
    CLR r1              root (r)
    CLR r2              remHi (h) (r0 is remLo)
*  clr r4    (q)
    SLA R0,3   lose the integer part
    LI r3,13            count = (7+FPSCALE/2) -> 7+6
SQRT0
    sla r1,1            r = r<<1;
    mov r1,r4   q = h + (0xFFFF ^ r);
    inv r4
    a r2,r4    
    jlt sqrt2   if( q >= 0 ) { r += 2; h = q; }
    inct r1
    mov r4,r2
sqrt2
   
    sla r2,2   h = (h << 2) | (x>>14);
    mov r0,r5
    srl r5,14
    soc r5,r2
    sla r0,2   x <<= 2;
   
    DEC r3              while (--count != 0);
    JNE SQRT0
   
    MOV r1,@>8300       return r;
    LWPI >8300
  
    B *R11
* INPUT R1,R0 - kills TMP1,TMP2 as well
PLOT
* use the E/A routine for address
    MOV  R0,tmp1        R0 is the Y value.
    SLA  tmp1,5
    SOC  R0,tmp1
    ANDI tmp1,>FF07
    MOV  R1,tmp2        R1 is the X value.
    ANDI tmp2,7
    A    R1,tmp1        tmp1 is the byte offset.
    S    tmp2,tmp1      tmp2 is the bit offset.
   
* inline VDP!
    SWPB tmp1                set up read address
    MOVB tmp1,*vdpadr
    SWPB tmp1
    MOVB tmp1,*vdpadr
    ORI tmp1,>4000        we need this later, and provides a VDP delay
    MOVB @>8800,R1        read the byte from VDP
    SWPB tmp1             set up write address
    MOVB tmp1,*vdpadr
    SWPB tmp1
    MOVB tmp1,*vdpadr
    SOCB @BITS(tmp2),R1   or the bit and provide VDP delay
    MOVB R1,@>8C00        write the byte back
    B *R11
* init the sine tables
* r1 - temp for reflected offset (0-510)
* r2 - add value
* r3 - current output value
* r4 - current change table entry
* r5 - table output offset (0-510)
* r6 - temp for negative output value
* r7,r8 - temp for x0.4 output
* r9 - loop counter
initSine
  mov r11,@SAVE
  
  li r2,54   * starting value
* no reason not to use the registers here...
  clr r9
  clr r3
nextbyte
  clr r4
  movb @genTable(r9),r4
* we don't have a stack, easier to do it inline
  bl @genOne
  bl @genOne
  bl @genOne
  bl @genOne
  bl @genOne
  bl @genOne
  bl @genOne
  bl @genOne
* what we DO have is lots of registers 
  inc r9
  ci r9,32
  jne nextbyte
  
  mov @SAVE,r11
  B *R11
* set all four points on the curve, and load both tables
genOne
  li r1,510
  s r5,r1     * reflection offset
  mov r3,r6
  neg r6     * negative version
  mov r3,@sinetab(r5)
  mov r3,@sinetab+512(r1)
  mov r6,@sinetab+1024(r5)
  mov r6,@sinetab+1536(r1)
  
* make the *0.4 version
  li r6,>0ccd    * 0.4 in 3.13
  mov r3,r7
  jgt gnotneg
  
  abs r7
  mpy r6,r7
  srl r8,13    * shift fraction
  sla r7,3    * shift int
  soc r7,r8    * make 3.13
  neg r8
  jmp gdone
  
gnotneg
  mpy r6,r7
  srl r8,13    * shift fraction
  sla r7,3    * shift int
  soc r7,r8    * make 3.13
gdone
  mov r8,r6
  neg r6
  mov r8,@sinefour(r5)
  mov r8,@sinefour+512(r1)
  mov r6,@sinefour+1024(r5)
  mov r6,@sinefour+1536(r1)
  
  inct r5
  a r2,r3
*        ; Read bit, test if sum must be decreased
  sla r4,1
  jnc nodec
  dec r2
nodec
  B *R11
* just a label so I can check sizes
DATASPC
* (my own - 0.40625 in 0.16 fixed point)
* data reference so we can MPY directly
point4 DATA >6800
* And this is just a 44 to mpy against
num44 DATA 44
* bits for pixel
BITS
    DATA >8040,>2010,>0804,>0201
* registers for bitmap (and 5A00 is the address of the sprite table)
* background is transparent (the only color never redefined)
* PDT - >0000
* SIT - >1800
* SDT - >1800
* CT  - >2000
* SAL - >1B00
BMREGS DATA >81E0,>8002,>8206,>83ff,>8403,>8536,>8603,>8700,>5B00,>0000
* misc bits for testing purposes
reflect data >2000
negativ data >4000
flipidx data >01FE
* data for sine generation
genTable data >f000,>0200,>0100,>1004,>0404,>0820,>8210,>4221
         data >0888,>4444,>4488,>8912,>2448,>9224,>8922,>4912
* BSS section
* spot to save return addresses
SAVE bss 2
* spot to store the sine table (full 1024 entries)
sinetab bss 2048
* sine divided by 4, to remove a multiply inline
sinefour bss 2048
* row table for hidden surface (one entry per column)
oldY bss 512
******************************************************************************
TEST
  LWPI >8300
* frequently used VDP address 
  li vdpadr,>8c02
* set up graphics and sine table
  BL @BITMAP
  BL @initsine
  
* erase the pattern table
  CLR R0
  CLR R1
  LI R2,>1800
  BL @VDPFILL
* set the color table to white on black
  LI R0,>2000
  LI R1,>F100
  LI R2,>1800
  BL @VDPFILL
* draw a sine wave
  clr x1
sinlp
  mov x1,r1
  sla r1,3
  mov @sinetab(r1),r0
  sra r0,7
  ai r0,96
  mov x1,r1
  bl @plot
  inc x1
  ci x1,256
  jne sinlp
  
* draw a sine wave
  clr x1
sinlp2
  mov x1,r1
  sla r1,3
  mov @sinefour(r1),r0
  sra r0,7
  ai r0,96
  mov x1,r1
  bl @plot
  inc x1
  ci x1,256
  jne sinlp2
  b @end
  END

Willsy · March 13, 2015

I stand (well, sit actually) in awe of this....!

Asmusr · March 13, 2015

If anyone knows offhand which variable I screwed up to get this pattern, let me know and we can fix it!

Looks like the 2nd sine function call. This controls the number of 'humps' on the hat.

dmsc · March 13, 2015

Hi!

I browsed your program, without really understanding the 9900 assembly, but here are some comments:

  srl r1,12   * not shifting all the way to get a multiply by 2 for the index
  sla r0,4
  soc r0,r1   * merge the two words back to 3.13 (not saved to tmp yet)
* z1 = sinetab[tmp];
  mov @sinetab(r1),z1

Don't you need something like "r1 = r1 & 0X7fe", or the indexes automatically address words only?

...

*  tmp = tmp + tmp + tmp;
  mov r1,tmp
  a r1,tmp
  a r1,tmp
  
* z1 += sinefour[tmp&0x3ff];
  andi tmp,>03ff
  a @sinefour(tmp),z1

Also here, I thought that you should and with 0x7fe.

...

* set all four points on the curve, and load both tables
genOne
  li r1,510
  s r5,r1     * reflection offset
  mov r3,r6
  neg r6     * negative version
  mov r3,@sinetab(r5)
  mov r3,@sinetab+512(r1)
  mov r6,@sinetab+1024(r5)
  mov r6,@sinetab+1536(r1)

Note that the reflection is not about 255, but about 256 (i.e., the point at "0" is the same that the point at "512", so I suspect that you should use "li r1,512", so that with "r1=0" you store the reflection at 512+512 = 1024.

  
* make the *0.4 version
  li r6,>0ccd    * 0.4 in 3.13
  mov r3,r7
  jgt gnotneg

This is not needed, as r3 is always positive (you are in the first quadrant).

Hope it helps!

Asmusr · March 13, 2015

>> Don't you need something like "r1 = r1 & 0X7fe", or the indexes automatically address words only?

A word operation like MOV (as opposed to a byte operation like MOVB) will automatically discard the least significant bit. I.e. if you try to read a word at 1001 you will read the word at 1000 instead.

+mizapf · March 13, 2015

>> Don't you need something like "r1 = r1 & 0X7fe", or the indexes automatically address words only?

A word operation like MOV (as opposed to a byte operation like MOVB) will automatically discard the least significant bit. I.e. if you try to read a word at 1001 you will read the word at 1000 instead.

When reading more about other architectures (like MIPS) I learned that this not a necessity for all architectures. Others react with a "bus error" trap. The masking of the LSB is obviously a consequence that the TMS9900 processor does not even have an A15 line; it is created by the chipset on the board.

Tursi · March 14, 2015

I love coming back to such problems after SLEEP. Things are so much more clear.

Looks like the 2nd sine function call. This controls the number of 'humps' on the hat.

Narrows it down!

Don't you need something like "r1 = r1 & 0X7fe", or the indexes automatically address words only?

Yeah, it does. The 9900 actually only has 15 address bits, so the least significant bit is never presented to the address bus. I cheated there.

Also here, I thought that you should and with 0x7fe.

And that's it, yes. As soon as I read that, combined with Rasmus' comment, I knew this was it. I was masking out half the sine table. (The 9900 may understand it's a word access, but it still makes us provide a byte-sized offset ) Corrected, and it's fixed. It draws a hat now!

Note that the reflection is not about 255, but about 256 (i.e., the point at "0" is the same that the point at "512", so I suspect that you should use "li r1,512", so that with "r1=0" you store the reflection at 512+512 = 1024.

I tried that at first, and it leaves gaps in the table if I don't store the 1.0 value like your table had. I was a little unclear on that, so yeah, I cheated. I've retested now with the benefit of sleep, and yes, you are correct. Will fix.

This is not needed, as r3 is always positive (you are in the first quadrant).

Ooh, good info, thank you!

Will post an update shortly. (Instead of, you know, doing something useful with my day. )

Tursi · March 14, 2015

So, porting dmsc's algorithm to the TI seems to work well! Here are the working versions.

Here's the standard code - all in 8-bit RAM, takes about 20 seconds to run (and 730 bytes of code+data).

The Classic99 heatmap is fun to watch while it's running - you can see the accesses to the sine tables as a sort of sparkling effect.

* Fixed point of 3.13 in most cases
* Port from DMSC's Atari800XL code
* Port to TI-99/4A 9900 by Tursi
* Assemble with something that likes labels > 6 chars
    DEF START
* Constants (INTEGER) - TI SX = 256, SY=192
cx   equ  128  sx / 2
cy   equ  90  sy * 15 / 32
fy   equ  56  sy * 7 / 24
* For the TI version.. Ys is an integer (step is 1), and Xs is 8.8 (step 0.
* the stepYs is thus optimized out and just inc/dec used.
stepXs equ  >00cc TOFP(sx / 320.0)
* these constants replace the squaring with simple counting loops
initZt2 equ  >2000 TOFP(1.0)
deltaZt equ  >0002 TOFP(1.0/(64.0*64.0))
initAZt equ  >ff02 deltaZt - TOFP(2.0 / 64.0)
stepZt2 equ  >0004 2 * deltaZt
* The values of delta² are too small for 3.13 bits of fixed-point, so we
* use 8 more bits, for an 3.21 bits of precision. This is not slow because
* we only use additions.
deltaXf equ  >009e TOFP( 20.0*20.0 / (9.0*9.0*sx*sx) * 256 ) - the *256 adds 8 bits of zeros
initAXf equ  >009e deltaXf
stepXf2 equ  >013c 2 * deltaXf
* variables in registers
vdpadr equ 15
zt2  EQU 14
azt  EQU 13
xs  EQU 12
RET  EQU 11  * for BL
ys  EQU 10
zi  EQU 9
x1  equ 8
x2  equ 7
axf  equ 6
*di  equ 5  * optimized away
tmp  equ 4
z1  equ 3
y1  equ 2
r1  equ 1
r0  equ 0
* for plot
tmp1 equ 4
tmp2 equ 3
* this one is easier to store in memory
* but we'll stick in scratchpad after the regs
* note we reserve 4 to preserve alignment, but
* it is big endian, and left aligned (4th byte is unused)
xf2  EQU >8320 * 3 bytes long
START
  LWPI >8300
* frequently used VDP address 
  li vdpadr,>8c02
* set up graphics and sine table
  BL @BITMAP
  BL @initsine
  
* clear out the oldY table (entries of 192)
  li r0,oldY
  li r1,>C0C0
  li r2,128
rlp
  mov r1,*r0+
  dec r2
  jne rlp
* erase the pattern table
  LI R0,>4000  write address >0000
  CLR R1
  LI R2,>1800
  BL @VDPFILL
  
* set the color table to white on black
  LI R0,>6000  write address >2000
  LI R1,>F100
  LI R2,>1800
  BL @VDPFILL
* init defaults
  li zt2,initZt2
  li azt,initAZt
  li xs,>3333
  li ys,>0040
  li zi,128
*        ; Outer for loop:
loopZi
* x1 = (xs>> + cx;  // SIGNED
  mov xs,x1
  sra x1,8
  ai x1,cx
* x2 = x1;
  mov x1,x2
*   ** only xf2 needs 24 bits **
* xf2 = 0;
  clr @xf2
  clr @xf2+2
 
*   axf = initAXf;
  li axf,initAXf
  
* inner for loop
loopXi
* di = sqrti( (xf2>> + zt2 );
  mov @xf2,r0   * two MS bytes here, LSB in next byte, so no shift needed
  a zt2,r0
  bl @sqrt   * return in r0
* if( di >= 0x2000 ) break; (inner loop - moved from the end, no visible difference but frees di)
  ci r0,>2000
  jhe lXiEnd
*  tmp = (0x3244 * di) >> 13; // * (PI/2) - 3.13 * 3.13 = 6.26, shift and truncate
  li tmp,>3244
  mpy tmp,r0   * output in r0,r1
  srl r1,13
  sla r0,3
  soc r0,r1   * merge the two words back to 3.13 (not saved to tmp yet)
* tmp = (tmp*489)>>13;  // multiplier to go from fixed point max range (1.57) to 3/4 circle (768)
  mov r1,r0
  li tmp,489
  mpy tmp,r0   * output in r0,r1
  srl r1,12   * not shifting all the way to get a multiply by 2 for the index
  sla r0,4
  soc r0,r1   * merge the two words back to 3.13 (not saved to tmp yet)
* z1 = sinetab[tmp];
  mov @sinetab(r1),z1
  
*  tmp = tmp + tmp + tmp;
  mov r1,tmp
  a r1,tmp
  a r1,tmp
  
* z1 += sinefour[tmp&0x3ff];
  andi tmp,>07fe
*  a @sinefour(tmp),z1 * delayed till below
* tmp =  ((z1 * fy) >> 13); // SIGNED
  li r0,fy
*  mov z1,z1
  a @sinefour(tmp),z1 * moved from above so we can add and test in one step
  jlt iisneg
  
  mpy z1,r0
  srl r1,13
  sla r0,3
  soc r0,r1   * merge the two words back to 3.13 (not saved to tmp yet)
  jmp idone
iisneg  
  neg z1
  mpy z1,r0   * 3.13 x 16.0 = 19.13, LSW is already correct
  srl r1,13
  sla r0,3
  soc r0,r1   * merge the two words back to 3.13 (not saved to tmp yet)
  neg r1
idone
* y1 = ys + cy - tmp;
  mov ys,y1
  ai y1,cy
  s r1,y1
* if( oldY[x1] > y1 )
  c @oldY(x1),y1
  jl noplot1
  
* oldY[x1] = y1;
  mov y1,@oldY(x1)
* dc->SetPixel(x1, y1, RGB(0,0,0));
  mov x1,r1
  mov y1,r0
  bl @plot
  
noplot1
         
*   if( oldY[x2] > y1 )
  c @oldY(x2),y1
  jl noplot2
  
* oldY[x2] = y1;
  mov y1,@oldY(x2)
 
* dc->SetPixel(x2, y1, RGB(0,0,0));
  mov x2,r1
  mov y1,r0
  bl @plot
noplot2
* end of inner loop processing (normally after di, but it doesn't change)
* x1++, x2--, xf2 += axf, axf += stepXf2
  inc x1
  dec x2
 
* xf2 needs to be done bytewise, cause we have to split axf for it
  mov axf,r0
  swpb r0
  andi r0,>ff00  * LSB in MSB position for LSB of xf2
  mov axf,r1
  swpb r1
  andi r1,>00FF  * MSB in LSB position for MSW of xf2
  ab r0,@xf2+2  * add the LSB
  jnc nocarry
  inc @xf2   * add in the carry to the MSW
nocarry 
  a r1,@xf2   * and add in the MSB
 
  ai axf,stepXf2
   
  jmp loopXi
lXiEnd
* outer loop end
* zt2 += azt, azt += stepZt2, xs -= stepXs, --ys, zi-- (condition)
  a azt,zt2
  ai azt,stepZt2
  ai xs,-stepXs
  dec ys
  
  dec zi
  jne loopZi
*        ; End of program
end
  LWPI >83E0   * GPLWS
  BL @>000E   * SCAN (so you can cancel screen blank)
  LIMI 2
  LIMI 0
  JMP end
  
*************************************************************************************
* utility code
*************************************************************************************
* VDP access
* Write R2 bytes from R1 to VDP R0
* Destroys R0,R1,R2
VDPFILL
    SWPB R0
    MOVB R0,*vdpadr
    SWPB R0
    MOVB R0,*vdpadr
VMBWLP
    MOVB R1,@>8C00
    DEC R2
    JNE VMBWLP
    B *R11
  
* load regs list to VDP address, end on >0000 and write >D0 (for sprites)
* address of table in R1 (destroyed)
LOADRG
LOADLP
    MOV *R1+,R0
    JEQ LDRDN
    SWPB R0
    MOVB R0,*vdpadr
    SWPB R0
    MOVB R0,*vdpadr
    JMP LOADLP
LDRDN
    LI R1,>D000
    MOVB R1,@>8C00
    B *R11
* Setup for normal bitmap mode
BITMAP
    MOV R11,@SAVE
* set display and disable sprites
    LI R1,BMREGS
    BL @LOADRG
   
* set up SIT - We load the standard 0-255, 3 times
    LI R0,>5800
    SWPB R0
    MOVB R0,*vdpadr
    SWPB R0
    MOVB R0,*vdpadr
    LI R2,3
    CLR R1
LP#
    MOVB R1,@>8C00
    AI R1,>0100
    JNE LP#
    DEC R2
    JNE LP#
   
    MOV @SAVE,R11
    B *R11
* IN AND OUT IN R0
* fractions only - returns 1.0 for values >0.99999
* adapted from dmsc's code
* R0 in = 3.13 signed fixed point
* Uses separate workspace - looks similar to following
* http://samples.sainsburysebooks.co.uk/9781483296692_sample_809121.pdf
* uses regs r0-r5 in new workspace
SQWP EQU >8324   we need some workspace, this preserves calling regs
SQRT
 ci r0,>2000   1.0 in 3.13
 jl sqrt1
 li r0,>2000   too big - early out
 b *r11
sqrt1 
 MOV R0,@SQWP
 LWPI SQWP   still have r0! (x)
 
    CLR r1              root (r)
    CLR r2              remHi (h) (r0 is remLo)
*  clr r4    (q) (doesn't need init, this line just for reference)
    SLA R0,3   lose the integer part
    LI r3,13            count = (7+FPSCALE/2) -> 7+6
SQRT0
    sla r1,1            r = r<<1;
    mov r1,r4   q = h + (0xFFFF ^ r);
    inv r4
    a r2,r4    
    jlt sqrt2   if( q >= 0 ) { r += 2; h = q; }
    inct r1
    mov r4,r2
sqrt2
   
    sla r2,2   h = (h << 2) | (x>>14);
    mov r0,r5
    srl r5,14
    soc r5,r2
    sla r0,2   x <<= 2;
   
    DEC r3              while (--count != 0);
    JNE SQRT0
   
    MOV r1,@>8300       return r;
    LWPI >8300
  
    B *R11
* INPUT R1,R0 - kills TMP1,TMP2 as well
PLOT
* use the E/A routine for address
    MOV  R0,tmp1        R0 is the Y value.
    SLA  tmp1,5
    SOC  R0,tmp1
    ANDI tmp1,>FF07
    MOV  R1,tmp2        R1 is the X value.
    ANDI tmp2,7
    A    R1,tmp1        tmp1 is the byte offset.
    S    tmp2,tmp1      tmp2 is the bit offset.
   
* inline VDP!
    SWPB tmp1             set up read address
    MOVB tmp1,*vdpadr
    SWPB tmp1
    MOVB tmp1,*vdpadr
    ORI tmp1,>4000        we need this later, and provides a VDP delay
    MOVB @>8800,R1        read the byte from VDP
    SWPB tmp1             set up write address
    MOVB tmp1,*vdpadr
    SWPB tmp1
    MOVB tmp1,*vdpadr
    SOCB @BITS(tmp2),R1   or the bit and provide VDP delay
    MOVB R1,@>8C00        write the byte back
    B *R11
* init the sine tables
* r1 - temp for reflected offset (0-510)
* r2 - add value
* r3 - current output value
* r4 - current change table entry
* r5 - table output offset (0-510)
* r6 - temp for negative output value
* r7,r8 - temp for x0.4 output
* r9 - loop counter
initSine
  mov r11,@SAVE  * need this to get home!
  
  li r2,54   * starting value
  clr r9
  clr r3
nextbyte
  clr r4
  movb @genTable(r9),r4
* we don't have a stack, easier to do it inline
  bl @genOne
  bl @genOne
  bl @genOne
  bl @genOne
  bl @genOne
  bl @genOne
  bl @genOne
  bl @genOne
* what we DO have is lots of registers 
  inc r9
  ci r9,32
  jne nextbyte
  
  li r3,>2000
  bl @genone
  
  mov @SAVE,r11
  B *R11
* set all four points on the curve, and load both tables
genOne
  li r1,512
  s r5,r1     * reflection offset
  mov r3,r6
  neg r6     * negative version
  mov r3,@sinetab(r5)
  mov r3,@sinetab+512(r1)
  mov r6,@sinetab+1024(r5)
  mov r6,@sinetab+1536(r1)
  
* make the *0.4 version (r3 is always positive here)
  li r6,>0ccd    * 0.4 in 3.13
  mov r3,r7
  mpy r6,r7
  srl r8,13    * shift fraction
  sla r7,3    * shift int
  soc r7,r8    * make 3.13
gdone
  mov r8,r6
  neg r6
  mov r8,@sinefour(r5)
  mov r8,@sinefour+512(r1)
  mov r6,@sinefour+1024(r5)
  mov r6,@sinefour+1536(r1)
  
  inct r5
  a r2,r3
*        ; Read bit, test if sum must be decreased
  sla r4,1
  jnc nodec
  dec r2
nodec
  B *R11
* bits for pixel
BITS
  DATA >8040,>2010,>0804,>0201
* registers for bitmap (and 5A00 is the address of the sprite table)
* background is transparent (the only color never redefined)
* PDT - >0000
* SIT - >1800
* SDT - >1800
* CT  - >2000
* SAL - >1B00
BMREGS  
  DATA >81E0,>8002,>8206,>83ff,>8403,>8536,>8603,>8700,>5B00,>0000
* data for sine generation
genTable
  data >f000,>0200,>0100,>1004,>0404,>0820,>8210,>4221
        data >0888,>4444,>4488,>8912,>2448,>9224,>8922,>4912
* BSS section
* spot to save return addresses
SAVE 
  bss 2
* spot to store the sine table (full 1024 entries)
sinetab 
  bss 2048
* sine divided by 4, to remove a multiply inline
sinefour
  bss 2048
* row table for hidden surface (one word per column)
oldY 
  bss 512
  END

The scratchpad version moves sqrt, plot, and the end of the inner loop into scratchpad RAM. This one completes in 17 seconds. (It kind of makes me wonder what a playground-like approach of swapping code in and out would be like - faster or slower?)

The overhead of copy loops makes it larger, 790 bytes.

Now, interestingly, I tried the suggestion of drawing to a CPU buffer first -- in both the scratchpad and non-scratchpad version, it saved only 1 second of runtime. (Probably because we control the number of pixels drawn so well). But, it freed up a bit more scratchpad space to not have the VDP access there, so I snuck a bit more of the inner loop into it. The net result is a bit under 16 seconds.

Files (and all three source) are on the attached disk: DMSCASM_O, DMSCSCRT_O and DMSCHIDE_O, program name START (disk tested).

HATASM.zip

Note with DMSCHIDE that the screen will display garbage for 16 seconds, then the graphic will appear all at once. But in my opinion, it feels like less of a wait when you get to watch it draw, so the /17/ seconds of DMSCSCRT is infinitely more interesting. And DMSCASM should be an easy port to the F18A GPU.

I've included the WinAsm99 source separately in the zip because it won't build with Editor/Assembler anyway.

Edited March 14, 2015 by Tursi

JamesD · March 14, 2015

Before you celebrate your win over Atari... isn't the Atari plotting for a width of 320 where you are plotting for a width of 256?
The Atari is having to draw more pixels.

*edit*
NIce work though!

Edited March 14, 2015 by JamesD

Tursi · March 14, 2015

Before you celebrate your win over Atari... isn't the Atari plotting for a width of 320 where you are plotting for a width of 256?

The Atari is having to draw more pixels.

*edit*

NIce work though!

yes, but they are running the same number of loops. (ie: there's no place where a loop is 64 steps smaller). The difference is made in the /step/ along the X axis, which is '1' on the Atari and 0.8 on the TI. In truth the TI is doing MORE work because of the scaling.

JamesD · March 14, 2015

yes, but they are running the same number of loops. (ie: there's no place where a loop is 64 steps smaller). The difference is made in the /step/ along the X axis, which is '1' on the Atari and 0.8 on the TI. In truth the TI is doing MORE work because of the scaling.

I haven't looked at the assembly but the BASIC code I posted in 186 doesn't step by 0.8.

*edit*

BTW, compare the image output by the assembly to that of the BASIC version. Then look at the last pic of the assembly output on the Atari.

Edited March 14, 2015 by JamesD

Tursi · March 14, 2015

I haven't looked at the assembly but the BASIC code I posted in 186 doesn't step by 0.8.

*edit*

BTW, compare the image output by the assembly to that of the BASIC version. Then look at the last pic of the assembly output on the Atari.

No, I think I'm pretty much done with it. I've invested literally days in drawing this silly image.

+Lee Stewart · March 14, 2015

I'm not yet done with it because I am learning a lot about fixed-point arithmetic and trying to wrap my head around the CORDIC algorithm. I want to implement a fixed-point, CORDIC algorithm for sine.

...lee

JamesD · March 14, 2015

No, I think I'm pretty much done with it. I've invested literally days in drawing this silly image.

I can understand that. Just porting the BASIC took more time than I should have spent on it.

bfollett · March 14, 2015

The Atari graphics chip steals cycles from the cpu and the speed penalty varies depending on the graphics mode being used. The 320 mode is the worst and I think you use about 30% of the CPU time. There is an option to have a narrow playfield on the Atari that would be 256 pixels instead of 320 and should speed up the program a bit. That would be a more fair comparison of speed if someone rewrote the Atari demo to match the TI resolution using the narrow playfield. The Atari also has the option to shut the display off completely if you didn't want to see the hat being drawn. You'd gain the full 30% speed increase that way. You just turn the screen on when it's done drawing.

Bob

Tursi · March 15, 2015

You guys diving in and saying "oh it's not fair", what the heck??

(Was trying to put a real-world example of the difference here, but I seem to have broken my test app It was just for example. Drawing the entire image costs the TI 1 second. Drawing 64 fewer columns, in those cases where it doesn't plot the pixel (it still calculates them), means a difference at absolute most of 0.25 seconds. Since I'm measuring by hand with a stopwatch, I can't even measure that.)

Saying that it's not fair because DMA steals cycles -- that's like saying it's not fair because the 6502 is clocked slower than the 9900. Well, I say it's not fair because the Atari has direct access to video RAM, and because the 9900 is crippled by 4 waits states on every memory access, and because our instructions are 16-bits wide instead of 8. It's a machine comparison - architecture is part of it.

But it's all a wash, isn't it? The code sizes are nearly identical. Furthermore, in the best case on both platforms (DMA off on the Atari, render to offscreen buffer on the TI), the runtime IS THE SAME. 16 seconds.

It IS a fair comparison, but even if it was not, it was just for fun. The winning code in both cases is by the same guy, after all! The 8-bit wars were thirty years ago.

Edited March 15, 2015 by Tursi

dmsc · March 15, 2015

Hi!,

Good to know that it worked, and relatively fast...

You guys diving in and saying "oh it's not fair", what the heck??

... snip ...

It IS a fair comparison, but even if it was not, it was just for fun. The winning code in both cases is by the same guy, after all! The 8-bit wars were thirty years ago.

I hereby declare the winner as my current notebook, that draws the complete 320x192 figure with the integer algorithm in 0.00067 seconds, so it's 25000 times faster than any of the other contenders

JamesD · March 15, 2015

Perhaps fair is the wrong terminology.
Since the image isn't the same it's a bit of an Apples and Oranges comparison though.

bfollett · March 15, 2015

Well if trying to draw to different resolutions screens doesn't make comparisons less valid/unfair, why not use the 64X48 mode on the TI. I bet that would shave some time off the results.

Bob

Tursi · March 15, 2015

Holy crap, you guys.

You weren't complaining when I was drawing only 40% of the image in TI BASIC.

+Ksarul · March 15, 2015

The key is the interval between X-axis iterations--.8 vs 1.0. Though the TI version used a smaller screen size to display the data, it performed a similar number of calculations overall to compute the data.

Willsy · March 15, 2015

The multipy by .8 in is an extra step that is required in the TI code that is not required in the Atari code. In addition the TI had the additional overhead of writing its video via a port, which is on the upper 8 bits of the data bus only. Additional overhead that the Atari version doesn't have.

Bitmap mode.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members