Jump to content
IGNORED

Benchmarking Languages


Tursi

Recommended Posts

I guess it makes sense, putting constants VDPWA & VDPWD into 2 registers and using register indirect mode in the loop, saves some time by not needing to increment the PC...

 

5 minutes 57 seconds (357 seconds) or 3.57

 

Still rounds up to 4.  I wonder if there's 8 seconds in there somewhere, so it rounds down :)

 

Spoiler

        AORG >A000
* assumes startup from Editor/Assembler

  DEF START
  REF VDPWA,VDPWD
  
* make it work as EA5 if desired
  B @START

START
  lwpi >8300
  li r1,>8320
  li r0,l140
! mov *R0+,*R1+
  ci R1,>8400
  jne -!

* call clear
  li r0,>0040     * write address >0000
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  
  li r1,>2000
  li r2,768
lp1
  movb r1,@VDPWD
  dec r2
  jne lp1
  
* call magnify(2)
  li r0,>c181     * write VDP register 1 with >C2 (16k,enable, no int, double-size sprites)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  
* call sprite(#1,42,2,1,1)
  li r0,>0186     * vdp register 6 to >01 (sprite descriptor table to >0800)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA

  li r0,>0043     * write address >0300
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
 
  li r0,>002A     * 1,1 (minus 1) and 42
  movb r0,@VDPWD
  nop
  movb r0,@VDPWD
  swpb r0
  movb r0,@VDPWD
  
  li r0,>01d0     * color 2 (-1) and list terminator
  movb r0,@VDPWD
  swpb r0
  movb r0,@VDPWD
  
* cnt=100
  li r5,10000
  
  li r0,>0100     * write address >0301 (X pos)
  li r1,>0000     * write address >0300 (Y pos)
  li r6,>4300
  li r7,VDPWA
  li r8,VDPWD
  clr r4
  B @>8320
l140
  li r3,>0100

* for x=1 to 240 (minus 1 for asm)
xlp1

* call locate(#1,1,x)
  movb r0,*r7
  nop
  movb r6,*r7
  nop
  movb r3,*r8
  
* next x
  ai r3,>0100
  ci r3,>ef00
  jne xlp1

  movb r0,*r7
  nop
  movb r6,*r7
  nop
  movb r3,*r8
  jmp opt1
  
* for y=1 to 176
ylp1

* call locate(#1,y,240)
  movb r1,*r7
  nop
  movb r6,*r7
  nop
  movb r4,*r8
  
* next y
opt1
  ai r4,>0100
  ci r4,>af00
  jne ylp1

  movb r1,*r7
  nop
  movb r6,*r7
  nop
  movb r4,*r8
  jmp opt3
  
* for x=240 to 1 step -1
xlp2

* call locate(#1,176,x)
  movb r0,*r7
  nop
  movb r6,*r7
  nop
  movb r3,*r8
  
* next x
opt3
  ai r3,>ff00
  jne xlp2

  movb r0,*r7
  nop
  movb r6,*r7
  nop
  movb r3,*r8
  jmp opt2
  
* for y=176 to 1 step -1
  
ylp2
* call locate(#1,y,1)
  movb r1,*r7
  nop
  movb r6,*r7
  nop
  movb r4,*r8
  
* next y
opt2
  ai r4,>ff00
  jne ylp2

  movb r1,*r7
  nop
  movb r6,*r7
  nop
  movb r4,*r8
  
* cnt=cnt-1
  dec r5
  jne l140
  
* end
  clr  @>83C4
  blwp @>0000
  
  end
  

 

Edited by JasonACT
Add the code
  • Like 1
Link to comment
Share on other sites

1 hour ago, Asmusr said:

I think it will work without the NOPs.

:) I had forgotten to add the NOP in an earlier version, where I now have a JMP opt3 (write data, then moving on to write address) and I don't think that made much difference in the sprite pattern you see on real iron.  Some might actually be safe to remove.  BITD I had so many switches on my console, 16bit/8bit internal memory expansion, GROMs frequently used, 8KB DSR RAM.. the list goes on.. there were at least 8 switches.  I have a feeling one was a 4MHz override, but my memory is now cloudy because I know for sure I stuck a 16MHz '68000 in my Amiga with a switch a few years later.

 

I do recall having to patch the TI Extended Basic ROM (in my home made GRAM/RAM device) for either GROM or VRAM access it was doing, which was flaky, when the ROM was on the 16bit bus.  I can't say if it was due to that and a 4MHz upgrade though now.

 

I've promised myself there will be no switches added to my current TI-99/4A.

Link to comment
Share on other sites

2 hours ago, Asmusr said:

I think it will work without the NOPs.

Mine had no NOPs and it worked fine on real iron. And I even removed the byte swapping when setting the VDP address.

I think there might be an issue if the whole program ran on 16 bit memory. ??

Link to comment
Share on other sites

2 hours ago, TheBF said:

Mine had no NOPs and it worked fine on real iron. And I even removed the byte swapping when setting the VDP address.

I think there might be an issue if the whole program ran on 16 bit memory. ??

AFAIK, the only way to overrun the VDP on the TI (including 16 bit memory) is to read immediately after setting a read address. I have never seen any issues with writing.

  • Like 4
Link to comment
Share on other sites

I don't know exactly what it's doing when it happens, but if you play the game Tennis on a console with 16-bit wide RAM all over, then it will not work properly.

Switch to 8-bit and it does.

The issue is that the players split in two, with the legs running one way and the torso another. I could very much understand that if a write to the VDP was messed up, but have harder to understand how a failed read would do the same damage. But I don't know how the program handles VDP RAM, so I don't really know.

Link to comment
Share on other sites

16 hours ago, Asmusr said:

AFAIK, the only way to overrun the VDP on the TI (including 16 bit memory) is to read immediately after setting a read address. I have never seen any issues with writing.

The sprite pattern without the NOPs does seem to be very stable. I can also see the console ROM sometimes sets the address with 2 sequential instructions, so I'm sold on the idea.

 

5 minutes and 1 seconds (301 seconds) or 3.01 seconds for 100, and I don't see any need to round up :) 

 

Spoiler

        AORG >A000
* assumes startup from Editor/Assembler

  DEF START
  REF VDPWA,VDPWD
  
* make it work as EA5 if desired
  B @START

START
  lwpi >8300
  li r1,>8320
  li r0,l140
! mov *R0+,*R1+
  ci R1,>8400
  jne -!

* call clear
  li r0,>0040     * write address >0000
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  
  li r1,>2000
  li r2,768
lp1
  movb r1,@VDPWD
  dec r2
  jne lp1
  
* call magnify(2)
  li r0,>c181     * write VDP register 1 with >C2 (16k,enable, no int, double-size sprites)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  
* call sprite(#1,42,2,1,1)
  li r0,>0186     * vdp register 6 to >01 (sprite descriptor table to >0800)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA

  li r0,>0043     * write address >0300
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
 
  li r0,>002A     * 1,1 (minus 1) and 42
  movb r0,@VDPWD
  nop
  movb r0,@VDPWD
  swpb r0
  movb r0,@VDPWD
  
  li r0,>01d0     * color 2 (-1) and list terminator
  movb r0,@VDPWD
  swpb r0
  movb r0,@VDPWD
  
* cnt=100
  li r5,10000
  
  li r0,>0100     * write address >0301 (X pos)
  li r1,>0000     * write address >0300 (Y pos)
  li r6,>4300
  li r7,VDPWA
  li r8,VDPWD
  li r9,>ef00
  li r10,>ff00
  li r11,>af00
  clr r4
  B @>8320
l140
  mov r0,r3

* for x=1 to 240 (minus 1 for asm)
xlp1

* call locate(#1,1,x)
  movb r0,*r7
  movb r6,*r7
  movb r3,*r8
  
* next x
  a r0,r3
  c r3,r9
  jne xlp1

  movb r0,*r7
  movb r6,*r7
  movb r3,*r8
  jmp opt1
  
* for y=1 to 176
ylp1

* call locate(#1,y,240)
  movb r1,*r7
  movb r6,*r7
  movb r4,*r8
  
* next y
opt1
  a r0,r4
  c r4,r11
  jne ylp1

  movb r1,*r7
  movb r6,*r7
  movb r4,*r8
  jmp opt3
  
* for x=240 to 1 step -1
xlp2

* call locate(#1,176,x)
  movb r0,*r7
  movb r6,*r7
  movb r3,*r8
  
* next x
opt3
  a r10,r3
  jne xlp2

  movb r0,*r7
  movb r6,*r7
  movb r3,*r8
  jmp opt2
  
* for y=176 to 1 step -1
  
ylp2
* call locate(#1,y,1)
  movb r1,*r7
  movb r6,*r7
  movb r4,*r8
  
* next y
opt2
  a r10,r4
  jne ylp2

  movb r1,*r7
  movb r6,*r7
  movb r4,*r8
  
* cnt=cnt-1
  dec r5
  jne l140
  
* end
  clr  @>83C4
  blwp @>0000
  
  end
  

 

Edited by JasonACT
Actual code tested here had some more changes (R9, R10, R11) but didn't yield a benefit, best to post it though.
  • Like 1
Link to comment
Share on other sites

12 hours ago, apersson850 said:

But I don't know how the program handles VDP RAM, so I don't really know.

If the files are mtennisc.bin and mtennisd.bin, try editing the mtennisc.bin:

 

File position 0x1eae=

02 01 83 84 d8 00 8c 02

to be

d8 00 8c 02 02 01 83 84

 

This change moves the "set address high byte delay instruction" (which we now know isn't time sensitive) to be in a position to instead delay the 1st data read instruction.

 

It's the only thing I can see in the disassembly, all other reads & writes do it by the book.

  • Like 3
Link to comment
Share on other sites

That could be interesting, just to see. Not useful, though, since the game more looks like table tennis and is virtually unplayable for people. But if you watch the computer play with itself, you get a fast match.

I can do that, since I also have stuff in my computer which, when enabled, traps the sensitive read and delays it in hardware. If I enable that, Tennis runs correctly, but pingpingpingping instead of pjoff - pjoff - pjoff - pjoff.

Edited by apersson850
  • Like 1
Link to comment
Share on other sites

I made a program to test VDP overruns using writes. It copies some code to scratch pad that sets up a VDP write address in just two instructions, and then it writes 4 bytes to the VDP. This continues until the screen is full and then it loops. The result looks like this:

 

js99er-20230930114300.png.0e565913235bb76e1ae744cdc133932b.png 

This image is from an emulator, but I ran the tests on an unmodified console with a 9918A VDP. If the VDP is overrun we would expect the picture to flicker.

 

I made 4 variations of the code, and the cartridge image for each test is attached.

 

1. This is the first variation, it writes the 4 bytes using MOVB, and this doesn't seem to overrun to VDP.

 

*      Attempting to overrun the VDP by writing

       aorg >6000

vdpwd:                                 ; VDP write data
       equ  >8c00
vdpwa:                                 ; VDP set read/write address
       equ  >8c02
pad_code:
       equ  >8320

*      Cartridge header
       byte >AA,1,1,0
       data 0
       data program
       data 0,0
program:
       data 0
       data start
       byte 13
       text 'VDP OVERRUN 1'

*      Main code
start:
       limi 0
       lwpi >8300
*      Copy code to scrath pad
       li   r0,code
       li   r1,pad_code
       li   r2,code_end-code
copy_loop:
       mov  *r0+,*r1+
       dect r2
       jne  copy_loop
*      Execute the code
       b    @pad_code

*      This code runs from scratch pad
code:
       clr  r0                         ; Low byte of VDP write address
       li   r1,>4000                   ; High byte of VDP write address
       li   r2,>2000                   ; Byte to write (space character)
       li   r3,>2100                   ; Another byte to write (exclamation mark)
       li   r4,>300/4                  ; Number of loops
       li   r5,vdpwa                   ; Cache for faster access
       li   r6,vdpwd                   ; Cache for faster access
code_loop:
       movb r0,*r5                     ; Write low byte of VDP write address
       movb r1,*r5                     ; Write high byte of VDP write address
       movb r2,*r6                     ; (26) Write byte
       movb r3,*r6                     ; (26) Write other byte
       movb r2,*r6                     ; (26) Write byte
       movb r3,*r6                     ; (26) Write other byte
       ai   r0,>0400                   ; Increment write address low byte
       jnc  code_1                     ; Skip incrementing high byte if low byte hasn't wrapped
       ai   r1,>0100                   ; Increment write address high byte
code_1:
       ai   r2,>0100                   ; Increment byte to write
       ci   r2,>6000                   ; Did we reach the empty characters?
       jlt  code_2                     ; Skip ahead if not
       li   r2,>2000                   ; Reset to space
code_2:
       dec  r4                         ; Decrement loop counter
       jne  code_loop                  ; Inner loop
       jmp  code                       ; Repeat forever

code_end:
       equ  $

       end start

 

2. For the next variation I replaced the first write with a CLR instruction. This doesn't seem to overrun either.

 

       clr  *r6                        ; (22) Clear byte
       movb r2,*r6                     ; (26) Write byte
       movb r3,*r6                     ; (26) Write other byte
       movb r2,*r6                     ; (26) Write byte

 

3. I then tried to move the CLR so it wasn't the first write. Now the first character of each column is flickering, indicating that 22 clock cycles is too fast for the previous write to finish. So using CLR from scratch pad, it is possible to overrun the VDP.

 

       movb r2,*r6                     ; (26) Write byte
       clr  *r6                        ; (22) Clear byte
       movb r2,*r6                     ; (26) Write byte
       movb r3,*r6                     ; (26) Write other byte

 

4. Since I thought it would come up, I also tried moving the workspace to WDPWD and write using LI. Now the 3 first characters of each column are flickering, which is not surprising since LI only takes 16 clock cycles.

 

       lwpi vdpwd
       li   r0,>2100                   ; (16) Write byte
       li   r0,>2200                   ; (16) Write byte
       li   r0,>2300                   ; (16) Write byte
       li   r0,>2400                   ; (16) Write byte
       lwpi >8300

 

My conclusion is that it is possible to overrun the VDP when writing data using certain instructions executed from scratch pad, but it's not possible using MOVB instructions. There doesn't seem to be any need to use delays when you set up a write address, even with the fastest possible code.

 

What then if you clear the screen using an unrolled loop of CLR instructions from scratch pad? I have used that several times, and it seems to work fine. I think it's because as long as you write the same byte every time, there isn't any problem.

 

P.S. For consoles that have been modified to have zero wait states on writes to VDP, the timing will be different, but that doesn't really interest me.

 

overrun1.bin overrun2.bin overrun3.bin overrun4.bin

Edited by Asmusr
  • Like 8
  • Thanks 1
Link to comment
Share on other sites

Regarding VDP overruns, I thought about the StrangeCart, which AFAIK uses LI instructions to write to the VDP. However, those instructions are not from scratch pad but from cartridge space, which means they take 24 cycles instead of 16  (one wait state to read the instruction and another to read the argument). I assume that the StrangeCart works on an unmodified console, so the threshold for writing to the VDP is between 24 (LI working) and 22 (CLR not working) cycles. 

  • Like 5
Link to comment
Share on other sites

My last go at this, 4 minutes and 45 seconds (285 seconds) or 2.85 seconds for 100.

 

Spoiler

        AORG >A000
* assumes startup from Editor/Assembler

  DEF START
  REF VDPWA,VDPWD
  
* make it work as EA5 if desired
  B @START

START
  lwpi >8300
  li r1,>8320
  li r0,l140
! mov *R0+,*R1+
  ci R1,>8400
  jne -!

* call clear
  li r0,>0040     * write address >0000
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  
  li r1,>2000
  li r2,768
lp1
  movb r1,@VDPWD
  dec r2
  jne lp1
  
* call magnify(2)
  li r0,>c181     * write VDP register 1 with >C2 (16k,enable, no int, double-size sprites)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  
* call sprite(#1,42,2,1,1)
  li r0,>0186     * vdp register 6 to >01 (sprite descriptor table to >0800)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA

  li r0,>0043     * write address >0300
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
 
  li r0,>002A     * 1,1 (minus 1) and 42
  movb r0,@VDPWD
  nop
  movb r0,@VDPWD
  swpb r0
  movb r0,@VDPWD
  
  li r0,>01d0     * color 2 (-1) and list terminator
  movb r0,@VDPWD
  swpb r0
  movb r0,@VDPWD
  
* cnt=100
  li r5,10000
  
  li r0,>0100     * write address >0301 (X pos)
  li r1,>0000     * write address >0300 (Y pos)
  li r6,>4300
  li r7,VDPWA
  li r8,VDPWD
  li r9,>ef00
  li r10,>ff00
  li r11,>af00
  li r12,>ee00
  li r13,>ae00
  B @>8320
l140
  mov r0,r3

* for x=1 to 240 (minus 1 for asm)
xlp1

* call locate(#1,1,x)
  movb r0,*r7
  movb r6,*r7
  movb r3,*r8
  a r0,r3
  movb r0,*r7
  movb r6,*r7
  movb r3,*r8
  
* next x
  a r0,r3
  c r3,r9
  jne xlp1

  movb r0,*r7
  movb r6,*r7
  movb r3,*r8
  mov r0,r4
  
* for y=1 to 176
ylp1

* call locate(#1,y,240)
  movb r1,*r7
  movb r6,*r7
  movb r4,*r8
  a r0,r4
  movb r1,*r7
  movb r6,*r7
  movb r4,*r8

* next y
  a r0,r4
  c r4,r11
  jne ylp1

  movb r1,*r7
  movb r6,*r7
  movb r4,*r8
  mov r12,r3
  
* for x=240 to 1 step -1
xlp2

* call locate(#1,176,x)
  movb r0,*r7
  movb r6,*r7
  movb r3,*r8
  
* next x
  a r10,r3
  jne xlp2

  movb r0,*r7
  movb r6,*r7
  movb r3,*r8
  mov r13,r4
  
* for y=176 to 1 step -1
  
ylp2
* call locate(#1,y,1)
  movb r1,*r7
  movb r6,*r7
  movb r4,*r8
  
* next y
  a r10,r4
  jne ylp2

  movb r1,*r7
  movb r6,*r7
  movb r4,*r8
  
* cnt=cnt-1
  dec r5
  jne l140
  
* end
  clr  @>83C4
  blwp @>0000
  
  end
  

 

Again, there are more optimisations that didn't do much, unrolling the first 2 incrementing loops (only comparing the odd values) gave the quicker result.

 

Still rounds up to 3 :) 

Edited by JasonACT
  • Like 3
Link to comment
Share on other sites

On 9/30/2023 at 4:13 AM, Asmusr said:

My conclusion is that it is possible to overrun the VDP when writing data using certain instructions executed from scratch pad, but it's not possible using MOVB instructions. There doesn't seem to be any need to use delays when you set up a write address, even with the fastest possible code.

 

Nice test! But yes, my assertion, frequently abbreviated, is "without contrived code cases in scratchpad" it's not possible to overrun the VDP. The test with CLR is neat to see, I didn't expect it.

 

You can't get an overrun on the VDP unless it needs to access VRAM - that's why having the CLR first didn't cause any issues. Setting the VDP address in 'write mode' doesn't need to access VDP, so the next access doesn't have any restrictions because you're still just talking to VDP registers. The cause of the overrun is simply overwriting the VDP CPU buffer register before a memory access is completed (or reading it too early). I did a lot of testing on the ColecoVision since it's fast enough to overrun the VDP in many cases, and I'm very sure of that conclusion.

 

So if you think carefully about when a VRAM access is necessary, that informs where the timing matters. The datasheet doesn't explicitly say it, but seems to back it up since it only goes into detail about delays where VRAM is involved (talking about CPU access cycles).

 

It's also worth remembering that that timing window is shorter in text mode (since sprites are disabled), and it's nearly 0 (2us, IIRC) during blank or when the screen is blanked deliberately. Though, that matters less on the 4A since you have to work at overrun... but for a fast clear it looks like it might help!

 

On 9/30/2023 at 3:38 PM, Asmusr said:

Regarding VDP overruns, I thought about the StrangeCart, which AFAIK uses LI instructions to write to the VDP. However, those instructions are not from scratch pad but from cartridge space, which means they take 24 cycles instead of 16  (one wait state to read the instruction and another to read the argument). I assume that the StrangeCart works on an unmodified console, so the threshold for writing to the VDP is between 24 (LI working) and 22 (CLR not working) cycles. 

Yeah, I remember that threw me since it should be right on the edge... 24 cycles is 8us. The 9918 datasheet tells us that the worst case access is 7.95us. I usually remember it as 8us, but seems like that 0.05us makes enough difference here! Taking in the CPU clock's 5% tolerance, it suggests that it is possible for a machine to execute the LI in 7.6us, but I guess till we see a machine fail... It's also possible/probable that there is some engineer slack in the datasheet numbers!

 

22 cycles is 7.3us, so that's much shorter (relatively). Every cycle is 1/3 of a microsecond.

 

  • Like 4
Link to comment
Share on other sites

On 9/30/2023 at 4:13 AM, Asmusr said:

What then if you clear the screen using an unrolled loop of CLR instructions from scratch pad? I have used that several times, and it seems to work fine. I think it's because as long as you write the same byte every time, there isn't any problem.

That's an interesting observation... I think you're right that the same byte every time might largely work because of it being the same value, since the VDP can process the request while the CPU is prepping the next write... though I would expect that if you wrote enough of them in a row, you would eventually lose an address increment. If that theory makes any sense, though, then that would suggest 8/0.66 which is only 12 instructions until it was possible to get two instructions in the same delay time. (If that happens, then the address will only increment once since the internal flag is only checked during the access cycle). Was your unrolled loop that big?

 

Although again, if there is engineer slack in there, then the delta could be much smaller than 0.66us. If it was closer to 0.33us then we'd need 24 CLR instructions to lose one. It might be interesting to determine what the delta is - we could use that to determine how much slack really exists. ;)

 

  • Like 2
Link to comment
Share on other sites

9 hours ago, sometimes99er said:

Interesting stuff. Well, actually not exactly the benchmarking of languages, but the VDP access. Always nice to know, how to move fast on a good old plain console. 😉
 

Well this shows GPL vs Assembly in console only as this runs from Console.

Link to comment
Share on other sites

I stumbled upon this last night...

 

http://ftp.whtech.com/user groups/Hunter Valley/

 

88'04 I was mentioned as demonstrating 16 bit 32KB RAM to a visitor I was introduced to, but had never met before.

88'06 I had advertised my whole collection, and it immediately sold (with an agreement to me writing 5 pages of doco on how to operate it all). (ACT - Canberra, Australia.)

 

The Foundation 128KB card, as I recall, had the 24KB space paged via CRUs...  I adapted the TI disk DSR ROM to do sector access to that RAM, and the rest "just worked"...

 

I'll never ask a magician how his tricks work...  I'll just practice.

 

Damn, AU$700 including postage, I wish, now...

  • Like 2
Link to comment
Share on other sites

On 10/5/2023 at 2:21 PM, JasonACT said:

Did anyone here have 16 bit 32KB ram before 88'04?  Curious?  There's some threads here I've now read that are not quite so positive?

I don't know the date exactly (didn't write it on the schematics, which I still have), but I had 16 bit RAM in the console in 1987. Not 32 KB though, but 64. It's still there, it still works.

That was my own design, though. I learned about that somebody managed to put it inside the console, so I thought if they can, then I can give it a go to.

  • Like 3
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...