Jump to content

Recommended Posts

4 minutes ago, senior_falcon said:

Getting back to the speed tests, I would be very interested in seeing the results of this program running in XB from VDP and using the 32K expansion.

10 N=N+1

20 CALL KEY(0,K,S)::CALL GCHAR(1,1,G)::CALL SOUND(10,110,10)::CALL SCREEN(3)

30 GOTO 10

I don't trust any timing results I get from Classic99 on my computer. (A short test program looping 1000 times took 20 seconds, then 27, then 25 and after opening Classic99 2 more times it took 33 seconds)

I tried JS99er and for a 1 minute run found that the program ran just over 1% faster running from 32K vs VDP. But I don't know how accurate the timing is on JS99er.

Of course, you could use a for next loop if that worked better for you.

I predict this program will have the same execution time with both memory types.

... Forever.  ;-) 

  • Like 2
5 minutes ago, mizapf said:

I got 106 seconds with 32K and 107 seconds with VDP only (current MAME, 50 Hz).

107.64 seconds on real iron. Console only.  XB Cartridge.

 

Last line changed to 

30 IF N<1000 THEN 10 

 

 

  • Like 1
5 minutes ago, TheBF said:

107.64 seconds on real iron. Console only.  XB Cartridge.

 

Last line changed to 


30 IF N<1000 THEN 10 

 

105.88 seconds with PEB, XB cartridge.

 

(I have a screen shot from my phone but it's HEIC format and I don't have a conversion utility)

 

 

 

  • Like 2
10 minutes ago, mizapf said:

The CALL SOUND is a blocking call of at least 10 ms (duration), so it dominates the runtime.

I may be ignorant here, but each loop is about 105 ms, so I don't see why that would be a problem.

You could take out the CALL SOUND and put in CALL HCHAR(1,1,42)::CALL VCHAR(1,1,42) and see what happens.

  • Like 2
1 hour ago, mizapf said:

The CALL SOUND is a blocking call of at least 10 ms (duration), so it dominates the runtime.

That's true.  Should change duration to -10

But there is nothing more to see here IMHO.

  • Like 1
1 hour ago, senior_falcon said:

I may be ignorant here, but each loop is about 105 ms, so I don't see why that would be a problem.

You could take out the CALL SOUND and put in CALL HCHAR(1,1,42)::CALL VCHAR(1,1,42) and see what happens.

If you want to measure memory performance, this CALL SOUND instruction is a rather big chunk in your sum, and it does not depend on the RAM type. I'd rather increase the loop count and leave away this constant summand.

  • Like 2
23 hours ago, TheBF said:

In the test above the array was all in VDP ram. That was the only variable that is used when the program runs.

All this shows that the big difference between different memory types doesn't matter much in interpreted programs.

 

Wow you just stated the exact opposite unless a 5% difference in performance on a computer is not NOTHING?

Tell you what tell computer geeks this and see if they agree.

5% in games is obvious to see, 5% in applications can be seen and 5% has never been 0%

25 minutes ago, mizapf said:

If you want to measure memory performance, this CALL SOUND instruction is a rather big chunk in your sum, and it does not depend on the RAM type. I'd rather increase the loop count and leave away this constant summand.

I don't see how this is any different than the others. They all depend on the RAM type. The program is either stored in VDP or CPU RAM. The token for CALL is read from ram/vram, then the name (KEY, GCHAR, SOUND, SCREEN) is read from ram/vram. The routine starts, then other tokens and the values are read from ram/vram and acted on as necessary. With SOUND, the sound generators are set up for the interrupt routine to handle, then on to the next instruction until the loop comes back to SOUND. At that point (>100 ms later) the 10 ms CALL SOUND should be done and so it starts up again.

 

The real question is "how much does a typical XB program benefit from expansion RAM?" A program with nothing but arithmetic operations would not be a typical program. A program with nothing but CALLs would not be a typical program. Most programs are a mixture of these. You could try to predict how much the speed might increase, but the actual change would depend on the program.

Edited by senior_falcon
  • Like 2
  • Confused 1
22 hours ago, Reciprocating Bill said:

The Extended BASIC manual, page 169 (on the Size command):

 

"If the Memory Expansion is attached, the space available in the stack is the amount of space left after the space taken up by string values, information about variables, and the like is subtracted. Program space is the amount of space left after the space taken up by the program and the values of numeric variables is subtracted."

I am a GPL programmer working on XB Source code of GPL and Assembly for 30 years now and you are going to quote the XB manual that has many errors in it?

I have fixed many of those errors in RXB for a reason, they were errors or bad programming ideas.

A perfect example is CALL LOAD will not work unless you use CALL INIT in XB, this is beyond stupid as this blocks use of CALL LOAD from Console.

It is also stupid as it forces you to use another routine CALL INIT when it does not matter or useful.

22 hours ago, senior_falcon said:

In post #74 Rasmus asked if these were the facts:

  • Reading from GROM/VDP is almost as fast as reading from CPU memory (ROM or RAM) if the read address has already been set up
  • If you need to set up the read address first, reading from GROM/VDP is a lot slower than reading from CPU memory
  • It's about 5% faster to execute an XB program from 32K RAM instead from VDP RAM only
  • An XB routine programmed in assembly is usually much faster than a similar routine written in GPL
  • Reading from 16-bit memory is exactly 4 CPU cycles faster than reading from 8-bit memory
  • Because console ROMs are 16-bit, they can be faster than similar assembly routines in cartridge ROM

Do you agree that this is accurate? If not then please indicate which of these points you feel is inaccurate.

 

 

"Reading from GROM/VDP is almost as fast as reading from CPU memory (ROM or RAM) if the read address has already been set up"

Most games in XB do not use sequential VDP writes or reads as Sprite and Character movements mostly need read address set up.

 

"An XB routine programmed in assembly is usually much faster than a similar routine written in GPL"

XB is GPL programming, and if you change it to Assembly instead it costs memory or has to be compiled either of these will not work in Console.

Unless you do it in ROM instead like I have approached the issue.

 

"Reading from 16-bit memory is exactly 4 CPU cycles faster than reading from 8-bit memory"

4 CPU cycles by hundreds of times will add up pretty fast, do a simple loop 100 times and 4 per cycle becomes 400 vs previous 100

That is a huge difference in many instances especially in XB games that could be easy to see.

 

"Because console ROMs are 16-bit, they can be faster than similar assembly routines in cartridge ROM"

As I am not attacking the problem with hardware but with programming my choices are limited, but the results are way better then pure GPL.

After all ROM is faster then GPL and Assembly from ROM per RXB show the effect is more then slightly increases speed like CALL HCHAR or CALL HPUT

Over time I just need to put more XML routines in place of GPL routines to speed it up, it is a laborious long process.

 

3 hours ago, TheBF said:

I predict this program will have the same execution time with both memory types.

... Forever.  ;-) 

What kind of computer are you running with Classic99?

I have never had a difference once.

 

I am running a Windows 10 PC with Asus motherboard, 32Gig of 3900Mhz RAM, M2 Rocket 2TB OS for main drive, RTX 2070 Super video card and AMD 3900 12 core CPU 4.43Ghz.

 

29 minutes ago, senior_falcon said:

I don't see how this is any different than the others. They all depend on the RAM type. The program is either stored in VDP or CPU RAM. The token for CALL is read from ram/vram, then the name (KEY, GCHAR, SOUND, SCREEN) is read from ram/vram. The routine starts, then other tokens and the values are read from ram/vram and acted on as necessary. With SOUND, the sound generators are set up for the interrupt routine to handle, then on to the next instruction until the loop comes back to SOUND. At that point (>100 ms later) the 10 ms CALL SOUND should be done and so it starts up again.

 

The real question is "how much does a typical XB program benefit from expansion RAM?" A program with nothing but arithmetic operations would not be a typical program. A program with nothing but CALLs would not be a typical program. Most programs are a mixture of these. You could try to predict how much the speed might increase, but the actual change would depend on the program.

Yes exactly!

This is why you do a spreadsheet on each command and time them each in all types of memory to show a chart.

That will 100% show what is going on and not be nit picking it to death for what is wanted to be seen, instead just pure data that cannot be disputed.

53 minutes ago, RXB said:

Wow you just stated the exact opposite unless a 5% difference in performance on a computer is not NOTHING?

Tell you what tell computer geeks this and see if they agree.

5% in games is obvious to see, 5% in applications can be seen and 5% has never been 0%

IMHO 5% is not nothing, but very close to it. :)

 

I want optimizations that move the needle more than that if I have to do much work to make them happen.

For example the inline optimizer on my system doubles the speed of code fragments and on number crunching programs it nets you +40% or so typically.

 

  • Like 2

The difference between individual consoles can be as much as 10% - the clock is rated for +/-5%, and I've measured consoles at both ends of that.

 

GROM reads are roughly 4 times slower than 8-bit CPU RAM bytes (about 25 cycles versus 6 cycles) -- but they make up such a small percentage of the memory cycles that it's usually lost in the noise. Even a tight loop copying data from GROM to RAM will spend far more time on the CPU than the GROM hold. Consider this loop in scratchpad, copying TO scratchpad:

MOVB *R1,*R2+  ; from GRMRD to scratchpad target
DEC R3      ; count down
JNE LP       ; jump if not done

Ignoring the actual wait states for a moment, the movb takes 14+4+6=24 cycles, the dec takes 10, and the jump (when taken) 10 more. So every byte has an overhead of 44 cycles.

If R1 points to zero waitstate memory, then it's a total of 44 cycles per byte moved. It's 100% overhead.
If R1 points to 8-bit memory, that adds cycles for a total of 4 cycles giving 48. It's 92% overhead.
If R1 points to VDP memory, that still adds only 4 cycles, still 48. Still 92% overhead.
If R1 points to GROM, that adds roughly 23 cycles (it can be +/- 1 depending on clock sync). Total is 67 cycles, 66% overhead.

So in the tightest possible copy loop, 66% of the time is still spent NOT accessing GROM. Slow down the loop in any manner, and the overhead grows. This is why the slow GROM is not the reason that the GPL interpreter seems to chug along - it's touching the GROM so rarely that it doesn't really matter. The GPL interpreter was written for size, not speed, and spends a lot of CPU time jumping around. (Incidentally, MOV instead of MOVB would halve the overhead if it was possible, AND halve the number of loops needed! This is also why unrolling helps - you are reducing that overhead by doing the DEC and JNE less often. ;) )

As for optimizations. It's a numbers game - PLURAL. One 5% optimization is nothing, and probably not worth it. But five of them are something really special. ;)



 

  • Like 3
8 hours ago, RXB said:

Yes exactly!

This is why you do a spreadsheet on each command and time them each in all types of memory to show a chart.

That will 100% show what is going on and not be nit picking it to death for what is wanted to be seen, instead just pure data that cannot be disputed.

That would be a colossal waste of time. We already know that going from VDP to CPU ram offers a modest increase in speed. Some tests show an average of 5% faster, but none of those use CALLs. Tests doing nothing but CALLs show an increase of 1% or slightly more. So a real program might run 3% faster give or take. What you propose would take centuries, and to what end? How is it useful to know in advance exactly how much faster the program would run. You may differ with me and think this is a golden opportunity to gain fame and notoriety, but for me, as Bert Williams sang, "It’s a wonderful chance for somebody, but it’s got to be somebody else, not me."

Edited by senior_falcon
  • Like 1
6 hours ago, RXB said:

What kind of computer are you running with Classic99?

I have never had a difference once.

 

I am running a Windows 10 PC with Asus motherboard, 32Gig of 3900Mhz RAM, M2 Rocket 2TB OS for main drive, RTX 2070 Super video card and AMD 3900 12 core CPU 4.43Ghz.

 

I still have not received an answer to the question I posed most recently in the RXB thread, page 58, post 1442.

 

(RXB) I ran this test multiple times with same results: (this was looping 10000 times)

TI Basic 13 Minute 44 seconds

XB  13 Minute 9 seconds

XB 2.9 13 Minute 9 seconds

RXB 2022A 13 Minute 8 seconds

 

(RXB) OK RAN A 100000 LOOP TEST SO RESULTS ARE:

XB       = 37 Minutes 5 Seconds

XB 2.9 = 37 Minutes 6 Seconds

RXB     = 26 Minutes 2 Seconds

 

Please explain how looping 10x more only takes 3x longer for XB and 2x longer for RXB?

 

I bring this up because I am getting erratic timing results from Classic99 and would like to know if our problems may be related.

  • Like 1

Here are data points derived from programs I wrote in TI Extended BASIC decades ago, programs that were typical of the mix of things I was interested in at the time. These are larger programs than the typical benchmark, and entail more operations of the sort that XB made possible using calls to its many subprograms. 

 

"SpinBounce" moved an animated sprite on the screen first in an oval pattern, then in a bouncing pattern. The trig was done first (and displayed on the screen), then came the spinning and the bouncing. Important stuff. "Christmas" was an animated Christmas card to my parents that used character graphics to draw their house and place stars randomly in the sky and sprites to provide falling snow. "Microhorse" put mountains, trees, moving clouds, a moving sun, and two running horses on the screen. "TurnBox" used character graphics to draw a box on the screen, first in one orientation, then in another. "Queens" solved the NQueens puzzle. Elsewhere I reported running a Maze generator, which I'll also reproduce here. 

 

I timed each program from "Run" until an arbitrarily selected event occurred (e.g. until "God Rest Ye Merry Gentlemen" started in "Christmas".  

 

I used an unmodified, unexpanded console in one instance and my 16-bit enhanced console on the other. This would serve, if anything, to exaggerate the differences between the two environments, although in reality the 16-bit modification makes no significant difference relative to the standard 32k expansion when running Extended BASIC. 

 

                        Stock              Expanded

SpinBounce        35.5                 35.0

Christmas          39.1                 39.0

MicroHorse        15.2                 15.2

TurnBox            19.4                 18.8

Queens             38.7                 37.7

Maze                47.4                  46.1

 

These are the sorts of numbers that disappointed me in 1982*, and prompted me to report a difference of "not much more than 1 or 2%." The fact of the matter was that adding expansion RAM did not significantly improve the performance of TI Extended BASIC for my "applications." 

 

*I've gotten over it.

 

 

 

Edited by Reciprocating Bill
A couple numbers reversed
  • Like 3

That nails it for me Bill. Thanks.

 

What's even worse is that I accidentally proved to myself, in the middle of this exercise, that I can fill 4K of VDP memory a tiny bit faster than I can fill Expansion RAM using an Assembler program.

Somebody else can try it with conventional tools and prove me wrong but after the initial VDP address setting the byte write speeds seem very close with VDP slightly faster.

  • Like 1
2 hours ago, senior_falcon said:

I bring this up because I am getting erratic timing results from Classic99 and would like to know if our problems may be related.

Turn on the VDP FPS display and make sure it's maintaining 60fps. (Video->Show FPS). That'll rule out performance issues.

 

 

  • Like 1
9 minutes ago, Tursi said:

Turn on the VDP FPS display and make sure it's maintaining 60fps. (Video->Show FPS). That'll rule out performance issues.

Mid 30s. Like I said, the computer is old and Classic99 is running on Windows XP in a virtual box.

Well, this is embarassing - I see it is running at 50 Hz. Turning that off gets it running in the high 40s.

Edited by senior_falcon
  • Like 1
1 hour ago, senior_falcon said:

Mid 30s. Like I said, the computer is old and Classic99 is running on Windows XP in a virtual box.

Well, this is embarassing - I see it is running at 50 Hz. Turning that off gets it running in the high 40s.

That's fair, yeah. The emulation no longer attempts to scale down. If it can't keep up, timing can not be relied on.

 

And 50hz ... I've spent so little time on that, that I don't trust it at all. ;)

 

7 hours ago, TheBF said:

That nails it for me Bill. Thanks.

 

What's even worse is that I accidentally proved to myself, in the middle of this exercise, that I can fill 4K of VDP memory a tiny bit faster than I can fill Expansion RAM using an Assembler program.

Somebody else can try it with conventional tools and prove me wrong but after the initial VDP address setting the byte write speeds seem very close with VDP slightly faster.

If my memory isn't too confused, the TMS 9900 can only wait an integer number of cycles. The clock generator doesn't stop running just because there's a WAIT request to the CPU. Thus either it's the same speed, or it's off by one or more full cycles. The CPU checks the wait input on each cycle, so if you have one and a half wait time, then it will be two cycles. Nothing gets done during that half cycle the WAIT was off, before it was checked again.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...