Jump to content
IGNORED

Timing normal 32K RAM vs 32K 16bit RAM


RXB

Recommended Posts

"Reading from GROM/VDP is almost as fast as reading from CPU memory (ROM or RAM) if the read address has already been set up"

Most games in XB do not use sequential VDP writes or reads as Sprite and Character movements mostly need read address set up.

 

But the reading and writing to data in VRAM happens regardless of whether the code runs in CPU RAM or in VDP RAM. That's why the difference is barely noticeable.

 

"An XB routine programmed in assembly is usually much faster than a similar routine written in GPL"

XB is GPL programming, and if you change it to Assembly instead it costs memory or has to be compiled either of these will not work in Console.

Unless you do it in ROM instead like I have approached the issue.

 

Which is the same as saying you agree.

 

"Reading from 16-bit memory is exactly 4 CPU cycles faster than reading from 8-bit memory"

4 CPU cycles by hundreds of times will add up pretty fast, do a simple loop 100 times and 4 per cycle becomes 400 vs previous 100

That is a huge difference in many instances especially in XB games that could be easy to see.

 

It's the number of cycles in relation to the total instruction cycle that's important. They also takes a hundred times longer for a hundred instructions.

Since Extended BASIC spends so much time doing other things than accessing CPU RAM, you don't even notice the difference. For pure assembly, it's more than twich as fast in 16-bit memory.

 

"Because console ROMs are 16-bit, they can be faster than similar assembly routines in cartridge ROM"

As I am not attacking the problem with hardware but with programming my choices are limited, but the results are way better then pure GPL.

After all ROM is faster then GPL and Assembly from ROM per RXB show the effect is more then slightly increases speed like CALL HCHAR or CALL HPUT

Over time I just need to put more XML routines in place of GPL routines to speed it up, it is a laborious long process.

 

There's no question assembly can be faster. The point is it's even faster running from ROM in the console, compared to ROM in the cartridge.

Link to comment
Share on other sites

16 hours ago, senior_falcon said:

I still have not received an answer to the question I posed most recently in the RXB thread, page 58, post 1442.

 

(RXB) I ran this test multiple times with same results: (this was looping 10000 times)

TI Basic 13 Minute 44 seconds

XB  13 Minute 9 seconds

XB 2.9 13 Minute 9 seconds

RXB 2022A 13 Minute 8 seconds

 

(RXB) OK RAN A 100000 LOOP TEST SO RESULTS ARE:

XB       = 37 Minutes 5 Seconds

XB 2.9 = 37 Minutes 6 Seconds

RXB     = 26 Minutes 2 Seconds

 

Please explain how looping 10x more only takes 3x longer for XB and 2x longer for RXB?

 

I bring this up because I am getting erratic timing results from Classic99 and would like to know if our problems may be related.

The more loops the more the difference adds up.

Under 1ms does not tell you much about cost as if smaller how would you tell if the loop is only 1000, it would be invisible from being less than the test can find.

So logically the larger the loop the more apparent the difference is to see.

This makes the test easier to see those differences the smaller they are, otherwise, you will not see the difference at all as the test fails.

 

And again what computer are you running as I have never on my Mac Pro or my PC have I gotten different results using Classic99? 

How is your computer so inconsistent?

A 10,000 loop will ferret out more difference and become easier to see.

Link to comment
Share on other sites

14 hours ago, Reciprocating Bill said:

Here are data points derived from programs I wrote in TI Extended BASIC decades ago, programs that were typical of the mix of things I was interested in at the time. These are larger programs than the typical benchmark, and entail more operations of the sort that XB made possible using calls to its many subprograms. 

 

"SpinBounce" moved an animated sprite on the screen first in an oval pattern, then in a bouncing pattern. The trig was done first (and displayed on the screen), then came the spinning and the bouncing. Important stuff. "Christmas" was an animated Christmas card to my parents that used character graphics to draw their house and place stars randomly in the sky and sprites to provide falling snow. "Microhorse" put mountains, trees, moving clouds, a moving sun, and two running horses on the screen. "TurnBox" used character graphics to draw a box on the screen, first in one orientation, then in another. "Queens" solved the NQueens puzzle. Elsewhere I reported running a Maze generator, which I'll also reproduce here. 

 

I timed each program from "Run" until an arbitrarily selected event occurred (e.g. until "God Rest Ye Merry Gentlemen" started in "Christmas".  

 

I used an unmodified, unexpanded console in one instance and my 16-bit enhanced console on the other. This would serve, if anything, to exaggerate the differences between the two environments, although in reality the 16-bit modification makes no significant difference relative to the standard 32k expansion when running Extended BASIC. 

 

                        Stock              Expanded

SpinBounce        35.5                 35.0

Christmas          39.1                 39.0

MicroHorse        15.2                 15.2

TurnBox            19.4                 18.8

Queens             38.7                 37.7

Maze                47.4                  46.1

 

These are the sorts of numbers that disappointed me in 1982*, and prompted me to report a difference of "not much more than 1 or 2%." The fact of the matter was that adding expansion RAM did not significantly improve the performance of TI Extended BASIC for my "applications." 

 

*I've gotten over it.

 

 

 

Your test does not have enough loops to even spot a difference, the difference may very well be under 1/3rd MS so such a short small test will never show any difference.

To a human this is like looking at Microwaves, we can not see them as the waves are to small to see. Your test has the same issue.

A loop will show that small difference magnified so the larger the loop the more it shows that difference.

 

Do you not understand sample size in science? To small of sample size is just is invalid as to large of one.

The sample size has to be enough to show change, otherwise in science the test is invalid.

 

Link to comment
Share on other sites

6 hours ago, apersson850 said:

"Reading from GROM/VDP is almost as fast as reading from CPU memory (ROM or RAM) if the read address has already been set up"

Most games in XB do not use sequential VDP writes or reads as Sprite and Character movements mostly need read address set up.

 

But the reading and writing to data in VRAM happens regardless of whether the code runs in CPU RAM or in VDP RAM. That's why the difference is barely noticeable.

 

"An XB routine programmed in assembly is usually much faster than a similar routine written in GPL"

XB is GPL programming, and if you change it to Assembly instead it costs memory or has to be compiled either of these will not work in Console.

Unless you do it in ROM instead like I have approached the issue.

 

Which is the same as saying you agree.

 

"Reading from 16-bit memory is exactly 4 CPU cycles faster than reading from 8-bit memory"

4 CPU cycles by hundreds of times will add up pretty fast, do a simple loop 100 times and 4 per cycle becomes 400 vs previous 100

That is a huge difference in many instances especially in XB games that could be easy to see.

 

It's the number of cycles in relation to the total instruction cycle that's important. They also takes a hundred times longer for a hundred instructions.

Since Extended BASIC spends so much time doing other things than accessing CPU RAM, you don't even notice the difference. For pure assembly, it's more than twich as fast in 16-bit memory.

 

"Because console ROMs are 16-bit, they can be faster than similar assembly routines in cartridge ROM"

As I am not attacking the problem with hardware but with programming my choices are limited, but the results are way better then pure GPL.

After all ROM is faster then GPL and Assembly from ROM per RXB show the effect is more then slightly increases speed like CALL HCHAR or CALL HPUT

Over time I just need to put more XML routines in place of GPL routines to speed it up, it is a laborious long process.

 

There's no question assembly can be faster. The point is it's even faster running from ROM in the console, compared to ROM in the cartridge.

And how is that valid for this post?

The discussion was VDP speed vs RAM or ROM?

 

Also even those bad tests there is a difference but the test is so small as it can not see values below 1ms at all, that makes the test just not valid.

And yet you base your opinion on this invalid test?

Link to comment
Share on other sites

Yes, and that's what we're discussing. Well, the rest of us, at least. Memory-mapped autoincrementing memory vs. "normal" memory.

 

I presume you aren't that daft that you think a difference of a millisecond matters when we have runtimes of programs for a hundred seconds, are you? It takes a full second to make one percent, and nobody will notice on percent.

It's a valid test. Everybody else here have already understood that.

Link to comment
Share on other sites

12 minutes ago, RXB said:

Your test does not have enough loops to even spot a difference, the difference may very well be under 1/3rd MS so such a short small test will never show any difference.

To a human this is like looking at Microwaves, we can not see them as the waves are to small to see. Your test has the same issue.

A loop will show that small difference magnified so the larger the loop the more it shows that difference.

30 seconds is enough to show if there was a difference. Since there's hardly any, there's no point in running for 30 minutes.

  • Like 1
Link to comment
Share on other sites

21 minutes ago, RXB said:

And again what computer are you running as I have never on my Mac Pro or my PC have I gotten different results using Classic99? 

How is your computer so inconsistent?

Your's is too, since ten times looping took only twice or three times the time to execute. Or so you claimed.

Link to comment
Share on other sites

33 minutes ago, apersson850 said:

Your's is too, since ten times looping took only twice or three times the time to execute. Or so you claimed.

As apersson points out and as anyone can see, there is something very wrong here. There are 3 possibilities that I can see that could account for this:

1 - Besides the loop count (10000 vs 100000) the programs were different somehow.

2 - For some reason Classic99 on your computer has some major errors when measuring time.

3 - You just made up the numbers.

Maybe there is another explanation that I haven't thought of.

Link to comment
Share on other sites

42 minutes ago, RXB said:

Your test does not have enough loops to even spot a difference, the difference may very well be under 1/3rd MS so such a short small test will never show any difference.

Each program had processes to complete (mathematical computations for some, character redefinitions for others, data retrievals for others, etc.) that consumed a significant amount of time (between 15 and 47 seconds, etc.) Adding expansion RAM made did not make a significant difference in the time to completion for those processes.  

 

Would increasing the number of loops change that judgment? It would not. Let's say we run the maze example 100 times, rather than once. Then the test completes in 1 hour, 19 minutes in one instance versus about 1 hour, 17 minutes in the other. It that a significant difference? It is not. Without running the processes side by side, and without the use of a stopwatch or other formal timing, you'd be hard pressed to say which was which as they ran. 

  • Like 1
Link to comment
Share on other sites

Another way to understand why CPU RAM, or faster CPU RAM (16-bit wide, no wait states) has so little influence is to see how it's used.

If you run a pure assembly program in CPU RAM, the memory expansion, with both code and workspace in that RAM, then the speed increase is around 110% if you have 16-bit memory instead of 8-bit. Differently put, it executes in a little less than half the time.

 

Now if this program is using data in VDP RAM, then it would still execute at twice the speed, because it's the code running that runs faster. That access to VDP RAM is more cumbersome doesn't matter. The instructions doing it still runs at twice the speed.

 

The fact that there's no noticeable difference for Extended BASIC is because it doesn't run a single instruction in this memory. Nor does it use it for any workspace.

 

So, what is going on with Extended BASIC, then? Where is it running?

First, the workspace is in the RAM PAD, which is 16-bit wide already. No change there.

Second, when the GPL interpreter runs, it's in 16-bit wide ROM in the console. No change there.

When it's executing GPL code in the Extended BASIC cartridge, it doesn't matter where these GROMs are. It's the same timing. No change there.

When it's executing assembly programs in the same cartridge, they are in ROM with 8-bit access no matter what. No change there.

When it's accessing data in VDP memory, it's the same timing as always. No change there.

It's only when it's reading bytes to interpret from CPU RAM that changing the address is easier. There is a tiny improvement.

When accessing numeric variables, they are just as easily accessed, since there's no special address setup for them either. There's a tiny improvement. Probably more valuable than the code access itself.

Finally, the fact that changing the VDP address doesn't change the address to read the next BASIC instruction also saves a few instructions now and then. There's a tiny improvement.

 

So, a few tiny improvements and the most is exactly the same. That's why four cycles at each CPU RAM access are neglectible in this case.

Link to comment
Share on other sites

14 hours ago, Tursi said:

And 50hz ... I've spent so little time on that, that I don't trust it at all. ;)

I had only best results when testing Extended Parsec on a 50Hz FlashROM99 console compared to Classic99. As it uses a lot of "automotion sprites", timing is critical. 

 

On 6/5/2022 at 8:12 PM, TheBF said:

IMHO 5% is not nothing, but very close to it. :)

I still try to teach my son not to buy new hardware if it does not double the KPI like size, speed, etc.  .. 5% are nice .. at maximum.

  • Like 1
Link to comment
Share on other sites

4 hours ago, RXB said:

Your test does not have enough loops to even spot a difference, the difference may very well be under 1/3rd MS so such a short small test will never show any difference.

To a human this is like looking at Microwaves, we can not see them as the waves are to small to see. Your test has the same issue.

A loop will show that small difference magnified so the larger the loop the more it shows that difference.

 

Do you not understand sample size in science? To small of sample size is just is invalid as to large of one.

The sample size has to be enough to show change, otherwise in science the test is invalid.

1/3 mS in 30 seconds works out to 1.000011x faster. Not exactly a scorching increase in speed.

Sounds like you have your work cut out for you. I showed you how to run a program with the 32K turned off. To recap:

First save the program you are testing to disk, using the 32K expansion. (This is the default mode in Classic99)

CALL INIT

CALL LOAD(-31868,0,0,0,0)   

OLD DSK1.PROGRAM       (for this to work you cannot enter the program; you have to load it from disk)

SIZE   -   you will see that the 32K is not being used

Now you can test both with and without 32K, and are no longer dependent on buffoons with stopwatches to do your testing for you. Use a program with a typical mix of CALLs, arithmetic operations, DISPLAY AT, etc. You can loop 1,000,000 times or even 1,000,000,000 if that makes you feel better.

Be sure to report on what you find out.

Edited by senior_falcon
Link to comment
Share on other sites

15 hours ago, apersson850 said:

30 seconds is enough to show if there was a difference. Since there's hardly any, there's no point in running for 30 minutes.

Hmmm you not understand what sample size means?

And you still using a stop watch like watching a horse race?

Link to comment
Share on other sites

15 hours ago, apersson850 said:

Your's is too, since ten times looping took only twice or three times the time to execute. Or so you claimed.

I never said I get inconsistent times on my PC or Mac Pro, not once.

Link to comment
Share on other sites

14 hours ago, apersson850 said:

Another way to understand why CPU RAM, or faster CPU RAM (16-bit wide, no wait states) has so little influence is to see how it's used.

If you run a pure assembly program in CPU RAM, the memory expansion, with both code and workspace in that RAM, then the speed increase is around 110% if you have 16-bit memory instead of 8-bit. Differently put, it executes in a little less than half the time.

 

Now if this program is using data in VDP RAM, then it would still execute at twice the speed, because it's the code running that runs faster. That access to VDP RAM is more cumbersome doesn't matter. The instructions doing it still runs at twice the speed.

 

The fact that there's no noticeable difference for Extended BASIC is because it doesn't run a single instruction in this memory. Nor does it use it for any workspace.

 

So, what is going on with Extended BASIC, then? Where is it running?

First, the workspace is in the RAM PAD, which is 16-bit wide already. No change there.

Second, when the GPL interpreter runs, it's in 16-bit wide ROM in the console. No change there.

When it's executing GPL code in the Extended BASIC cartridge, it doesn't matter where these GROMs are. It's the same timing. No change there.

When it's executing assembly programs in the same cartridge, they are in ROM with 8-bit access no matter what. No change there.

When it's accessing data in VDP memory, it's the same timing as always. No change there.

It's only when it's reading bytes to interpret from CPU RAM that changing the address is easier. There is a tiny improvement.

When accessing numeric variables, they are just as easily accessed, since there's no special address setup for them either. There's a tiny improvement. Probably more valuable than the code access itself.

Finally, the fact that changing the VDP address doesn't change the address to read the next BASIC instruction also saves a few instructions now and then. There's a tiny improvement.

 

So, a few tiny improvements and the most is exactly the same. That's why four cycles at each CPU RAM access are neglectible in this case.

LOL it is like 4 cycles never add up as you act like it is 4 instructions only.

Of course you would not notice as the sample size so small your timer is useless.

Link to comment
Share on other sites

11 hours ago, senior_falcon said:

1/3 mS in 30 seconds works out to 1.000011x faster. Not exactly a scorching increase in speed.

Sounds like you have your work cut out for you. I showed you how to run a program with the 32K turned off. To recap:

First save the program you are testing to disk, using the 32K expansion. (This is the default mode in Classic99)

CALL INIT

CALL LOAD(-31868,0,0,0,0)   

OLD DSK1.PROGRAM       (for this to work you cannot enter the program; you have to load it from disk)

SIZE   -   you will see that the 32K is not being used

Now you can test both with and without 32K, and are no longer dependent on buffoons with stopwatches to do your testing for you. Use a program with a typical mix of CALLs, arithmetic operations, DISPLAY AT, etc. You can loop 1,000,000 times or even 1,000,000,000 if that makes you feel better.

Be sure to report on what you find out.

 

 

 

 

Link to comment
Share on other sites

1 hour ago, apersson850 said:

I hope they have mail in hell, because that's where I'll be before that's done.

That would be to large of sample size as it would show the difference but using some math unlikely it would get any more accurate.

To small of sample size is just as bad as then you are just guesstimating.

Link to comment
Share on other sites

56 minutes ago, RXB said:

Hmmm you not understand what sample size means?

And you still using a stop watch like watching a horse race?

Yes I do understand sample size, but you don't. It's relevant when you have variations between the samples. When you run x identical loops, there's no variation.
In this case a stop watch would do. When I do real testing on the real 99/4A, i use the clock in the computer. Those who don't have that use a stopwatch.

Link to comment
Share on other sites

53 minutes ago, RXB said:

LOL it is like 4 cycles never add up as you act like it is 4 instructions only.

Of course you would not notice as the sample size so small your timer is useless.

Your addition skills are limited (laughing loudly). There are many more cycles than the wait cycles, so the other cycles add to a larger number. That's why they are barely noticeable even in the long run. When you run Extended BASIC, that is, since it doesn't use the memory much. Even if you run for hours, the relative difference is equally irrelevant.

Link to comment
Share on other sites

3 hours ago, apersson850 said:

So what caused ten loops taking twice the time, not ten times the time, then? Bad math?

This doesn't even tell the whole story. The thread was "Byte Magazine Sieve Benchmark"

 

In post #41 Rasmus reports:

  These are my results using Classic99 QI399.057:

  TI BASIC: 10:24

  XB: 3:43

  RXB 2022A: 8:39

 

In post #45, Bill reports:

  ...on real iron* running out of a FinalGrom, I get: (this was looping 1000 times)

  TI-BASIC 60 seconds                x10= 10 minutes

  Extended Basic 23 seconds       x10= 3 minutes 50 seconds

  RXB 2020  23 seconds

  RXB 2022  52 seconds              x10= 8 minutes 40 seconds

 

Yet in post #43 Rich reports:

  I ran this test multiple times with same results:

  TI Basic 13 Minute 44 seconds

  XB  13 Minute 9 seconds

  RXB 2022A 13 Minute 8 seconds

 

And in post #44 Rich reports:

  OK RAN A 100000 LOOP TEST SO RESULTS ARE:

  XB       = 37 Minutes 5 Seconds

  XB 2.9 = 37 Minutes 6 Seconds

  RXB     = 26 Minutes 2 Seconds

 

On Classic99, Rasmus gets results very similar to Bill's results on a real TI99. Rich's tests do not even remotely resemble those results. Tursi has shown why my geriatric computer gives inaccurate results. I think we can rule out performance issues on Rich's computer, so that leaves just 2 possibilities that I can see:

1 - The programs he is running are not the same as the one he posted that Rasmus and Bill ran, and not even the same in his two tests (10000 loop vs 100000 loop)

2 - He has fabricated those numbers

Maybe there is another explanation? (edit) Perhaps he is using a different number system, such as quinary or undecimal?

Edited by senior_falcon
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...