Jump to content
IGNORED

Timing normal 32K RAM vs 32K 16bit RAM


RXB

Recommended Posts

Actually, most of the GROM space on the p-Code card is being used as a GROM-disk holding a lot of the start-up data to initialize the p-System environment, there isn't much GPL code in there at all. I've seen the memory map that shows what is what in there, although I don't have it handy right now. I may have to go dig that out again. I think it is in the p-System specifications document. . .

  • Like 5
Link to comment
Share on other sites

17 hours ago, Willsy said:

Does this summarize the facts?

This list of facts is far from complete.  

 

     * As a result of the crisis of the late 16th century in Russia, many service-class landowners were left with no peasants working their land.

     * June 3 is Martyrs Day in Uganda.

     * Time flies like an arrow. Fruit flies like a banana. 

     * Myth: Something that never was true, and always will be. 

     * William Shakespeare once described a shepherd's hat as a "platted hive of straw."

 

There are probably more. 

Edited by Reciprocating Bill
  • Like 1
  • Haha 2
Link to comment
Share on other sites

1 hour ago, Asmusr said:

Does this summarize the facts?

  • Reading from GROM/VDP is almost as fast as reading from CPU memory (ROM or RAM) if the read address has already been set up
  • If you need to set up the read address first, reading from GROM/VDP is a lot slower than reading from CPU memory
  • It's about 5% faster to execute an XB program from 32K RAM instead from VDP RAM only
  • An XB routine programmed in assembly is usually much faster than a similar routine written in GPL
  • Reading from 16-bit memory is exactly 4 CPU cycles faster than reading from 8-bit memory
  • Because console ROMs are 16-bit, they can be faster than similar assembly routines in cartridge ROM

 

After 3 pages of bloviation, that seems like a succinct recap. The one thing I don't know is this: "Reading from 16-bit memory is exactly 4 CPU cycles faster than reading from 8-bit memory", but if you say it is so, that is good enough for me.

You pay closer attention to clock cycles than I do, which is one reason your programs run faster than mine!

  • Like 2
Link to comment
Share on other sites

I may be pulling this off the original topic a bit, but as I was hunting the p-System GROM memory map, I came across an interesting tidbit in the GPL Programmer's Guide (on page H-6). Most everyone has always worked on the assumption that we have exactly 5 GROMs available to a program in the cartridge port (8 if we are overriding the Console GROMs). The specification says otherwise. You could technically use the 5 cartridge GROMs in all 16 GROM base addresses as part of a single program using the CALL procedure. The UberGROM gives you enough GROM space to fully populate three bases, allowing a GPL program of up to 120K in size. CALLS may be nested as well, for additional flexibility.

  • Like 1
Link to comment
Share on other sites

3 hours ago, Asmusr said:

Does this summarize the facts?

  • It's about 5% faster to execute an XB program from 32K RAM instead from VDP RAM only

For a full understanding, we should also realize that it's not just reading sequential code bytes that happens here. Things like reading the line number table and accessing variables is faster in CPU RAM, as reloading the address has to be done much more frequently there, compared to when running linear code. Executing GPL from GROM is less affected, as the GROM read address is separate from the VDP read addess.

Now it was a long time ago, but if my memory is correct, Extended BASIC moves numeric variables to CPU RAM too, doesn't it? Only strings remain in VDP RAM, I think. Correct me if I'm wrong here.

Link to comment
Share on other sites

1 hour ago, senior_falcon said:

After 3 pages of bloviation, that seems like a succinct recap. The one thing I don't know is this: "Reading from 16-bit memory is exactly 4 CPU cycles faster than reading from 8-bit memory", but if you say it is so, that is good enough for me.

You pay closer attention to clock cycles than I do, which is one reason your programs run faster than mine!

 

The point is, although we see the recurring notion that the 8-bit bus be the problem, it is in fact the wait states that slow the machine down. The TI console inserts 2 wait states for each byte access cycle, so a MOV from the external bus comprises 6 CPU cycles (two for the byte transfers, and four wait states). The 16-bit access with no wait states uses 2 cycles (the minimum time).

  • Like 1
Link to comment
Share on other sites

3 hours ago, Ksarul said:

Actually, most of the GROM space on the p-Code card is being used as a GROM-disk holding a lot of the start-up data to initialize the p-System environment, there isn't much GPL code in there at all. I've seen the memory map that shows what is what in there, although I don't have it handy right now. I may have to go dig that out again. I think it is in the p-System specifications document. . .

That's "near truth", but not fully accurate.

The p-code card has 12 K ROM. There you find the brunt of the PME (P-Machine Emulator, the p-code interpreter), implementation of special intrinsics like moveleft and treesearch and some BIOS routines.

 

The GROM chips have two purposes.

 

They are a code repository for assembly code, which is transferred to 8 K RAM expansion. This is done in two stages. First, the system transfers boot code to this area. This is run during the startup of the system, to setup a lot of things required before p-code can be interpreted.

Then another transfer takes place, to move code used during operation of the p-system to the same memory area. This time the screen is getting operational too, so the first 1920 bytes of the 8 K RAM are left untouched by this code. The reason the p-system needs code here is that it's implemented on an expansion box card, not on a cartridge. Hence the p-code card has to be disabled to be able to access the RS232 card and the disk controller. The code handling that has to run from somewhere else than the p-code card, obviously.

 

The second purpose is to implement the GROM-disk, which presents itself as blocked device #14:, called OS:. It holds three files, SYSTEM.MISCINFO, SYSTEM.CHARAC and SYSTEM.PASCAL.

They hold information about p-system settings, the character definitions for the system and the operating system itself. They are of course fixed in GROM, but the system is designed in such a way that if you do have a *SYSTEM.CHARAC file on your root disk, #4:, they will be used instead of the files on OS:. This is true for the entire SYSTEM.MISCINFO and SYSTEM.CHARAC files. SYSTEM.PASCAL is different. It's a segmented code file, which can be updated by creating a *SYSTEM.PASCAL file on the root disk on a segment basis. That is, you can create a *SYSTEM.PASCAL file which contains only the segment COMMANDIO and store that on the root disk. If you do, then the system will run the segment COMMANDIO from the *SYSTEM.PASCAL file on the root disk, but all other segments will still be used from the fixed file OS:SYSTEM.PASCAL.

Unlike earlier versions, the p-system IV doesn't allocate code space on the heap. Instead, it has a separate concept called the code pool. On single memory systems, this pool resides between the heap and the stack. On systems with more than 64 K RAM, the heap and stack can reside in one such 64 K RAM segment and the code pool in another. This allows the p-system to make good use of up to 128 K RAM, in spite of addresses being only 16 bits.
On the 99/4A, there are instead two code pools. One is the normal one, between the heap and stack in 24 K RAM. The other one is in VPD RAM, in the area between what's needed for the screen and the disk controller buffer.

 

The PME can interpret code regardless of whether it's in VDP RAM or CPU RAM. The interpreter's core exists in more than one version. The core is moved to RAM PAD for highest possible speed, and is simply replaced by the appropriate version, depending on where the p-code to interpret is located.

The PME does contain the ability to interpret p-code not only from the code pools in VDP and CPU RAM, but can also interpret code directly from GROM. I've not investigated the system enough to figure out if it's really running p-code in the SYSTEM.PASCAL file directly from GROM, instead of first moving it to a code pool. The latter would be easier, since there would be no difference compared to any other code. But making the effort to only move the code environment (variables and stuff) to RAM, but run the code itself from GROM, would save memory space. For this discussion, it doesn't really matter.

 

Anyway, the important thing here, when it comes to the p-system's relation to GPL, is that there is none. The only thing they have in common is that data is stored in GROM chips. There's not a single GPL instruction in the p-system. Which is pretty obvious, since the system was available for several different computers, where none of the others had any GPL at all.

Edited by apersson850
  • Like 2
Link to comment
Share on other sites

5 hours ago, RXB said:

LOL so GROM in the PCODE card is just data?

Yes, a repository for assembly code, which is moved to RAM to be executed, and the files making up the OS: volume. The files contain data and the operating system, which is written in Pascal and compiled to p-code. That p-code is in the GROMs.

  • Like 1
Link to comment
Share on other sites

48 minutes ago, Ksarul said:

I may be pulling this off the original topic a bit, but as I was hunting the p-System GROM memory map, I came across an interesting tidbit in the GPL Programmer's Guide (on page H-6). Most everyone has always worked on the assumption that we have exactly 5 GROMs available to a program in the cartridge port (8 if we are overriding the Console GROMs). The specification says otherwise. You could technically use the 5 cartridge GROMs in all 16 GROM base addresses as part of a single program using the CALL procedure. The UberGROM gives you enough GROM space to fully populate three bases, allowing a GPL program of up to 120K in size. CALLS may be nested as well, for additional flexibility.

The p-code card has a different GROM access address. It's within the p-code card's address space, 4000-5FFF. Thus doesn't interfere with the console GROMs, so the p-code card has 8 GROM chips, for a total of 48 K GROM. Then there's 12 K ROM, where the second half is paged. So in a way this is the 17th GROM base address in the system.

  • Like 1
Link to comment
Share on other sites

27 minutes ago, mizapf said:

 

The point is, although we see the recurring notion that the 8-bit bus be the problem, it is in fact the wait states that slow the machine down. The TI console inserts 2 wait states for each byte access cycle, so a MOV from the external bus comprises 6 CPU cycles (two for the byte transfers, and four wait states). The 16-bit access with no wait states uses 2 cycles (the minimum time).

Yes, for each word accessed, there is one wait state per byte to manage the juggling of the bytes to present 8-bit data to the 16-bit data bus, and one more to just wait for the memory timing.

  • Like 1
Link to comment
Share on other sites

32 minutes ago, apersson850 said:

Anyway, the important thing here, when it comes to the p-system's relation to GPL, is that there is none. The only thing they have in common is that data is stored in GROM chips. There's not a single GPL instruction in the p-system. Which is pretty obvious, since the system was available for several different computers, where none of the others had any GPL at all.

You gave a much fuller explanation (many thanks), but we were thinking the same things here. I didn't mention the ROMs (serious omission as it initiates the process), but you hit on everything I remembered and then some.

Link to comment
Share on other sites

5 hours ago, RXB said:

You know this how? Can you show me how you know this with some of the code? As a GPL programmer I would like to see it.

From very extensive studies of the system. It's completely beyond any reasonable scope to post enough about the system here to show this. The p-system is far more complex than GPL system in the console.

Remember that the p-system is not an application. It's an operating system. Once it has booted, it never runs the machine's native operating system in the console ROMs. The only thing it does is to call some routines for floating point operations, access to the keyboard scanning and similar stuff. It never runs the GPL interpreter. It never reaches the normal startup screen with the color bars.

Edited by apersson850
  • Like 2
Link to comment
Share on other sites

Just now, Ksarul said:

You gave a much fuller explanation (many thanks), but we were thinking the same things here. I didn't mention the ROMs (serious omission as it initiates the process), but you hit on everything I remembered and then some.

Yes, your post was generally correct. But since we've discussed a lot of details in this thread, I considered that it was important to get the details correct too.

In spite of the fact that I've done a lot of digging into the p-system, I've still far from touched all of it. It's a pity the source code for the implementation on the 99/4A has never been published, but for commercial reasons that will probably never happen. Even if the commercial value doesn't exist any longer.

  • Sad 2
Link to comment
Share on other sites

1 hour ago, apersson850 said:

Yes, your post was generally correct. But since we've discussed a lot of details in this thread, I considered that it was important to get the details correct too.

In spite of the fact that I've done a lot of digging into the p-system, I've still far from touched all of it. It's a pity the source code for the implementation on the 99/4A has never been published, but for commercial reasons that will probably never happen. Even if the commercial value doesn't exist any longer.

Actually, I need to dive into a couple of TI-99/8 documents I have that I haven't had a chance to scan yet. One is a binder of system source code, but I think it is mostly for the Extended BASIC II interpreter. The biggest problem in getting source code for any V4.x p-System implementation is that all of the system adaptation work was done by Softech Microsystems (and later, Pecan). They generally only provided the finished product, not the associated source code, to the system vendor commissioning the port.

Link to comment
Share on other sites

Another data point on the question of the speed of access to VDP RAM versus CPU RAM.

 

I've adapted my assembly version of the BYTE Sieve of Eratosthenes such that the 8192 byte data array is stored in VDP RAM. This is a task that, for the most part, can't utilize the auto-incrementing of VDP addresses because reads and writes are not to sequential addresses (an exception being the initialization of the array at the outset of each iteration, but that is a small percentage of the time required to complete the Sieve.) If I'm getting this right, the array is read/written to 313,810 times during 10 iterations of the Sieve (including initialization).

 

The results:

 

The unmodified Sieve (utilizing CPU RAM) executes in 6.4 seconds on my 16-bit console. On the same console, the Sieve takes 17.7 seconds* when accessing VDP RAM for the data array (as the stopwatch flies). It executes in 23.5 seconds on Classic99 with "normal" CPU throttling.

 

*a couple optimizations got this down to 16.6 seconds.

Edited by Reciprocating Bill
  • Like 3
Link to comment
Share on other sites

10 hours ago, RXB said:

LOL so GROM in the PCODE card is just data?

You know this how? Can you show me how you know this with some of the code? As a GPL programmer I would like to see it.

We have quite a few cartridge images that do exactly this, Rich. They lob the data in GROM into the 32K space and execute from there. Because it is in cartridge space, it requires a short GPL DSR to lob the data, but you don't need that bit of GPL code in the p-System, because it already has a ROM DSR in Assembly to handle the data moves. As noted, the data in GROM space is device #14 in the p-System.

  • Like 1
Link to comment
Share on other sites

11 hours ago, Ksarul said:

I may be pulling this off the original topic a bit, but as I was hunting the p-System GROM memory map, I came across an interesting tidbit in the GPL Programmer's Guide (on page H-6). Most everyone has always worked on the assumption that we have exactly 5 GROMs available to a program in the cartridge port (8 if we are overriding the Console GROMs). The specification says otherwise. You could technically use the 5 cartridge GROMs in all 16 GROM base addresses as part of a single program using the CALL procedure. The UberGROM gives you enough GROM space to fully populate three bases, allowing a GPL program of up to 120K in size. CALLS may be nested as well, for additional flexibility.

Yea I have been telling people this for many years now, so this has been demonstrated by me using Classic99 a number of years ago.

SWGR source,destination switches GROM base page. 

RTGR address returns to the original GROM base page.

Link to comment
Share on other sites

9 hours ago, Reciprocating Bill said:

Another data point on the question of the speed of access to VDP RAM versus CPU RAM.

 

I've adapted my assembly version of the BYTE Sieve of Eratosthenes such that the 8192 byte data array is stored in VDP RAM. This is a task that, for the most part, can't utilize the auto-incrementing of VDP addresses because reads and writes are not to sequential addresses (an exception being the initialization of the array at the outset of each iteration, but that is a small percentage of the time required to complete the Sieve.) If I'm getting this right, the array is read/written to 313,810 times during 10 iterations of the Sieve (including initialization).

 

The results:

 

The unmodified Sieve (utilizing CPU RAM) executes in 6.4 seconds on my 16-bit console. On the same console, the Sieve takes 17.7 seconds* when accessing VDP RAM for the data array (as the stopwatch flies). It executes in 23.5 seconds on Classic99 with "normal" CPU throttling.

 

*a couple optimizations got this down to 16.6 seconds.

Very nice.

If I do that in Forth the difference should be smaller than your hand coded program, if what we have been saying is correct,  because the interpreter is about 50% of the runtime in indirect-threaded Forth.

So VDP memory access is smaller percent of the runtime equation.

I will give it a shot.

  • Like 2
Link to comment
Share on other sites

That was pretty easy. I only had to change 4 things.  (edit: missed one operator. Has no impact on Runtime.  ) 

Change the base address to a VDP location and replace the three four memory operators to VDP equivalent.

Spoiler

\ VDP memory address used for array
HEX 1000 CONSTANT FLAGS   ( SIZE ALLOT)  0 FLAGS V!

DECIMAL
 8190 CONSTANT SIZE

: DO-PRIME
   FLAGS SIZE  1 VFILL  ( set array )
   0        ( counter )
   SIZE 0
   DO FLAGS I + VC@
     IF I DUP +  3 +  DUP I +
        BEGIN
          DUP SIZE <
        WHILE
           0 OVER FLAGS +  VC!
           OVER +
        REPEAT
        DROP DROP
        1+
     THEN
   LOOP
   CR SPACE . ." Primes"  ;

: PRIMES ( -- )
   ."  10 Iterations"
   10 0 DO  DO-PRIME  LOOP
   CR ." Done!"
;

 

 

So on my indirect threaded Forth, the one with the biggest interpreter, (3 instructions) I re-ran the regular sieve code on real iron this time.

Time was 2:08 ( 128 seconds made with a stop-watch rounded up to seconds)  

 

Putting the 8K array in VDP RAM and used VDP operators to read/write VDP RAM and slowed it down to 2:27 (147)  seconds. 

 

So a slow down of ~1.15X  (15%)

 

The Assembler program slowed down 2.8X.  (176%)

So the theory holds. 

The difference in using VDP memory is smaller once an interpreter is involved.

Looks like the bigger the interpreter the smaller the difference as we see in BASIC with only a 5% slowdown. 

 

 

 

  • Like 3
Link to comment
Share on other sites

3 hours ago, TheBF said:

That was pretty easy. I only had to change 4 things.  (edit: missed one operator. Has no impact on Runtime.  ) 

Change the base address to a VDP location and replace the three four memory operators to VDP equivalent.

  Hide contents


\ VDP memory address used for array
HEX 1000 CONSTANT FLAGS   ( SIZE ALLOT)  0 FLAGS V!

DECIMAL
 8190 CONSTANT SIZE

: DO-PRIME
   FLAGS SIZE  1 VFILL  ( set array )
   0        ( counter )
   SIZE 0
   DO FLAGS I + VC@
     IF I DUP +  3 +  DUP I +
        BEGIN
          DUP SIZE <
        WHILE
           0 OVER FLAGS +  VC!
           OVER +
        REPEAT
        DROP DROP
        1+
     THEN
   LOOP
   CR SPACE . ." Primes"  ;

: PRIMES ( -- )
   ."  10 Iterations"
   10 0 DO  DO-PRIME  LOOP
   CR ." Done!"
;

 

 

So on my indirect threaded Forth, the one with the biggest interpreter, (3 instructions) I re-ran the regular sieve code on real iron this time.

Time was 2:08 ( 128 seconds made with a stop-watch rounded up to seconds)  

 

Putting the 8K array in VDP RAM and used VDP operators to read/write VDP RAM and slowed it down to 2:27 (147)  seconds. 

 

So a slow down of ~1.15X  (15%)

 

The Assembler program slowed down 2.8X.  (176%)

So the theory holds. 

The difference in using VDP memory is smaller once an interpreter is involved.

Looks like the bigger the interpreter the smaller the difference as we see in BASIC with only a 5% slowdown. 

 

 

 

So as this all started it was 5% not 1% ?

And VDP is not the same speed as RAM nor would it ever compare in speed to 16bit RAM.

 

Also using an Array in TI Basic is not ideal as in Forth it does not run the program from VDP like Basic, it does not store variables in VDP like Basic.

After all in Basic, the tokens are in VDP, the Variables are in VDP, the DSR is in VDP and the program itself is in VDP.

 

Now XB can be like TI Basic or it can move the Token/Program to RAM in upper 24K.

Some variables like Numeric constants my reside in RAM not VDP, but most variables like Numbers and Strings are all in VDP.

Biggest slowdown for me being a GPL programmer for XB for 30 years is VDP address >0820 where it pushes the Crunched line.

If this was in RAM instead of VDP it would considerably speed up XB.

Reading writing RAM does not require 3 or 4  more instructions to read or write like VDP does.

 

 

Link to comment
Share on other sites

1 hour ago, RXB said:

So as this all started it was 5% not 1% ?

And VDP is not the same speed as RAM nor would it ever compare in speed to 16bit RAM.

 

Also using an Array in TI Basic is not ideal as in Forth it does not run the program from VDP like Basic, it does not store variables in VDP like Basic.

After all in Basic, the tokens are in VDP, the Variables are in VDP, the DSR is in VDP and the program itself is in VDP.

 

In the test above the array was all in VDP ram. That was the only variable that is used when the program runs.

All this shows that the big difference between different memory types doesn't matter much in interpreted programs.

 

Link to comment
Share on other sites

1 hour ago, RXB said:

Some variables like Numeric constants my reside in RAM not VDP, but most variables like Numbers and Strings are all in VDP.

The Extended BASIC manual, page 169 (on the Size command):

 

"If the Memory Expansion is attached, the space available in the stack is the amount of space left after the space taken up by string values, information about variables, and the like is subtracted. Program space is the amount of space left after the space taken up by the program and the values of numeric variables is subtracted."

  • Like 1
Link to comment
Share on other sites

1 hour ago, RXB said:

So as this all started it was 5% not 1% ?

And VDP is not the same speed as RAM nor would it ever compare in speed to 16bit RAM.

Also using an Array in TI Basic is not ideal as in Forth it does not run the program from VDP like Basic, it does not store variables in VDP like Basic.

After all in Basic, the tokens are in VDP, the Variables are in VDP, the DSR is in VDP and the program itself is in VDP.

Now XB can be like TI Basic or it can move the Token/Program to RAM in upper 24K.

Some variables like Numeric constants my reside in RAM not VDP, but most variables like Numbers and Strings are all in VDP.

Biggest slowdown for me being a GPL programmer for XB for 30 years is VDP address >0820 where it pushes the Crunched line.

If this was in RAM instead of VDP it would considerably speed up XB.

Reading writing RAM does not require 3 or 4  more instructions to read or write like VDP does.

In post #74 Rasmus asked if these were the facts:

  • Reading from GROM/VDP is almost as fast as reading from CPU memory (ROM or RAM) if the read address has already been set up
  • If you need to set up the read address first, reading from GROM/VDP is a lot slower than reading from CPU memory
  • It's about 5% faster to execute an XB program from 32K RAM instead from VDP RAM only
  • An XB routine programmed in assembly is usually much faster than a similar routine written in GPL
  • Reading from 16-bit memory is exactly 4 CPU cycles faster than reading from 8-bit memory
  • Because console ROMs are 16-bit, they can be faster than similar assembly routines in cartridge ROM

Do you agree that this is accurate? If not then please indicate which of these points you feel is inaccurate.

 

 

Link to comment
Share on other sites

Getting back to the speed tests, I would be very interested in seeing the results of this program running in XB from VDP and using the 32K expansion.

10 N=N+1

20 CALL KEY(0,K,S)::CALL GCHAR(1,1,G)::CALL SOUND(10,110,10)::CALL SCREEN(3)

30 GOTO 10

I don't trust any timing results I get from Classic99 on my computer. (A short test program looping 1000 times took 20 seconds, then 27, then 25 and after opening Classic99 2 more times it took 33 seconds)

I tried JS99er and for a 1 minute run found that the program ran just over 1% faster running from 32K vs VDP. But I don't know how accurate the timing is on JS99er.

Of course, you could use a for next loop if that worked better for you.

Edited by senior_falcon
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...