Jump to content
IGNORED

How long would the following instructions take?


Willsy

Recommended Posts

  • 3 weeks later...

Inspired by this discussion on saving clock cycles I added some cycle counting support to xas99:

 

XAS99 CROSS-ASSEMBLER   VERSION 1.4.2
0001               * TIMING
0002               
0003 0000 02E0  18 START  LWPI >8300
     0002 8300 
0004               
0005 0004 C081  18        MOV  R1,R2
0006 0006 C811  46        MOV  *R1,@0
     0008 0000 
0007 000A CC60  50        MOV  @0,*R1+
     000C 0000 
0008 000E DC60  48        MOVB @0,*R1+
     0010 0000 
0009               
0010 0012 A820  54        A    @0,@2
     0014 0000 
     0016 0002 
0011 0018 78A1  54        SB   @0(R1),@2(R2)
     001A 0000 
     001C 0002 
0012 001E 0241  22        ANDI R1,0
     0020 0000 
0013 0022 2860  34        XOR  @0,R1
     0024 0000 
0014               
0015 0026 0431  46        BLWP *R1+
0016 0028 0420  54        BLWP @0
     002A 0000 
The fourth column of the list file now contains the number of cycles required for execution, including memory access. For some mnemonics such as MPY and SLA the worst case is shown.

 

From my understanding of the TMS9900 Data Manual and the TI 99 architecture

 

  • access to scratch pad RAM has no wait states,
  • all other memory accesses incur 4 wait states, and
  • before every write there is a read.

 

So for my calculation I assume that registers are always located in scratch pad RAM, whereas all other memory accesses are in slow RAM. This is obviously overly pessimistic, and could be slightly improved for absolute addressing, but you still wouldn't be able to account for *R and *R+.

Now with above assumptions, the instructions

CLR *R0+
ANDI R0,IMM

have memory accesses W / Z W / W / Z and W / Z, W / Z / -, respectively (instruction/read/write/increment, W = 4 wait states, Z = 0 wait states).

 

It wasn't straight-forward to separate W- and Z-type memory accesses based on the Data Manual alone, so I had to consult the E/A Manual as well. Still, I'm not sure if CRU addressing is comparable to memory access with respect to timing. Or I may have mixed up read and write arguments for some mnemonic.

 

Does anyone know of reference timings for a variety of statements that I could use for comparison?

  • Like 1
Link to comment
Share on other sites

Inspired by this discussion on saving clock cycles I added some cycle counting support to xas99:

Nice! Like me in my Disassembler Editor of the TI-Disk Manager, I recently released. ;-)

 

From my understanding of the TMS9900 Data Manual and the TI 99 architecture

  • access to scratch pad RAM has no wait states,
  • all other memory accesses incur 4 wait states, and
  • before every write there is a read.

Witch Data Manual did you mean? (It's not a shame to use reference.)

I think that is not quite right.

I already started a discuss of clock cycle calculating on this thread. So it's better you use the named references: The chapter 4 of 9900-FamilySystemsDesign-1stEdition (page 94) explains how each instruction is executed in the CPU, with all memory accesses. There you can find all necessary informations, also those for CRU operations.

 

So for my calculation I assume that registers are always located in scratch pad RAM, whereas all other memory accesses are in slow RAM.

The assembler should know witch address the WP the assembled program has. And also the AORG/RORG directives gives your algorithm additional informations for the location where an instruction is executed. You can check the address of WP and PC for areas from >2000 to >7FFF and >A000 to >FFFF, where wait states appears. So, depending on these informations, you can do a more exact calculation of the clock cycles!

Edited by HackMac
  • Like 1
Link to comment
Share on other sites

So for my calculation I assume that registers are always located in scratch pad RAM, whereas all other memory accesses are in slow RAM.

 

Registers can be wherever the programmer puts them. “All other memory accesses” can certainly include scratchpad RAM.

This is obviously overly pessimistic, and could be slightly improved for absolute addressing, but you still wouldn't be able to account for *R and *R+.

 

See page 28 of the TMS9900 Data Manual.

 

...lee

Link to comment
Share on other sites

  • access to scratch pad RAM has no wait states,
  • all other memory accesses incur 4 wait states, and
  • before every write there is a read.

 

System ROM (>0000->1FFF) also has zero wait states. Also, certain instructions apparently do NOT read before write (LI has been quoted as such, but I haven't got that into Classic99.)

  • Like 1
Link to comment
Share on other sites

Registers can be wherever the programmer puts them. “All other memory accesses” can certainly include scratchpad RAM.

 

 

Lee, I'm aware of that, that's why it's an assumption. ;-) But maybe it wasn't very clearly phrased: It's an assumption for my calculation, not about the way the TI works.
Now that out of the way, I think it's a reasonable assumption for registers -- for other accesses not so much so, which is why I called it "overly pessimistic".
Yet this hardly can be improved upon. The assembler could check absolute symbolic accesses (and I might add that feature), but there is no way to tell if indirect addressing refers to fast or slow RAM.

 

See page 28 of the TMS9900 Data Manual.

 

 

That very page is the basis of my implementation. But please note that the table lumps all memory accesses together and assumes the same amount of wait states for each access. That's why I had to figure out the amount of "W" and "Z" accesses (see my initial post) for each instruction. (And then the BLWP instruction requires even further tweaking as it doesn't fit the table A/B schema.)
But my revised tables might be wrong, which is why I was looking for a way to validate my results ...
  • Like 1
Link to comment
Share on other sites

System ROM (>0000->1FFF) also has zero wait states. Also, certain instructions apparently do NOT read before write (LI has been quoted as such, but I haven't got that into Classic99.)

 

Ah, thanks a lot, I didn't know either fact!

 

Thankfully zero wait-state ROM doesn't invalidate my assumptions (== "makes my computed result even less useful") too much, as you might not read from (or write into ;)) ROM all that much. (YMMV.)

 

As for the immediate instructions, it also doesn't matter, as register accesses don't count. So for LI R,IMM and others I'm counting only two slow memory accesses, one for the reading the instruction and one for reading the value (again, assuming that the instruction is not in scratch pad RAM). But I made a mental note about the no-read.

Link to comment
Share on other sites

Lee, I'm aware of that, that's why it's an assumption. ;-) But maybe it wasn't very clearly phrased: It's an assumption for my calculation, not about the way the TI works.

 

I understand that it is an assumption; but, I do not understand why that is useful. DSRLNK, for example, does not use scratchpad RAM for its registers. Another example is TI Forth and my fbForth: Though the Forth system registers are in scratchpad RAM, there is no room for another set of registers. Also, scratchpad RAM is used for the Forth inner interpreter's program code, which runs more than any other piece of code in the system and is there precisely because of the speed of scratchpad RAM.

 

...lee

  • Like 1
Link to comment
Share on other sites

I understand that it is an assumption; but, I do not understand why that is useful.

 

 

Ah, I see ... Well, I'm undecided about that myself.

 

The idea was to give assembly programmers a quick way to check how fast their critical code would run, especially when deciding between two alternatives. For this I would probably not report per-statement timings but aggregate blocks of code.

 

Still, your objections remain largely valid. The PC could be inferred, and register location could be hinted at by directives or command-line parameters, but indirect memory addressing still remains a killer ...

 

[EDIT: Wrote WP, meant PC.]

Edited by ralphb
Link to comment
Share on other sites

 

 

System ROM (>0000->1FFF) also has zero wait states. Also, certain instructions apparently do NOT read before write (LI has been quoted as such, but I haven't got that into Classic99.)

 

Would taking the contents of memory directly after the instruction to place into the register location not technically qualify as read-before-write? Or is there something in the processor that normally dictates a location to be written is read first with no deviation such as reading a different location? Or something else (I hate close-ended questions like that :)?)

  • Like 1
Link to comment
Share on other sites

Extended Basic really suffers from using Scratch Pad for Storage of Variables that rarely change.


[0208]               ***********************************************************
[0209]               *    VDP addresses
[0210] 02E2          NLNADD EQU  >02E2             New LiNe ADDress
[0211] 02FE          ENDSCR EQU  >02FE             END of SCReen address
[0212] 0371          LODFLG EQU  >0371             Auto-boot needed flag
[0213] 0372          START  EQU  >0372             Line to start execution at
[0214] 0376          SYMBOL EQU  >0376             Saved symbol table pointer
[0215] 0382          SPGMPT EQU  >0382             Saved PGMPTR for continue
[0216] 0384          SBUFLV EQU  >0384             Saved BUFLEV for contiue
[0217] 0386          SEXTRM EQU  >0386             Saved EXTRAM for continue
[0218]               * SAVEVP EQU  >0388           Saved VSPRT for continue
[0219]               * ERRLN  EQU  >038A           On-error line pointer
[0220] 038C          BUFSRT EQU  >038C             Edit recall start addr (VARW)
[0221] 038E          BUFEND EQU  >038E             Edit recall end addr (VARA)
[0222] 0392          TABSAV EQU  >0392             Saved main symbol table ponte
[0223] 0396          SLSUBP EQU  >0396             Saved LSUBP for continue
[0224] 0398          SFLAG  EQU  >0398             Saved on-warning/break bits
[0225] 039A          SSTEMP EQU  >039A             To save subprogram program ta
[0226] 039C          SSTMP2 EQU  >039C             Same as above. Used in SUBPRO
[0227] 039E          MRGPAB EQU  >039E             MERGEd temporary for pab ptr
[0228]               *----------------------------------------------------------
[0229]               * Added 6/8/81 for NOPSCAN feature
[0230] 03B7          PSCFG  EQU  >03B7
[0231]               *----------------------------------------------------------
[0232]               * RXB PATCH CODE SWAP CONFLG & >35D7 FOR cONSOLE MENU FLAG
[0233]               *    Flag 0:  99/4  console, 5/29/81
[0234]               *         1:  99/4A console
[0235] 03BB          CONFLG EQU  >03BB
[0236]               *----------------------------------------------------------
[0237]               * Temporary
[0238] 0374          NOTONE EQU  >0374             NO-TONE for SIZE in ACCEPT us
[0239]               *                              in FLMGRS (4 bytes used)
[0240] 0388          SAVEVP EQU  >0388
[0241] 038A          ERRLN  EQU  >038A
[0242] 03AC          ACCVRW EQU  >03AC             Temoporary used in ERRZZ, als
[0243]               *                              used in FLMGRS
[0244] 03B0          VALIDP EQU  >03B0             Use as two values passing fro
[0245] 03B2          VALIDL EQU  >03B2             VALIDATE code to READL1
[0246] 03BC          OLDTOP EQU  >03BC             Temporary used in ERRZZ, also
[0247] 0820          CRNBUF EQU  >0820             CRuNch BUFfer address
[0248] 08BE          CRNEND EQU  >08BE             CRuNch buffer END
[0249] 08C0          RECBUF EQU  >08C0             Edit RECall BUFfer
[0250] 0958          VRAMVS EQU  >0958             Default base of value stack
[0251] 0390          CNSTMP EQU  >0390             Use as temporary stored place
[0252] 03C0          VROAZ  EQU  >03C0             Temporary VDP Roll Out Are

Look at all the VDP address that are used instead for access most of the time, no wonder XB is so freaking slow!

 

And that is twice as bad in TI Basic.

  • Like 1
Link to comment
Share on other sites

Inspired by this discussion, last night I reviewed every opcode in Classic99 with a view to sorting out the memory accesses. I was pleased to note that most were correct, but I changed the cycle counting for memory accesses slightly and I think it's slightly more accurate now. (I did not review the addressing mode penalties yet, and I need to run some timing tests on real hardware for certain instructions to verify assumptions. That can't happen soon either, unfortunately.)

 

That said, I think I found some interesting details.

 

One surprise to me was that there are actually only four instructions that do an "unnecessary" read before write of the destination address (in that, they didn't need to do the read as they don't use the value). MOV, MOVB, SETO and CLR. MOVB is arguable since it needs to do the read in order to change just one byte, but since it often comes up that the 8-bit variants don't need to do this, I listed it. (They are, sadly, some of the most common opcodes, though ;) ). All other opcodes, such as A, need that destination word in order to do the math.

 

LI indeed does not do a read before write - it has three memory cycles. They are: read instruction, read immediate argument, write destination. When we talk about "read-before-write", we're talking about reading the destination address, so doesn't technically count.

 

I also did the research and worked out the "likely" candidate for the DIV algorithm, and coded it so cycle counts will be accurate. This should be provable with a few choice timing tests on hardware, too.

  • Like 6
Link to comment
Share on other sites

  • 8 years later...
On 6/25/2015 at 3:21 AM, HackMac said:

The TI has no clock that can measure the loop time. Or can it be done by the 9901?

I think, what I need is a logic analyzer. so I can count clock cycles between memory accesses. 😞

 

But if there is a shy guy here in the forum, who is under cover and has any idea, please be courageous and contribute your part. (I believe there are more people than Lee and Michael.)

I know this is a really old thread but I found this file in "test" folder while trying to reorganize my stuff. 

I followed the link  in my file. I don't think I published this but I created the file in March 2020.

I made a test program to try and measure this question with the 9901 timer but I probably was not very confident in my system or my ability back then. 

 

So here is a result measured on my real TI-99 using the 9901 timer.

There is compensation for the Forth interpreter substracted from the result. (~106 uS because the 9901 code is executed twice to measure the loop)

I run the loop 100 times to reduce jitter.  There are still some random fluctuations in the loop even though interrupts are off.  (hardware?) 

 

On Classic 99 I see both instructions showing (14) clocks  so T = 28 * 0.333 = 9.324uS 

 

My test is showing 8.3 to 8.95 uS so pretty close.

 

So there's another answer @Willsy and only 9 years late! 

Spoiler
\ Willy's question 2015 

\ https://atariage.com/forums/topic/183479-how-long-would-the-following-instructions-take/
\
\ LOOP DEC R2
\   JNE LOOP
\
\    In uS? Thanks my lovelies :-)?
\ ==================================================

\ Test in Camel Forth /TTY using 9901 timer

NEEDS DEC,  FROM DSK1.ASM9900
NEEDS $:    FROM DSK1.ASMLABELS
NEEDS MARKER FROM DSK1.MARKER 

MARKER /WILLSY 

DECIMAL
 \ camel99 USER variables return addresses in current workspace and beyond
4 USER 'R2      \ returns address of register 2 in current workspace 

\ convert 9901 ticks to micro-seconds (21.3uS/tick)
: >uS  ( ticks -- uS)  213 10 */ ;

CODE LIMI0    0 LIMI,  NEXT, ENDCODE  \ disable interrupts

\ read 9901 twice to get Forth interpreter overhead (~106 uS)
: OVERHEAD ( -- n) LIMI0  TMR@ TMR@ -  ;

CODE WILLSY
1 $:        R2  DEC,
            1 $ JNE,   
    
            NEXT,      \ return Forth
            ENDCODE

: TEST  ( -- n)
   100 'R2 !                    \ load R2 from Forth for 100 interations
   LIMI0                        \ STOP interrupts while we time
   TMR@ WILLSY TMR@ - >uS       \ measure execution time.
   OVERHEAD 2 * >uS -           \ subtract overhead x 3
   CR DUP . ." uS per 100 Iterations"  
;

 

 

 

 

 

 

 

 

 

  • Like 3
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...