+mizapf Posted June 25, 2015 Share Posted June 25, 2015 No, use a stopwatch. For that reason I suggested to run the loop for millions of times. You can safely tell apart times that differ by more than 2 seconds. 1 Quote Link to comment Share on other sites More sharing options...
ralphb Posted July 12, 2015 Share Posted July 12, 2015 Inspired by this discussion on saving clock cycles I added some cycle counting support to xas99: XAS99 CROSS-ASSEMBLER VERSION 1.4.2 0001 * TIMING 0002 0003 0000 02E0 18 START LWPI >8300 0002 8300 0004 0005 0004 C081 18 MOV R1,R2 0006 0006 C811 46 MOV *R1,@0 0008 0000 0007 000A CC60 50 MOV @0,*R1+ 000C 0000 0008 000E DC60 48 MOVB @0,*R1+ 0010 0000 0009 0010 0012 A820 54 A @0,@2 0014 0000 0016 0002 0011 0018 78A1 54 SB @0(R1),@2(R2) 001A 0000 001C 0002 0012 001E 0241 22 ANDI R1,0 0020 0000 0013 0022 2860 34 XOR @0,R1 0024 0000 0014 0015 0026 0431 46 BLWP *R1+ 0016 0028 0420 54 BLWP @0 002A 0000 The fourth column of the list file now contains the number of cycles required for execution, including memory access. For some mnemonics such as MPY and SLA the worst case is shown. From my understanding of the TMS9900 Data Manual and the TI 99 architecture access to scratch pad RAM has no wait states, all other memory accesses incur 4 wait states, and before every write there is a read. So for my calculation I assume that registers are always located in scratch pad RAM, whereas all other memory accesses are in slow RAM. This is obviously overly pessimistic, and could be slightly improved for absolute addressing, but you still wouldn't be able to account for *R and *R+. Now with above assumptions, the instructions CLR *R0+ ANDI R0,IMM have memory accesses W / Z W / W / Z and W / Z, W / Z / -, respectively (instruction/read/write/increment, W = 4 wait states, Z = 0 wait states). It wasn't straight-forward to separate W- and Z-type memory accesses based on the Data Manual alone, so I had to consult the E/A Manual as well. Still, I'm not sure if CRU addressing is comparable to memory access with respect to timing. Or I may have mixed up read and write arguments for some mnemonic. Does anyone know of reference timings for a variety of statements that I could use for comparison? 1 Quote Link to comment Share on other sites More sharing options...
HackMac Posted July 13, 2015 Share Posted July 13, 2015 (edited) Inspired by this discussion on saving clock cycles I added some cycle counting support to xas99: Nice! Like me in my Disassembler Editor of the TI-Disk Manager, I recently released. From my understanding of the TMS9900 Data Manual and the TI 99 architecture access to scratch pad RAM has no wait states, all other memory accesses incur 4 wait states, and before every write there is a read. Witch Data Manual did you mean? (It's not a shame to use reference.) I think that is not quite right. I already started a discuss of clock cycle calculating on this thread. So it's better you use the named references: The chapter 4 of 9900-FamilySystemsDesign-1stEdition (page 94) explains how each instruction is executed in the CPU, with all memory accesses. There you can find all necessary informations, also those for CRU operations. So for my calculation I assume that registers are always located in scratch pad RAM, whereas all other memory accesses are in slow RAM. The assembler should know witch address the WP the assembled program has. And also the AORG/RORG directives gives your algorithm additional informations for the location where an instruction is executed. You can check the address of WP and PC for areas from >2000 to >7FFF and >A000 to >FFFF, where wait states appears. So, depending on these informations, you can do a more exact calculation of the clock cycles! Edited July 13, 2015 by HackMac 1 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted July 13, 2015 Share Posted July 13, 2015 So for my calculation I assume that registers are always located in scratch pad RAM, whereas all other memory accesses are in slow RAM. Registers can be wherever the programmer puts them. “All other memory accesses” can certainly include scratchpad RAM. This is obviously overly pessimistic, and could be slightly improved for absolute addressing, but you still wouldn't be able to account for *R and *R+. See page 28 of the TMS9900 Data Manual. ...lee Quote Link to comment Share on other sites More sharing options...
Tursi Posted July 13, 2015 Share Posted July 13, 2015 access to scratch pad RAM has no wait states, all other memory accesses incur 4 wait states, and before every write there is a read. System ROM (>0000->1FFF) also has zero wait states. Also, certain instructions apparently do NOT read before write (LI has been quoted as such, but I haven't got that into Classic99.) 1 Quote Link to comment Share on other sites More sharing options...
ralphb Posted July 13, 2015 Share Posted July 13, 2015 Registers can be wherever the programmer puts them. “All other memory accesses” can certainly include scratchpad RAM. Lee, I'm aware of that, that's why it's an assumption. But maybe it wasn't very clearly phrased: It's an assumption for my calculation, not about the way the TI works. Now that out of the way, I think it's a reasonable assumption for registers -- for other accesses not so much so, which is why I called it "overly pessimistic". Yet this hardly can be improved upon. The assembler could check absolute symbolic accesses (and I might add that feature), but there is no way to tell if indirect addressing refers to fast or slow RAM. See page 28 of the TMS9900 Data Manual. That very page is the basis of my implementation. But please note that the table lumps all memory accesses together and assumes the same amount of wait states for each access. That's why I had to figure out the amount of "W" and "Z" accesses (see my initial post) for each instruction. (And then the BLWP instruction requires even further tweaking as it doesn't fit the table A/B schema.) But my revised tables might be wrong, which is why I was looking for a way to validate my results ... 1 Quote Link to comment Share on other sites More sharing options...
ralphb Posted July 13, 2015 Share Posted July 13, 2015 System ROM (>0000->1FFF) also has zero wait states. Also, certain instructions apparently do NOT read before write (LI has been quoted as such, but I haven't got that into Classic99.) Ah, thanks a lot, I didn't know either fact! Thankfully zero wait-state ROM doesn't invalidate my assumptions (== "makes my computed result even less useful") too much, as you might not read from (or write into ) ROM all that much. (YMMV.) As for the immediate instructions, it also doesn't matter, as register accesses don't count. So for LI R,IMM and others I'm counting only two slow memory accesses, one for the reading the instruction and one for reading the value (again, assuming that the instruction is not in scratch pad RAM). But I made a mental note about the no-read. Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted July 13, 2015 Share Posted July 13, 2015 Lee, I'm aware of that, that's why it's an assumption. But maybe it wasn't very clearly phrased: It's an assumption for my calculation, not about the way the TI works. I understand that it is an assumption; but, I do not understand why that is useful. DSRLNK, for example, does not use scratchpad RAM for its registers. Another example is TI Forth and my fbForth: Though the Forth system registers are in scratchpad RAM, there is no room for another set of registers. Also, scratchpad RAM is used for the Forth inner interpreter's program code, which runs more than any other piece of code in the system and is there precisely because of the speed of scratchpad RAM. ...lee 1 Quote Link to comment Share on other sites More sharing options...
ralphb Posted July 13, 2015 Share Posted July 13, 2015 (edited) I understand that it is an assumption; but, I do not understand why that is useful. Ah, I see ... Well, I'm undecided about that myself. The idea was to give assembly programmers a quick way to check how fast their critical code would run, especially when deciding between two alternatives. For this I would probably not report per-statement timings but aggregate blocks of code. Still, your objections remain largely valid. The PC could be inferred, and register location could be hinted at by directives or command-line parameters, but indirect memory addressing still remains a killer ... [EDIT: Wrote WP, meant PC.] Edited July 14, 2015 by ralphb Quote Link to comment Share on other sites More sharing options...
+OLD CS1 Posted July 13, 2015 Share Posted July 13, 2015 System ROM (>0000->1FFF) also has zero wait states. Also, certain instructions apparently do NOT read before write (LI has been quoted as such, but I haven't got that into Classic99.) Would taking the contents of memory directly after the instruction to place into the register location not technically qualify as read-before-write? Or is there something in the processor that normally dictates a location to be written is read first with no deviation such as reading a different location? Or something else (I hate close-ended questions like that ?) 1 Quote Link to comment Share on other sites More sharing options...
RXB Posted July 14, 2015 Share Posted July 14, 2015 Extended Basic really suffers from using Scratch Pad for Storage of Variables that rarely change. [0208] *********************************************************** [0209] * VDP addresses [0210] 02E2 NLNADD EQU >02E2 New LiNe ADDress [0211] 02FE ENDSCR EQU >02FE END of SCReen address [0212] 0371 LODFLG EQU >0371 Auto-boot needed flag [0213] 0372 START EQU >0372 Line to start execution at [0214] 0376 SYMBOL EQU >0376 Saved symbol table pointer [0215] 0382 SPGMPT EQU >0382 Saved PGMPTR for continue [0216] 0384 SBUFLV EQU >0384 Saved BUFLEV for contiue [0217] 0386 SEXTRM EQU >0386 Saved EXTRAM for continue [0218] * SAVEVP EQU >0388 Saved VSPRT for continue [0219] * ERRLN EQU >038A On-error line pointer [0220] 038C BUFSRT EQU >038C Edit recall start addr (VARW) [0221] 038E BUFEND EQU >038E Edit recall end addr (VARA) [0222] 0392 TABSAV EQU >0392 Saved main symbol table ponte [0223] 0396 SLSUBP EQU >0396 Saved LSUBP for continue [0224] 0398 SFLAG EQU >0398 Saved on-warning/break bits [0225] 039A SSTEMP EQU >039A To save subprogram program ta [0226] 039C SSTMP2 EQU >039C Same as above. Used in SUBPRO [0227] 039E MRGPAB EQU >039E MERGEd temporary for pab ptr [0228] *---------------------------------------------------------- [0229] * Added 6/8/81 for NOPSCAN feature [0230] 03B7 PSCFG EQU >03B7 [0231] *---------------------------------------------------------- [0232] * RXB PATCH CODE SWAP CONFLG & >35D7 FOR cONSOLE MENU FLAG [0233] * Flag 0: 99/4 console, 5/29/81 [0234] * 1: 99/4A console [0235] 03BB CONFLG EQU >03BB [0236] *---------------------------------------------------------- [0237] * Temporary [0238] 0374 NOTONE EQU >0374 NO-TONE for SIZE in ACCEPT us [0239] * in FLMGRS (4 bytes used) [0240] 0388 SAVEVP EQU >0388 [0241] 038A ERRLN EQU >038A [0242] 03AC ACCVRW EQU >03AC Temoporary used in ERRZZ, als [0243] * used in FLMGRS [0244] 03B0 VALIDP EQU >03B0 Use as two values passing fro [0245] 03B2 VALIDL EQU >03B2 VALIDATE code to READL1 [0246] 03BC OLDTOP EQU >03BC Temporary used in ERRZZ, also [0247] 0820 CRNBUF EQU >0820 CRuNch BUFfer address [0248] 08BE CRNEND EQU >08BE CRuNch buffer END [0249] 08C0 RECBUF EQU >08C0 Edit RECall BUFfer [0250] 0958 VRAMVS EQU >0958 Default base of value stack [0251] 0390 CNSTMP EQU >0390 Use as temporary stored place [0252] 03C0 VROAZ EQU >03C0 Temporary VDP Roll Out Are Look at all the VDP address that are used instead for access most of the time, no wonder XB is so freaking slow! And that is twice as bad in TI Basic. 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted July 14, 2015 Share Posted July 14, 2015 Inspired by this discussion, last night I reviewed every opcode in Classic99 with a view to sorting out the memory accesses. I was pleased to note that most were correct, but I changed the cycle counting for memory accesses slightly and I think it's slightly more accurate now. (I did not review the addressing mode penalties yet, and I need to run some timing tests on real hardware for certain instructions to verify assumptions. That can't happen soon either, unfortunately.) That said, I think I found some interesting details. One surprise to me was that there are actually only four instructions that do an "unnecessary" read before write of the destination address (in that, they didn't need to do the read as they don't use the value). MOV, MOVB, SETO and CLR. MOVB is arguable since it needs to do the read in order to change just one byte, but since it often comes up that the 8-bit variants don't need to do this, I listed it. (They are, sadly, some of the most common opcodes, though ). All other opcodes, such as A, need that destination word in order to do the math. LI indeed does not do a read before write - it has three memory cycles. They are: read instruction, read immediate argument, write destination. When we talk about "read-before-write", we're talking about reading the destination address, so doesn't technically count. I also did the research and worked out the "likely" candidate for the DIV algorithm, and coded it so cycle counts will be accurate. This should be provable with a few choice timing tests on hardware, too. 6 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 8 Share Posted March 8 On 6/25/2015 at 3:21 AM, HackMac said: The TI has no clock that can measure the loop time. Or can it be done by the 9901? I think, what I need is a logic analyzer. so I can count clock cycles between memory accesses. 😞 But if there is a shy guy here in the forum, who is under cover and has any idea, please be courageous and contribute your part. (I believe there are more people than Lee and Michael.) I know this is a really old thread but I found this file in "test" folder while trying to reorganize my stuff. I followed the link in my file. I don't think I published this but I created the file in March 2020. I made a test program to try and measure this question with the 9901 timer but I probably was not very confident in my system or my ability back then. So here is a result measured on my real TI-99 using the 9901 timer. There is compensation for the Forth interpreter substracted from the result. (~106 uS because the 9901 code is executed twice to measure the loop) I run the loop 100 times to reduce jitter. There are still some random fluctuations in the loop even though interrupts are off. (hardware?) On Classic 99 I see both instructions showing (14) clocks so T = 28 * 0.333 = 9.324uS My test is showing 8.3 to 8.95 uS so pretty close. So there's another answer @Willsy and only 9 years late! Spoiler \ Willy's question 2015 \ https://atariage.com/forums/topic/183479-how-long-would-the-following-instructions-take/ \ \ LOOP DEC R2 \ JNE LOOP \ \ In uS? Thanks my lovelies :-)? \ ================================================== \ Test in Camel Forth /TTY using 9901 timer NEEDS DEC, FROM DSK1.ASM9900 NEEDS $: FROM DSK1.ASMLABELS NEEDS MARKER FROM DSK1.MARKER MARKER /WILLSY DECIMAL \ camel99 USER variables return addresses in current workspace and beyond 4 USER 'R2 \ returns address of register 2 in current workspace \ convert 9901 ticks to micro-seconds (21.3uS/tick) : >uS ( ticks -- uS) 213 10 */ ; CODE LIMI0 0 LIMI, NEXT, ENDCODE \ disable interrupts \ read 9901 twice to get Forth interpreter overhead (~106 uS) : OVERHEAD ( -- n) LIMI0 TMR@ TMR@ - ; CODE WILLSY 1 $: R2 DEC, 1 $ JNE, NEXT, \ return Forth ENDCODE : TEST ( -- n) 100 'R2 ! \ load R2 from Forth for 100 interations LIMI0 \ STOP interrupts while we time TMR@ WILLSY TMR@ - >uS \ measure execution time. OVERHEAD 2 * >uS - \ subtract overhead x 3 CR DUP . ." uS per 100 Iterations" ; COM1_19200bps - TI-99 VT100 VT 2024-03-08 16-16-27.mp4 3 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.