Willsy Posted October 8, 2010 Share Posted October 8, 2010 ; wait 12uS - see editor assembler page 349, paragraph 5. nop ; 2 bytes nop ; 2 bytes nop ; 2 bytes rt ; 2 bytes I need the same delay time (or a little bit longer...) but in less instructions (I need to save 2 bytes). Can it be done? BTW: This is running in 16-bit scratch-pad RAM. Would something like this work? ; wait 12uS - see editor assembler page 349, paragraph 5. mov *r0,*r0 ; 2 bytes rt ; 2 bytes The second bit of code is non-destructive, and 50% smaller. But how long does it take to execute? Classic99 indicates 22 cycles, but what is that in actual time? (And how do you work it out - I'll write it down this time, I promise mov r0,r0 ; 2 bytes The above code (spot the subtle difference) requires 14 cycles according to Classic99 - how long is that. Thanks Quote Link to comment Share on other sites More sharing options...
matthew180 Posted October 9, 2010 Share Posted October 9, 2010 (edited) 1 cycle in the 99/4A is about 333ns (nanoseconds), which is the period for a 3MHz clock. There are 1000ns in 1us (microsecond). So, 14 cycles is 333ns * 14 = 4.662us. *NOTE* This all assumes scratch pad no wait-state RAM, as you indicated. As soon as you step into any other RAM, you have to add wait states to these timings. The MOV instruction without any modifiers (symbolic, indirect, indexed, and autoincrement) is 14 clock cycles. So: MOV R0,R0 is 4.662us. Adding indirect addressing to either parameter will add 4 clock cycles and 1 memory access. MOV *R0,R0 - 18 clock cycles = 333ns * 18 = 5.994us MOV *R0,*R0 - 22 clock cycles = 333ns * 22 = 7.326us Autoincrement adds an additional 4 clocks on top of the indirect addressing: MOV *R0+,R0 - 22 clock cycles = 333ns * 22 = 7.326us MOV *R0+,*R0 - 26 clock cycles = 333ns * 26 = 8.658us MOV *R0+,*R0+ - 30 clock cycles = 333ns * 30 = 9.990us What you really need though is 36 clock cycles to get 12us (12us / 333ns = 36.036). Use a shift instruction, since they take 12 clocks + 2C (C == the count) clocks. That gives you some fine grain control, and the count is stored in the instruction so it is still only 2 bytes. For example: SRC R1,1 - 12 clocks + 2 * 1 = 14 clocks = 4.662us SRC R1,2 - 12 clocks + 2 * 2 = 16 clocks = 5.328us SRC R1,3 - 12 clocks + 2 * 3 = 18 clocks = 5.994us SRC R1,4 - 12 clocks + 2 * 4 = 20 clocks = 6.660us . SRC R1,12 - 12 clocks + 2 * 12 = 36 clocks = 11.988us . SRC R1,15 - 12 clocks + 2 * 15 = 42 clocks = 13.986us Just keep in mind to not use a count of zero! That means to use bits 12 through 15 of R0 for the count. Counts can only be 1 to 15. A count of 16 == 0. If you use a zero count, and bits 12 through 15 of R0 are also zero, the shift executes 16 times and takes 52 clocks. Anyway, just stick with 1 to 15 and you have a delay between 4.662us and 13.986us. Matthew Edited October 13, 2010 by matthew180 Quote Link to comment Share on other sites More sharing options...
Willsy Posted October 9, 2010 Author Share Posted October 9, 2010 Great stuff - thanks for taking the time, Matthew, that's a great help! Mark Quote Link to comment Share on other sites More sharing options...
sometimes99er Posted October 9, 2010 Share Posted October 9, 2010 mov *r0,*r0 ; 2 bytes The second bit of code is non-destructive, and 50% smaller. But how long does it take to execute? Classic99 indicates 22 cycles, but what is that in actual time? (And how do you work it out - I'll write it down this time, I promise Isn't it potentially dangerous ? - Like reading and writing memory-mapped devices. Quote Link to comment Share on other sites More sharing options...
Willsy Posted October 9, 2010 Author Share Posted October 9, 2010 mov *r0,*r0 ; 2 bytes The second bit of code is non-destructive, and 50% smaller. But how long does it take to execute? Classic99 indicates 22 cycles, but what is that in actual time? (And how do you work it out - I'll write it down this time, I promise Isn't it potentially dangerous ? - Like reading and writing memory-mapped devices. It could be, but I'm in control of the 'code path' anyway, and I know at that point that R0 isn't pointing to anything dangerous! Quote Link to comment Share on other sites More sharing options...
Willsy Posted October 9, 2010 Author Share Posted October 9, 2010 Matthew Could I impose on you to time these instructions for me (the ones indicated)? I would do it, but I don't have the data book, and I don't know of an online source where this information is available (in fact, that would make a GREAT addition to the programming resources thread here on Atariage!) ; convert the word to nybbles and send to the speech synth... li r2,4 ; 4 nybbles to load loadlp src r0,4 ; start with least significant nybble mov r0,r1 ; copy it src r1,4 ; get target nybble into correct position andi r1,>0f00 ; mask out the nybble of interest ori r1,>4000 ; put in 4x00 format for speech synth movb r1,@spchwt ; send it to the speech synth --> dec r2 ; finished? --> jne loadlp ; do next nybble if not --> li r1,>4000 ; load 'speak from rom' opcode to speech synth movb r1,@spchwt ; send it to the speech synth... synth is now talking romspx rt ; return from interrupt I've been reading the speech synth section in the Editor Assembler manual. That section is rather poorly written IMO and you have to read it carefully. According to section 22.1.1, page 349: The delay time from loading an address until the next command is 42 microseconds. Fine, I can live with that. No problem. However, on the last iteration of the loop above, just after the last address byte is written out to SPCHWT, the instructions indicated can actually be considered as part of the 42 uS delay, since they have to be decoded and executed. So if they add up to 42uS or more, I'm all set and I don't need to worry about it any more. At worst I may need to add a NOP or a MOV R0,R0 or something to spin the wheels a bit longer. Seems pedantic? Well yes, but I don't like redundant code, so a delay loop, if not required is just silly, and I'm at the point where bytes matter! BTW: These instructions are executing from 8-bit CPU ROM (cartridge space). Thanks again, Mark Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted October 9, 2010 Share Posted October 9, 2010 I would do it, but I don't have the data book, and I don't know of an online source where this information is available (in fact, that would make a GREAT addition to the programming resources thread here on Atariage!) Mark, did you look at section 3.6 (TMS9900 INSTRUCTION EXECUTION TIMES), page 28 of the TMS9900 Microprocessor data manual ? I know it's a bit brief, but it might be what you are looking for. I have the "9900 Family Systems Design and Data Book" at home. That's a good book that also goes into detail on instruction timing. Unfortunately I don't have this available as PDF. If someone has a PDF copy, please send it to me and I'll upload it to the Development Resources thread. Quote Link to comment Share on other sites More sharing options...
Willsy Posted October 9, 2010 Author Share Posted October 9, 2010 I would do it, but I don't have the data book, and I don't know of an online source where this information is available (in fact, that would make a GREAT addition to the programming resources thread here on Atariage!) Mark, did you look at section 3.6 (TMS9900 INSTRUCTION EXECUTION TIMES), page 28 of the TMS9900 Microprocessor data manual ? I know it's a bit brief, but it might be what you are looking for. I have the "9900 Family Systems Design and Data Book" at home. That's a good book that also goes into detail on instruction timing. Unfortunately I don't have this available as PDF. If someone has a PDF copy, please send it to me and I'll upload it to the Development Resources thread. Thanks, I've downloaded the PDF. The PDF will help to give the instruction timings for the 9900, but it doesn't help in the 99/4a environment, especially when you have wait-state/non-wait-state RAM & ROM! Quote Link to comment Share on other sites More sharing options...
Willsy Posted October 9, 2010 Author Share Posted October 9, 2010 (edited) According to the data sheet, that 3 instruction phrase gives: (where WS=wait-states=0) DEC R2 = 10 + (3*WS) = 13 JNE LOADLP = 8 + (1*WS) = 9 LI R1,4000 = 12 + (3*WS) = 15 -- 37 cycles In 0 wait-state memory I think that would be: 37 cycles @ 333ns per cycle = 12321ns / 1000 = 12.321us Is that correct? Or is it 30 cycles in 0 wait-state memory? (because WS=0 and x*0 is always 0?) How does it work for 8-bit memory? Is it 666ns/cycle? Edited October 9, 2010 by Willsy Quote Link to comment Share on other sites More sharing options...
matthew180 Posted October 9, 2010 Share Posted October 9, 2010 (edited) The wait state generator adds 4 wait states to the ~2 clock cycle that make up a memory operation. So you use 4 as the wait state variable in the equation from the data manual. A wait state basically suspends the operation for 1 cycle, i.e. 333ns. So the 4 wait states add about 1.3us to *each* memory operation. That include the instruction fetch, reading any immediate or symbolic operands, and if the workspace is in 8-bit RAM, each register access gets hit too. Thierry breaks it down in detail here: http://nouspikel.group.shef.ac.uk/ti99/wait.htm Matthew Edited October 9, 2010 by matthew180 Quote Link to comment Share on other sites More sharing options...
Willsy Posted October 9, 2010 Author Share Posted October 9, 2010 The wait state generator adds 4 wait states to the ~2 clock cycle that make up a memory operation. So you use 4 as the wait state variable in the equation from the data manual. A wait state basically suspends the operation for 1 cycle, i.e. 333ns. So the 4 wait states add about 1.3us to *each* memory operation. That include the instruction fetch, reading any immediate or symbolic operands, and if the workspace is in 8-bit RAM, each register access gets hit too. Thierry breaks it down in detail here: http://nouspikel.group.shef.ac.uk/ti99/wait.htm Matthew Thanks Matthew, So I make that: (where WS=wait-states=4) DEC R2 = 10 + (3*WS) = 22 JNE LOADLP = 8 + (1*WS) = 12 LI R1,4000 = 12 + (3*WS) = 24 -- 58 cycles 58 cycles @ 333ns per cycle = 19314ns / 1000 = 19.341us ~36% slower? Does the fact that the WS is in 0 wait state memory complicate matters? Sorry to hassle about this - there are not many people that can answer this stuff! Mark Quote Link to comment Share on other sites More sharing options...
matthew180 Posted October 9, 2010 Share Posted October 9, 2010 All memory access in the 99/4A causes wait states *except* the system ROM (>0000 to >2000) and the 256 bytes of scratch pad RAM. So, if the workspace register points to scratch pad (as is always should), then any operands that read the value of the registers will not have wait states. However, something like *R0 will not have a wait state when reading the register value, but the indirection address will cause a wait state (assuming it does not point to scratch pad memory). Yes, the wait states do cause a huge performance hit, as you have discovered. 36% is probably conservative. Matthew Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted October 10, 2010 Share Posted October 10, 2010 All memory access in the 99/4A causes wait states *except* the system ROM (>0000 to >2000) and the 256 bytes of scratch pad RAM. So, if the workspace register points to scratch pad (as is always should), then any operands that read the value of the registers will not have wait states. However, something like *R0 will not have a wait state when reading the register value, but the indirection address will cause a wait state (assuming it does not point to scratch pad memory). Yes, the wait states do cause a huge performance hit, as you have discovered. 36% is probably conservative. Matthew Matthew, instruction timing & memory access would make a *very* interesting topic for your book on assembly language Quote Link to comment Share on other sites More sharing options...
matthew180 Posted October 12, 2010 Share Posted October 12, 2010 Sorry Willsy, I have not have any time to calculate timings lately. Did you work out your numbers? Retroclouds: I was thinking about a chapter on instruction timing, but I don't know if that goes too low level. You have to dig into the operation of the CPU at a level lower than assembly to really understand and calculate the timing. I like that kind of stuff though, and it can't hurt to have it in the book. :-) Matthew Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted October 12, 2010 Share Posted October 12, 2010 (edited) Hi Mark, I presume you are currently looking for driving the speech synthesizer from the cartridge ROM space >6000->7FFF ? Also is it safe to assume you can do that, without having to copy part of the speech player code into scratch-pad memory ? That would only be required if your program code is residing in memory located "behind" the speech synthesizer (32K memory in PEB) ? I know that timing for the code running from the cartridge space is different as when located in scratchpad. But other than that, it should be possible I guess? Sure hope so, because I need to save on scratchpad memory. I'm starting work on the speech player now, using some code I got from you in 2009 This is an interesting area I don't know much about yet, but starting to learn now. How big are the differences between the TMS5220 and TMS5200 ? Were all TI-99/4A speech synthesizers driven by the TMS5200 or are there also any using the TMS5220. I think that classic99 and MESS are running on TMS5220 but this seems to cause "glitches". Anyway, I just found the TMS5220 preliminary data manual at bitsavers. Check here. Might be worth adding to the Development Resources thread EDIT: The thing about player code having to reside in scratch-pad memory only applies for the "speak external" command. When using the normal "speak" command with built-in vocabulary you are safe. The question is: Is any of the built-in vocabulary worth listening to in games ? Edited October 12, 2010 by retroclouds Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted October 12, 2010 Share Posted October 12, 2010 (edited) Here is a cool demo of the speak 'n spell. I suppose its TMS5100 to be relatively close to the TMS5200 ? EDIT: Would be cool doing a TI-99/4A speech game with the speak & spell voice. Sorry getting carried away now Edited October 12, 2010 by retroclouds Quote Link to comment Share on other sites More sharing options...
Willsy Posted October 12, 2010 Author Share Posted October 12, 2010 Hi RetroClouds The *only* Speech Synth code that needs to go in scratch-pad ram is the code to READ data from the synth. Basically, to cut a long story (that I only just about understand) short, the SS is rather slow. It seems to take a long time to either latch data in, or gate data out (can't remember now). But when reading, you cannot use the 8 bit bus (and that includes the 8 bit memory in the cartridge space, as (according to Tursi) accesses in the cartridge space will still trigger the multiplexer and cause the wrong data to be latched. There are other timing restrictions, but they do not require execution from pad ram. To summarise: When reading data or status: 12uS - 8 bit bus cannot be used, so the read and the delay must be executed from pad. When writing external data into the speech synth (i.e. streaming LPC speech data into it) 10uS is required. In practice, no action is necessary, because you will be streaming data in a loop, and the loop will take > 10uS to execute. The delay after loading a command until giving it some more work/data is 42uS. So, to speak a word from ROM, load the 4 address nybbles, then load >40 then wait 42uS, then issue a "speak from rom" (>50) command. It's a bit of a pain to work with! If your scratch-pad space is low, consider overwriting other code in scratch pad with synth code when you need it, then restore the old code. In practice, you don't need much code. With Matthew's help, I arrived at the following code which does the job with minimal byte usage: movb @spchrd,@spdata ; move data from speech synth to location spdata src R0,12 ; wait 12uS - see editor assembler page 349, paragraph 5. rt spdata data 0 ; place to store the data read from the synth That's 10 bytes, plus 2 bytes (only one byte *actually* needed) to store the data read from the synth. Note: You need to store the data read from the synth in scratch-pad ram, or else you will trigger the 8-bit bus! Of course, once your program has returned via the RT you can move it from pad and do anything. You could make the above code 4 bytes shorter if you use registers: movb *r0,*r1 ; move data from speech synth to location spdata src R0,12 ; wait 12uS - see editor assembler page 349, paragraph 5. rt spdata data 0 ; place to store the data read from the synth But of course, you would have to initialise r0 and r1 appropriately before calling the subroutine in pad. Hope this helps. Mark 1 Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted October 13, 2010 Share Posted October 13, 2010 Hi Mark, thank you for your answer, it'll save me quite some time It's a bit of a bummer that I'll have to use scratchpad memory for reading/status polling. But it's ok, I probably can use the same scratchpad area I also use for tight loops. Yeah, it's a wonder how they managed to drive the synth with the TI-99/4A in the first place. Makes me appreciate what Parsec does a lot more Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted October 13, 2010 Share Posted October 13, 2010 I'm trying to calculate how the "SRC R0,12" is used to get to 12 uS. Assuming the SRC is located in scratchpad: From the data book: T = Total instruction execution time Tc(o) = clock cycle time C = number of clock cycles for instruction execution plus address modification W = number of required wait states per memory access for instruction execution plus address modification M = number of memory accesses T=tc(o) (C+W*M) In our case that would read up to: Tc(o) = 0.333 uS on the TI-99/4A C = 52 clock cycles for executing "SRC" + increasing PC W = 0 (we are in scratchpad) M = 4 T= 0.333 * 52 + (0 * 4) = 17.316 uS So I must be doing something wrong here, because 17.316 uS is more than 12 uS ? I'm confused Quote Link to comment Share on other sites More sharing options...
sometimes99er Posted October 13, 2010 Share Posted October 13, 2010 (edited) I'm trying to calculate how the "SRC R0,12" is used to get to 12 uS. Assuming the SRC is located in scratchpad: From the data book: T = Total instruction execution time Tc(o) = clock cycle time C = number of clock cycles for instruction execution plus address modification W = number of required wait states per memory access for instruction execution plus address modification M = number of memory accesses T=tc(o) (C+W*M) In our case that would read up to: Tc(o) = 0.333 uS on the TI-99/4A C = 52 clock cycles for executing "SRC" + increasing PC W = 0 (we are in scratchpad) M = 4 T= 0.333 * 52 + (0 * 4) = 17.316 uS So I must be doing something wrong here, because 17.316 uS is more than 12 uS ? I'm confused SRC is a "Shift" operation. Look it up in table 3, page 28, in the http://www.retroclouds.de/atariage/tms9900_microprocessor_data_manual.pdf There's a difference when the count is zero or not. Also see post #2. Edited October 13, 2010 by sometimes99er Quote Link to comment Share on other sites More sharing options...
matthew180 Posted October 13, 2010 Share Posted October 13, 2010 (edited) The binary encoding (machine code) of a shift instruction looks like this: | 0 1 2 3 4 5 6 7 | 08 09 10 11 | 12 13 14 15 | +-----------------+-------------+-------------+ | OPCODE | C | W | +-----------------+-------------+-------------+ C is the count W is the register to shift Note that C and W only have 4 bits, thus the only possible values are between 0 and 15. There are three variations of the shift instructions: 1. The C (count) is NOT zero: 12 + 2C clocks 2. The C is zero and bits 12 through 15 of R0 *ARE* zero: 52 clocks 3. The C is zero and bits 12 through 15 of R0 *are NOT* zero: 20 + 2N clocks The N parameter in #3 is the count value, 0 to 15, from bits 12 through 15 of R0. R0 when used for the count | 0 1 2 3 4 5 6 7 8 9 10 11 | 12 13 14 15 | +---------------------------+-------------+ | XXXX DON'T CARE XXXX | N | +---------------------------+-------------+ Remember, a zero parameter to a shift instruction means "get the shift count from R0". But, since the shift count can only be between 0 and 15, only bits 12 through 15 of R0 are used. If that value is also zero, then the shift count will be 16, hence the longest instruction time for #2 above. This is the ONLY way to get a shift count of 16. Note the difference between #1 and #3 is 8 cycles, which is due to having to read R0 for the count in the case of #3 (in case #1 the count is encoded as part of the instruction.) I think using R0 in a shift instruction with a count of zero is illegal (SRC R0,0), but maybe only for the compiler, I think the CPU will execute it. I can not find anything in the datasheet that says you can't do that. If you try to use a count greater than 15 in a shift instruction, the compiler will do a "mod 16" on the value, so you will always have a shift between 0 and 15. However, you have to be careful with that, since, if you try to code something like this: SRC R1,16 That 16 will really become 0 (16 mod 16 = 0), and the instruction will use bits 12 through 15 of R0 for the count, which is probably *not* what you intended. IMO, this kind of situation should cause the compiler to generate an error, but I'm pretty sure it does not. Matthew Edited October 13, 2010 by matthew180 Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted October 13, 2010 Share Posted October 13, 2010 Thanks for the answers people! This community is just plain awesome Quote Link to comment Share on other sites More sharing options...
Tursi Posted October 19, 2010 Share Posted October 19, 2010 I just wanted to note that the information attributed to me above isn't cut-and-dried. I don't know what the speech synth does on the bus or why the E/A manual says you can't touch the 8-bit bus during certain operations.. in the attributed conversation I was just confirming that the cartridge port IS the 8-bit bus since it sits on the back-end of the multiplexer. It's nice to have someone else working on the timing counts, hehe. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.