Kchula-Rrit Posted March 21, 2021 Share Posted March 21, 2021 Since the low-byte data lines are not connected to the VDP port, could a word access (MOV R1,@VDPWD) be used instead of a byte access (MOVB R1,@VDPWD)? Just curious, since I thought a word-access is faster than a byte-access. Especially a write operation, since the CPU has to read the word, plug the new byte into it, then write the word out. K-R. Quote Link to comment Share on other sites More sharing options...
Fredrik Öhrström Posted March 21, 2021 Share Posted March 21, 2021 I believe the 9900 cpu is a bit stupid, it will always perform a read before a write. 1 Quote Link to comment Share on other sites More sharing options...
+mizapf Posted March 21, 2021 Share Posted March 21, 2021 Yes. All format 1 instructions have a read-before-write, even MOV. Quote Link to comment Share on other sites More sharing options...
Kchula-Rrit Posted March 21, 2021 Author Share Posted March 21, 2021 Just checked the 9900 data sheet for MOV/MOVB execution times; time is essentially the same. Thanks, K-R. Quote Link to comment Share on other sites More sharing options...
+mizapf Posted March 21, 2021 Share Posted March 21, 2021 That's the big advantage of the TMS9995 in the Geneve; it does not need any read-before-write because of the 8-bit data bus. Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted March 21, 2021 Share Posted March 21, 2021 1 hour ago, Kchula-Rrit said: Just checked the 9900 data sheet for MOV/MOVB execution times; time is essentially the same. Thanks, K-R. ...except for indirect, auto-incrementing register writes: MOVB *R1+,@VDPWD <<<-----Faster by 2 clock cycles * --vs-- MOV *R1+,@VDPWD Of course, writing every other byte in a loop with the latter would rarely make sense, so would not likely be chosen by the programmer, anyway. ...lee 2 1 Quote Link to comment Share on other sites More sharing options...
Kchula-Rrit Posted March 21, 2021 Author Share Posted March 21, 2021 You're right; I had not thought it through. This is off the topic, but is the read-before-write the reason the VDP (among other peripherals) read and write addresses are not the same? If, for example, you are writing an address to the VDP, it seems to me that, while writing the second byte, the read-before-write would reset the sequencer inside the VDP and never update the address. Sometimes I still have trouble wrapping my head around the way the TI does things. K-R. 1 Quote Link to comment Share on other sites More sharing options...
+mizapf Posted March 21, 2021 Share Posted March 21, 2021 Yes, and therefore, in the native mode of the Geneve, the video ports are the same for reading and writing because it was not necessary to pick two different addresses. These are the things that you simply accepted at first, but learn some decades later why it had to be done. 4 Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 22, 2021 Share Posted March 22, 2021 8 hours ago, Kchula-Rrit said: This is off the topic, but is the read-before-write the reason the VDP (among other peripherals) read and write addresses are not the same? That's correct. Another downside is that the VDP access triggers the multiplexer, so you suffer the extra wait states on both read and write, even though you don't need them. Quote Link to comment Share on other sites More sharing options...
Kchula-Rrit Posted March 22, 2021 Author Share Posted March 22, 2021 It's been a while since I looked at the schematics, but I thought the multiplexer was only used for the side port. Well, that should help with the "time-waste" required between VDP writes. K-R. Quote Link to comment Share on other sites More sharing options...
PeteE Posted March 22, 2021 Share Posted March 22, 2021 There is one way I've found that skips the read before write, but it has some downsides: One byte can be written in 24 cycles by setting to the workspace pointer to the VDP write data address, then use a chain of "LI R0,>XX00" for each byte where XX is the byte to write. The LI instruction is 4 bytes, so the code size will be 4 times the number of bytes written. PS. Don't forget you do need a NOP time waste when reading a byte from the VDP after setting the address to read from. It may work fine on emulation or F18A, but a real 9918A will sometimes return wrong values if you read too soon. 4 2 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted March 22, 2021 Share Posted March 22, 2021 The reason for the read before write approach is obvious when handling bytes. Since the CPU can only write a full word, it has to read it in first, then modify the byte to access and write the whole thing back. That it does this for word access too was to save space in the CPU design. It means that the same logic can be used to access both word and byte values. Performance suffers, but that's the reason. Since I have a console with 64 Kbytes 16-bit wide RAM inside, I can see when programmers have relied on the 16 to 8-bit multiplexing to slow things down enough for the VDP to be able to cope without the NOP TI recommended you put in your code. These programs fail if I don't switch back to the standard memory expansion (I designed my 64 Kbyte RAM expansion so that I can disable it with CRU bits). I also have the ability to turn on a piece of hardware which will detect VDP access and insert a wait state in these cases. Then the programs will run correctly, just faster than from 8-bit RAM. 4 Quote Link to comment Share on other sites More sharing options...
matthew180 Posted March 22, 2021 Share Posted March 22, 2021 10 hours ago, PeteE said: PS. Don't forget you do need a NOP time waste when reading a byte from the VDP after setting the address to read from. It may work fine on emulation or F18A, but a real 9918A will sometimes return wrong values if you read too soon. IIRC it was proven that on the 99/4A you don't need a time-waste simply due to the 9900 access time to fetch the next instruction (which might the be VDP read) is longer than the required delay the VDP needs to pre-fetch the first byte after setting the read-address. Somewhere in the forum is a thread about this with testing. I guess I should probably try to find the reference. However, unless you are using unrolled loops, there will probably already be instructions between setting the VDP read-address and reading VDP data, so the traditional NOPs are not necessary; and if they are required you can typically find actual useful instructions to use instead. Most VDP access is probably going to use some sort of function, like the VWTR, VSBR, VMBR, etc. so there will already be lots of instructions between setting the address and reading data. 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 22, 2021 Share Posted March 22, 2021 12 hours ago, Kchula-Rrit said: It's been a while since I looked at the schematics, but I thought the multiplexer was only used for the side port. Well, that should help with the "time-waste" required between VDP writes. It does, we've gone in detail about that in other threads. 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 22, 2021 Share Posted March 22, 2021 1 hour ago, matthew180 said: IIRC it was proven that on the 99/4A you don't need a time-waste simply due to the 9900 access time to fetch the next instruction (which might the be VDP read) is longer than the required delay the VDP needs to pre-fetch the first byte after setting the read-address. Somewhere in the forum is a thread about this with testing. I guess I should probably try to find the reference. However, unless you are using unrolled loops, there will probably already be instructions between setting the VDP read-address and reading VDP data, so the traditional NOPs are not necessary; and if they are required you can typically find actual useful instructions to use instead. Most VDP access is probably going to use some sort of function, like the VWTR, VSBR, VMBR, etc. so there will already be lots of instructions between setting the address and reading data. The case PeteE listed is the one case where it's easily possible, at least in scratchpad RAM and using registers. The turnaround time between writing the second byte of address and reading back the cache register from the VDP can be faster than the maximum 8uS needed for the VDP to fetch the data. It's /extremely/ common for programmers to put the address set and the read inline, so you do need to make sure something is in there. As you note, it's easy to find something useful to do, and if you're using subroutines, yeah, the RT counts! No other case is an issue unless you're doing something deliberately tricky. LI's count as tricky. Of course, if you create your own memory expansion with 64k of 16-bit wide memory that can be controlled via CRU and turned on and off, then yes, you may have combability issues, and you probably should install a VDP that can keep up with your accelerated system. But even there, it's usually the address set/read data combination that has trouble... except now ALL your memory is essentially scratchpad. Even without wait states, the usual instruction sequence is slow enough on the 9900 to not overrun the VDP for other accesses. We can get into an accelerated 9900 as well, but it seems strange to tell everyone they need to slow down their software so your faster system works correctly... 5 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted March 24, 2021 Share Posted March 24, 2021 (edited) I didn't bother trying to modify the VDP just because I have fast memory all over the place. I designed it so that if I disable the 32 K part, which corresponds to the normal memory expansion, then whatever is "below" will be visible. So my internal memory expansion can co-exist with the standard 32 K RAM expansion. Just set the appropriate CRU bits and my internal expansion disappears. I designed it so that the 64 K RAM covers the entire addressable range. When the 8-bit latch, which holds the memory enable bits, is reset, it pages out all internal RAM where there shouldn't be any, but pages in where the normal 32 K RAM is available. Thus default is using 32 out of 64 K RAM where RAM should be, but not where ROM and other stuff should be. But I can page in 8 K chunks over the monitor ROM, over DSR space etc. if I want to, just as I can page out memory from the 8 and/or 24 K RAM banks, if I want to. Assuming there is a standard memory expansion in the machine, setting the correct bits is what it takes to go into compatibility mode. This scheme makes it possible to copy monitor ROM to RAM, then change things like interrupt vectors, for example. As you say, very few programs actually fail with fast memory everywhere. The only one I've encountered is the game Tennis. Running in my fast RAM, the players will split by the hip, where the upper part of the body will run in one direction but the legs in another. Enabling only the hardware wait state generation on VDP access will make the game run, but it looks more like table tennis than lawn tennis... You can just watch it run the demo, since beating it is virtually impossible in that case. Edited March 24, 2021 by apersson850 4 1 Quote Link to comment Share on other sites More sharing options...
Kchula-Rrit Posted March 27, 2021 Author Share Posted March 27, 2021 On 3/22/2021 at 12:56 AM, apersson850 said: The reason for the read before write approach is obvious when handling bytes. Since the CPU can only write a full word, it has to read it in first, then modify the byte to access and write the whole thing back. That it does this for word access too was to save space in the CPU design. It means that the same logic can be used to access both word and byte values. Performance suffers, but that's the reason. ... My response is a bit late, but that makes sense. I'd wondered about the read-before-write. Thanks, K-R. Quote Link to comment Share on other sites More sharing options...
Airshack Posted April 15, 2021 Share Posted April 15, 2021 On 3/21/2021 at 10:57 PM, PeteE said: It may work fine on emulation or F18A, but a real 9918A will sometimes return wrong values if you read too soon. Oops! Never considered this. Quote Link to comment Share on other sites More sharing options...
PeteE Posted April 15, 2021 Share Posted April 15, 2021 18 hours ago, Airshack said: Oops! Never considered this. Funny story, at Fest West 2017, your system with F18A was the very first time I was able to test Tilda on real hardware, and it worked! Later I was showing to someone with 9918A system, and the scrolling would cause wrong characters to appear all over the screen - all due to no delay after setting the VDP read address before reading the data. I added the delay instruction, recompiled the code on my laptop and tried it again, success! 4 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.