marc.hull Posted June 12, 2010 Share Posted June 12, 2010 (edited) To revisit the "Can The TI Outpace the VDP" question..... The answer is a tentative yes (with a caveat.) When the computer is running with 16 bit memory it can definitely (I think) outrun the VDP. See the video link below. Not that this is earth shattering news by any means just thought it interesting. The code reads a byte from VDP, shifts that byte one time , writes back the address and the data. I used Matt's example of fast VSBR/VSBW using register addressing and symbolic addressing to avoid a SWPB. Unless I am mistaken (what are the odds of that ;-)the program is writing data faster than the VDP can handle it...... Here is the link to the video. Theory's ?? Sorry for the audio quality of the narrator. He has a voice just made for the printed word Edited June 12, 2010 by marc.hull 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted June 12, 2010 Share Posted June 12, 2010 We need to see the code to answer that question. Bitmap mode is the most memory intensive VDP mode, so it is of course the worst case scenario. But with a posting of the code we can run through the cycle counting and see if it makes sense, or this is new information. Quote Link to comment Share on other sites More sharing options...
matthew180 Posted June 12, 2010 Share Posted June 12, 2010 (edited) The theory of the 99/4A not being able to outrun the VDP is partly based on information from Thierry's website where he states that the memory addresses mapped to the VDP trigger the wait states. That means it does not matter if the source or destination is the 16-bit no wait state scratch pad RAM, the read or write to the VDP will incur the wait state, and thus never be fast enough to outpace the VDP even in the worst case. I too would like to see the code you used. Edit: If you have a console where you put 32K on the 16-bit bus and/or disabled the wait-state generator as per Thierry's tech pages, then yes, you can overrun the VDP. I suppose if you overclock the console you could also overrun. Matthew Edited June 12, 2010 by matthew180 Quote Link to comment Share on other sites More sharing options...
sometimes99er Posted June 13, 2010 Share Posted June 13, 2010 The effect looks good. Looks like the read and write address comes thru alright. You're hitting the right spots in the bitmap. Guess you're demoing exactly the same code, only it running from 16 bit memory instead of 8 bit. I can't suspect a missing clearing of LSB. The effect would have been different anyway (from what I can see in the video). So if the address is okay, then the read, write or both of data are failing (if the problem does not lie elsewhere). Firstly try to put in a single NOP before reading the data. Then one NOP before the write of data. Then more NOPs. The results are interesting, since, if you get it working correctly with one or a few NOPs, then we might generally take care when running from ScratchPad. Quote Link to comment Share on other sites More sharing options...
sometimes99er Posted June 13, 2010 Share Posted June 13, 2010 Edit: If you have a console where you put 32K on the 16-bit bus and/or disabled the wait-state generator as per Thierry's tech pages, then yes, you can overrun the VDP. I suppose if you overclock the console you could also overrun. Oops, does ScratchPad issue wait-states ? Quote Link to comment Share on other sites More sharing options...
Willsy Posted June 13, 2010 Share Posted June 13, 2010 Velly intelesting. TF uses VDPWA in a register to read and write to/from VDP during VMBR and VMBW and VSBWM. It doesn't bother on single byte stuff. I have two consoles, one UK spec, one USA spec. Neither of which are set up as I don't have a PEB. I guess I should get one hacked to run 16-bit fast ram. Quote Link to comment Share on other sites More sharing options...
marc.hull Posted June 13, 2010 Author Share Posted June 13, 2010 This seems to be the offending piece of code. CLR R1 MOVB @R0LB,@VDPWA MOVB R0,@VDPWA MOVB @VDPRD,R1 SLA R1,1 ORI R0,>4000 MOVB @R0LB,@VDPWA MOVB R0,@VDWPA MOVB R1,@VDPWD When in 16 bit zero wait memory it causes the issue. Any one out there with real gear want to confirm? Quote Link to comment Share on other sites More sharing options...
sometimes99er Posted June 13, 2010 Share Posted June 13, 2010 Velly intelesting. Eh, I hope not my English is cuase of commotion ? Quote Link to comment Share on other sites More sharing options...
matthew180 Posted June 13, 2010 Share Posted June 13, 2010 When in 16 bit zero wait memory it causes the issue. Any one out there with real gear want to confirm? I'll test it on my real hardware, but it will have to be a little later. Matthew Quote Link to comment Share on other sites More sharing options...
Opry99er Posted June 13, 2010 Share Posted June 13, 2010 I am new to assembly and actually very interested in this topic. I kept up with the whole Y! thread and I've been reading here as well. It seems as though we are in need of a super-fast game to test this out. I propose "Crack-Mario" without any wait states or delay loops. He smokes a bunch of crack and flies through the levels at 7000 miles an hour. All 9 worlds should take about 3.9 seconds to complete. 1 Quote Link to comment Share on other sites More sharing options...
Willsy Posted June 13, 2010 Share Posted June 13, 2010 Try this: CLR R1 MOVB @R0LB,@VDPWA MOVB R0,@VDPWA NOP ; chill MOVB @VDPRD,R1 SLA R1,1 ORI R0,>4000 MOVB @R0LB,@VDPWA MOVB R0,@VDWPA NOP ; just chillin' MOVB R1,@VDPWD I reckon it's the period between the address being written and the data being read/written. OR... You could run your code on the VDP interrupt. You get a VDP interrupt when the VDP is entering the vertical refresh period (VRAM isn't being accessed, and thus the CPU window is open for 4.3 milliseconds - you could get a fair bit of VRAM thrashing done in that window. You might have to sit down with a pen and paper and work out how many cycles your code needs and how long it takes. Then you can work out how many times you can loop in the 4.3 millisecs. One advantage of this method would be that it should work on both modified and stock consoles. By the way, don't bother with that LIMI 0 crap: LIMI 0 [ do some work ] LIMI 2 LIMI is a slow instruction, and eats into your interrupt window. Just work out how much code (or how many loops) you can run in the window. Classic99 has a feature that will tell you how many cycles a section of code takes: T(address-address). That will help you to work out how long a section of code takes. Lastly: It strikes me that you may be able to use your discovery to 'discover' if the host machine has 16-bit ram or not. Thrash some data into VRAM like a chipmunk on speed, then read it back. If you don't get what you wrote, you're on 16-bit ram, or you've done to much shit. Quote Link to comment Share on other sites More sharing options...
Willsy Posted June 13, 2010 Share Posted June 13, 2010 I am new to assembly and actually very interested in this topic. I kept up with the whole Y! thread and I've been reading here as well. It seems as though we are in need of a super-fast game to test this out. I propose "Crack-Mario" without any wait states or delay loops. He smokes a bunch of crack and flies through the levels at 7000 miles an hour. All 9 worlds should take about 3.9 seconds to complete. HA HA HA! LMAO! I just peed in my pants. It's not a pretty sight! Quote Link to comment Share on other sites More sharing options...
marc.hull Posted June 13, 2010 Author Share Posted June 13, 2010 Try this: CLR R1 MOVB @R0LB,@VDPWA MOVB R0,@VDPWA NOP ; chill MOVB @VDPRD,R1 SLA R1,1 ORI R0,>4000 MOVB @R0LB,@VDPWA MOVB R0,@VDWPA NOP ; just chillin' MOVB R1,@VDPWD I reckon it's the period between the address being written and the data being read/written. OR... You could run your code on the VDP interrupt. You get a VDP interrupt when the VDP is entering the vertical refresh period (VRAM isn't being accessed, and thus the CPU window is open for 4.3 milliseconds - you could get a fair bit of VRAM thrashing done in that window. You might have to sit down with a pen and paper and work out how many cycles your code needs and how long it takes. Then you can work out how many times you can loop in the 4.3 millisecs. One advantage of this method would be that it should work on both modified and stock consoles. By the way, don't bother with that LIMI 0 crap: LIMI 0 [ do some work ] LIMI 2 LIMI is a slow instruction, and eats into your interrupt window. Just work out how much code (or how many loops) you can run in the window. Classic99 has a feature that will tell you how many cycles a section of code takes: T(address-address). That will help you to work out how long a section of code takes. Lastly: It strikes me that you may be able to use your discovery to 'discover' if the host machine has 16-bit ram or not. Thrash some data into VRAM like a chipmunk on speed, then read it back. If you don't get what you wrote, you're on 16-bit ram, or you've done to much shit. Hey K... This is not really a problem I am encountering as much as a curiosity. The actual code run's OK in 8 bit RAM. I happened to leave my machine in 16 bit mode when I started the program the other night and noticed the anomaly. I believe that the 32K16 mod is equivalent to running in the scratchpad and just thought since I championed the claim that you couldn't overrun the VDP on a stock TI I should redact it with some new evidence because evidently you can (provided it bears out scrutiny Quote Link to comment Share on other sites More sharing options...
Tursi Posted June 14, 2010 Share Posted June 14, 2010 The theory of the 99/4A not being able to outrun the VDP is partly based on information from Thierry's website where he states that the memory addresses mapped to the VDP trigger the wait states. I thought you guys (Marc and Matthew) were both there on whichever list it was where we hammered this into the ground, getting the /actual/ timing from the console during VDP access (ie: there's no need to "rely on information from Thierry's website"). When we left it, all that was left was verifying the theories, which none of us did. ). Unless you guys don't trust my results, I know I get chewed out every few months from someone for "pretending" to know what I'm talking about. Oops, does ScratchPad issue wait-states ? No, it doesn't, so it should be the same as Marc's test. So, assuming registers are also in 0-wait-state RAM: CLR R1 10 cycles MOVB @R0LB,@VDPWA 14 + 8 symbolic + 8 symbolic + 4 read vdp + 4 write vdp MOVB R0,@VDPWA 14 + 8 symbolic + 4 read vdp + 4 write vdp MOVB @VDPRD,R1 14 + 8 symbolic + 4 read vdp SLA R1,1 12 + 2 ORI R0,>4000 14 MOVB @R0LB,@VDPWA 14 + 8 + 8 + 4 + 4 MOVB R0,@VDWPA 14 + 8 + 4 + 4 MOVB R1,@VDPWD 14 + 8 + 4 + 4 The 4 clocks are the wait states imposed, and every write has a read-before-write cycle. A cycle is 0.333uS. According to the datasheet, the VDP needs 2uS after setting an address before it is ready to make the data transfer. It then takes anywhere from 2-8uS for the transfer to actually occur. Bitmap mode is the worst case, so the full 8uS is more likely to occur than in other modes - that's a total of 10uS, which takes 30 CPU cycles. This information is all out of the respective datasheets. Note that the 2uS delay after setting the address doesn't apply to subsequent reads or writes wrt the VDP, only the 2-8uS window applies to those. Most likely, it's the read that is failing. But it's not going to be 100% reliable about failing, which makes this sample code a little tough. Sometimes it will work. It depends exactly when the CPU request happens compared to what the VDP is currently doing on the screen. Anyway, the reason it's likely the read, is it's the shortest time between writing the address and accessing the data register. Since the address write is the last thing the previous MOVB does, and the data read is the first thing that the next MOVB does, you've only got about 20 cycles or so between the write of the address and the read. Even so, I did prove in that thread that READS could outstrip the VDP, only writes appeared to be safe (and based on this information, maybe not after setting the address). To quote: VDP Writes - our fastest write is 8.65uS. This means that writes to the VDP at any speed should always be safe, as this is greater than the worst case access speed to the VDP. Confirmed theorhetically, just need some proof. VDP Reads - our fastest read is 7.32uS. This is potentially close enough to the edge for tight loops to occasionally miss during graphics I or II. It's easy to add 4 cycles to make it safe by using the symbolic addressing mode instead of register indirect. They are safe in text (3.1uS max) and multicolor (3.5uS max) mode, however. But this was talking about sequential accesses -- you need a little extra time after setting the address. The write might just be safe, as the actual write will occur right around the 30 clock mark, but the read is definately too early to be reliable. (Also note the above treated the fastest instruction as a MOVB R0,*R1, where R1 contains the VDP address, or vice versa, anyway, using register indirect to access the VDP and Register for the data. We talked about other hacks like LIMI later.) Of course, with the program in wait-state RAM, you gain 4 extra clock cycles for every word of program, which pushes your read up to 30 cycles between the address set and the read. Quote Link to comment Share on other sites More sharing options...
matthew180 Posted June 14, 2010 Share Posted June 14, 2010 Yeah, I was part of that, I did the real hardware testing and you smacked it down with the data analyzer. We came to a resolution, I wrote my routines, and promptly forgot everything we figured out. I'm still not clear if Marc's computer is modified and how, and does the 32K on the 16-bit bus remove the wait states all together? It won't matter in a few months anyway... oops, shhhh. Matthew Quote Link to comment Share on other sites More sharing options...
marc.hull Posted June 14, 2010 Author Share Posted June 14, 2010 [quote name='Tursi' date='Sun I know I get chewed out every few months from someone for "pretending" to know what I'm talking about. Hmmmmmm... Erik© Brent. Kinda catchy Just kidding bro..... The test I ran earlier WAS entirely sequential and did not mix reads and writes. Sometimes it just takes a while for us less than guru status peeps to catch on Marcus... Quote Link to comment Share on other sites More sharing options...
marc.hull Posted June 14, 2010 Author Share Posted June 14, 2010 Yeah, I was part of that, I did the real hardware testing and you smacked it down with the data analyzer. We came to a resolution, I wrote my routines, and promptly forgot everything we figured out. I'm still not clear if Marc's computer is modified and how, and does the 32K on the 16-bit bus remove the wait states all together? It won't matter in a few months anyway... oops, shhhh. Matthew My console is modded to run in either normal 8 bit wide mode with wait states or 16 bit wide mode without wait states. I did NOT put this code into the scratch pad and run it in normal mode. Only switched the console between the two. Normally when I program I assemble in "fast mode" to cut the time in half and switch over to slow mode to run the executable. A couple of beers got in the way the other night and I forgot to return to the slower operation and saw the anomaly. BTW this does not occur when the console is running @ 3.58 Mhz (to answer an earlier query.) I'll leave it to you to do the scratch pad test...... 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted June 23, 2010 Share Posted June 23, 2010 Scratchpad and RAM with no wait states run at the same speed, there's no need to retest that (unless you really want to). The mod just adds the normal 32k RAM areas to the circuit that disables the wait state generator. The 3.58MHz tweak is interesting - are you saying it does not have any problems running at 3.58MHz with the wait states disabled (ie: running full speed)? Quote Link to comment Share on other sites More sharing options...
marc.hull Posted June 23, 2010 Author Share Posted June 23, 2010 Scratchpad and RAM with no wait states run at the same speed, there's no need to retest that (unless you really want to). The mod just adds the normal 32k RAM areas to the circuit that disables the wait state generator. The 3.58MHz tweak is interesting - are you saying it does not have any problems running at 3.58MHz with the wait states disabled (ie: running full speed)? 32K16 enabled / 3.00 Mhz speed = fouled video 32K8 enabled / 3.58 Mhz speed = good video 32K16 enabled / 3.58 Mhz speed = fouled video (obviously) 32K8 enabled / 3.00 Mhz speed = good video (again obviously, just stating for the record... Quote Link to comment Share on other sites More sharing options...
matthew180 Posted June 24, 2010 Share Posted June 24, 2010 Cool, thanks for the break down. So what I'm seeing is that a stock TI can't over run the VDP. Am I missing anything? Matthew Quote Link to comment Share on other sites More sharing options...
Willsy Posted June 24, 2010 Share Posted June 24, 2010 Scratchpad and RAM with no wait states run at the same speed, there's no need to retest that (unless you really want to). The mod just adds the normal 32k RAM areas to the circuit that disables the wait state generator. The 3.58MHz tweak is interesting - are you saying it does not have any problems running at 3.58MHz with the wait states disabled (ie: running full speed)? 32K16 enabled / 3.00 Mhz speed = fouled video 32K8 enabled / 3.58 Mhz speed = good video 32K16 enabled / 3.58 Mhz speed = fouled video (obviously) 32K8 enabled / 3.00 Mhz speed = good video (again obviously, just stating for the record... You're in bitmap mode, right Marc? Bitmap mode being the worst case in terms of bus activity (i.e. less cpu access windows). Quote Link to comment Share on other sites More sharing options...
sometimes99er Posted June 24, 2010 Share Posted June 24, 2010 And why should bitmap mode be worse than ordinary graphic mode ? Quote Link to comment Share on other sites More sharing options...
matthew180 Posted June 24, 2010 Share Posted June 24, 2010 Actually for Graphics Mode I and II it is the same, the CPU access window during the active display is between 2uS and 8uS. The problem is, you never know how long the access is going to take and the VDP does not have a READY or HOLD pin... Why not, I have no idea. During the vertical retrace, VRAM can be accesses every 2uS, which is faster than the CPU. But even the worst case 8uS during the active display is faster than the CPU, except in very certain circumstances (like 32K on the 16-bit bus perhaps.) Matthew Quote Link to comment Share on other sites More sharing options...
marc.hull Posted June 24, 2010 Author Share Posted June 24, 2010 Cool, thanks for the break down. So what I'm seeing is that a stock TI can't over run the VDP. Am I missing anything? Matthew I believe it can overrun the VDP if you shoe horn the code into the scratchpad which I think mike stated runs @ zero wait states. Quote Link to comment Share on other sites More sharing options...
matthew180 Posted June 24, 2010 Share Posted June 24, 2010 Yeah, the scratch pad runs at 0-wait state, but according to Thierry's site, the VDP triggers the wait-states. So you still have a wait-state for half of the MOVB instruction. I'll have to go back and read it all again and did out the schematics to make sure though. Tursi put his logic analyzer on it, but I don't remember all the details and if he saw a wait-state on half of the accesses and all that. I'm pretty sure my VDP over run tests were done via code in the scratch pad. I'll go dig them up. Matthew Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.