ARRGGGGHHH! VIDEO CORRUPTION

digress · August 23, 2016

Nah I ruled it out this afternoon by disabling all pattern updates during gameplay and it still happened.

It could be the origin of the problem if ICVGM gives this message in the first place.

digress · August 23, 2016

"That looks very much like a VDP register being altered... "

Yes, I'm convinced this is what is happening but I'm certainly not doing it on purpose. making the vdp point else where abruptly. I can drive around the original scene still even though it looks like a mess. The pickups work only where they were originally placed but I cannot interact with anything on the mess screen.

So it's like a swap screen to a mess area of memory but the game continues on in the correct area.

I'll try a recovery key to see if I could swap back to the first screen. >> I'll post the result in a few minutes

That looks very much like a VDP register being altered... are you sure you are not trying at some point to write to a negative VDP address (either by accident or on purpose?) I did this in Super Space Acer, on purpose, naively thinking that the address register would be masked anyway, and forgetting that a negative VDP address would be interpreted as register writes.

I would be very happy to reproduce this error message with ICVGM on my computer. I've never seen this before.

I sent you the file that caused the error. I'm using win 7 64 bit. It could be something incompatible with my setup.

Tursi · August 23, 2016

What I'm suggesting is that if you have any code that calculates a VDP address (for instance, looking up row and column, or even a pattern address), just review that code and make very sure it can never go negative. A value of -1 will change register 7 to F (turning the screen white), for instance. That last screenshot really looks like external video was turned on, which is in register 0. It's a really easy bug to happen and will cause all sorts of seemingly random events to occur.

otherwise, it sounds like you've got a good approach for the troubleshooting - disable large blocks of code and see if it still happens. Also, see if it reproduces under an emulator with a debugger... BlueMSX might not reproduce such a bug based on my experience, but I believe there's a newer one? If you can make it happen, you can see what it changes the registers to and maybe narrow down when it happens (save states may help too).

Kiwi · August 24, 2016

I do wish I could help you further. Other thing that could cause a crash that I know that if the data attempt to insert data into an out of bound array will cause issue and a crash, like inserting value to variable[6] when variable[6] isn't assigned. The very first entry of the sound table need to be SOUNDAREA1, or else there will be a crash. And the sound data have to end with 0x10,0x50,0x90,0xd0, or else the sound pointer will get lost and cause havok. Maybe the crash is related to the sound engine/voice. I can't be sure.

The ICVGM graphic window looked like it is falling in and not filling in the empty space. Alterative, CV Sprite 2 that comes with the CV graphic tool kit can make Colecovision sprites.

Only other thing I can suggest have another person that you could trust look over your code. I think it helps to have another pair of eyes look over the code and give helpful suggestions.

digress · August 24, 2016

Over running arrays is a good thought . I havn't checked that out yet. I did have a problems with this before before where it caused a problem on a previous project.

I also on mr turtle had a problem where the boulder went off the bottom of the screen but kept drawing offscreen which caused corruption so I've been searching out any possibilities of that as well.

A couple of more things to check.

I might get someone to check over my code. It's pretty complex at this point though. I'm not sure whom would want to.

I have found I can get things working by simplifying what is going on so say only 1 tank on the screen or no music etc. so I will get there.

What I'm suggesting is that if you have any code that calculates a VDP address (for instance, looking up row and column, or even a pattern address), just review that code and make very sure it can never go negative. A value of -1 will change register 7 to F (turning the screen white), for instance. That last screenshot really looks like external video was turned on, which is in register 0. It's a really easy bug to happen and will cause all sorts of seemingly random events to occur.

otherwise, it sounds like you've got a good approach for the troubleshooting - disable large blocks of code and see if it still happens. Also, see if it reproduces under an emulator with a debugger... BlueMSX might not reproduce such a bug based on my experience, but I believe there's a newer one? If you can make it happen, you can see what it changes the registers to and maybe narrow down when it happens (save states may help too).

I do wish I could help you further. Other thing that could cause a crash that I know that if the data attempt to insert data into an out of bound array will cause issue and a crash, like inserting value to variable[6] when variable[6] isn't assigned. The very first entry of the sound table need to be SOUNDAREA1, or else there will be a crash. And the sound data have to end with 0x10,0x50,0x90,0xd0, or else the sound pointer will get lost and cause havok. Maybe the crash is related to the sound engine/voice. I can't be sure.

The ICVGM graphic window looked like it is falling in and not filling in the empty space. Alterative, CV Sprite 2 that comes with the CV graphic tool kit can make Colecovision sprites.

Only other thing I can suggest have another person that you could trust look over your code. I think it helps to have another pair of eyes look over the code and give helpful suggestions.

Tursi · August 24, 2016

No music might be a good clue... the music player is quite expensive. I always make sure to run it AFTER all VDP access is finished in the blank, otherwise you're almost guaranteed to be running during the display when you can't write so quickly.

I'd be willing to give it a try on my modified BlueMSX to see if we can rule in/rule out the VDP address overruns I mentioned.

Kiwi · August 25, 2016

Over running arrays is a good thought . I havn't checked that out yet. I did have a problems with this before before where it caused a problem on a previous project.

One other thing I forgot which I had problem with Computer Space. One of the function was missing a } so that was causing issues too.

I also on mr turtle had a problem where the boulder went off the bottom of the screen but kept drawing offscreen which caused corruption so I've been searching out any possibilities of that as well.

put_frame(); function is one of the BIOS function. It only check the right and left side of the screen to determine whether to print a bunch of tiles, part of the tiles, or none at all if the x-coordinate tiles is over 32.

It'll keep writing data if the y-coordinate is off screen into the vram. Right below the screen is the sprite attribute, and under that is screen 2, and further more is the color table, sprite table and lastly the pattern table.

I might get someone to check over my code. It's pretty complex at this point though. I'm not sure whom would want to.

I did let groovybee check my coding for the Rockcutter to see if there's anything I can do to reduce the game size down from 32KB. He did teach me new things like using switch/case to generate a jump table instead of a bunch of ifs to gain a lot of cycles back. Using data tables. I was shy of showing my code at first, but I'm happy it worked out.

And it is hard to pinpoint what exactly cause the corruption/crash without looking at the code. So that's why I suggested that.

Tursi · August 27, 2016

I need to correct myself on one point earlier:

QuoteReading the VDP status register has the side effect of changing the internal VDP address. This is why, even if you don't do anything else, the NMI can corrupt your VDP access.

Reading the status register does NOT change the VDP address... I tested this out myself yesterday. However, reading the status register DOES reset the internal address load flipflop, which means that if an NMI triggers in the middle of setting an address or changing any VDP register (AND that NMI is allowed to read the status register), that set or update will be corrupted and will change the VDP address to something other than was intended. (It also will not update the VDP register in question).

It seems like a pretty tight race, but I observed it happening a lot. Murphy's out to get you when you're a programmer.

Edited August 27, 2016 by Tursi

digress · August 27, 2016

This problem has been resolved. Thanks to everyone's suggestions and some coding help with the nmi problem from tursi. I should now be able to get back to finishing the project rather than trouble shooting the corruption issue.

The latest build had no corruption issues after a long play in all areas. I might even be able to put a few things back in that I had taken out while trying to fix this.

alekmaul · August 28, 2016

Could you explain how you resolved it ? It can be useful for others if they had same problem.

Tursi · August 28, 2016

The cvlib code that was being used is a nice piece of coding, and there's a ton of work put into it, but the handling of the NMI is not safe at all. There are a few reasons for this:

1) The NMI handler in the crt /always/ reads the VDP status register, even if the NMI is disabled. This can break address setup and register writes if they are interrupted -- which happens a lot more often than you'd expect.

2) The NMI is not normally disabled. There is an no_nmi flag, but it is only used to prevent re-entrancy into the NMI. Worse, if re-entrancy occurs and the NMI is skipped, the skipped NMI still reads the VDP status register, even though the previous NMI is not complete. (Technically, though, this case shouldn't happen).

3) The library provides a disable_nmi and enable_nmi function, but these functions operate by turning the interrupt on and off in the VDP itself. While this DOES prevent the NMI from ever firing (yes, there is a way!), the problem is that this requires a register write, which can be interrupted itself and thus not happen.

If you'll forgive a longer post, let me talk again about how the VDP works and how it ties into the Coleco's NMI. Understanding this is key to preventing these corruption issues.

So, the basics: when the VDP finishes rendering a frame and reaches the bottom of the visible screen (ie: line 192), it sets an 'end of frame' bit in the status register. The VDP has the ability to raise a physical output pin high when this bit is set (ie: to tell the CPU vertical blank has begun). This operation is controlled by an "interrupt enable" bit in VDP register 1. Anytime both the status bit and the control bit are both high, the output pin on the VDP is high. If either bit is low, the output pin is low.

The status bit is cleared by reading the VDP status byte (ie: with an in a,(0xbf)). This gives you a copy of the status and clears the end of frame, sprite collision and 5 sprite flags. You can do this at any time (although, unrelated, it may be slightly racy to poll this register too quickly...)

VDP register 1, of course, is updated by writing two bytes to the command port, one specifying the data, and one specifying the register to update as well as a control bit to indicate register write. Because this takes two separate writes, the VDP internally maintains a counter that indicates whether it is expecting the first or second byte of a command. This counter is also reset any time the status register is read, or the data port accessed (for read or write).

In the ColecoVision, the output pin from the VDP is connected to the Z80's NMI line. NMI is the only interrupt on the Z80 that is edge driven rather than level driven - that is, it will only trigger when the line transitions from low to high. (I believe that other interrupts will continuously re-trigger as long as the input is high - this may be why they selected NMI). Because it's a non-maskable interrupt, the CPU can not prevent the NMI from interrupting operation and jumping to the NMI handler.

The normal, expected behavior for an NMI handler, then, is to perform any operation that you need to do regularly or during vertical blank, and then read the VDP status register. Reading the VDP status register will clear the end of frame bit, which in turn will lower the interrupt line. This will then enable the NMI to happen again next time the line goes high (ie: end of next frame).

Of course, if you never read the status register, then the high will stay high. Even if you return from the NMI, it will never retrigger because only a low-to-high transition can trigger it.

The main reason that VDP corruption occurs is when an NMI occurs between the two writes of an address setup or register command. It may seem like a really narrow race, but it will happen more often than you expect. When I checked out Tank Mission, for instance, it was happening almost every frame during the sparkly part of the title page, since the address was changing a lot.

So what happens when a command is interrupted? The first byte written is always loaded to the LSB of the address register before anything else happens. If the NMI then occurs and the status register is read, then the command counter is reset. When the NMI returns, the unaware code writes what it thinks is the second byte of the command, but the VDP thinks it's the first byte and copies it to the LSB of the address register. Then the program reads or writes VDP memory, completely unaware that it's going to the wrong place. The VDP is okay with that, too, the access to the data port just resets the command counter again.

For example, let's say the VDP address is 0x0320, because you just loaded a title page and color table. You want to change to 0x2800 and write a sprite table. The command for a write address requires you to OR in 0x40 in the MSB, so you're going to need to write 0x6800, LSB first. So first you write 0x00 to the command port. The VDP drops that into the LSB of the address register, internally changing it to 0x0300. If you next wrote the 0x68, the VDP would drop that into the MSB, stripping off the command bits, and the address would be 0x2800 as desired. But instead, let's say an NMI happens. The NMI is careful not to mess with anything else, but it still reads the status register, which has the side effect of resetting the command counter. Now it returns, and NOW your code sends the 0x68. Well, the command counter is expecting the LSB of a command, so it copies that to the LSB of the address register, making it 0x0368. Now you write your sprite table, and it's nowhere near the 0x2800 you expected. Same thing will happen with register writes.

There are a few ways to deal with this.

First, you can just keep the VDP out of the main code. If you do all your VDP access (and I mean ALL of it) in the NMI function, then you can't ever interrupt VDP accesses in the main code (because there isn't any). In practice this can be tough, but there is a trick... just do ALL your game code in the NMI. All your main code needs to do is sit in a loop... if the NMI does not read the VDP status register until the end, it will still be safe even if it takes more than a frame to execute (although it will run more slowly). A lot of NES games work this way.

Second, you can disable the interrupt enable in the VDP permanently, and poll the status register for blanking status. You have to be careful that you do the interrupt disable at a safe point (ie: during the NMI function) so that it itself can not be interrupted. I don't like this answer too much because we've found evidence that heavy polling of the status register can sometimes miss changes. In general it seems to work though, you won't often notice if like 1/500 frames gets missed. This means there's no NMI code at all and nothing you do can ever be pre-empted, but it's harder to get a consistent timing or catch the vertical blank, you have to plan for it.

Third, and this is the approach I went with in my library (and what I adapted for this fix), you can control whether the NMI is allowed to access the VDP, in essence creating a critical section. When setting this, you have to very really strict, if the NMI is not allowed to access the VDP, it does not touch it at all. Since you can't prevent the interrupt from happening, you need to make sure it doesn't break anything.

I did first a proof of concept to test the theory that it really was the NMI interrupting at a bad point -- I first changed the CRT0 code to do NOTHING to the VDP if the no_nmi variable was set, not even read the status register. It would just exit. Then I added two macros to control no_nmi. "BLOCK_NMI" simply set no_nmi to 1. "RELEASE_NMI" had to be slightly more -- if an NMI was skipped then the status register would still be high. Since this was just a dumb set, it would first set no_nmi to 0 (to allow NMIs to occur), and then it just blindly read the status register to release any old, missed one. The order of those two operations is important, if it was the other way around a new NMI could occur before no_nmi was cleared, and the status bit would be stuck high.

I then updated just the title page code with the new macros around the VDP access, and gave it a run. This worked, clearing up the corruption that I saw. But because it hit the 'no_nmi' case so often, it ran a lot slower. So I implemented the more complex fix from my lib.

The more complex fix is, in a nutshell, just an extension of the above. It changes no_nmi from a byte flag to two bit flags. The least significant bit indicates if NMIs are blocked, and the most significant bit is set if an NMI was missed. This let me use the Z80 bit commands to update them -- I could also have used two bytes. The important thing is that because the flags are updated by two different processes (the main code and the NMI), any writes need to happen in a single instruction. It's not safe to read/update/write, because that could be pre-empted by the other process, changed, and then you come back and finish the original change, losing what the other process did.

Anyway, the extension, then, is to use these two bits. I updated the above macros but because Tank Mission already used them a lot, I also rewrote the library's enable_nmi() and disable_nmi() to do this instead of updating the VDP. disable_nmi() just needs to set the least significant bit. enable_nmi() first clears the disable bit - this must happen first! Then it checks the most significant bit, if it's set indicating a missed NMI, we jump directly over to the NMI (note: in user context!) and process the work we missed, including the clearing of the status register. If it was not set, then we are done. This allows the game code to make up a missed interrupt, as long as it's not too late. It's important to keep the blocking durations short as possible for good performance.

I also needed to update the delay() function to account for my changes, as it was used a lot.

It is important to note that my 'catch up' mechanism is not compatible with code that MUST run during vertical blank, such as code that needs to write at full speed or wants to double buffer. In Super Space Acer that's only a problem during setup functions, otherwise everything is kept short enough that it's safe. My NMI handler checks there whether an interrupt is being called in a 'catch up' context and skips high speed writes like the sprite table for those frames.

All of the above are just the usual race issues that you see with multi-threaded programs, and that's essentially what a title using the NMI is. You need to assume, in a multi-threaded application, that literally any code running can be interrupted by the other threads at any time, and ensure that an interruption won't break what you're doing.

There was one other thing I noted in cvlib that I don't think was causing issues here, but not being aware of it could lead to random strangeness which is very hard to track down. It makes a lot of use of ColecoVision BIOS calls which apparently store a bit of data around 0x73C0 and up... things like a random number seed and VDP register backups. If your program uses a lot of data up to this point - that data can be overwritten by the BIOS calls and/or the library could make decisions based on your data instead of the cached data it expected. This is also only about 60 bytes from the top of RAM, meaning that it wouldn't take very much stack use to hit this area. If the data changed while the stack was that low, you could see a random crash, maybe even some time after the corruption occurred. Just watch out for that, the map file wll help you see how close you are to it. I don't use the BIOS in my lib so I wasn't aware of those mirrors, I had to step through the code to learn what they were.

alekmaul · August 28, 2016

Thanks a lot for all the explanations, that is really nice and understandable !

As I'm also using cvlib and have some minor glitch on some levels, I will seek deeply in cvlib to see how to manage correctly nmi.

ARRGGGGHHH! VIDEO CORRUPTION

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members